architecturepatternCuria

Extracting a Flutter voice SDK from a production app and publishing it on pub.dev

June 2026·10 min

Why voice in a field service app?

HomeGuild Pro is a mobile app for service professionals — HVAC techs, plumbers, electricians, the people who keep buildings running. They spend their days on ladders, under sinks, and inside crawl spaces. They do not want to tap through screens while their hands are full of copper fittings.

We built voice mode so they can talk to the AI assistant hands-free. The flow looks like this:

User speaks into the phone
Deepgram transcribes speech to text over a WebSocket
Transcribed text gets sent to our backend
Backend streams a response via SSE
We buffer the streaming text into complete sentences
Each sentence gets sent to OpenAI TTS as soon as it's ready
Audio plays back sequentially

That pipeline has to handle real-world conditions: the user interrupts mid-sentence, background noise from a job site, tool call announcements ("Let me look up that customer..."), and filler words ("um", "uh", "you know") that Deepgram faithfully transcribes but the backend doesn't need.

The architecture was right from the start — three clean layers with explicit seams between them. But the implementation was tangled with HomeGuild infrastructure in ways that only became visible when we tried to pull it out. After a few months in production, it was clear the voice orchestration layer had no business being coupled to our app. So we extracted it.

The three-layer architecture

The extraction worked because the original code already had clear boundaries between three concerns. We just had to formalize them into interfaces — and design those interfaces for the consumer, not the implementation.

STT Provider

The speech-to-text layer is an abstract interface. You implement startListening(), stopListening(), and expose a stream of transcript results. Each result carries the text, whether it's interim or final, and an optional confidence score.

abstract class STTProvider {
  Stream<STTResult> get transcriptStream;
  Future<void> startListening();
  Future<void> stopListening();
  void dispose();
}

Our Deepgram implementation handles the WebSocket connection, keep-alive pings, voice activity detection events, and the nova-2 model configuration. But if you want to swap in Google Speech-to-Text or Whisper, you implement the same four members and you're done.

TTS Provider

The text-to-speech layer is similarly minimal. It speaks a single sentence, can stop and clear its queue, and tells you whether it's currently speaking.

abstract class TTSProvider {
  Future<void> speakSentence(String sentence);
  Future<void> stopAndClear();
  Stream<bool> get isSpeakingStream;
  void dispose();
}

Our OpenAI implementation manages an internal sentence queue, downloads audio for each sentence, and plays them sequentially. The queue is important — streaming responses arrive faster than TTS can speak them, so sentences stack up and play in order.

VoiceStreamAdapter

This is the interface that was hardest to get right, and the one that matters most. The conversation manager doesn't know about HTTP, GraphQL, SSE, WebSockets, or any specific API. It calls streamResponse() and gets back a stream of typed chunks.

abstract class VoiceStreamAdapter {
  Stream<VoiceStreamChunk> streamResponse({
    required String message,
    required Map<String, dynamic> context,
  });
}

A VoiceStreamChunk carries a type (narrative, toolCall, progress, done, error) and a payload. The manager uses the type to decide what to do: narrative text gets buffered into sentences and sent to TTS, tool calls trigger spoken announcements, errors get surfaced to the UI.

This is the seam that lets the voice layer work with any backend. In HomeGuild, our AgentStreamingService implements VoiceStreamAdapter and maps our internal SSE streaming protocol to the generic interface. Someone building a chatbot could implement it with a simple HTTP POST that yields a single narrative chunk and a done chunk.

The orchestrator

VoiceConversationManager ties it all together with a state machine:

idle → listening → processing → speaking → idle

When the user starts talking, the manager transitions to listening and feeds audio to the STT provider. When a final transcript arrives, it cleans filler words, transitions to processing, and sends the text through the stream adapter. As narrative chunks arrive, it buffers them into sentences and queues them for TTS. When the last sentence finishes playing, it transitions back to idle (or listening, if continuous mode is on).

Interruptions are the interesting part. If the user starts speaking while the manager is in speaking state, it calls stopAndClear() on the TTS provider, drops back to listening, and picks up the new input. No state corruption, no orphaned audio.

What had to change for extraction

Riverpod removal

The original code was wired with Riverpod providers everywhere. The open-source packages export plain classes. You construct them with explicit parameters and wire them into whatever DI system you use. Inside HomeGuild Pro, we still use Riverpod, but the providers now just create and expose instances of the plain classes:

final sttProviderPod = Provider<STTProvider>((ref) {
  final config = ref.read(appConfigProvider);
  return DeepgramSTTProvider(apiKey: config.deepgramApiKey);
});

Logger swap

Internally we use a StructuredLogger that ships logs directly to OpenSearch. We replaced all logging callsites with the standard Dart logging package:

static final Logger _logger = Logger('VoiceConversationManager');

_logger.info('State transition: $oldState → $newState');
_logger.severe('STT connection failed: $e', e, stackTrace);

Consumers attach whatever handler they want. Our internal code attaches the OpenSearch handler.

Configurable tool announcements

Originally a hardcoded map inside the manager. Now an injectable parameter:

final manager = VoiceConversationManager(
  sttProvider: sttProvider,
  ttsProvider: ttsProvider,
  streamAdapter: adapter,
  toolAnnouncements: {
    'search': ['Looking that up...', 'Let me check...'],
    'create': ['On it...'],
  },
);

Configurable filler words

final manager = VoiceConversationManager(
  fillerWords: ['um', 'uh', 'like', 'you know', 'basically'],
);

Default is a reasonable English set. For Spanish, pass ['este', 'eh', 'o sea', 'pues']. For no cleaning, pass an empty list.

Filler word cleaning matters more than you'd expect. Without it, the backend gets "um so uh can you like look up uh the Johnson estimate you know the one from like Tuesday." Strip the filler, it becomes "can you look up the Johnson estimate the one from Tuesday." Night and day for LLM comprehension.

VoiceStreamAdapter bridge

class AgentStreamingService implements VoiceStreamAdapter {
  @override
  Stream<VoiceStreamChunk> streamResponse({
    required String message,
    required Map<String, dynamic> context,
  }) async* {
    await for (final event in _internalStream(message, context)) {
      yield VoiceStreamChunk(
        type: _mapEventType(event.type),
        payload: event.data,
      );
    }
  }
}

Fifteen minutes of work. The abstraction was already right; we just had to make it explicit.

The pub.dev publishing workflow

Three packages under the homeguild.ai verified publisher:

homeguild_voice_kit — core orchestration, state machine, widget library, abstract interfaces
homeguild_voice_deepgram — Deepgram STT provider
homeguild_voice_openai_tts — OpenAI TTS provider

Publish core first. Provider packages depend on it:

dependencies:
  homeguild_voice_kit: ^0.1.0
  web_socket_channel: ^3.0.0

Always run the dry-run before publishing. You cannot unpublish a version.

flutter pub publish --dry-run

Using the package

final manager = VoiceConversationManager(
  sttProvider: DeepgramSTTProvider(apiKey: deepgramKey),
  ttsProvider: OpenAITTSProvider(apiKey: openaiKey, voice: 'nova'),
  streamAdapter: MyBackendAdapter(),
);

await manager.startConversation();

manager.stateStream.listen((state) => updateUI(state));
manager.transcriptStream.listen((t) => showTranscript(t.text, isInterim: t.isInterim));

manager.dispose();

Your MyBackendAdapter connects to your own backend. In practice you'd stream with SSE or WebSockets and yield chunks as they arrive. The voice layer doesn't care how your backend works.

What we learned

Extracting from a production app is harder than writing from scratch. Every implicit coupling becomes visible only when you try to remove it.

Abstract interfaces should be designed for the consumer, not the implementation. Our first draft of VoiceStreamAdapter exposed too much internal protocol. The final version has one method. Everything else is the implementer's problem.

Publishing order matters, and you can't take it back. Run --dry-run. Fix the warnings. Then publish.

The packages are live:

← all posts