Reference links for my voice AI talk at AI Engineer April 2026 Meetup in Minneapolis.


Cloud STT Providers

Start here. Get an API key and have something working in ten minutes.

  • Deepgram — WebSocket streaming, VAD and diarization included
  • AssemblyAI — strong diarization and summarization features
  • OpenAI Audio API — Whisper endpoint plus TTS in one place

Open Source Models

  • Whisper — OpenAI’s open source model; the benchmark everything else is measured against
  • Parakeet — NVIDIA’s streaming-capable ASR model collection; CC-BY-4.0
  • Moonshine — small, fast model designed for edge devices; runs on ONNX; weights on Hugging Face

On-Device / Apple Silicon

  • Whisper.cpp — C++ port of Whisper; runs on CPU, no GPU required
  • WhisperKit — Whisper optimized for Apple Silicon via Core ML; open source
  • FluidAudio — open source Swift SDK; runs Parakeet on the Apple Neural Engine; includes VAD and diarization; iOS 17+ / macOS 14+
  • Argmax — commercial SDK for Mac and iOS; Parakeet and Whisper on Core ML; WebSocket API mirrors Deepgram so you can swap cloud for on-device in one line of code

Text-to-Speech

Not the focus of this talk, but most real apps use both directions.


Voice Interface Products

  • Wispr Flow — voice dictation for Mac/Windows/iOS; works in any text field; requires accessibility access
  • Willow — open source, self-hosted voice assistant; GitHub

Concepts

  • Core ML — Apple’s on-device ML framework; routes inference to CPU, GPU, or ANE depending on model and device
  • Silero VAD — widely used open source Voice Activity Detection model; runs on ONNX
  • pyannote.audio — the standard open source library for speaker diarization
  • ONNX — open standard for ML model interoperability across runtimes

Managed Inference / Hosting

  • Baseten — host specialized models on GPU infrastructure; more control than cloud APIs, less ops than self-hosting

Products Referenced