Voice AI: Speech-to-Text on Apple Silicon — Talk Notes

Apr 6, 2026

Reference links for my voice AI talk at AI Engineer April 2026 Meetup in Minneapolis.

Cloud STT Providers

Start here. Get an API key and have something working in ten minutes.

Deepgram — WebSocket streaming, VAD and diarization included
AssemblyAI — strong diarization and summarization features
OpenAI Audio API — Whisper endpoint plus TTS in one place

Open Source Models

Whisper — OpenAI’s open source model; the benchmark everything else is measured against
Parakeet — NVIDIA’s streaming-capable ASR model collection; CC-BY-4.0
Moonshine — small, fast model designed for edge devices; runs on ONNX; weights on Hugging Face

On-Device / Apple Silicon

Whisper.cpp — C++ port of Whisper; runs on CPU, no GPU required
WhisperKit — Whisper optimized for Apple Silicon via Core ML; open source
FluidAudio — open source Swift SDK; runs Parakeet on the Apple Neural Engine; includes VAD and diarization; iOS 17+ / macOS 14+
Argmax — commercial SDK for Mac and iOS; Parakeet and Whisper on Core ML; WebSocket API mirrors Deepgram so you can swap cloud for on-device in one line of code

Text-to-Speech

Not the focus of this talk, but most real apps use both directions.

ElevenLabs — high-quality cloud TTS
OpenAI Audio API — TTS and STT in one endpoint (see above)

Voice Interface Products

Wispr Flow — voice dictation for Mac/Windows/iOS; works in any text field; requires accessibility access
Willow — open source, self-hosted voice assistant; GitHub

Concepts

Core ML — Apple’s on-device ML framework; routes inference to CPU, GPU, or ANE depending on model and device
Silero VAD — widely used open source Voice Activity Detection model; runs on ONNX
pyannote.audio — the standard open source library for speaker diarization
ONNX — open standard for ML model interoperability across runtimes

Managed Inference / Hosting

Baseten — host specialized models on GPU infrastructure; more control than cloud APIs, less ops than self-hosting

Products Referenced

Notion AI Meeting Notes — shipped in 2025; combines transcription with LLM summarization
Anthropic Claude — added device-level voice control in 2026