Voice AI: Speech-to-Text on Apple Silicon — Talk Notes
Reference links for my voice AI talk at AI Engineer April 2026 Meetup in Minneapolis.
Cloud STT Providers
Start here. Get an API key and have something working in ten minutes.
- Deepgram — WebSocket streaming, VAD and diarization included
- AssemblyAI — strong diarization and summarization features
- OpenAI Audio API — Whisper endpoint plus TTS in one place
Open Source Models
- Whisper — OpenAI’s open source model; the benchmark everything else is measured against
- Parakeet — NVIDIA’s streaming-capable ASR model collection; CC-BY-4.0
- Moonshine — small, fast model designed for edge devices; runs on ONNX; weights on Hugging Face
On-Device / Apple Silicon
- Whisper.cpp — C++ port of Whisper; runs on CPU, no GPU required
- WhisperKit — Whisper optimized for Apple Silicon via Core ML; open source
- FluidAudio — open source Swift SDK; runs Parakeet on the Apple Neural Engine; includes VAD and diarization; iOS 17+ / macOS 14+
- Argmax — commercial SDK for Mac and iOS; Parakeet and Whisper on Core ML; WebSocket API mirrors Deepgram so you can swap cloud for on-device in one line of code
Text-to-Speech
Not the focus of this talk, but most real apps use both directions.
- ElevenLabs — high-quality cloud TTS
- OpenAI Audio API — TTS and STT in one endpoint (see above)
Voice Interface Products
- Wispr Flow — voice dictation for Mac/Windows/iOS; works in any text field; requires accessibility access
- Willow — open source, self-hosted voice assistant; GitHub
Concepts
- Core ML — Apple’s on-device ML framework; routes inference to CPU, GPU, or ANE depending on model and device
- Silero VAD — widely used open source Voice Activity Detection model; runs on ONNX
- pyannote.audio — the standard open source library for speaker diarization
- ONNX — open standard for ML model interoperability across runtimes
Managed Inference / Hosting
- Baseten — host specialized models on GPU infrastructure; more control than cloud APIs, less ops than self-hosting
Products Referenced
- Notion AI Meeting Notes — shipped in 2025; combines transcription with LLM summarization
- Anthropic Claude — added device-level voice control in 2026