From Microphone to Insight: Extracting Structured Data From Voice Notes — Talk Notes

Hey all. Here’s a summary and notes from my talk at MinneAnalytics Data Tech 2026.

Full session link here.

Thanks to everyone who came out and the great questions during Q&A.

Something like this has probably happened to you.

Your boss, your CEO, someone’s product lead tracks you down in the hallway. “Quick question.” The loaded kind.

“We’re transcribing all our meetings now. What are we actually doing with that data?”

That’s the question this talk is built around. And I think the answer is going to define a lot of interesting work over the next few years.

The thesis

Ambient data from everyday systems — voice, calendar, contacts, meeting notes — is being captured at a scale that didn’t exist two years ago. The interesting part is feeding that data into workflows and getting something useful out of it.

AI meeting notes products went from basically zero to everywhere in about six months in 2025. The enabling factor wasn’t just transcription. LLMs got good enough at summarization that a wall of transcript text became something actually readable. That tipped the market.

So now we have an explosion of unused audio data. This talk is a mental model for what’s possible and how to get started.

The models

Two main classes of speech-to-text: streaming and non-streaming. Non-streaming takes a full audio chunk and spits out text. Streaming sends continuous audio and returns text as words are recognized. Both have their place.

Three models worth knowing:

Whisper — OpenAI’s open source model. The OG. May not have the best accuracy anymore, but it’s the baseline everything else is measured against. Thirty-plus language support, detects the spoken language automatically. If you don’t know where to start, start here.

Parakeet — NVIDIA’s streaming-capable ASR model. Faster than Whisper, still accurate. Not the same language coverage, but popular for English and European languages. I was running it during this talk, chunking in roughly seven-second intervals.

Moonshine — designed for edge and small devices. Wearables, appliances, anything resource-constrained. Trade-off: if you want a different language, you need a different model. Can’t just swap a config.

All three run on-device. You don’t need the cloud for any of them.

Voice Activity Detection

Don’t send silence to your model. Sounds obvious. Matters a lot.

Burning tokens on dead air is expensive. Worse, models hallucinate when you throw garbage at them. Send pure silence to a speech-to-text model and it will confidently spit out text. Nonsense text. Not a great look.

Silero VAD is the widely used open source option. It signals when speech starts and stops, filters out ambient noise, and keeps your inputs clean. Use it.

Diarization

Diarization is figuring out who said what. It’s important. It’s also not solved.

80–90% accuracy is roughly where things stand. Useful, but not bulletproof. And the failure modes matter. Someone at a previous talk shared a story about a doctor whose transcription app picked up the doctor describing their own cancer journey — and attributed it to the patient. That patient now has notes indicating a cancer history they don’t have. That’s not a minor bug.

Keep humans in the loop. Don’t let these systems make decisions without review. The tech is good enough to be useful. It’s not good enough to be trusted unsupervised.

pyannote.audio is the standard open source library if you want to dig in. Two-speaker setups are well-handled. A 12-person Zoom meeting is a different problem entirely.

Cloud vs. on-device

Cloud is where most commercial products live. Deepgram is the one-stop shop — WebSocket streaming, VAD, diarization, the works. Get an API key, open Claude Code, and you’ve got a working transcription app in ten minutes. OpenAI’s audio API hosts Whisper if you want more control over which model you’re running.

Downsides: you don’t control the model. Something goes down, you’re stuck. Latency is a real constraint — anything that needs edge compute won’t work. And if you’re sending confidential client conversations to a third-party provider that may use them for training data… that’s a problem depending on your industry.

On-device is maturing fast. I transcribed this whole talk on a MacBook Air with no cloud connection, using FluidAudio running Parakeet on the Apple Neural Engine. Works well. Lawyers, hospitals, defense contractors — anyone with data sensitivity has real reasons to stay local.

WhisperKit and Whisper.cpp are both solid on-device options. Argmax has a commercial SDK for Mac and iOS that runs Parakeet and Whisper on Core ML — and its WebSocket API mirrors Deepgram, so you can swap cloud for on-device in one line of code. Baseten sits in the middle — they’ll host your models on specialized hardware if you want more control than a cloud API but less ops burden than running your own infrastructure.

Adding value beyond the transcript

Meeting notes apps are a commodity now. Wispr Flow is great, and a lot of people love voice-to-text dictation. But if you’re building something, you need a differentiation angle.

A few:

Ambient knowledge. Pull in your wikis, Slack history, previous meeting notes from the same people. Give the model context before summarization happens. It helps with names, topics, terminology. Notion’s AI meeting notes does this well enough that it correctly spells my name — which no out-of-the-box model does by default.

Domain-specific terminology. Air traffic control, medicine, law — these fields have acronyms and terms that weren’t in the training data. If you’re building for a specialized vertical, tuning the model on that vocabulary is real differentiation.

Audio environments. Noisy environments, bad microphones, non-standard acoustics all degrade accuracy. Audio preprocessing is a lever most products ignore.

Multimodal. This is where it gets interesting. Text alone loses a lot. “I’m doing great” reads the same regardless of tone. Combine the transcript with emotion data from the audio and you’ve got something richer. Hume AI does speech emotional recognition — an audience member mentioned they had a client already using it in production. The combination of what was said and how it was said is a much stronger signal than text alone.

How audio gets into models

Quick note on the mechanics, since it came up: the dominant technique is the log-mel spectrogram. Audio data is dense — a full-quality WAV file is a lot of data points. A log-mel spectrogram converts that to a representation of audio energy over time on a musical frequency scale, which maps cleanly to speech patterns. 16 kHz is the standard sample rate because most speech content sits below 8 kHz.

Getting started

If you want to try this on your own machine: Ghost Pepper is an on-device Mac transcription app that’s open source and downloadable. Run it, see what on-device STT actually feels like.

If you want to build something: Deepgram API key, Claude Code, ten minutes.

The data is being captured. The models are accessible. 2027 is when we actually start doing things with it.

Talk transcribed by FluidAudio’s Parakeet implementation. Entirely on-device from a MacBook Air microphone. Summary generated by Claude.