Interviews Vector
Back to Roadmap
6
17 lessons

Speech & Audio

Hear, understand, speak.

01

Audio Fundamentals: Waveforms, Sampling, FFT

Learn
Python

Waveforms are the raw signal. Spectrograms are the representation. Mel features are the ML-friendly form. Every modern ASR and TTS pipeline walks this ladder, and the first rung…

02

Spectrograms, Mel Scale & Audio Features

Build
Python

Neural nets do not consume raw waveforms well. They consume spectrograms. They consume mel spectrograms even better. Every ASR, TTS, and audio classifier in 2026 lives or dies b…

03

Audio Classification

Build
Python

Everything from "dog barking vs siren" to "which language is this" is audio classification. The features are mels. The architecture moves each decade. The evaluation stays AUC, …

04

Speech Recognition (ASR)

Build
Python

Speech recognition is audio classification at every timestep, glued together by a sequence model that knows English and silence. CTC, RNN-T, and attention are the three ways to …

05

Whisper: Architecture & Fine-Tuning

Build
Python

Whisper is a 30-second-window transformer encoder-decoder, trained on 680k hours of multilingual weakly-supervised audio-text pairs. One architecture, multiple tasks, robust acr…

06

Speaker Recognition & Verification

Build
Python

ASR asks "what did they say?" Speaker recognition asks "who said it?" The math looks the same — embeddings plus cosine — but every production decision hinges on a single EER num…

07

Text-to-Speech (TTS)

Build
Python

ASR inverts speech to text; TTS inverts text to speech. The 2026 stack is three parts: text → tokens, tokens → mel, mel → waveform. Each part has a default model that fits in a …

08

Voice Cloning & Voice Conversion

Build
Python

Voice cloning reads your text in someone else's voice. Voice conversion rewrites your voice into someone else's while preserving what you said. Both hang on the same decompositi…

09

Music Generation

Build
Python

2026 music generation: Suno v5 and Udio v4 dominate commercial; MusicGen, Stable Audio Open, and ACE-Step lead open-source. The technical problem is mostly solved. The legal pro…

10

Audio-Language Models

Build
Python

2026 audio-language models reason over speech + environmental sound + music. Qwen2.5-Omni-7B matches GPT-4o Audio on MMAU-Pro. Audio Flamingo Next beats Gemini 2.5 Pro on LongAu…

11

Real-Time Audio Processing

Build
Python

Batch pipelines process a file. Real-time pipelines process the next 20 milliseconds before the next 20 arrive. Every conversational AI, broadcast studio, and telephony bot live…

12

Build a Voice Assistant Pipeline

Build
Python

Everything from lessons 01-11, stitched together. Build a voice assistant that listens, reasons, and talks back. In 2026 that is a solved engineering problem, not a research pro…

13

Neural Audio Codecs — EnCodec, SNAC, Mimi, DAC

Learn
Python

2026 audio generation is almost all tokens. EnCodec, SNAC, Mimi, and DAC turn continuous waveforms into discrete sequences that a transformer can predict. The semantic-vs-acoust…

14

Voice Activity Detection & Turn-Taking

Build
Python

Every voice agent lives or dies on two decisions: is the user speaking now, and are they done? VAD answers the first. Turn-detection (VAD + silence-hangover + semantic endpoint …

15

Streaming Speech-to-Speech — Moshi, Hibiki

Learn
Python

2024-2026 redefined voice AI. Moshi ships a single model that listens and speaks simultaneously at 200 ms latency. Hibiki does speech-to-speech translation chunk-by-chunk. Both …

16

Voice Anti-Spoofing & Audio Watermarking

Build
Python

Voice cloning shipped faster than defenses. 2026 production voice systems need two things: a detector (AASIST, RawNet2) that classifies real vs fake speech, and a watermark (Aud…

17

Audio Evaluation — WER, MOS, MMAU, Leaderboards

Learn
Python

You cannot ship what you cannot measure. This lesson names the 2026 metrics for every audio task: ASR (WER, CER, RTFx), TTS (MOS, UTMOS, SECS, WER-on-ASR-round-trip), audio-lang…