Interviews Vector
Back to Roadmap
5
29 lessons

NLP: Foundations to Advanced

Language is the interface to intelligence.

01

Text Processing: Tokenization, Stemming, Lemmatization

Build
Python

Language is continuous. Models are discrete. Preprocessing is the bridge.

02

Bag of Words, TF-IDF & Text Representation

Build
Python

Count first, think later. TF-IDF still beats embeddings on well-defined tasks in 2026.

03

Word Embeddings: Word2Vec from Scratch

Build
Python

A word is the company it keeps. Train a shallow net on that idea and geometry falls out.

04

GloVe, FastText & Subword Embeddings

Build
Python

Word2Vec trained one embedding per word. GloVe factorized the co-occurrence matrix. FastText embedded the pieces. BPE bridged to transformers.

05

Sentiment Analysis

Build
Python

The canonical NLP task. Most of what you need to know about classical text classification shows up here.

06

Named Entity Recognition (NER)

Build
Python

Pull the names out. Sounds easy until you deal with ambiguous boundaries, nested entities, and domain jargon.

07

POS Tagging & Syntactic Parsing

Build
Python

Grammar was unfashionable for a while. Then every LLM pipeline needed to validate structured extraction, and it came back.

08

Text Classification — CNNs & RNNs for Text

Build
Python

Convolutions learn n-grams. Recurrences remember. Both are superseded by attention. Both still matter on constrained hardware.

09

Sequence-to-Sequence Models

Build
Python

Two RNNs pretending to be a translator. The bottleneck they hit is the reason attention exists.

10

Attention Mechanism — The Breakthrough

Build
Python

The decoder stops squinting at a compressed summary and starts looking at the whole source. Everything after this is attention plus engineering.

11

Machine Translation

Build
Python

Translation is the task that paid for NLP research for thirty years and keeps paying now.

12

Text Summarization

Build
Python

Extractive systems tell you what the document said. Abstractive systems tell you what the author meant. Different tasks, different pitfalls.

13

Question Answering Systems

Build
Python

Three systems shaped modern QA. Extractive found spans. Retrieval-augmented grounded them in documents. Generative produced answers. Every modern AI assistant is a mix of the th…

14

Information Retrieval & Search

Build
Python

BM25 is precise but brittle. Dense casts a wide net but misses keywords. Hybrid is the 2026 default. Everything else is tuning.

15

Topic Modeling: LDA, BERTopic

Build
Python

LDA: documents are mixtures of topics, topics are distributions over words. BERTopic: documents cluster in embedding space, clusters are topics. Same goal, different decompositi…

16

Text Generation

Build
Python

If a word is surprising, the model is bad. Perplexity makes surprise a number. Smoothing keeps it finite.

17

Chatbots: Rule-Based to Neural

Build
Python

ELIZA replied with pattern matches. DialogFlow mapped intents. GPT answered from weights. Claude runs tools and verifies. Each era solved the previous one's worst failure.

18

Multilingual NLP

Build
Python

One model, 100+ languages, zero training data for most of them. Cross-lingual transfer is the practical miracle of the 2020s.

19

Subword Tokenization: BPE, WordPiece, Unigram, SentencePiece

Learn
Python

Word tokenizers choke on unseen words. Character tokenizers blow up sequence length. Subword tokenizers split the difference. Every modern LLM ships on one.

20

Structured Outputs & Constrained Decoding

Build
Python

Ask an LLM for JSON. Get JSON most of the time. In production, "most" is the problem. Constrained decoding turns "most" into "always" by editing the logits before sampling.

21

NLI & Textual Entailment

Learn
Python

"t entails h" means a human reading t would conclude h is true. NLI is the task of predicting entailment / contradiction / neutral. Boring on the surface, load-bearing in produc…

22

Embedding Models Deep Dive

Learn
Python

Word2Vec gave you a vector per word. Modern embedding models give you a vector per passage, cross-lingual, with sparse, dense, and multi-vector views, sized to fit your index. P…

23

Chunking Strategies for RAG

Build
Python

Chunking configuration influences retrieval quality as much as the choice of embedding model (Vectara NAACL 2025). Get chunking wrong and no amount of reranking saves you.

24

Coreference Resolution

Learn
Python

"She called him. He did not answer. The doctor was at lunch." Three references to two people and nobody is named. Coreference resolution figures out who is who.

25

Entity Linking & Disambiguation

Build
Python

NER found "Paris." Entity linking decides: Paris, France? Paris Hilton? Paris, Texas? Paris (the Trojan prince)? Without linking, your knowledge graph stays ambiguous.

26

Relation Extraction & Knowledge Graph Construction

Build
Python

NER found the entities. Entity linking anchored them. Relation extraction finds the edges between them. A knowledge graph is the sum of nodes, edges, and their provenance.

27

LLM Evaluation: RAGAS, DeepEval, G-Eval

Build
Python

Exact-match and F1 miss semantic equivalence. Human review does not scale. LLM-as-judge is the production answer — with enough calibration to trust the number.

28

Long-Context Evaluation: NIAH, RULER, LongBench, MRCR

Learn
Python

Gemini 3 Pro advertises 10M tokens of context. At 1M tokens, 8-needle MRCR drops to 26.3%. Advertised ≠ usable. Long-context evaluation tells you the actual capacity of the mode…

29

Dialogue State Tracking

Build
Python

"I want a cheap restaurant in the north... actually make it moderate... and add Italian." Three turns, three state updates. DST keeps the slot-value dict in sync so the booking …