Text Processing: Tokenization, Stemming, Lemmatization
Language is continuous. Models are discrete. Preprocessing is the bridge.
Bag of Words, TF-IDF & Text Representation
Count first, think later. TF-IDF still beats embeddings on well-defined tasks in 2026.
Word Embeddings: Word2Vec from Scratch
A word is the company it keeps. Train a shallow net on that idea and geometry falls out.
GloVe, FastText & Subword Embeddings
Word2Vec trained one embedding per word. GloVe factorized the co-occurrence matrix. FastText embedded the pieces. BPE bridged to transformers.
Sentiment Analysis
The canonical NLP task. Most of what you need to know about classical text classification shows up here.
Named Entity Recognition (NER)
Pull the names out. Sounds easy until you deal with ambiguous boundaries, nested entities, and domain jargon.
POS Tagging & Syntactic Parsing
Grammar was unfashionable for a while. Then every LLM pipeline needed to validate structured extraction, and it came back.
Text Classification — CNNs & RNNs for Text
Convolutions learn n-grams. Recurrences remember. Both are superseded by attention. Both still matter on constrained hardware.
Sequence-to-Sequence Models
Two RNNs pretending to be a translator. The bottleneck they hit is the reason attention exists.
Attention Mechanism — The Breakthrough
The decoder stops squinting at a compressed summary and starts looking at the whole source. Everything after this is attention plus engineering.
Machine Translation
Translation is the task that paid for NLP research for thirty years and keeps paying now.
Text Summarization
Extractive systems tell you what the document said. Abstractive systems tell you what the author meant. Different tasks, different pitfalls.
Question Answering Systems
Three systems shaped modern QA. Extractive found spans. Retrieval-augmented grounded them in documents. Generative produced answers. Every modern AI assistant is a mix of the th…
Information Retrieval & Search
BM25 is precise but brittle. Dense casts a wide net but misses keywords. Hybrid is the 2026 default. Everything else is tuning.
Topic Modeling: LDA, BERTopic
LDA: documents are mixtures of topics, topics are distributions over words. BERTopic: documents cluster in embedding space, clusters are topics. Same goal, different decompositi…
Text Generation
If a word is surprising, the model is bad. Perplexity makes surprise a number. Smoothing keeps it finite.
Chatbots: Rule-Based to Neural
ELIZA replied with pattern matches. DialogFlow mapped intents. GPT answered from weights. Claude runs tools and verifies. Each era solved the previous one's worst failure.
Multilingual NLP
One model, 100+ languages, zero training data for most of them. Cross-lingual transfer is the practical miracle of the 2020s.
Subword Tokenization: BPE, WordPiece, Unigram, SentencePiece
Word tokenizers choke on unseen words. Character tokenizers blow up sequence length. Subword tokenizers split the difference. Every modern LLM ships on one.
Structured Outputs & Constrained Decoding
Ask an LLM for JSON. Get JSON most of the time. In production, "most" is the problem. Constrained decoding turns "most" into "always" by editing the logits before sampling.
NLI & Textual Entailment
"t entails h" means a human reading t would conclude h is true. NLI is the task of predicting entailment / contradiction / neutral. Boring on the surface, load-bearing in produc…
Embedding Models Deep Dive
Word2Vec gave you a vector per word. Modern embedding models give you a vector per passage, cross-lingual, with sparse, dense, and multi-vector views, sized to fit your index. P…
Chunking Strategies for RAG
Chunking configuration influences retrieval quality as much as the choice of embedding model (Vectara NAACL 2025). Get chunking wrong and no amount of reranking saves you.
Coreference Resolution
"She called him. He did not answer. The doctor was at lunch." Three references to two people and nobody is named. Coreference resolution figures out who is who.
Entity Linking & Disambiguation
NER found "Paris." Entity linking decides: Paris, France? Paris Hilton? Paris, Texas? Paris (the Trojan prince)? Without linking, your knowledge graph stays ambiguous.
Relation Extraction & Knowledge Graph Construction
NER found the entities. Entity linking anchored them. Relation extraction finds the edges between them. A knowledge graph is the sum of nodes, edges, and their provenance.
LLM Evaluation: RAGAS, DeepEval, G-Eval
Exact-match and F1 miss semantic equivalence. Human review does not scale. LLM-as-judge is the production answer — with enough calibration to trust the number.
Long-Context Evaluation: NIAH, RULER, LongBench, MRCR
Gemini 3 Pro advertises 10M tokens of context. At 1M tokens, 8-needle MRCR drops to 26.3%. Advertised ≠ usable. Long-context evaluation tells you the actual capacity of the mode…
Dialogue State Tracking
"I want a cheap restaurant in the north... actually make it moderate... and add Italian." Three turns, three state updates. DST keeps the slot-value dict in sync so the booking …