NLP: Foundations to Advanced

Language is continuous. Models are discrete. Preprocessing is the bridge.

Bag of Words, TF-IDF & Text Representation

Count first, think later. TF-IDF still beats embeddings on well-defined tasks in 2026.

Word Embeddings: Word2Vec from Scratch

A word is the company it keeps. Train a shallow net on that idea and geometry falls out.

GloVe, FastText & Subword Embeddings

Word2Vec trained one embedding per word. GloVe factorized the co-occurrence matrix. FastText embedded the pieces. BPE bridged to transformers.

Sentiment Analysis

The canonical NLP task. Most of what you need to know about classical text classification shows up here.

Named Entity Recognition (NER)

Pull the names out. Sounds easy until you deal with ambiguous boundaries, nested entities, and domain jargon.

POS Tagging & Syntactic Parsing

Grammar was unfashionable for a while. Then every LLM pipeline needed to validate structured extraction, and it came back.

Text Classification — CNNs & RNNs for Text

Convolutions learn n-grams. Recurrences remember. Both are superseded by attention. Both still matter on constrained hardware.

Sequence-to-Sequence Models

Two RNNs pretending to be a translator. The bottleneck they hit is the reason attention exists.

Attention Mechanism — The Breakthrough

The decoder stops squinting at a compressed summary and starts looking at the whole source. Everything after this is attention plus engineering.

Machine Translation

Translation is the task that paid for NLP research for thirty years and keeps paying now.

Text Summarization

Extractive systems tell you what the document said. Abstractive systems tell you what the author meant. Different tasks, different pitfalls.

Question Answering Systems

Three systems shaped modern QA. Extractive found spans. Retrieval-augmented grounded them in documents. Generative produced answers. Every modern AI assistant is a mix of the th…

Information Retrieval & Search

BM25 is precise but brittle. Dense casts a wide net but misses keywords. Hybrid is the 2026 default. Everything else is tuning.

Topic Modeling: LDA, BERTopic

LDA: documents are mixtures of topics, topics are distributions over words. BERTopic: documents cluster in embedding space, clusters are topics. Same goal, different decompositi…

Text Generation

If a word is surprising, the model is bad. Perplexity makes surprise a number. Smoothing keeps it finite.

Chatbots: Rule-Based to Neural

ELIZA replied with pattern matches. DialogFlow mapped intents. GPT answered from weights. Claude runs tools and verifies. Each era solved the previous one's worst failure.

Multilingual NLP

One model, 100+ languages, zero training data for most of them. Cross-lingual transfer is the practical miracle of the 2020s.

Subword Tokenization: BPE, WordPiece, Unigram, SentencePiece

Word tokenizers choke on unseen words. Character tokenizers blow up sequence length. Subword tokenizers split the difference. Every modern LLM ships on one.

Structured Outputs & Constrained Decoding

Ask an LLM for JSON. Get JSON most of the time. In production, "most" is the problem. Constrained decoding turns "most" into "always" by editing the logits before sampling.

NLI & Textual Entailment

"t entails h" means a human reading t would conclude h is true. NLI is the task of predicting entailment / contradiction / neutral. Boring on the surface, load-bearing in produc…

Embedding Models Deep Dive

Word2Vec gave you a vector per word. Modern embedding models give you a vector per passage, cross-lingual, with sparse, dense, and multi-vector views, sized to fit your index. P…

Chunking Strategies for RAG

Chunking configuration influences retrieval quality as much as the choice of embedding model (Vectara NAACL 2025). Get chunking wrong and no amount of reranking saves you.

Coreference Resolution

"She called him. He did not answer. The doctor was at lunch." Three references to two people and nobody is named. Coreference resolution figures out who is who.

Entity Linking & Disambiguation

NER found "Paris." Entity linking decides: Paris, France? Paris Hilton? Paris, Texas? Paris (the Trojan prince)? Without linking, your knowledge graph stays ambiguous.

Relation Extraction & Knowledge Graph Construction

NER found the entities. Entity linking anchored them. Relation extraction finds the edges between them. A knowledge graph is the sum of nodes, edges, and their provenance.

LLM Evaluation: RAGAS, DeepEval, G-Eval

Exact-match and F1 miss semantic equivalence. Human review does not scale. LLM-as-judge is the production answer — with enough calibration to trust the number.

Long-Context Evaluation: NIAH, RULER, LongBench, MRCR

Gemini 3 Pro advertises 10M tokens of context. At 1M tokens, 8-needle MRCR drops to 26.3%. Advertised ≠ usable. Long-context evaluation tells you the actual capacity of the mode…

Dialogue State Tracking