Build Your Own Mini Framework
You have built neurons, layers, networks, backprop, activations, loss functions, optimizers, regularization, initialization, and LR schedules. All as separate pieces. Now wire them together into a framework. Not PyTorch. Not TensorFlow. Yours.
Type: Build Languages: Python Prerequisites: All of Phase 03 (Lessons 01-09) Time: ~120 minutes
Learning Objectives
- Build a complete deep learning framework (~500 lines) with Module, Linear, ReLU, Sigmoid, Dropout, BatchNorm, Sequential, loss functions, optimizers, and DataLoader
- Explain the Module abstraction (forward, backward, parameters) and why train/eval mode toggling is necessary
- Wire all components into a working training loop that trains a 4-layer network on circle classification
- Map each component of your framework to its PyTorch equivalent (nn.Module, nn.Sequential, optim.Adam, DataLoader)
The Problem
You have ten lessons of building blocks scattered across separate files. A Value class here, a training loop there, weight initialization in another file, learning rate schedules in yet another. To train a network, you copy-paste from five different lessons and wire them together by hand.
That is what frameworks solve. PyTorch gives you nn.Module, nn.Sequential, optim.Adam, DataLoader, and a training loop pattern that ties them together. TensorFlow gives you keras.Layer, keras.Sequential, keras.optimizers.Adam. These are not magic. They are organizational patterns that make it possible to define, train, and evaluate networks without reinventing the plumbing every time.
You are going to build the same thing in ~500 lines of Python. No numpy. No external dependencies. A framework that can define any feedforward network, train it with SGD or Adam, batch the data, apply dropout and batch normalization, use any activation, and schedule the learning rate.
When you finish, you will understand exactly what happens when you write model = nn.Sequential(...) in PyTorch. You will understand why model.train() and model.eval() exist. You will understand why optimizer.zero_grad() is a separate call. You will understand all of it, because you built all of it.
The Concept
The Module Abstraction
Every layer in PyTorch inherits from nn.Module. A Module has three responsibilities:
- forward() -- compute the output given inputs
- parameters() -- return all trainable weights
- backward() -- compute gradients (handled by autograd in PyTorch, explicit in ours)
A Linear layer is a Module. A ReLU activation is a Module. A dropout layer is a Module. A batch normalization layer is a Module. They all have the same interface.
Sequential Container
nn.Sequential chains Modules. Forward pass: feed data through Module 1, then Module 2, then Module 3. Backward pass: reverse the chain. The container itself is a Module -- it has forward(), parameters(), and backward(). This is the composite pattern: a sequence of Modules is itself a Module.
Training vs Evaluation Mode
Dropout randomly zeroes neurons during training but passes everything through during evaluation. Batch normalization uses batch statistics during training but running averages during evaluation. The train() and eval() methods toggle this behavior. Every Module has a training flag.
Optimizer
The optimizer updates parameters using their gradients. SGD: param -= lr * grad. Adam: maintains momentum and variance estimates, then updates. The optimizer does not know about the network architecture -- it only sees a flat list of parameters and their gradients.
DataLoader
Batching matters for two reasons. First, you cannot fit the entire dataset in memory for large problems. Second, mini-batch gradient descent provides noise that helps escape local minima. The DataLoader splits data into batches and optionally shuffles between epochs.
Framework Architecture
graph TD subgraph "Modules" Linear["Linear<br/>W*x + b"] ReLU["ReLU<br/>max(0, x)"] Sigmoid["Sigmoid<br/>1/(1+e^-x)"] Dropout["Dropout<br/>random zero mask"] BatchNorm["BatchNorm<br/>normalize activations"] end subgraph "Containers" Sequential["Sequential<br/>chains modules"] end subgraph "Loss Functions" MSE["MSELoss<br/>(pred - target)^2"] BCE["BCELoss<br/>binary cross-entropy"] end subgraph "Optimizers" SGD["SGD<br/>param -= lr * grad"] Adam["Adam<br/>adaptive moments"] end subgraph "Data" DataLoader["DataLoader<br/>batching + shuffle"] end Sequential --> |"contains"| Linear Sequential --> |"contains"| ReLU Sequential --> |"forward/backward"| MSE SGD --> |"updates"| Sequential DataLoader --> |"feeds"| Sequential
Training Loop
sequenceDiagram participant DL as DataLoader participant M as Model participant L as Loss participant O as Optimizer loop Each Epoch DL->>M: batch of inputs M->>M: forward pass (layer by layer) M->>L: predictions L->>L: compute loss L->>M: backward pass (gradients) M->>O: parameters + gradients O->>M: updated parameters O->>O: zero gradients end
Module Hierarchy
classDiagram class Module { +forward(x) +backward(grad) +parameters() +train() +eval() } class Linear { -weights -biases +forward(x) +backward(grad) } class ReLU { +forward(x) +backward(grad) } class Sequential { -modules[] +forward(x) +backward(grad) +parameters() } Module <|-- Linear Module <|-- ReLU Module <|-- Sequential Sequential *-- Module
Build It
Step 1: Module Base Class
The abstract interface that every layer implements.
class Module: def __init__(self): self.training = True def forward(self, x): raise NotImplementedError def backward(self, grad): raise NotImplementedError def parameters(self): return [] def train(self): self.training = True def eval(self): self.training = False
Step 2: Linear Layer
The fundamental building block. Stores weights and biases, computes Wx + b forward, and weight/input gradients backward.
import math import random class Linear(Module): def __init__(self, fan_in, fan_out): super().__init__() std = math.sqrt(2.0 / fan_in) self.weights = [[random.gauss(0, std) for _ in range(fan_in)] for _ in range(fan_out)] self.biases = [0.0] * fan_out self.weight_grads = [[0.0] * fan_in for _ in range(fan_out)] self.bias_grads = [0.0] * fan_out self.fan_in = fan_in self.fan_out = fan_out self.input = None def forward(self, x): self.input = x output = [] for i in range(self.fan_out): val = self.biases[i] for j in range(self.fan_in): val += self.weights[i][j] * x[j] output.append(val) return output def backward(self, grad): input_grad = [0.0] * self.fan_in for i in range(self.fan_out): self.bias_grads[i] += grad[i] for j in range(self.fan_in): self.weight_grads[i][j] += grad[i] * self.input[j] input_grad[j] += grad[i] * self.weights[i][j] return input_grad def parameters(self): params = [] for i in range(self.fan_out): for j in range(self.fan_in): params.append((self.weights, i, j, self.weight_grads)) params.append((self.biases, i, None, self.bias_grads)) return params
Step 3: Activation Modules
ReLU, Sigmoid, and Tanh as Modules. Each caches what it needs for the backward pass.
class ReLU(Module): def __init__(self): super().__init__() self.mask = None def forward(self, x): self.mask = [1.0 if v > 0 else 0.0 for v in x] return [max(0.0, v) for v in x] def backward(self, grad): return [g * m for g, m in zip(grad, self.mask)] class Sigmoid(Module): def __init__(self): super().__init__() self.output = None def forward(self, x): self.output = [] for v in x: v = max(-500, min(500, v)) self.output.append(1.0 / (1.0 + math.exp(-v))) return self.output def backward(self, grad): return [g * o * (1 - o) for g, o in zip(grad, self.output)] class Tanh(Module): def __init__(self): super().__init__() self.output = None def forward(self, x): self.output = [math.tanh(v) for v in x] return self.output def backward(self, grad): return [g * (1 - o * o) for g, o in zip(grad, self.output)]
Step 4: Dropout Module
Randomly zeroes elements during training. Scales remaining elements by 1/(1-p) so expected values stay the same. Does nothing during eval.
class Dropout(Module): def __init__(self, p=0.5): super().__init__() self.p = p self.mask = None def forward(self, x): if not self.training: return x self.mask = [0.0 if random.random() < self.p else 1.0 / (1 - self.p) for _ in x] return [v * m for v, m in zip(x, self.mask)] def backward(self, grad): if self.mask is None: return grad return [g * m for g, m in zip(grad, self.mask)]
Step 5: BatchNorm Module
Normalizes activations to zero mean and unit variance per feature across the batch. Maintains running statistics for eval mode.
class BatchNorm(Module): def __init__(self, size, momentum=0.1, eps=1e-5): super().__init__() self.size = size self.gamma = [1.0] * size self.beta = [0.0] * size self.gamma_grads = [0.0] * size self.beta_grads = [0.0] * size self.running_mean = [0.0] * size self.running_var = [1.0] * size self.momentum = momentum self.eps = eps self.x_norm = None self.std_inv = None self.batch_input = None def forward_batch(self, batch): batch_size = len(batch) output_batch = [] if self.training: mean = [0.0] * self.size for sample in batch: for j in range(self.size): mean[j] += sample[j] mean = [m / batch_size for m in mean] var = [0.0] * self.size for sample in batch: for j in range(self.size): var[j] += (sample[j] - mean[j]) ** 2 var = [v / batch_size for v in var] self.std_inv = [1.0 / math.sqrt(v + self.eps) for v in var] self.x_norm = [] self.batch_input = batch for sample in batch: normed = [(sample[j] - mean[j]) * self.std_inv[j] for j in range(self.size)] self.x_norm.append(normed) output = [self.gamma[j] * normed[j] + self.beta[j] for j in range(self.size)] output_batch.append(output) for j in range(self.size): self.running_mean[j] = (1 - self.momentum) * self.running_mean[j] + self.momentum * mean[j] self.running_var[j] = (1 - self.momentum) * self.running_var[j] + self.momentum * var[j] else: std_inv = [1.0 / math.sqrt(v + self.eps) for v in self.running_var] for sample in batch: normed = [(sample[j] - self.running_mean[j]) * std_inv[j] for j in range(self.size)] output = [self.gamma[j] * normed[j] + self.beta[j] for j in range(self.size)] output_batch.append(output) return output_batch def forward(self, x): result = self.forward_batch([x]) return result[0] def backward(self, grad): if self.x_norm is None: return grad for j in range(self.size): self.gamma_grads[j] += self.x_norm[0][j] * grad[j] self.beta_grads[j] += grad[j] return [grad[j] * self.gamma[j] * self.std_inv[j] for j in range(self.size)] def parameters(self): params = [] for j in range(self.size): params.append((self.gamma, j, None, self.gamma_grads)) params.append((self.beta, j, None, self.beta_grads)) return params
Step 6: Sequential Container
Chains modules. Forward goes left-to-right, backward goes right-to-left.
class Sequential(Module): def __init__(self, *modules): super().__init__() self.modules = list(modules) def forward(self, x): for module in self.modules: x = module.forward(x) return x def backward(self, grad): for module in reversed(self.modules): grad = module.backward(grad) return grad def parameters(self): params = [] for module in self.modules: params.extend(module.parameters()) return params def train(self): self.training = True for module in self.modules: module.train() def eval(self): self.training = False for module in self.modules: module.eval()
Step 7: Loss Functions
MSE and Binary Cross-Entropy. Each returns the loss value and provides a backward() that returns the gradient.
class MSELoss: def __call__(self, predicted, target): self.predicted = predicted self.target = target n = len(predicted) self.loss = sum((p - t) ** 2 for p, t in zip(predicted, target)) / n return self.loss def backward(self): n = len(self.predicted) return [2 * (p - t) / n for p, t in zip(self.predicted, self.target)] class BCELoss: def __call__(self, predicted, target): self.predicted = predicted self.target = target eps = 1e-7 n = len(predicted) self.loss = 0 for p, t in zip(predicted, target): p = max(eps, min(1 - eps, p)) self.loss += -(t * math.log(p) + (1 - t) * math.log(1 - p)) self.loss /= n return self.loss def backward(self): eps = 1e-7 n = len(self.predicted) grads = [] for p, t in zip(self.predicted, self.target): p = max(eps, min(1 - eps, p)) grads.append((-t / p + (1 - t) / (1 - p)) / n) return grads
Step 8: SGD and Adam Optimizers
Both take a parameter list and update weights using gradients.
class SGD: def __init__(self, parameters, lr=0.01): self.params = parameters self.lr = lr def step(self): for container, i, j, grad_container in self.params: if j is not None: container[i][j] -= self.lr * grad_container[i][j] else: container[i] -= self.lr * grad_container[i] def zero_grad(self): for container, i, j, grad_container in self.params: if j is not None: grad_container[i][j] = 0.0 else: grad_container[i] = 0.0 class Adam: def __init__(self, parameters, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8): self.params = parameters self.lr = lr self.beta1 = beta1 self.beta2 = beta2 self.eps = eps self.t = 0 self.m = [0.0] * len(parameters) self.v = [0.0] * len(parameters) def step(self): self.t += 1 for idx, (container, i, j, grad_container) in enumerate(self.params): if j is not None: g = grad_container[i][j] else: g = grad_container[i] self.m[idx] = self.beta1 * self.m[idx] + (1 - self.beta1) * g self.v[idx] = self.beta2 * self.v[idx] + (1 - self.beta2) * g * g m_hat = self.m[idx] / (1 - self.beta1 ** self.t) v_hat = self.v[idx] / (1 - self.beta2 ** self.t) update = self.lr * m_hat / (math.sqrt(v_hat) + self.eps) if j is not None: container[i][j] -= update else: container[i] -= update def zero_grad(self): for container, i, j, grad_container in self.params: if j is not None: grad_container[i][j] = 0.0 else: grad_container[i] = 0.0
Step 9: DataLoader
Splits data into batches, optionally shuffles each epoch.
class DataLoader: def __init__(self, data, batch_size=32, shuffle=True): self.data = data self.batch_size = batch_size self.shuffle = shuffle def __iter__(self): indices = list(range(len(self.data))) if self.shuffle: random.shuffle(indices) for start in range(0, len(indices), self.batch_size): batch_indices = indices[start:start + self.batch_size] batch = [self.data[i] for i in batch_indices] inputs = [item[0] for item in batch] targets = [item[1] for item in batch] yield inputs, targets def __len__(self): return (len(self.data) + self.batch_size - 1) // self.batch_size
Step 10: Train a 4-Layer Network on Circle Classification
Wire everything together. Define a model, pick a loss, pick an optimizer, run the training loop.
def make_circle_data(n=500, seed=42): random.seed(seed) data = [] for _ in range(n): x = random.uniform(-2, 2) y = random.uniform(-2, 2) label = 1.0 if x * x + y * y < 1.5 else 0.0 data.append(([x, y], [label])) return data def train(): random.seed(42) model = Sequential( Linear(2, 16), ReLU(), Linear(16, 16), ReLU(), Linear(16, 8), ReLU(), Linear(8, 1), Sigmoid(), ) criterion = BCELoss() optimizer = Adam(model.parameters(), lr=0.01) data = make_circle_data(500) split = int(len(data) * 0.8) train_data = data[:split] test_data = data[split:] loader = DataLoader(train_data, batch_size=16, shuffle=True) model.train() for epoch in range(100): total_loss = 0 total_correct = 0 total_samples = 0 for batch_inputs, batch_targets in loader: batch_loss = 0 for x, t in zip(batch_inputs, batch_targets): pred = model.forward(x) loss = criterion(pred, t) batch_loss += loss optimizer.zero_grad() grad = criterion.backward() model.backward(grad) optimizer.step() predicted_class = 1.0 if pred[0] >= 0.5 else 0.0 if predicted_class == t[0]: total_correct += 1 total_samples += 1 total_loss += batch_loss avg_loss = total_loss / total_samples accuracy = total_correct / total_samples * 100 if epoch % 10 == 0 or epoch == 99: print(f"Epoch {epoch:3d} | Loss: {avg_loss:.6f} | Train Accuracy: {accuracy:.1f}%") model.eval() correct = 0 for x, t in test_data: pred = model.forward(x) predicted_class = 1.0 if pred[0] >= 0.5 else 0.0 if predicted_class == t[0]: correct += 1 test_accuracy = correct / len(test_data) * 100 print(f"\nTest Accuracy: {test_accuracy:.1f}% ({correct}/{len(test_data)})") return model, test_accuracy
Use It
Here is the PyTorch equivalent of what you just built:
import torch import torch.nn as nn from torch.utils.data import DataLoader, TensorDataset model = nn.Sequential( nn.Linear(2, 16), nn.ReLU(), nn.Linear(16, 16), nn.ReLU(), nn.Linear(16, 8), nn.ReLU(), nn.Linear(8, 1), nn.Sigmoid(), ) criterion = nn.BCELoss() optimizer = torch.optim.Adam(model.parameters(), lr=0.01) for epoch in range(100): model.train() for inputs, targets in dataloader: optimizer.zero_grad() predictions = model(inputs) loss = criterion(predictions, targets) loss.backward() optimizer.step() model.eval() with torch.no_grad(): test_predictions = model(test_inputs)
The structure is identical. Sequential, Linear, ReLU, Sigmoid, BCELoss, Adam, zero_grad, backward, step, train, eval. Every concept maps one-to-one. The difference is that PyTorch handles autograd automatically (no need to implement backward() in each module), runs on GPU, and has been optimized for years. But the bones are the same.
Now when you see PyTorch code, you know exactly what is happening at every line. That understanding is the whole point.
Ship It
This lesson produces:
outputs/prompt-framework-architect.md-- a prompt for designing neural network architectures using framework abstractions
Exercises
-
Add a
SoftmaxCrossEntropyLossclass for multi-class classification. Softmax the predictions, compute cross-entropy loss, and handle the combined backward pass. Test it on a 3-class spiral dataset. -
Implement learning rate scheduling in the optimizer: add a
set_lr()method and wire in the cosine schedule from Lesson 09. Train the circle classifier with warmup + cosine and compare to constant LR. -
Add a
save()andload()method to Sequential that serializes all weights to a JSON file and loads them back. Verify that a loaded model produces the same predictions as the original. -
Implement weight decay (L2 regularization) in the Adam optimizer. Add a
weight_decayparameter that shrinks weights toward zero each step. Compare training with decay=0 vs decay=0.01. -
Replace the per-sample training loop with proper mini-batch gradient accumulation: accumulate gradients across all samples in a batch, then divide by batch size and take one optimizer step. Measure whether this changes convergence speed.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| Module | "A layer" | The base abstraction in a framework -- anything with forward(), backward(), and parameters() |
| Sequential | "Stack layers in order" | A container that chains modules, applying them in sequence for forward and reverse for backward |
| Forward pass | "Run the network" | Computing the output by passing input through each module in order |
| Backward pass | "Compute gradients" | Propagating the loss gradient through each module in reverse to compute parameter gradients |
| Parameters | "The trainable weights" | All values in the network that the optimizer can update -- weights and biases |
| Optimizer | "The thing that updates weights" | An algorithm that uses gradients to update parameters, implementing SGD, Adam, or other rules |
| DataLoader | "The thing that feeds data" | An iterator that splits a dataset into batches, optionally shuffling between epochs |
| Training mode | "model.train()" | A flag that enables stochastic behavior like dropout and batch normalization with batch stats |
| Evaluation mode | "model.eval()" | A flag that disables dropout and uses running statistics for batch normalization |
| Zero grad | "Clear the gradients" | Resetting all parameter gradients to zero before computing the next batch's gradients |
Further Reading
- Paszke et al., "PyTorch: An Imperative Style, High-Performance Deep Learning Library" (2019) -- the paper describing PyTorch's design decisions
- Chollet, "Deep Learning with Python, Second Edition" (2021) -- Chapter 3 covers Keras internals with the same module/layer abstraction
- Johnson, "Tiny-DNN" (https://github.com/tiny-dnn/tiny-dnn) -- a header-only C++ deep learning framework for understanding framework internals