Logistic Regression
Logistic regression bends a straight line into an S-curve to answer yes-or-no questions with probabilities.
Type: Build Languages: Python Prerequisites: Phase 2 Lesson 1-2 (What Is ML, Linear Regression) Time: ~90 minutes
Learning Objectives
- Implement logistic regression from scratch using the sigmoid function and binary cross-entropy loss
- Compute and interpret precision, recall, F1 score, and the confusion matrix for binary classification
- Explain why MSE fails for classification and why binary cross-entropy produces a convex cost surface
- Build a softmax regression model for multi-class classification and evaluate threshold tuning tradeoffs
The Problem
You want to predict whether a tumor is malignant or benign given its size. You try linear regression. It outputs numbers like 0.3 or 1.7 or -0.5. What do those mean? Is 1.7 "very malignant"? Is -0.5 "very benign"? Linear regression outputs unbounded numbers. Classification needs bounded probabilities between 0 and 1, and a clear decision: yes or no.
Logistic regression solves this. It takes the same linear combination (wx + b) and passes it through the sigmoid function, which squashes any number into the range (0, 1). The output is a probability. You set a threshold (usually 0.5) and make a decision.
This is one of the most widely used algorithms in practice. Despite its name, logistic regression is a classification algorithm, not a regression algorithm. The name comes from the logistic (sigmoid) function it uses.
The Concept
Why Linear Regression Fails for Classification
Imagine predicting pass/fail (1/0) based on study hours. Linear regression fits a line through the data:
hours: 1 2 3 4 5 6 7 8 9 10
actual: 0 0 0 0 1 1 1 1 1 1
A linear fit might produce predictions like -0.2 at hour 1 and 1.3 at hour 10. These values are not probabilities. They go below 0 and above 1. Worse, a single outlier (someone who studied 50 hours) would drag the entire line, changing predictions for everyone.
Classification needs a function that:
- Outputs values between 0 and 1 (probabilities)
- Creates a sharp transition (a decision boundary)
- Is not distorted by outliers far from the boundary
The Sigmoid Function
The sigmoid function does exactly this:
sigmoid(z) = 1 / (1 + e^(-z))
Properties:
- When z is large and positive, sigmoid(z) approaches 1
- When z is large and negative, sigmoid(z) approaches 0
- When z = 0, sigmoid(z) = 0.5
- The output is always between 0 and 1
- The function is smooth and differentiable everywhere
The derivative has a convenient form: sigmoid'(z) = sigmoid(z) * (1 - sigmoid(z)). This makes gradient computation efficient.
Logistic Regression = Linear Model + Sigmoid
The model computes z = wx + b (same as linear regression), then applies sigmoid:
flowchart LR X[Input features x] --> L["Linear: z = wx + b"] L --> S["Sigmoid: p = 1/(1+e^-z)"] S --> D{"p >= 0.5?"} D -->|Yes| P[Predict 1] D -->|No| N[Predict 0]
The output p is interpreted as P(y=1 | x), the probability that the input belongs to class 1. The decision boundary is where wx + b = 0, which makes sigmoid output exactly 0.5.
Binary Cross-Entropy Loss
You cannot use MSE for logistic regression. MSE with a sigmoid creates a non-convex cost surface with many local minima. Instead, use binary cross-entropy (log loss):
Loss = -(1/n) * sum(y * log(p) + (1-y) * log(1-p))
Why this works:
- When y=1 and p is close to 1: log(1) = 0, so loss is near 0 (correct, low cost)
- When y=1 and p is close to 0: log(0) approaches negative infinity, so loss is huge (wrong, high cost)
- When y=0 and p is close to 0: log(1) = 0, so loss is near 0 (correct, low cost)
- When y=0 and p is close to 1: log(0) approaches negative infinity, so loss is huge (wrong, high cost)
This loss function is convex for logistic regression, guaranteeing a single global minimum.
Gradient Descent for Logistic Regression
The gradients for binary cross-entropy with sigmoid have a clean form:
dL/dw = (1/n) * sum((p - y) * x)
dL/db = (1/n) * sum(p - y)
These look identical to the linear regression gradients. The difference is that p = sigmoid(wx + b) instead of p = wx + b. The sigmoid introduces the nonlinearity, but the gradient update rule stays the same.
flowchart TD A[Initialize w=0, b=0] --> B[Forward pass: z = wx+b, p = sigmoid z] B --> C[Compute loss: binary cross-entropy] C --> D["Compute gradients: dw = (1/n) * sum((p-y)*x)"] D --> E[Update: w = w - lr*dw, b = b - lr*db] E --> F{Converged?} F -->|No| B F -->|Yes| G[Model trained]
The Decision Boundary
For a 2D input (two features), the decision boundary is the line where:
w1*x1 + w2*x2 + b = 0
Points on one side get classified as 1, points on the other side as 0. Logistic regression always produces a linear decision boundary. If you need a curved boundary, you either add polynomial features or use a nonlinear model.
Multi-Class Classification with Softmax
Binary logistic regression handles two classes. For k classes, use the softmax function:
softmax(z_i) = e^(z_i) / sum(e^(z_j) for all j)
Each class has its own weight vector. The model computes a score z_i for each class, then softmax converts scores to probabilities that sum to 1. The predicted class is the one with the highest probability.
The loss function becomes categorical cross-entropy:
Loss = -(1/n) * sum(sum(y_k * log(p_k)))
where y_k is 1 for the true class and 0 for all others (one-hot encoding).
Evaluation Metrics
Accuracy alone is not enough. For a dataset with 95% negative and 5% positive, a model that always predicts negative gets 95% accuracy but is useless.
Confusion Matrix:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | True Positive (TP) | False Negative (FN) |
| Actually Negative | False Positive (FP) | True Negative (TN) |
Precision: Of all predicted positives, how many are actually positive?
Precision = TP / (TP + FP)
Recall (Sensitivity): Of all actual positives, how many did we catch?
Recall = TP / (TP + FN)
F1 Score: Harmonic mean of precision and recall. Balances both metrics.
F1 = 2 * (Precision * Recall) / (Precision + Recall)
When to prioritize:
- Precision: when false positives are costly (spam filter, you do not want to block legitimate email)
- Recall: when false negatives are costly (cancer screening, you do not want to miss a tumor)
- F1: when you need a single balanced metric
Build It
Step 1: Sigmoid function and data generation
import random import math def sigmoid(z): z = max(-500, min(500, z)) return 1.0 / (1.0 + math.exp(-z)) random.seed(42) N = 200 X = [] y = [] for _ in range(N // 2): X.append([random.gauss(2, 1), random.gauss(2, 1)]) y.append(0) for _ in range(N // 2): X.append([random.gauss(5, 1), random.gauss(5, 1)]) y.append(1) combined = list(zip(X, y)) random.shuffle(combined) X, y = zip(*combined) X = list(X) y = list(y) print(f"Generated {N} samples (2 classes, 2 features)") print(f"Class 0 center: (2, 2), Class 1 center: (5, 5)") print(f"First 5 samples:") for i in range(5): print(f" Features: [{X[i][0]:.2f}, {X[i][1]:.2f}], Label: {y[i]}")
Step 2: Logistic regression from scratch
class LogisticRegression: def __init__(self, n_features, learning_rate=0.01): self.weights = [0.0] * n_features self.bias = 0.0 self.lr = learning_rate self.loss_history = [] def predict_proba(self, x): z = sum(w * xi for w, xi in zip(self.weights, x)) + self.bias return sigmoid(z) def predict(self, x, threshold=0.5): return 1 if self.predict_proba(x) >= threshold else 0 def compute_loss(self, X, y): n = len(y) total = 0.0 for i in range(n): p = self.predict_proba(X[i]) p = max(1e-15, min(1 - 1e-15, p)) total += y[i] * math.log(p) + (1 - y[i]) * math.log(1 - p) return -total / n def fit(self, X, y, epochs=1000, print_every=200): n = len(y) n_features = len(X[0]) for epoch in range(epochs): dw = [0.0] * n_features db = 0.0 for i in range(n): p = self.predict_proba(X[i]) error = p - y[i] for j in range(n_features): dw[j] += error * X[i][j] db += error for j in range(n_features): self.weights[j] -= self.lr * (dw[j] / n) self.bias -= self.lr * (db / n) loss = self.compute_loss(X, y) self.loss_history.append(loss) if epoch % print_every == 0: print(f" Epoch {epoch:4d} | Loss: {loss:.4f} | w: [{self.weights[0]:.3f}, {self.weights[1]:.3f}] | b: {self.bias:.3f}") return self def accuracy(self, X, y): correct = sum(1 for i in range(len(y)) if self.predict(X[i]) == y[i]) return correct / len(y) split = int(0.8 * N) X_train, X_test = X[:split], X[split:] y_train, y_test = y[:split], y[split:] print("\n=== Training Logistic Regression ===") model = LogisticRegression(n_features=2, learning_rate=0.1) model.fit(X_train, y_train, epochs=1000, print_every=200) print(f"\nTrain accuracy: {model.accuracy(X_train, y_train):.4f}") print(f"Test accuracy: {model.accuracy(X_test, y_test):.4f}") print(f"Weights: [{model.weights[0]:.4f}, {model.weights[1]:.4f}]") print(f"Bias: {model.bias:.4f}")
Step 3: Confusion matrix and metrics from scratch
class ClassificationMetrics: def __init__(self, y_true, y_pred): self.tp = sum(1 for t, p in zip(y_true, y_pred) if t == 1 and p == 1) self.tn = sum(1 for t, p in zip(y_true, y_pred) if t == 0 and p == 0) self.fp = sum(1 for t, p in zip(y_true, y_pred) if t == 0 and p == 1) self.fn = sum(1 for t, p in zip(y_true, y_pred) if t == 1 and p == 0) def accuracy(self): total = self.tp + self.tn + self.fp + self.fn return (self.tp + self.tn) / total if total > 0 else 0 def precision(self): denom = self.tp + self.fp return self.tp / denom if denom > 0 else 0 def recall(self): denom = self.tp + self.fn return self.tp / denom if denom > 0 else 0 def f1(self): p = self.precision() r = self.recall() return 2 * p * r / (p + r) if (p + r) > 0 else 0 def print_confusion_matrix(self): print(f"\n Confusion Matrix:") print(f" Predicted") print(f" Pos Neg") print(f" Actual Pos {self.tp:4d} {self.fn:4d}") print(f" Actual Neg {self.fp:4d} {self.tn:4d}") def print_report(self): self.print_confusion_matrix() print(f"\n Accuracy: {self.accuracy():.4f}") print(f" Precision: {self.precision():.4f}") print(f" Recall: {self.recall():.4f}") print(f" F1 Score: {self.f1():.4f}") y_pred_test = [model.predict(x) for x in X_test] print("\n=== Classification Report (Test Set) ===") metrics = ClassificationMetrics(y_test, y_pred_test) metrics.print_report()
Step 4: Decision boundary analysis
print("\n=== Decision Boundary ===") w1, w2 = model.weights b = model.bias print(f"Decision boundary: {w1:.4f}*x1 + {w2:.4f}*x2 + {b:.4f} = 0") if abs(w2) > 1e-10: print(f"Solved for x2: x2 = {-w1/w2:.4f}*x1 + {-b/w2:.4f}") print("\nSample predictions near the boundary:") test_points = [ [3.0, 3.0], [3.5, 3.5], [4.0, 4.0], [2.5, 2.5], [5.0, 5.0], ] for point in test_points: prob = model.predict_proba(point) pred = model.predict(point) print(f" [{point[0]}, {point[1]}] -> prob={prob:.4f}, class={pred}")
Step 5: Multi-class with softmax
class SoftmaxRegression: def __init__(self, n_features, n_classes, learning_rate=0.01): self.n_features = n_features self.n_classes = n_classes self.lr = learning_rate self.weights = [[0.0] * n_features for _ in range(n_classes)] self.biases = [0.0] * n_classes def softmax(self, scores): max_score = max(scores) exp_scores = [math.exp(s - max_score) for s in scores] total = sum(exp_scores) return [e / total for e in exp_scores] def predict_proba(self, x): scores = [ sum(self.weights[k][j] * x[j] for j in range(self.n_features)) + self.biases[k] for k in range(self.n_classes) ] return self.softmax(scores) def predict(self, x): probs = self.predict_proba(x) return probs.index(max(probs)) def fit(self, X, y, epochs=1000, print_every=200): n = len(y) for epoch in range(epochs): grad_w = [[0.0] * self.n_features for _ in range(self.n_classes)] grad_b = [0.0] * self.n_classes total_loss = 0.0 for i in range(n): probs = self.predict_proba(X[i]) for k in range(self.n_classes): target = 1.0 if y[i] == k else 0.0 error = probs[k] - target for j in range(self.n_features): grad_w[k][j] += error * X[i][j] grad_b[k] += error true_prob = max(probs[y[i]], 1e-15) total_loss -= math.log(true_prob) for k in range(self.n_classes): for j in range(self.n_features): self.weights[k][j] -= self.lr * (grad_w[k][j] / n) self.biases[k] -= self.lr * (grad_b[k] / n) if epoch % print_every == 0: print(f" Epoch {epoch:4d} | Loss: {total_loss / n:.4f}") return self def accuracy(self, X, y): correct = sum(1 for i in range(len(y)) if self.predict(X[i]) == y[i]) return correct / len(y) random.seed(42) X_3class = [] y_3class = [] centers = [(1, 1), (5, 1), (3, 5)] for label, (cx, cy) in enumerate(centers): for _ in range(50): X_3class.append([random.gauss(cx, 0.8), random.gauss(cy, 0.8)]) y_3class.append(label) combined = list(zip(X_3class, y_3class)) random.shuffle(combined) X_3class, y_3class = zip(*combined) X_3class = list(X_3class) y_3class = list(y_3class) split_3 = int(0.8 * len(X_3class)) X_train_3 = X_3class[:split_3] y_train_3 = y_3class[:split_3] X_test_3 = X_3class[split_3:] y_test_3 = y_3class[split_3:] print("\n=== Multi-class Softmax Regression (3 classes) ===") softmax_model = SoftmaxRegression(n_features=2, n_classes=3, learning_rate=0.1) softmax_model.fit(X_train_3, y_train_3, epochs=1000, print_every=200) print(f"\nTrain accuracy: {softmax_model.accuracy(X_train_3, y_train_3):.4f}") print(f"Test accuracy: {softmax_model.accuracy(X_test_3, y_test_3):.4f}") print("\nSample predictions:") for i in range(5): probs = softmax_model.predict_proba(X_test_3[i]) pred = softmax_model.predict(X_test_3[i]) print(f" True: {y_test_3[i]}, Predicted: {pred}, Probs: [{', '.join(f'{p:.3f}' for p in probs)}]")
Step 6: Threshold tuning
print("\n=== Threshold Tuning ===") print("Default threshold: 0.5. Adjusting the threshold trades precision for recall.\n") thresholds = [0.3, 0.4, 0.5, 0.6, 0.7] print(f"{'Threshold':>10} {'Accuracy':>10} {'Precision':>10} {'Recall':>10} {'F1':>10}") print("-" * 52) for t in thresholds: y_pred_t = [1 if model.predict_proba(x) >= t else 0 for x in X_test] m = ClassificationMetrics(y_test, y_pred_t) print(f"{t:>10.1f} {m.accuracy():>10.4f} {m.precision():>10.4f} {m.recall():>10.4f} {m.f1():>10.4f}")
Use It
Now the same thing with scikit-learn.
from sklearn.linear_model import LogisticRegression as SklearnLR from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score from sklearn.metrics import confusion_matrix, classification_report from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import numpy as np np.random.seed(42) X_0 = np.random.randn(100, 2) + [2, 2] X_1 = np.random.randn(100, 2) + [5, 5] X_sk = np.vstack([X_0, X_1]) y_sk = np.array([0] * 100 + [1] * 100) X_tr, X_te, y_tr, y_te = train_test_split(X_sk, y_sk, test_size=0.2, random_state=42) scaler = StandardScaler() X_tr_sc = scaler.fit_transform(X_tr) X_te_sc = scaler.transform(X_te) lr = SklearnLR() lr.fit(X_tr_sc, y_tr) y_pred = lr.predict(X_te_sc) print("=== Scikit-learn Logistic Regression ===") print(f"Accuracy: {accuracy_score(y_te, y_pred):.4f}") print(f"Precision: {precision_score(y_te, y_pred):.4f}") print(f"Recall: {recall_score(y_te, y_pred):.4f}") print(f"F1: {f1_score(y_te, y_pred):.4f}") print(f"\nConfusion Matrix:\n{confusion_matrix(y_te, y_pred)}") print(f"\nClassification Report:\n{classification_report(y_te, y_pred)}")
Your from-scratch implementation produces the same decision boundary and metrics. Scikit-learn adds solver options (liblinear, lbfgs, saga), automatic regularization, multi-class strategies (one-vs-rest, multinomial), and numerical stability optimizations.
Ship It
This lesson produces:
code/logistic_regression.py- logistic regression from scratch with metrics
Exercises
- Generate a dataset that is NOT linearly separable (e.g., two concentric circles). Train logistic regression and observe its failure. Then add polynomial features (x1^2, x2^2, x1*x2) and train again. Show that the accuracy improves.
- Implement a multi-class confusion matrix for the 3-class softmax model. Compute per-class precision and recall. Which class is hardest to classify?
- Build an ROC curve from scratch. For 100 threshold values from 0 to 1, compute the true positive rate and false positive rate. Calculate the AUC (area under the curve) using the trapezoidal rule.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| Logistic regression | "Regression for classification" | A linear model followed by a sigmoid function that outputs class probabilities |
| Sigmoid function | "The S-curve" | The function 1/(1+e^(-z)) that maps any real number to the range (0, 1) |
| Binary cross-entropy | "Log loss" | The loss function -[y*log(p) + (1-y)*log(1-p)] that penalizes confident wrong predictions severely |
| Decision boundary | "The dividing line" | The surface where the model's output probability equals 0.5, separating predicted classes |
| Softmax | "Multi-class sigmoid" | A function that converts a vector of scores into probabilities that sum to 1 |
| Precision | "How many selected are relevant" | TP / (TP + FP), the fraction of positive predictions that are actually positive |
| Recall | "How many relevant are selected" | TP / (TP + FN), the fraction of actual positives that the model correctly identifies |
| F1 score | "Balanced accuracy" | The harmonic mean of precision and recall: 2PR / (P+R) |
| Confusion matrix | "The error breakdown" | A table showing TP, TN, FP, FN counts for each class pair |
| Threshold | "The cutoff" | The probability value above which the model predicts class 1 (default 0.5, tunable) |
| One-hot encoding | "Binary columns for categories" | Representing class k as a vector of zeros with a 1 at position k |
| Categorical cross-entropy | "Multi-class log loss" | The extension of binary cross-entropy to k classes using one-hot encoded labels |