Linear Regression From Scratch

Overview

This project implements four regression models entirely from scratch using only NumPy — no scikit-learn, no PyTorch, no black-box libraries. Each model is built with a modular, vectorized training pipeline and validated against sklearn baselines for numerical equivalence.

The implementation covers four models, each with distinct regularization strategies:

OLS (Ordinary Least Squares) — No penalty. Solved via gradient descent or the normal equation. Produces unbiased estimates with minimum variance.

Ridge Regression — L2 penalty on coefficient magnitude. Shrinks all weights toward zero without eliminating any. Handles multicollinearity gracefully.

Lasso Regression — L1 penalty via coordinate descent. Produces sparse solutions by zeroing out irrelevant features — automatic feature selection.

ElasticNet — Combines L1 and L2 penalties. Balances Lasso's sparsity with Ridge's grouping effect, controlled by a mixing ratio parameter.

The project includes 8 progressive notebooks, an interactive p5.js visualizer, 6 evaluation metrics, and a real-world experiment on the California Housing dataset (20,640 samples).

Project Architecture

linear-regression/
├── src/
│   ├── linear_regression.py    # 452 lines — all 4 models + modular pipeline
│   ├── metrics.py              # 124 lines — MAE, MSE, RMSE, R², Adj R², MAPE
│   ├── visualization.py        # 173 lines — matplotlib plotting utilities
│   ├── utils.py                # 106 lines — data generation helpers
│   ├── demo.py                 # 340 lines — 4 experiment runner
│   └── __init__.py             # 61 lines — public API exports
── notebooks/
│   ├── 01_regression_intuition.ipynb
│   ├── 02_maths.ipynb
│   ├── 03_linear_regression.ipynb
│   ├── 04_visualization_and_metrics.ipynb
│   ├── 05_sklearn_implementation.ipynb
│   ├── 06_regression_types.ipynb
│   ├── 07_assumptions_and_limitations.ipynb
│   └── 08_real_world_dataset.ipynb
├── data/
│   └── housing.csv             # California Housing — 20,640 rows
├── visualize.html              # Interactive p5.js training visualizer
└── README.md                   # Full documentation

Mathematical Foundation

Hypothesis Function

For a dataset with n features, the linear model predicts:

$h_{w,b}(x) = w^T x + b = \sum_{j=1}^{n} w_j x_j + b$

In vectorized form across m samples:

$\hat{y} = Xw + b$

Cost Function — Mean Squared Error

$J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})^2 = \frac{1}{2m} \|Xw + b - y\|_2^2$

Gradient Descent — OLS

The gradients for ordinary least squares:

$\frac{\partial J}{\partial w} = \frac{2}{m} X^T (Xw + b - y)$

$\frac{\partial J}{\partial b} = \frac{2}{m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})$

Parameter updates with learning rate α:

$w := w - \alpha \frac{\partial J}{\partial w}, \quad b := b - \alpha \frac{\partial J}{\partial b}$

Ridge Regression — L2 Regularization

Adds a penalty on the magnitude of coefficients:

$J_{ridge}(w, b) = \frac{1}{2m} \|Xw + b - y\|_2^2 + \alpha \|w\|_2^2$

Gradient with L2 penalty:

$\frac{\partial J_{ridge}}{\partial w} = \frac{2}{m} X^T (Xw + b - y) + 2\alpha w$

Closed-form solution (Normal Equation with regularization):

$w = (X^T X + \alpha I)^{-1} X^T y$

Lasso Regression — L1 Regularization

Uses L1 penalty which produces sparse solutions:

$J_{lasso}(w, b) = \frac{1}{2m} \|Xw + b - y\|_2^2 + \alpha \|w\|_1$

Solved via coordinate descent with soft thresholding:

$w_j := \frac{S(\rho_j, \alpha)}{1}$

Where the soft threshold operator:

$S(\rho, \lambda) = \begin{cases} \rho - \lambda & \text{if } \rho > \lambda \\ \rho + \lambda & \text{if } \rho < -\lambda \\ 0 & \text{if } |\rho| \leq \lambda \end{cases}$

ElasticNet — Combined L1 + L2

Blends Lasso and Ridge penalties with mixing ratio ρ:

$J_{elastic}(w, b) = \frac{1}{2m} \|Xw + b - y\|_2^2 + \alpha \left[ \rho \|w\|_1 + \frac{1-\rho}{2} \|w\|_2^2 \right]$

Coordinate descent update:

$w_j := \frac{S(\rho_j, \alpha \rho)}{1 + \alpha(1-\rho)}$

Modular Training Pipeline

The implementation uses a fully vectorized, modular pipeline that is shared across all models:

def initialize_parameters(n_features, random_state=None):
    weights = np.zeros(n_features)
    bias = 0.0
    return weights, bias

def forward_pass(X, weights, bias):
    return X @ weights + bias

def compute_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

def compute_gradients_ols(X, y_true, y_pred):
    n = len(y_true)
    dw = (2 / n) * X.T @ (y_pred - y_true)
    db = (2 / n) * np.sum(y_pred - y_true)
    return dw, db

def update_parameters(weights, bias, dw, db, learning_rate):
    weights -= learning_rate * dw
    bias -= learning_rate * db
    return weights, bias

Base Class — Feature Standardization

All models inherit from _BaseRegression which handles Z-score normalization:

def _standardize(self, X, fit=True):
    if fit:
        self.X_mean = np.mean(X, axis=0)
        self.X_std = np.std(X, axis=0)
    return (X - self.X_mean) / self.X_std

def _unstandardize_weights(self):
    # Map learned weights back to original feature scale
    self.weights = self.weights / self.X_std
    self.bias = self.bias - np.dot(self.X_mean, self.weights)

This ensures that:

Gradient descent converges faster (features on same scale)
Learned coefficients are interpretable in original units
Regularization penalties are applied fairly across features

Model Implementations

1. OLS — Ordinary Least Squares

Two solvers provided:

Gradient Descent — iterative, works for any dataset size:

model = LinearRegressionFromScratch(learning_rate=0.05, n_iters=1500)
model.fit(X_train, y_train)

Closed-Form (Normal Equation) — exact solution in one step:

model.fit_closed_form(X_train, y_train)

Validation: GD and closed-form produce identical results with max prediction difference of 4.26 × 10⁻¹⁴.

2. Ridge Regression — L2 Regularization

model = RidgeRegressionFromScratch(alpha=1.0, learning_rate=0.05, n_iters=3000)
model.fit(X_train, y_train)

Also supports closed-form: model.fit_closed_form(X_train, y_train)

Key behavior: shrinks all coefficients toward zero but never exactly zero. Effective when features are correlated.

3. Lasso Regression — L1 Regularization

model = LassoRegressionFromScratch(alpha=0.1, n_iters=5000)
model.fit(X_train, y_train)

Uses coordinate descent (no learning rate). Produces sparse solutions — on the 13-feature collinear dataset, Lasso zeroed out 7.7% of coefficients, performing automatic feature selection.

4. ElasticNet — Combined Regularization

model = ElasticNetRegressionFromScratch(alpha=0.1, l1_ratio=0.5, n_iters=5000)
model.fit(X_train, y_train)

Balances Lasso's sparsity with Ridge's grouping effect. When l1_ratio=1.0 → pure Lasso; when l1_ratio=0.0 → pure Ridge.

Evaluation Metrics

Six metrics implemented from scratch in src/metrics.py:

MAE

\frac{1}{n}\sum|y - \hat{y}|

Robust to outliers

MSE

\frac{1}{n}\sum(y - \hat{y})^2

Penalizes large errors

RMSE

\sqrt{MSE}

Same units as target

R²

1 - \frac{SS_{res}}{SS_{tot}}

Variance explained

Adjusted R²

1 - (1-R^2)\frac{n-1}{n-p-1}

Penalizes extra features

MAPE

\frac{100}{n}\sum|\frac{y-\hat{y}}{y}|

Percentage error

Experiments

Experiment 1 — Simple Univariate Regression

Data: 300 samples, 1 feature, true coefficient w = 3.0
Models: OLS (GD), OLS (closed-form), sklearn LinearRegression
Result: All three produce identical fitted lines. GD converges in ~800 iterations at lr=0.05.

Experiment 2 — Multi-Feature with Collinearity

Data: 500 samples, 13 features (3 collinear pairs added)
Models: OLS, Ridge (α=1.0), Lasso (α=0.1), ElasticNet (α=0.1, ρ=0.5)
Key findings:
- OLS coefficients explode on collinear features
- Ridge shrinks all coefficients smoothly
- Lasso zeros out 7.7% of features
- ElasticNet balances both behaviors

Experiment 3 — Polynomial Regression

Data: 200 samples, nonlinear y = 2x - 0.5x² + 0.1x³
Features: Polynomial features up to degree 3 (from-scratch implementation)
Models: OLS, Ridge (α=5.0), Lasso (α=0.5)
Result: Degree-3 polynomial captures the true curve. Ridge prevents overfitting on higher-degree terms.

Experiment 4 — California Housing (Real-World)

Data: 20,640 samples, 8 features (median income, rooms, age, etc.)
Pipeline: StandardScaler → all 4 models
Results:

Model

R²

RMSE

OLS

0.625

69,200

Ridge (α=1.0)

0.625

69,200

Lasso (α=0.1)

0.618

70,100

ElasticNet (α=0.1, ρ=0.5)

0.622

69,600

Top features: median_income (coefficient ≈ 43,000), latitude, longitude
Lasso zeroed out 2 of 8 features (automatic feature selection)

Interactive Visualization

The project includes visualize.html — a real-time p5.js training visualizer with:

Left panel: 2D scatter plot with 4 simultaneous regression lines
Top-right: Live coefficient bar chart (learned vs. true weights)
Bottom-right: 4 loss curves on log scale
Controls: Model selector, α slider, speed control, pause/reset
Interaction: Click to add data points, Space to pause, R to reset

All 4 models train simultaneously in the browser, allowing direct visual comparison of convergence behavior.

Five Regression Assumptions

The implementation documents and demonstrates all five classical assumptions:

Linearity — Relationship between features and target is linear
Independence — Residuals are uncorrelated (Durbin-Watson test)
Homoscedasticity — Constant variance of residuals across predictions
Normality — Residuals follow a normal distribution (Shapiro-Wilk test, Q-Q plots)
No Multicollinearity — Features are not highly correlated (VIF analysis)

Notebook 07 provides visual demonstrations of each assumption and remedies for violations.

Key Learnings

Feature scaling is non-negotiable — Without standardization, gradient descent diverges or converges extremely slowly. Z-score normalization reduced convergence iterations by ~60%.
Closed-form vs. iterative — The normal equation gives exact solutions but requires inverting an n×n matrix (O(n³)). For n > 10,000 features, gradient descent is the only practical option.
Regularization prevents overfitting — On the collinear dataset, OLS produced coefficients as large as ±500. Ridge capped them at ±50. Lasso zeroed out irrelevant features entirely.
Coordinate descent for L1 — Gradient descent cannot handle the non-differentiable L1 penalty. Coordinate descent updates one weight at a time using soft thresholding — elegant and efficient.
Vectorization is everything — The fully vectorized training loop (X @ weights + bias) runs 50-100× faster than explicit Python loops. NumPy's BLAS-backed operations are the difference between a usable and unusable implementation.

References

Ng, A. (2012). CS229 Lecture Notes: Supervised Learning. Stanford University.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning.
GitHub Repository: mahirmlk/linear-regression