Have you ever performed linear regression involving multiple predictor variables and run into this expression $$\hat \beta = (X^TX)^{-1}X^Ty$$? It’s called the OLS solution via Normal Equations. To find out where it comes from, read on!

In the linear regression framework, we model an output variable $$y$$ (in this case a scalar) as a linear combination of some independent input variables $$X$$ plus some independent noise $$\epsilon$$. The linear combination of the independent variables is defined by a parameter vector $$\beta$$:

$y = X \beta + \epsilon$

We also assume that the noise term $$\epsilon$$ is drawn from a Normal distribution with zero mean and a noise variance $$\sigma_{\epsilon}^2$$ (generally assumed to be equal to one):

$\epsilon \sim N(0,\sigma_{\epsilon}^2)$

For some estimate of the model parameters $$\hat \beta$$, the model’s prediction errors (a.k.a. residuals) $$e$$ are the difference between the model prediction and the observed ouput values:

$e = y - X\hat \beta$

The Ordinary Least Squares (OLS) solution to the problem–i.e. determining an optimal solution for $$\hat \beta$$–requires minimizing the sum of the squared errors with respect to the model parameters $$\hat \beta$$. It turns out, the sum of squared errors is equal to the inner product of the residuals vector with itself $$\sum_i e_i^2 = e^Te$$ :

\begin{align} e^T e &= (y - X \hat \beta)^T (y - X \hat \beta) \\ &= y^Ty - y^T (X \hat \beta) - (X \hat \beta)^T y + (X \hat \beta)^T (X \hat \beta) \\ &= y^Ty - (X \hat \beta)^T y - (X \hat \beta)^T y + (X \hat \beta)^T (X \hat \beta) \\ &= y^Ty - 2(X \hat \beta)^T y + (X \hat \beta)^T (X \hat \beta) \\ &= y^Ty - 2\hat \beta^T X^T y + \hat \beta^T X^T X \hat \beta \text{,} \tag{1} \end{align}

where we take advantage of the matrix identity $$(AB)^T = B^TA^T$$ in steps 2-3 above.

To determine the parameters $$\hat \beta$$ we minimize the sum of squared errors with respect to the parameters:

\begin{align} \frac{\partial}{\partial \beta} \left[ e^T e \right] &= 0 \\ \frac{\partial}{\partial \beta} \left[ y^Ty - 2\hat \beta^T X^T y + \hat \beta^T X^T X \hat \beta \right ] &= 0 \;\; \text{, via Eq. (1)}\\ -2X^Ty + 2X^TX \hat \beta &= 0 \\ -X^Ty + X^TX \hat \beta &= 0 \\ X^TX \hat \beta&= X^Ty \text{,} \tag{2} \end{align}

where we note to the matrix derivative identity $$\frac{\partial \mathbf{a}^T \mathbf{b}}{\partial \mathbf{a}} = \mathbf{b}$$, for vectors $$\mathbf{a}$$ and $$\mathbf{b}$$ in step 2-3 above.

The relationship in Equation 2 is the matrix form of what are known as the Normal Equations. Solving for $$\hat \beta$$ gives the analytical solution to the Ordinary Least Squares problem.

$\hat \beta = (X^TX)^{-1}X^Ty$

…and voila!

# Notes

This post is a refactor of content with the same title originally posted on The Clever Machine Wordpress blog.