Ridge regression is the name given to least-squares regression with squared Euclidean norm regularisation added. Given $n$ example vectors $x_{i}$ of dimension $m$ with scalar labels $y_{i}$, the problem is expressed as finding the weight vector $w$ and scalar bias $b$ which minimise the objective function

$f(w, b) = \frac{1}{2} \sum_{i = 1}^{n} \left( x_{i}^{T} w + b - y_{i} \right)^{2} + \frac{\lambda}{2} \| w \|^{2} .$

### Eliminating the bias

Setting the derivative of $f$ with respect to $b$ to zero yields

\begin{aligned} \frac{\partial f}{\partial b}(w, b) = \sum_{i = 1}^{n} \left( x_{i}^{T} w + b - y_{i} \right) & = 0, & b & = \bar{y} - \bar{x}^{T} w \end{aligned}

and therefore the problem is to find the minimiser of

$h(w) = f(w, b(w)) = \frac{1}{2} \sum_{i = 1}^{n} \left[ (x_{i} - \bar{x})^{T} w - (y_{i} - \bar{y}) \right]^{2} + \frac{\lambda}{2} \| w \|^{2} .$

From this point on we will assume that the example vectors and the labels have been pre-processed to have zero-mean, leading to the simplified form

$h(w) = \frac{1}{2} \sum_{i = 1}^{n} \left( x_{i}^{T} w - y_{i} \right)^{2} + \frac{\lambda}{2} \| w \|^{2} .$

Let us introduce the notation that $X$ is an $m \times n$ matrix whose columns are the example vectors and $y$ is a vector comprising the corresponding labels, writing the objective as $h(w) = \frac{1}{2} \| X^{T} w - y \|^{2} + \frac{1}{2} \lambda \| w \|^{2}$.

### Solving for the weights in the primal

The problem above can be re-written as

$\arg\min_{w} \left[ \frac{1}{2} w^{T} (S + \lambda I) w - w^{T} X y \right]$

where $S = X X^{T}$ is the $m \times m$ covariance matrix. The solution to this unconstrained quadratic program is simply $w = (S + \lambda I)^{-1} X y$.

### The dual problem

The problem can be converted into a constrained minimisation problem

$\arg\min_{w, r} \left[ \frac{1}{2} \| r \|^{2} + \frac{\lambda}{2} \| w \|^{2} \right] \text{ subject to } r = X^{T} w - y$

whose Lagrangian is

$L(w, r, \alpha) = \frac{1}{2} \| r \|^{2} + \frac{\lambda}{2} \| w \|^{2} + \alpha^{T} (r - X^{T} w + y) .$

Setting derivatives with respect to the primal variables to zero, we obtain

\begin{aligned} \frac{\partial L}{\partial w}(w, r, \alpha) & = \lambda w - X \alpha = 0, & w &= \frac{1}{\lambda} X \alpha \\ \frac{\partial L}{\partial r}(w, r, \alpha) & = r + \alpha = 0, & r &= -\alpha . \end{aligned}

Making these substitutions to eliminate $r$ and $w$ gives the dual function

\begin{aligned} g(\alpha) & = L(w(\alpha), r(\alpha), \alpha) \\ & = \frac{1}{2} \| \alpha \|^{2} + \frac{1}{2 \lambda} \| X \alpha \|^{2} + \alpha^{T} \left( -\alpha - \frac{1}{\lambda} X^{T} X \alpha + y \right) \\ & = -\frac{1}{2} \| \alpha \|^{2} - \frac{1}{2 \lambda} \| X \alpha \|^{2} + \alpha^{T} y . \end{aligned}

and the dual problem is

$\arg\min_{\alpha} \left[ \frac{1}{2} \alpha^{T} (K + \lambda I) \alpha - \lambda \alpha^{T} y \right]$

where $K = X^{T} X$ is the $n \times n$ kernel matrix. The solution is obtained $\alpha = \lambda (K + \lambda I)^{-1} y$ and then $w = X (K + \lambda I)^{-1} y$.

### Primal vs dual

We now have two equivalent solutions, one using the covariance matrix and the other the kernel matrix.

$w = (S + \lambda I)^{-1} X y = X (K + \lambda I)^{-1} y$