Ridge regression is the name given to least-squares regression with squared Euclidean norm regularisation added. Given $n$ example vectors $x_{i}$ of dimension $m$ with scalar labels $y_{i}$, the problem is expressed as finding the weight vector $w$ and scalar bias $b$ which minimise the objective function

### Eliminating the bias

Setting the derivative of $f$ with respect to $b$ to zero yields

and therefore the problem is to find the minimiser of

From this point on we will assume that the example vectors and the labels have been pre-processed to have zero-mean, leading to the simplified form

Let us introduce the notation that $X$ is an $m \times n$ matrix whose columns are the example vectors and $y$ is a vector comprising the corresponding labels, writing the objective as $h(w) = \frac{1}{2} \| X^{T} w - y \|^{2} + \frac{1}{2} \lambda \| w \|^{2}$.

### Solving for the weights in the primal

The problem above can be re-written as

where $S = X X^{T}$ is the $m \times m$ covariance matrix. The solution to this unconstrained quadratic program is simply $w = (S + \lambda I)^{-1} X y$.

### The dual problem

The problem can be converted into a constrained minimisation problem

whose Lagrangian is

Setting derivatives with respect to the primal variables to zero, we obtain

Making these substitutions to eliminate $r$ and $w$ gives the dual function

and the dual problem is

where $K = X^{T} X$ is the $n \times n$ kernel matrix. The solution is obtained $\alpha = \lambda (K + \lambda I)^{-1} y$ and then $w = X (K + \lambda I)^{-1} y$.

### Primal vs dual

We now have two equivalent solutions, one using the covariance matrix and the other the kernel matrix.