Ridge regression is the name given to least-squares regression with squared Euclidean norm regularisation added. Given \(n\) example vectors \(x_{i}\) of dimension \(m\) with scalar labels \(y_{i}\), the problem is expressed as finding the weight vector \(w\) and scalar bias \(b\) which minimise the objective function

\[f(w, b) = \frac{1}{2} \sum_{i = 1}^{n} \left( x_{i}^{T} w + b - y_{i} \right)^{2} + \frac{\lambda}{2} \| w \|^{2} .\]

Eliminating the bias

Setting the derivative of \(f\) with respect to \(b\) to zero yields

\[\begin{aligned} \frac{\partial f}{\partial b}(w, b) = \sum_{i = 1}^{n} \left( x_{i}^{T} w + b - y_{i} \right) & = 0, & b & = \bar{y} - \bar{x}^{T} w \end{aligned}\]

and therefore the problem is to find the minimiser of

\[h(w) = f(w, b(w)) = \frac{1}{2} \sum_{i = 1}^{n} \left[ (x_{i} - \bar{x})^{T} w - (y_{i} - \bar{y}) \right]^{2} + \frac{\lambda}{2} \| w \|^{2} .\]

From this point on we will assume that the example vectors and the labels have been pre-processed to have zero-mean, leading to the simplified form

\[h(w) = \frac{1}{2} \sum_{i = 1}^{n} \left( x_{i}^{T} w - y_{i} \right)^{2} + \frac{\lambda}{2} \| w \|^{2} .\]

Let us introduce the notation that \(X\) is an \(m \times n\) matrix whose columns are the example vectors and \(y\) is a vector comprising the corresponding labels, writing the objective as \(h(w) = \frac{1}{2} \| X^{T} w - y \|^{2} + \frac{1}{2} \lambda \| w \|^{2}\).

Solving for the weights in the primal

The problem above can be re-written as

\[\arg\min_{w} \left[ \frac{1}{2} w^{T} (S + \lambda I) w - w^{T} X y \right]\]

where \(S = X X^{T}\) is the \(m \times m\) covariance matrix. The solution to this unconstrained quadratic program is simply \(w = (S + \lambda I)^{-1} X y\).

The dual problem

The problem can be converted into a constrained minimisation problem

\[\arg\min_{w, r} \left[ \frac{1}{2} \| r \|^{2} + \frac{\lambda}{2} \| w \|^{2} \right] \text{ subject to } r = X^{T} w - y\]

whose Lagrangian is

\[L(w, r, \alpha) = \frac{1}{2} \| r \|^{2} + \frac{\lambda}{2} \| w \|^{2} + \alpha^{T} (r - X^{T} w + y) .\]

Setting derivatives with respect to the primal variables to zero, we obtain

\[\begin{aligned} \frac{\partial L}{\partial w}(w, r, \alpha) & = \lambda w - X \alpha = 0, & w &= \frac{1}{\lambda} X \alpha \\ \frac{\partial L}{\partial r}(w, r, \alpha) & = r + \alpha = 0, & r &= -\alpha . \end{aligned}\]

Making these substitutions to eliminate \(r\) and \(w\) gives the dual function

\[\begin{aligned} g(\alpha) & = L(w(\alpha), r(\alpha), \alpha) \\ & = \frac{1}{2} \| \alpha \|^{2} + \frac{1}{2 \lambda} \| X \alpha \|^{2} + \alpha^{T} \left( -\alpha - \frac{1}{\lambda} X^{T} X \alpha + y \right) \\ & = -\frac{1}{2} \| \alpha \|^{2} - \frac{1}{2 \lambda} \| X \alpha \|^{2} + \alpha^{T} y . \end{aligned}\]

and the dual problem is

\[\arg\min_{\alpha} \left[ \frac{1}{2} \alpha^{T} (K + \lambda I) \alpha - \lambda \alpha^{T} y \right]\]

where \(K = X^{T} X\) is the \(n \times n\) kernel matrix. The solution is obtained \(\alpha = \lambda (K + \lambda I)^{-1} y\) and then \(w = X (K + \lambda I)^{-1} y\).

Primal vs dual

We now have two equivalent solutions, one using the covariance matrix and the other the kernel matrix.

\[w = (S + \lambda I)^{-1} X y = X (K + \lambda I)^{-1} y\]