SVM with linear features
This post derives a result from the 2010 PAMI paper of Ashraf, Lucey and Chen, Reinterpreting the Application of Gabor Filters as a Manipulation of the Margin in Linear Support Vector Machines. Thanks to Mark Cox for discussion.
The support vector machine optimisation problem is
\[\min_{w} \left\{ \frac{1}{2} \|w\|^{2} + C \sum_{i} \max(0, 1 - y_{i} \phi(x_{i})^{T} w) \right\} .\]Here we consider the case where the feature transform is linear \(\phi(x) = A x\), such that \(w^{T} \phi(x) = \langle w, A x \rangle = \langle A^{T} w, x \rangle\). This can also be expressed as the constrained optimisation problem
\[\begin{aligned} \min_{w, z} \; & \left\{ \frac{1}{2} \|w\|^{2} + C \sum_{i} \max(0, 1 - y_{i} x_{i}^{T} z) \right\} \\ \text{s.t.} \; & z = A^{T} w \enspace . \end{aligned}\]We now look to remove \(w\) from this problem. There are a few different cases depending on the shape and rank of the matrix.
Higher dimension, full rank
Let’s assume that \(A\) is \(m \times n\) where \(m \ge n\) and the rank of \(A\) is \(n\) (the maximum possible). This is where \(A\) is increasing the dimension of the examples without losing any information. The constraint \(z = A^{T} w\) implies that
\[z \in \operatorname{col}(A^{T}) = \mathbb{R}^{n}\]which importantly means that \(z\) alone does not need to be constrained. However, for a given \(z\) there will be many \(w\) that satisfy the constraint.
To resolve the ambiguity of \(w\), we examine the regularisation term. First, let’s decompose \(A = U S V^{T}\) into its “thin” SVD, where \(U\) is \(m \times n\), \(V\) is \(n \times n\) and \(S\) is \(n \times n\) and diagonal. We have \(U^{T} U = I_{n}\) and \(V^{T} V = V V^{T} = I_{n}\). Then
\[\begin{align} z & = A^{T} w = V S U^{T} w \\ S^{-1} V^{T} z & = U^{T} w \end{align}\]All solutions for \(w\) can be parametrised
\[w = U S^{-1} V^{T} z + U_{\bot} a\]for arbitrary \(a\) where \(U_{\bot}\) is \(m \times (m - n)\) and the basis for \(\operatorname{null}(A^{T})\). It satisfies \(U^{T} U_{\bot} = 0\) and \(U_{\bot}^{T} U_{\bot} = I_{m-n}\). Note that \(U S^{-1} V^{T} = (A^{T})^{\dagger}\) is the pseudo-inverse.
Introducing this result into the regularisation term, we see
\[w^{T} w = z^{T} V S^{-2} V^{T} z + a^{T} a = z^{T} (A^{T} A)^{-1} z + a^{T} a\]and then the objective becomes separable
\[\min_{z} \left\{ \frac{1}{2} \|z\|^{2}_{(A^{T} A)^{-1}} + C \sum_{i} \max(0, 1 - y_{i} x_{i}^{T} z) \right\} + \min_{a} \frac{1}{2} \|a\|^{2} \enspace .\]Thus the solution is the same as the original SVM, except the margin is measured in a weighted space.
Lower dimension, full rank
When \(m < n\) things are not so simple. The constraint \(z = A^{T} w\) will restrict
\[z \in \operatorname{col}(A^{T}) \subset \mathbb{R}^{n} \enspace .\]While it is still possible to make the substitution \(w^{T} w = z V S^{-2} V^{T} z\), however it is no longer the case that \(V S^{-2} V^{T} = (A^{T} A)^{-1}\), and instead we must use the norm weighted by the pseudo-inverse \((A^{T} A)^{\dagger} = A^{\dagger} (A^{\dagger})^{T} = A^{T} (A A^{T})^{-2} A\) in the constrained optimisation problem
\[\begin{aligned} \min_{z} \; & \frac{1}{2} \|z\|^{2}_{(A^{T} A)^{\dagger}} + C \sum_{i} \max(0, 1 - y_{i} x_{i}^{T} z) \\ \text{s.t.} \; & z \in \operatorname{col}(A^{T}) \enspace . \end{aligned}\]This might be considered a weighted margin where some of the eigenvalues are infinite, confining the solution to a subspace.