The least-squares solution to a system of linear equations is a function of the parameters of the problem

This note is interested in determining the derivatives

Let the dimensions of $A$ be $m \times n$. It is assumed that $m \ge n$ and that $A$ is full rank, that is, its rank is $n$.

First, some notation. If $f(x)$ is a function that maps $\mathbb{R}^{n} \to \mathbb{R}^{m}$, then $\partial f / \partial x$ is a function that maps $x \in \mathbb{R}^{n}$ to the $m \times n$ matrix that defines the linear approximation at $x$

To compute the derivative of a function $f(X)$ with respect to a matrix-valued argument $X$, the matrix will be treated as a vector containing the same elements. We will use lower-case to denote a vectorised variable $x = \operatorname{vec}(X)$. If $f(X)$ is a function that maps an $m \times n$ matrix to a vector of length $p$, then its linear approximation is

Case 1: Square Matrix

If $m = n$, then $x^{\star}(A, b) = A^{-1} b$. The derivative with respect to $b$ is trivial

The derivative with respect to $A$ can be obtained using the vectorisation identity

and the derivative of a matrix inverse

to give the derivative

Matrix-vector products can be computed without constructing the explicit matrices according to

These expressions involving $A^{-1}$ should be computed using a cached factorisation for numerical stability (LU decomposition in the most general case).

Case 2: Rectangular Matrix

If $m > n$, then $x^{\star}(A, b) = (A^{T} A)^{-1} A^{T} b$.

The derivative with respect to $b$ is again trivial

To obtain the derivative with respect to $A$, however, it is necessary to introduce the product rule for matrix-valued functions. If $H(x) = F(x) G(x)$, where $F(x)$ is $p \times r$, $G(x)$ is $r \times q$ and $x$ is a vector of length $n$, then

and the product rule is

Applying this to the expression for $x^{\star}(A, b)$ gives

Introduce $J_{m n}$ to denote the operator such that $\operatorname{vec}(A^{T}) = J_{m n} \operatorname{vec}(A)$ for an $m \times n$ matrix $A$.

Using the derivative of the Gram matrix

it is possible to obtain the derivative of its inverse

The overall derivative is then

Matrix-vector products with this expression are

This reveals a simpler form for the derivative

For numerical stability, this should be computed using QR factorisations.

The complete source for this experiment can be found on Github.