Derivative of linear least-squares minimiser
The least-squares solution to a system of linear equations is a function of the parameters of the problem
\[x^{\star}(A, b) = \arg\min_{x} \|A x - b\|^{2} \enspace .\]This note is interested in determining the derivatives
\[\frac{\partial x^{\star}}{\partial A}(A, b) \quad \text{and} \quad \frac{\partial x^{\star}}{\partial b}(A, b) \enspace .\]Let the dimensions of $A$ be $m \times n$. It is assumed that $m \ge n$ and that $A$ is full rank, that is, its rank is $n$.
First, some notation. If $f(x)$ is a function that maps $\mathbb{R}^{n} \to \mathbb{R}^{m}$, then $\partial f / \partial x$ is a function that maps $x \in \mathbb{R}^{n}$ to the $m \times n$ matrix that defines the linear approximation at $x$
\[f(x + \delta) \approx f(x) + \left(\frac{\partial f}{\partial x}(x)\right) \delta \enspace .\]To compute the derivative of a function $f(X)$ with respect to a matrix-valued argument $X$, the matrix will be treated as a vector containing the same elements. We will use lower-case to denote a vectorised variable $x = \operatorname{vec}(X)$. If $f(X)$ is a function that maps an $m \times n$ matrix to a vector of length $p$, then its linear approximation is
\[f(X + \delta X) \approx f(X) + \underbrace{\left(\frac{\partial f}{\partial x}(X)\right)}_{p \times (m, n)} \operatorname{vec}(\delta X) \enspace .\]Case 1: Square Matrix
If $m = n$, then $x^{\star}(A, b) = A^{-1} b$. The derivative with respect to $b$ is trivial
\[\frac{\partial x^{\star}}{\partial b}(A, b) = A^{-1} \enspace .\]The derivative with respect to $A$ can be obtained using the vectorisation identity
\[A^{-1} b = (b^{T} \otimes I_{n}) \operatorname{vec}(A^{-1})\]and the derivative of a matrix inverse
\[\frac{\partial}{\partial a} \operatorname{vec}(A^{-1}) = - A^{-T} \otimes A^{-1}\]to give the derivative
\[\frac{\partial x^{\star}}{\partial a}(A, b) = - (b^{T} \otimes I_{n}) (A^{-T} \otimes A^{-1}) = - (x^{\star})^{T} \otimes A^{-1} \enspace .\]Matrix-vector products can be computed without constructing the explicit matrices according to
\[\left[\frac{\partial x^{\star}}{\partial b}(A, b)\right] v = A^{-1} v\] \[\left[\frac{\partial x^{\star}}{\partial a}(A, b)\right] \operatorname{vec}(V) = - \left((x^{\star})^{T} \otimes A^{-1}\right) \operatorname{vec}(V) = - A^{-1} V x^{\star} \enspace .\]These expressions involving $A^{-1}$ should be computed using a cached factorisation for numerical stability (LU decomposition in the most general case).
Case 2: Rectangular Matrix
If $m > n$, then $x^{\star}(A, b) = (A^{T} A)^{-1} A^{T} b$.
The derivative with respect to $b$ is again trivial
\[\frac{\partial x^{\star}}{\partial b}(A, b) = (A^{T} A)^{-1} A^{T} \enspace .\]To obtain the derivative with respect to $A$, however, it is necessary to introduce the product rule for matrix-valued functions. If $H(x) = F(x) G(x)$, where $F(x)$ is $p \times r$, $G(x)$ is $r \times q$ and $x$ is a vector of length $n$, then
\[h(x) = (G(x)^{T} \otimes I_{p}) f(x) = (I_{q} \otimes F(x)) g(x)\]and the product rule is
\[\underbrace{\frac{\partial h}{\partial x}(x)}_{(p, q) \times n} = \underbrace{(G(x)^{T} \otimes I_{p})}_{(p, q) \times (p, r)} \underbrace{\left(\frac{\partial f}{\partial x}(x)\right)}_{(p, r) \times n} + \underbrace{(I_{q} \otimes F(x))}_{(p, q) \times (r, q)} \underbrace{\left(\frac{\partial g}{\partial x}(x)\right)}_{(r, q) \times n} \enspace .\]Applying this to the expression for $x^{\star}(A, b)$ gives
\[\begin{align} \frac{\partial x^{\star}}{\partial a} (A, b) & = \frac{\partial}{\partial a} G(A)^{-1} (A^{T} b) \\ & = (b^{T} A \otimes I_{n}) \left(\frac{\partial}{\partial a} \operatorname{vec}(G(A)^{-1})\right) + G^{-1} \left(\frac{\partial}{\partial a} A^{T} b\right) \enspace . \end{align}\]Introduce $J_{m n}$ to denote the operator such that $\operatorname{vec}(A^{T}) = J_{m n} \operatorname{vec}(A)$ for an $m \times n$ matrix $A$.
\[\frac{\partial}{\partial a} A^{T} b = \frac{\partial}{\partial a} (b^{T} \otimes I_{n}) \operatorname{vec}(A^{T}) = (b^{T} \otimes I_{n}) J_{m n} \enspace .\]Using the derivative of the Gram matrix
\[\frac{\partial g}{\partial a}(A) = \frac{\partial}{\partial a} \operatorname{vec}(A^{T} A) = (I_{n} \otimes A^{T}) + (A^{T} \otimes I_{n}) J_{m n}\]it is possible to obtain the derivative of its inverse
\[\begin{align} \frac{\partial}{\partial a} \operatorname{vec} (G(A)^{-1}) & = \left(\frac{\partial}{\partial g} \operatorname{vec}(G^{-1})\right) \left(\frac{\partial g}{\partial a} (A)\right) \\ & = - \left[G^{-T} \otimes G^{-1}\right] \left[(I_{n} \otimes A^{T}) + (A^{T} \otimes I_{n}) J_{m n}\right] \enspace . \end{align}\]The overall derivative is then
\[\begin{align} \frac{\partial x^{\star}}{\partial a} (A, b) & = - (b^{T} A \otimes I_{n}) \left[G^{-T} \otimes G^{-1}\right] \left[(I_{n} \otimes A^{T}) + (A^{T} \otimes I_{n}) J_{m n}\right] + G^{-1} (b^{T} \otimes I_{n}) J_{m n} \\ & = - [(x^{\star})^{T} \otimes G^{-1}] \left[(I_{n} \otimes A^{T}) + (A^{T} \otimes I_{n}) J_{m n}\right] + G^{-1} (b^{T} \otimes I_{n}) J_{m n} \\ & = - [(x^{\star})^{T} \otimes G^{-1} A^{T}] - [(A x^{\star})^{T} \otimes G^{-1}] J_{m n} + G^{-1} (b^{T} \otimes I_{n}) J_{m n} \enspace . \end{align}\]Matrix-vector products with this expression are
\[\begin{align} \left[\frac{\partial x^{\star}}{\partial a} (A, b)\right] \operatorname{vec}(V) & = - G^{-1} A^{T} V x^{\star} - G^{-1} V^{T} A x^{\star} + G^{-1} V^{T} b \\ & = - (A^{T} A)^{-1} [A^{T} V x^{\star} + V^{T} (A x^{\star} - b)] \enspace . \end{align}\]This reveals a simpler form for the derivative
\[\begin{align} \frac{\partial x^{\star}}{\partial a} (A, b) & = -(x^{\star})^{T} \otimes G^{-1} A^{T} - [(A x^{\star} - b)^{T} \otimes G^{-1}] J_{m n} \enspace . \end{align}\]For numerical stability, this should be computed using QR factorisations.
The complete source for this experiment can be found on Github.