# Derivative of non-linear least-squares minimiser

If $y^{\star}(x)$ is the minimiser of a non-linear least-squares problem

where function $r$ maps $x \in \mathbb{R}^{m}$ and $y \in \mathbb{R}^{n}$ to a vector in $r(x, y) \in \mathbb{R}^{p}$, then how can we estimate its derivative

This page uses some techniques from the generalised Wiberg paper of Strelow (pdf). It also makes use of some notation and results from the post about the linear problem.

Assume that we are able to obtain an estimate of the minimiser $\hat{y} \approx y^{\star}(x)$. The true minimiser can be represented as a displacement from this estimate $y^{\star}(x) = \hat{y} + \delta y^{\star}(x)$. The derivative of this displacement with respect to $x$ is equal to that of the minimiser

If the estimate is good, then the optimal displacement $\delta y^{\star}(x)$ is small and

and hopefully (perhaps this can be shown with Lipschitz continuity?)

This is the solution to a linear least-squares objective

where $A(x) = \frac{\partial r}{\partial y}(x, \hat{y})$ is $p \times n$ and $b(x) = -r(x, \hat{y})$. The derivative of this expression is

where $A^{\dagger} = (A^{T} A)^{-1} A^{T}$ is the left-inverse and $a \equiv \operatorname{vec}(A)$.

The derivatives of $\delta y^{\star}$ with respect to $A$ and $b$ are known from the post about the linear problem. The derivatives of the parameters of the linear system are

This is enough to compute derivatives

The complete source for this experiment can be found on Github.

The overall expression for the derivative can be found

where $G = A^{T} A$ and $J_{m n}$ is the linear operator such that $\operatorname{vec}(X^{T}) = J_{m n} \operatorname{vec}(X)$ if $X$ is $m \times n$.

There might be some tensor notation to better express this. The derivatives with respect to single elements of $x$ at least have a neater expression

where $C_{i}$ is a $p \times n$ matrix and $d_{i}$ is a vector of length $p$

Therefore it is necessary to compute $\frac{\partial r}{\partial x}(x, \hat{y})$, $\frac{\partial r}{\partial y}(x, \hat{y})$ and $\frac{\partial^{2} r}{\partial x \partial y}(x, \hat{y})$ as well as the QR factorisation of $\frac{\partial r}{\partial y}(x, \hat{y})$.