Introduction to differentials

Differential of a vector function

Let $f(x)$ be a function that maps ${\mathbb{R}}^{n} \to {\mathbb{R}}^{m}$. The derivative $\partial f(x) / \partial x$ is an $m \times n$ matrix that defines the linear approximation

The differential $df$ is an alternative way of expressing this approximation

where $df(x; u)$ must be linear in $u$.

Chain rule for differentials

The chain rule for differentials states that if $h(x) = g(f(x))$ then

The differential $dh$ is clearly linear in $u$ because it is the composition of two linear functions.

Differential of a variable

The chain rule enables some useful notation. Let us introduce the variable $y = f(x)$ and write the differential of the variable $y$ as a function of the differential of $x$

We can similarly introduce $z = g(y)$ and write the differential of $z$

While these expressions were obtained in isolation, each without knowledge of the other, they can be combined and the result agrees with the chain rule

Therefore, knowledge of the entire system in which the expression resides is not required to take differentials. That is, the expression of $dz$ in terms of $dy$ was valid whether $dy$ was an arbitrary vector or a function of $dx$. This is known as Cauchy’s rule of invariance.

Common rules in differential form

The product rule becomes

where the product $\odot$ can be any operation that distributes over addition

This includes the element-wise product, matrix multiplication, the inner product, the Kronecker product, etc.

Extension to matrices

The differential form is easily extended to matrix functions $F : {\mathbb{R}}^{m \times n} \to {\mathbb{R}}^{p \times q}$

where the differential $dF(X; U)$ is linear in the elements of $U$.

We will now use matrix differentials and the product rule to find the derivative of the $n \times n$ matrix inverse $Y = f(X) = X^{-1}$. This function can alternatively be defined implicitly $X Y = I$. Taking differentials of both sides gives

This gives an expression for the differential $dY = df(X; dX)$ that is linear in $dX$ as expected.

Finding derivatives using differentials

Note that the differential provides the derivative as an operator

rather than as an array of values. The explicit derivative can be trivially extracted from an expression for the differential $df(x; dx)$ by taking the coefficients of $dx$ since

This is known as the identification theorem.

For the example of the matrix inverse, this must be generalized to matrices

Comparing this to the element-wise expression of the differential

provides the derivative

Back-propagation and differentials

Back-propagation

Back-propagation is an algorithm for computing the derivatives of a scalar energy $E$, which is a composition of $n$ functions

with respect to the parameters $\theta_{i}$ of each function $f_{i}$. That is, we want to find

Let the variables $y_{i}$ represent the intermediate results so that

with $y_{0} = x$ and $y_{n} = E$.

It is trivial to obtain the derivatives with respect to $y_{n-1}$ and $\theta_{n}$ from the derivatives of $f_{n}$ according to

Back-propagation proceeds by obtaining the derivatives with respect to $\theta_{i}$ and $y_{i-1}$ from the derivative with respect to $y_{i}$

and so on.

Forward propagation using differentials

Whereas back-propagation computes

forward propagation computes

Forward propagation is much less efficient for two main reasons: the computation is not shared for different $i$, and one of the matrix dimensions is always 1 in back propagation.

Breaking forward propagation down into steps, the first step is to compute

and the next steps are to compute, for $j = i+1, \dots, n$,

Note that this product is exactly the operator that is defined by the differential. Therefore, if we have the differential $df_{j}(y_{j-1}, \theta_{j}; dy_{j-1}, d\theta_{j})$, we can use it to perform forward propagation

However, recall that forward propagation is much less efficient than back-propagation. The question remains: how to use differentials to obtain an operator for back-propagation?

Back-propagation using differentials

Consider a scalar cost $E = g(y)$ where $y = f(x)$, and define $h(x) = g(f(x))$. The definition of the differential and Cauchy’s rule of invariance provide

Since $E$ is a scalar, it’s more natural to write this as an inner product

where the gradient $\nabla_{x} E$ denotes a vector in the same space as $x$, instead of the operator $\partial E / \partial x$ which maps that space to ${\mathbb{R}}$. When $x$ is a vector, the gradient is the transpose of the derivative $\nabla_{x} E = (\partial E / \partial x)^{T}$. A useful short-hand is to adopt $\bar{x} = \nabla_{x} E$, since no specific knowledge of $E$ is required for this identity

Manipulation of this equation after substituting an expression for $dy$ in terms of $dx$ can provide an expression for $\bar{x} = \nabla_{x} E$ as a linear function of $\bar{y} = \nabla_{y} E$, precisely the operator required for back-propagation!

Critically, observe that it is much easier to preserve the structure of the intermediate variables with the back-propagation operator, whereas the forward-propagation operator requires all variables to be vectorized.

Let’s return to the example of the matrix inverse. For matrices, it’s useful to recall the inner product $\langle A, B \rangle = {\operatorname{tr}}(A^{T} B)$ and the trace rotation ${\operatorname{tr}}(A B C) = {\operatorname{tr}}(C A B) = {\operatorname{tr}}(B C A)$ for compatible dimensions. Substituting the expression for $dY$ in the identity gives

and therefore

In the case of the matrix inverse, this expression could have been obtained from the forward operator using vectorization and Kronecker product identities

and then

However, it is somewhat messier. Since we used differentials to find the forward operator, we might as well use them to find the backward operator.

More examples

Gram matrix

Consider the function $Y = f(X) = X^{T} X$. Taking differentials and applying the product rule gives

then making the substitution gives

and therefore $\bar{X} = X (\bar{Y} + \bar{Y}^{T})$.

Circular convolution

Consider the function $y = f(x) = a * x$. Taking differentials and applying the product rule gives

then making the substitution gives

This inner product has an equivalent expression in the Fourier domain since inner products are preserved

and therefore

where $\star$ denotes circular cross-correlation.