Artificial neural networks (ANNs) are a powerful class of models used for nonlinear regression and classification tasks that are motivated by biological neural computation. The general idea behind ANNs is pretty straightforward: map some input onto a desired target value using a distributed cascade of nonlinear transformations (see Figure 1). However, for many, myself included, the learning algorithm used to train ANNs can be difficult to get your head around at first. In this post I give a step-by-step walkthrough of the derivation of the gradient descent algorithm commonly used to train ANNs–aka the “backpropagation” algorithm. Along the way, I’ll also try to provide some high-level insights into the computations being performed during learning1.

Some Background and Notation

An ANN consists of an input layer, an output layer, and any number (including zero) of hidden layers situated between the input and output layers. Figure 1 diagrams an ANN with a single hidden layer. The feed-forward computations performed by the ANN are as follows:

  1. The signals from the input layer ai are multiplied by a set of wij connecting each input to a node in the hidden layer.
  2. These weighted signals are then summed (indicated by in Figure 1) and combined with a bias bi (not displayed in Figure 1). This calculation forms the pre-activation signal zj=bj+iaiwij for the hidden layer.
  3. The pre-activation signal is then transformed by the hidden layer activation function gj to form the feed-forward activation signals aj leaving leaving the hidden layer.
  4. In a similar fashion, the hidden layer activation signals aj are multiplied by the weights connecting the hidden layer to the output layer wjk, summed, and a bias bk is added.
  5. The resulting output layer pre-activation zk is transformed by the output activation function gk to form the network output ak.
  6. The computed output ak is then compared to a desired target value tk and the error between ak and tk is calculated. This error is used to determine how to update model parameters, as we’ll discuss in the remainder of the post


Figure 1: Diagram of an artificial neural network with a single hidden layer (bias units not shown)


Training a neural network involves determining the set of parameters θ={W,b} that reduces the amount errors that the network makes. Often the choice for the error function is the sum of the squared errors between the target values tk and the network output ak:

E=12Kk=1(aktk)2

Where K is the dimensionality of the target/output for a single observation. This parameter optimization problem can be solved using gradient descent, which requires determining Eθ for all θ in the model.

Before we begin, let’s define the notation that will be used in remainder of the derivation. Please refer to Figure 1 for any clarifications.

  • zj: input to node j in layer l
  • gj: activation function for node j in layer l (applied to zj)
  • aj=gj(zj): the output/activation of node j in layer l
  • bj: bias/offset for unit j in layer l
  • wij: weights connecting node i in layer (l1) to node j in layer l
  • tk: target value for node k in the output layer

Also note that the parameters for an ANN can be broken up into two distinct sets: those parameters that are associated with the output layer (i.e. θk={wjk,bk}), and thus directly affect the network output error; and the remaining parameters that are associated with the hidden layer(s), and thus affect the output error indirectly. We’ll first derive the gradients for the output layer parameters, then extend these results to the hidden layer parameters.

Gradients for Output Layer Parameters

Output layer connection weights, wjk

Since the output layer parameters directly affect the value of the error function, determining the gradient of the error function with respect to those parameters is fairly straight-forward using an application of the chain rule2:

Ewjk=12k(aktk)2=(aktk)wjk(aktk)

The derivative with respect to tk is zero because it does not depend on wjk. We can also use the fact that ak=g(zk), and re-apply the chain rule to give

Ewjk=(aktk)wjkak=(aktk)wjkgk(zk)=(aktk)gk(zk)wjkzk.

Now, recall that zk=bk+jgj(zj)wjk and thus zkwjk=gj(zj)=aj, thus giving us:

Ewjk=(aktk)gk(zk)aj

From *Equation 4 we can see that the gradient of the error function with respect to the output layer weights wjk is a product of three terms:

  • (aktk): the difference between the network output ak and the target value tk.
  • gk(zk): the derivative of output layer activation function gk(). For more details on activation function derivatives, please refer to this post
  • aj: the activation signal of node j from the hidden layer feeding into the output layer.

If we define δk to be all the terms that involve index k:

δk=(aktk)gk(zk)

Then we get the “delta form” of the error function gradient for the output layer weights:

Ewjk=δkaj

Here the δk terms can be interpreted as the network output error after being “backpropagated” through the output activation function gk, thus creating an “error signal”. Loosely speaking, Equation 6 can be interpreted as determining how much each wjk contributes to the error signal by weighting the error by the magnitude of the output activation from the previous (hidden) layer. The gradients with respect to each wjk are thus considered to be the “contribution” of that parameter to the total error signal and should be “negated” during learning. This gives the following gradient descent update rule for the output layer weights:

wjkwjkηEwjkwjkη(δkaj)

where η is some step size, often referred to as the “learning rate”. Similar update rules are used to update the remaining parameters, once Eθ has been determined.

As we’ll see shortly, the process of “backpropagating” the error signal can repeated all the way back to the input layer by successively projecting δk back through wjk, then through the activation function gj(zj) for the hidden layer to give the error signal δj, and so on. This backpropagation concept is central to training neural networks with more than one layer.

Output layer biases, bk

As for the gradient of the error function with respect to the output layer biases, we follow the same routine as above for wjk. However, the third term in Equation 3 is bkzk=bk[bk+jgj(zj)]=1, giving the following gradient for the output biases:

Ebk=(aktk)gk(zk)(1)=δk

Thus the gradient for the biases is simply the back-propagated error signal δk from the output units. One interpretation of this is that the biases are weights on activations that are always equal to one, regardless of the feed-forward signal. Thus the bias gradients aren’t affected by the feed-forward signal, only by the error.

Gradients for Hidden Layer Parameters

Now that we’ve derived the gradients for the output layer parameters and established the notion of backpropagation, let’s continue with this information in hand in order to derive the gradients for the remaining layers.

Hidden layer connection weights, wij

Due to the indirect affect of the hidden layer on the output error, calculating the gradients for the hidden layer weights wij is somewhat more involved. However, the process starts just the same as for the output layer 3:

Ewij=12k(aktk)2=k(aktk)wijak

Continuing on, noting that ak=gk(zk) and again applying chain rule, we obtain:

Ewij=k(aktk)wijgk(zk)=k(aktk)gk(zk)wijzk

Ok, now here’s where things get slightly more involved. Notice that the partial derivative wijzk in Equation 10 is with respect to wij, but the target zk is a function of index k. How the heck do we deal with that!? If we expand zk a little, we find that it is composed of other sub functions:

zk=bk+jajwjk=bk+jgj(zj)wjk=bk+jgj(bi+iaiwij)wjk

From Equation 11 we see that zk is indirectly dependent on wij. Equation 10 also suggests that we can again use the chain rule to calculate zkwij. This is probably the trickiest part of the derivation, and also requires noting that zj=bj+iaiwij and aj=gj(zj):

zkwij=zkajajwij=aj(bk+jajwjk)ajwij=wjkajwij=wjkgj(zj)wij=wjkgj(zj)zjwij=wjkgj(zj)wij(bj+iaiwij)=wjkgj(zj)ai

Now, plugging Equation 12 into zkwij into Equation 10 gives the following expression for Ewij:

Ewij=k(aktk)gk(zk)wjkgj(zj)ai=(kδkwjk)gj(zj)ai

Notice that the gradient for the hidden layer weights has a similar form to that of the gradient for the output layer weights. Namely the gradient is composed of three terms:

  • the current layer’s activation function gj(zj)
  • the output activation signal from the layer below ai.
  • an error term kδkwjk

For the output layer weight gradients, the error term was simply the difference in the target and output layer activations aktk. Here, the error term includes not only the output layer error signal, δk, but this error signal is further projected onto wjk. Analogous to the output layer weights, the gradient for the hidden layer weights can be interpreted as a proxy for the “contribution” of the weights to the output error signal. However, for hidden layers, this error can only be “observed” from the point-of-view of the weights by backpropagating the error signal through the layers above the hidden layer.

To make this idea more explicit, we can define the resulting error signal backpropagated to layer j as δj, which includes all terms in Equation 13 that involve index j. This definition results in the following gradient for the hidden unit weights:

δj=gj(zj)kδkwjk

Thus giving the final expression for the gradient:

Ewij=δjai

Equation 15 suggests that in order to calculate the weight gradients at any layer l in an arbitrarily-deep neural network, we simply need to calculate the backpropagated error signal δl that reaches that layer from the “above” layers, and weight it by the feed-forward signal al1 feeding into that layer.

Hidden Layer Biases, bj

Calculating the error gradients with respect to the hidden layer biases bj follows a very similar procedure to that for the hidden layer weights where, as in Equation 12, we use the chain rule to calculate zkbj.

Ebj=k(aktk)bjgk(zk)=k(aktk)gk(zk)zkbj

Again, using the chain rule to solve for zkbj

zkbj=zkajajbj=aj(bk+jajwjk)ajbj=wjkajbj=wjkgj(zj)bj=wjkgj(zj)zjbj=wjkgj(zj)bj(bj+iaiwij)=wjkgj(zj)(1)

Plugging Equation 17 into the expression for zkbj in Equation 16 gives the final expression for the hidden layer bias gradients:

Ebj=k(aktk)gk(zk)wjkgj(zj)=gj(zj)kδkwjk=δj

In a similar fashion to calculation of the bias gradients for the output layer, the gradients for the hidden layer biases are simply the backpropagated error signal reaching that layer. This suggests that we can also calculate the bias gradients at any layer l in an arbitrarily-deep network by simply calculating the backpropagated error signal reaching that layer δl. Pretty cool!

Wrapping up

In this post we went over some of the formal details of the backpropagation learning algorithm. The math covered in this post allows us to train arbitrarily deep neural networks by re-applying the same basic computations. In a later post, we’ll go a bit deeper in implementation and applications of neural networks, referencing this post for the formal development of the underlying calculus required for gradient descent.



Notes

  1. Though, I guess these days with autograd, who really needs to understand how the calculus for gradient descent works, amiright? (hint: that is a joke) 

  2. You may also notice that the summation disappears in the derivative. This is because when we take the partial derivative with respect to the j-th dimension/node. Therefore the only term that survives in the error gradient is the j-th, and we can thus ignore the remaining terms in the summation. 

  3. Notice here that the sum does not disappear in the derivative as it did for the output layer parameters. This is due to the fact that the hidden layers are fully connected, and thus each of the hidden unit outputs affects the state of each output unit.