Artificial neural networks (ANNs) are a powerful class of models used for nonlinear regression and classification tasks that are motivated by biological neural computation. The general idea behind ANNs is pretty straightforward: map some input onto a desired target value using a distributed cascade of nonlinear transformations (see Figure 1). However, for many, myself included, the learning algorithm used to train ANNs can be difficult to get your head around at first. In this post I give a step-by-step walkthrough of the derivation of the gradient descent algorithm commonly used to train ANNs–aka the “backpropagation” algorithm. Along the way, I’ll also try to provide some high-level insights into the computations being performed during learning1.
Some Background and Notation
An ANN consists of an input layer, an output layer, and any number (including zero) of hidden layers situated between the input and output layers. Figure 1 diagrams an ANN with a single hidden layer. The feed-forward computations performed by the ANN are as follows:
- The signals from the input layer are multiplied by a set of connecting each input to a node in the hidden layer.
- These weighted signals are then summed (indicated by in Figure 1) and combined with a bias (not displayed in Figure 1). This calculation forms the pre-activation signal for the hidden layer.
- The pre-activation signal is then transformed by the hidden layer activation function to form the feed-forward activation signals leaving leaving the hidden layer.
- In a similar fashion, the hidden layer activation signals are multiplied by the weights connecting the hidden layer to the output layer , summed, and a bias is added.
- The resulting output layer pre-activation is transformed by the output activation function to form the network output .
- The computed output is then compared to a desired target value and the error between and is calculated. This error is used to determine how to update model parameters, as we’ll discuss in the remainder of the post
Figure 1: Diagram of an artificial neural network with a single hidden layer (bias units not shown)
Training a neural network involves determining the set of parameters that reduces the amount errors that the network makes. Often the choice for the error function is the sum of the squared errors between the target values and the network output :
Where is the dimensionality of the target/output for a single observation. This parameter optimization problem can be solved using gradient descent, which requires determining for all in the model.
Before we begin, let’s define the notation that will be used in remainder of the derivation. Please refer to Figure 1 for any clarifications.
- : input to node in layer
- : activation function for node in layer (applied to )
- : the output/activation of node in layer
- : bias/offset for unit in layer
- : weights connecting node in layer to node in layer
- : target value for node in the output layer
Also note that the parameters for an ANN can be broken up into two distinct sets: those parameters that are associated with the output layer (i.e. ), and thus directly affect the network output error; and the remaining parameters that are associated with the hidden layer(s), and thus affect the output error indirectly. We’ll first derive the gradients for the output layer parameters, then extend these results to the hidden layer parameters.
Gradients for Output Layer Parameters
Output layer connection weights,
Since the output layer parameters directly affect the value of the error function, determining the gradient of the error function with respect to those parameters is fairly straight-forward using an application of the chain rule2:
The derivative with respect to is zero because it does not depend on . We can also use the fact that , and re-apply the chain rule to give
Now, recall that and thus , thus giving us:
From *Equation 4 we can see that the gradient of the error function with respect to the output layer weights is a product of three terms:
- : the difference between the network output and the target value .
- : the derivative of output layer activation function . For more details on activation function derivatives, please refer to this post
- : the activation signal of node from the hidden layer feeding into the output layer.
If we define to be all the terms that involve index :
Then we get the “delta form” of the error function gradient for the output layer weights:
Here the terms can be interpreted as the network output error after being “backpropagated” through the output activation function , thus creating an “error signal”. Loosely speaking, Equation 6 can be interpreted as determining how much each contributes to the error signal by weighting the error by the magnitude of the output activation from the previous (hidden) layer. The gradients with respect to each are thus considered to be the “contribution” of that parameter to the total error signal and should be “negated” during learning. This gives the following gradient descent update rule for the output layer weights:
where is some step size, often referred to as the “learning rate”. Similar update rules are used to update the remaining parameters, once has been determined.
As we’ll see shortly, the process of “backpropagating” the error signal can repeated all the way back to the input layer by successively projecting back through , then through the activation function for the hidden layer to give the error signal , and so on. This backpropagation concept is central to training neural networks with more than one layer.
Output layer biases,
As for the gradient of the error function with respect to the output layer biases, we follow the same routine as above for . However, the third term in Equation 3 is , giving the following gradient for the output biases:
Thus the gradient for the biases is simply the back-propagated error signal from the output units. One interpretation of this is that the biases are weights on activations that are always equal to one, regardless of the feed-forward signal. Thus the bias gradients aren’t affected by the feed-forward signal, only by the error.
Gradients for Hidden Layer Parameters
Now that we’ve derived the gradients for the output layer parameters and established the notion of backpropagation, let’s continue with this information in hand in order to derive the gradients for the remaining layers.
Hidden layer connection weights,
Due to the indirect affect of the hidden layer on the output error, calculating the gradients for the hidden layer weights is somewhat more involved. However, the process starts just the same as for the output layer 3:
Continuing on, noting that and again applying chain rule, we obtain:
Ok, now here’s where things get slightly more involved. Notice that the partial derivative in Equation 10 is with respect to , but the target is a function of index . How the heck do we deal with that!? If we expand a little, we find that it is composed of other sub functions:
From Equation 11 we see that is indirectly dependent on . Equation 10 also suggests that we can again use the chain rule to calculate . This is probably the trickiest part of the derivation, and also requires noting that and :
Now, plugging Equation 12 into into Equation 10 gives the following expression for :
Notice that the gradient for the hidden layer weights has a similar form to that of the gradient for the output layer weights. Namely the gradient is composed of three terms:
- the current layer’s activation function
- the output activation signal from the layer below .
- an error term
For the output layer weight gradients, the error term was simply the difference in the target and output layer activations . Here, the error term includes not only the output layer error signal, , but this error signal is further projected onto . Analogous to the output layer weights, the gradient for the hidden layer weights can be interpreted as a proxy for the “contribution” of the weights to the output error signal. However, for hidden layers, this error can only be “observed” from the point-of-view of the weights by backpropagating the error signal through the layers above the hidden layer.
To make this idea more explicit, we can define the resulting error signal backpropagated to layer as , which includes all terms in Equation 13 that involve index . This definition results in the following gradient for the hidden unit weights:
Thus giving the final expression for the gradient:
Equation 15 suggests that in order to calculate the weight gradients at any layer in an arbitrarily-deep neural network, we simply need to calculate the backpropagated error signal that reaches that layer from the “above” layers, and weight it by the feed-forward signal feeding into that layer.
Hidden Layer Biases,
Calculating the error gradients with respect to the hidden layer biases follows a very similar procedure to that for the hidden layer weights where, as in Equation 12, we use the chain rule to calculate .
Again, using the chain rule to solve for
Plugging Equation 17 into the expression for in Equation 16 gives the final expression for the hidden layer bias gradients:
In a similar fashion to calculation of the bias gradients for the output layer, the gradients for the hidden layer biases are simply the backpropagated error signal reaching that layer. This suggests that we can also calculate the bias gradients at any layer in an arbitrarily-deep network by simply calculating the backpropagated error signal reaching that layer . Pretty cool!
In this post we went over some of the formal details of the backpropagation learning algorithm. The math covered in this post allows us to train arbitrarily deep neural networks by re-applying the same basic computations. In a later post, we’ll go a bit deeper in implementation and applications of neural networks, referencing this post for the formal development of the underlying calculus required for gradient descent.
Though, I guess these days with autograd, who really needs to understand how the calculus for gradient descent works, amiright? (hint: that is a joke) ↩
You may also notice that the summation disappears in the derivative. This is because when we take the partial derivative with respect to the -th dimension/node. Therefore the only term that survives in the error gradient is the -th, and we can thus ignore the remaining terms in the summation. ↩
Notice here that the sum does not disappear in the derivative as it did for the output layer parameters. This is due to the fact that the hidden layers are fully connected, and thus each of the hidden unit outputs affects the state of each output unit. ↩