Backpropagation through time is a gradient-based technique for training certain types of recurrent neural networks. It can be used to train Elman networks. The algorithm was independently derived by numerous researchers.
Algorithm
The training data for a recurrent neural network is an ordered sequence of input-output pairs,. An initial value must be specified for the hidden state. Typically, a vector of all zeros is used for this purpose. BPTT begins by unfolding a recurrentneural network in time. The unfolded network contains inputs and outputs, but every copy of the network shares the same parameters. Then the backpropagation algorithm is used to find the gradient of the cost with respect to all the network parameters. Consider an example of a neural network that contains a recurrent layer and a feedforward layer. There are different ways to define the training cost, but the total cost is always the average of the costs of each of the time steps. The cost of each time step can be computed separately. The figure above shows how the cost at time can be computed, by unfolding the recurrent layer for three time steps and adding the feedforward layer. Each instance of in the unfolded network shares the same parameters. Thus the weight updates in each instance are summed together.
Pseudocode
Pseudocode for a truncated version of BPTT, where the training data contains input-output pairs, but the network is unfolded for time steps: Back_Propagation_Through_Time // a is the input at time t. y is the output Unfold the network to containk instances of f do until stopping criteria is met: x := the zero-magnitude vector // x is the current context for t from 0 to n − k do // t is time. n is the length of the training sequence Set the network inputs to x, a, a,..., a p := forward-propagate the inputs over the whole unfolded network e := y − p; // error = target − prediction Back-propagate the error, e, back across the whole unfolded network Sum the weight changes in the k instances of f together. Update all the weights in f and g. x := f; // compute the context for the next time-step
BPTT has difficulty with local optima. With recurrent neural networks, local optima are a much more significant problem than with feed-forward neural networks. The recurrent feedback in such networks tends to create chaotic responses in the error surface which cause local optima to occur frequently, and in poor locations on the error surface.