Catastrophic interference
Catastrophic interference, also known as catastrophic forgetting, is the tendency of an artificial neural network to completely and abruptly forget previously learned information upon learning new information. Neural networks are an important part of the network approach and connectionist approach to cognitive science. These networks use computer simulations to try to model human behaviours, such as memory and learning. Catastrophic interference is an important issue to consider when creating connectionist models of memory. It was originally brought to the attention of the scientific community by research from McCloskey and Cohen, and Ratcliff. It is a radical manifestation of the 'sensitivity-stability' dilemma or the 'stability-plasticity' dilemma. Specifically, these problems refer to the issue of being able to make an artificial neural network that is sensitive to, but not disrupted by, new information. Lookup tables and connectionist networks lie on the opposite sides of the stability plasticity spectrum. The former remains completely stable in the presence of new information but lacks the ability to generalize, i.e. infer general principles, from new inputs. On the other hand, connectionist networks like the standard backpropagation network are very sensitive to new information and can generalize on new inputs. Backpropagation models can be considered good models of human memory insofar as they mirror the human ability to generalize but these networks often exhibit less stability than human memory. Notably, these backpropagation networks are susceptible to catastrophic interference. This is considered an issue when attempting to model human memory because, unlike these networks, humans typically do not show catastrophic forgetting. Thus, the issue of catastrophic interference must be eradicated from these backpropagation models in order to enhance the plausibility as models of human memory.
History of catastrophic interference
The term catastrophic interference was originally coined by McCloskey and Cohen but was also brought to the attention of the scientific community by research from Ratcliff.''The Sequential Learning Problem'': McCloskey and Cohen (1989)
McCloskey and Cohen noted the problem of catastrophic interference during two different experiments with backpropagation neural network modelling.- Experiment 1: Learning the ones and twos addition facts
- Experiment 2: Replication of Barnes and Underwood study
McCloskey and Cohen tried to reduce interference through a number of manipulations including changing the number of hidden units, changing the value of the learning rate parameter, overtraining on the A-B list, freezing certain connection weights, changing target values 0 and 1 instead 0.1 and 0.9. However none of these manipulations satisfactorily reduced the catastrophic interference exhibited by the networks.
Overall, McCloskey and Cohen concluded that:
- at least some interference will occur whenever new learning alters the weights involved representing
- the greater the amount of new learning, the greater the disruption in old knowledge
- interference was catastrophic in the backpropagation networks when learning was sequential but not concurrent
''Constraints Imposed by Learning and Forgetting Functions'': Ratcliff (1990)
- Well-learned information was catastrophically forgotten as new information was learned in both small and large backpropagation networks.
- Discrimination between the studied items and previously unseen items decreased as the network learned more.
Proposed solutions
Many researchers have suggested that the main cause of catastrophic interference is overlap in the representations at the hidden layer of distributed neural networks. In a distributed representation any given input will tend to create changes in the weights to many of the nodes. Catastrophic forgetting occurs because when many of the weights where "knowledge is stored" are changed, it is impossible for prior knowledge to be kept intact. During sequential learning, the inputs become mixed with the new input being superimposed over top of the old input. Another way to conceptualize this is through visualizing learning as movement through a weight space. This weight space can be likened to a spatial representation of all of the possible combinations of weights that the network can possess. When a network first learns to represent a set of patterns, it has found a point in weight space which allows it to recognize all of the patterns that it has seen. However, when the network learns a new set of patterns sequentially it will move to a place in the weight space that allows it to only recognize the new pattern. To recognize both sets of patterns, the network must find a place in weight space that can represent both the new and the old output. One way to do this is by connecting a hidden unit to only a subset of the input units. This reduces the likelihood that two different inputs will be encoded by the same hidden units and weights, and so will decrease the chance of interference. Indeed, a number of the proposed solutions to catastrophic interference involve reducing the amount of overlap that occurs when storing information in these weights.Many of the early techniques in reducing representational overlap involved making either the input vectors or the hidden unit activation patterns orthogonal to one another. Lewandowsky and Li noted that the interference between sequentially learned patterns is minimized if the input vectors are orthogonal to each other. Input vectors are said to be orthogonal to each other if the pairwise product of their elements across the two vectors sum to zero. For example, the patterns and are said to be orthogonal because = 0. One of the techniques which can create orthogonal representations at the hidden layers involves bipolar feature coding. Orthogonal patterns tend to produce less interference with each other. However, not all learning problems can be represented using these types of vectors and some studies report that the degree of interference is still problematic with orthogonal vectors. Simple techniques such as varying the learning rate parameters in the backpropagation equation were not successful in reducing interference. Varying the number of hidden nodes has also been used to try and reduce interference. However, the findings have been mixed, with some studies finding that more hidden units decrease interference and other studies finding it does not.
Below are a number of techniques which have empirical support in successfully reducing catastrophic interference in backpropagation neural networks:
Node sharpening technique
French proposed that catastrophic interference arises in feedforward backpropagation networks due to the interaction of node activations, or activation overlap, that occur in distributed representations at the hidden layer. Specifically, he defined this activation overlap as the average shared activation over all units in the hidden layer, calculated by summing the lowest activation of the nodes at the hidden layer and averaging this sum. For example, if the activations at the hidden layer from one input are and the activations from the next input are the activation overlap would be / 4 = 0.275. When using binary representation of input vectors, activation values will be 0 through 1, where 0 indicates no activation overlap and 1 indicates full activation overlap. French noted that neural networks which employ very localized representations do not show catastrophic interference because of the lack of overlap at the hidden layer. That is to say, each input pattern will create a hidden layer representation that involves the activation of only one node, so differed inputs will have an activation overlap of 0. Thus, he suggested that reducing the value of activation overlap at the hidden layer would reduce catastrophic interference in distributed networks. Specifically he proposed that this could be done through changing the distributed representations at the hidden layer to 'semi-distributed' representations. A 'semi-distributed' representation has fewer hidden nodes that are active, and/or a lower activation value for these nodes, for each representation, which will make the representations of the different inputs overlap less at the hidden layer. French recommended that this could be done through 'activation sharpening', a technique which slightly increases the activation of a certain number of the most active nodes in the hidden layer, slightly reduces the activation of all the other units and then changes the input-to-hidden layer weights to reflect these activation changes. Overall the guidelines for the process of 'activation sharpening' are as follows:- Perform a forward activation pass by feeding an input from the input layer to the hidden layer and record the activations at the hidden layer
- "Sharpen" the activation of x number of most active nodes by a sharpening factor α:
- : Anew = Aold + α For nodes to be sharpened, i.e. more activated
- : Anew = Aold – αAold For all other nodes
- : French suggested the number of nodes to be sharpened should be log n nodes, where n is the number of hidden layer nodes
- Use the difference between the old activation and the sharpened activation as an error, backpropagate this error to the input layer, and modify the weights of input-to-output appropriately
- Do a full forward pass with the input through to the output layer
- Backpropagate as usual from the output to the input layer
- Repeat
According to French, the sharpened activations interfere less with weights in the network than unsharpened weights and this is due specifically to the way that backpropagation algorithm calculates weight changes. Activations near 0 will change the weights of links less than activations near 1. Consequently, when there are many nodes with low activations, the weights to and from these nodes will be modified much less than the weights on very active nodes. As a result, when a new input is fed into the network, sharpening will reduce activation overlap by limiting the number of highly active hidden units and will reduce the likelihood of representational overlap by reducing the number of weights that are to be changed. Thus, node sharpening will decrease the amount of disruption in the old weights, which store prior input patterns, thereby reducing the likelihood of catastrophic forgetting.
Novelty rule
Kortge proposed a learning rule for training neural networks, called the 'novelty rule', to help alleviate catastrophic interference. As its name suggests, this rule helps the neural network to learn only the components of a new input that differ from an old input. Consequently, the novelty rule changes only the weights that were not previously dedicated to storing information, thereby reducing the overlap in representations at the hidden units. Thus, even when inputs are somewhat similar to another, dissimilar representations can be made at the hidden layer. In order to apply the novelty rule, during learning the input pattern is replaced by a novelty vector that represents the components that differ. The novelty vector for the first layer is determined by taking the target pattern away from the current output of the network. For the second layer the novelty vector is simply the activation of the hidden units that resulted from using the novelty vector as an input through the first layer. Weight changes in the network are computed by using a modified delta rule with the novelty vector replacing the activation value :When the novelty rule is used in a standard backpropagation network there is no, or lessened, forgetting of old items when new items are presented sequentially. However, this rule can only apply to auto-encoder or auto-associative networks, in which the target response for the output layer is identical to the input pattern. This is because the novelty vector would be meaningless if the desired output was not identical to the input as it would be impossible to calculate how much a new input differed from the old input.
Pre-training networks
McRae and Hetherington argued that humans, unlike most neural networks, do not take on new learning tasks with a random set of weights. Rather, people tend to bring a wealth of prior knowledge to a task and this helps to avoid the problem of interference. They proposed that when a network is pre-trained on a random sample of data prior to starting a sequential learning task that this prior knowledge will naturally constrain how the new information can be incorporated. This would occur because a random sample of data from a domain which has a high degree of internal structure, such as the English language, training would capture the regularities, or recurring patterns, found within that domain. Since the domain is based on regularities, a newly learned item will tend to be similar to the previously learned information, which will allow the network to incorporate new data with little interference with existing data. Specifically, an input vector which follows the same pattern of regularities as the previously trained data should not cause a drastically different pattern of activation at the hidden layer or drastically alter weights.To test their hypothesis, McRae and Hetherington compared the performance of a naïve and pre-trained auto-encoder backpropagation network on three simulations of verbal learning tasks. The pre-trained network was trained using letter based representations of English monosyllabic words or English word pairs. All three tasks involved the learning of some consonant-vowel-consonant strings or CVC pairs, followed by training on a second list of these items. Afterwards, the distributions of the hidden node activations were compared between the naïve and pre-trained network. In all three tasks, the representations of a CVC in the naïve network tended to be spread fairly evenly across all hidden nodes, whereas most hidden nodes were inactive in the pre-trained network. Furthermore, in the pre-trained network the representational overlap between CVCs was reduced compared to the naïve network. The pre-trained network also retained some similarity information as the representational overlap between similar CVCs, like "JEP" and "ZEP", was greater than for dissimilar CVCs, such as "JEP" and "YUG". This suggests that the pre-trained network had a better ability to generalize, i.e. notice the patterns, than the naïve network. Most importantly, this reduction in hidden unit activation and representational overlap resulted in significantly less forgetting in the pre-trained network than the naïve network, essentially eliminating catastrophic interference. Essentially, the pre-training acted to create internal orthogonalization of the activations at the hidden layer, which reduced interference. Thus, pre-training is a simple way to reduce catastrophic forgetting in standard backpropagation networks.
Pseudo-recurrent networks
French proposed the idea of a pseudo-recurrent backpropagation network in order to help reduce catastrophic interference. In this model the network is separated into two functionally distinct but interacting sub-networks. This model is biologically inspired and is based on research from McClelland, McNaughton, and O'Reilly. In this research McClelland et al., suggested that the hippocampus and neocortex act as separable but complementary memory systems. Specifically, the hippocampus short term memory storage and acts gradually over time to transfer memories into the neocortex for long term memory storage. They suggest that the information that is stored can be "brought back" to the hippocampus during active rehearsal, reminiscence, and sleep and renewed activation is what acts to transfer the information to the neocortex over time. In the pseudo-recurrent network, one of the sub-networks acts as an early processing area, akin to the hippocampus, and functions to learn new input patters. The other sub-network acts as a final-storage area, akin to the neocortex. However, unlike in McClelland et al. model, the final-storage area sends internally generated representation back to the early processing area. This creates a recurrent network. French proposed that this interleaving of old representations with new representations is the only way to reduce radical forgetting. Since the brain would most likely not have access to the original input patterns, the patterns that would be fed back to the neocortex would be internally generated representations called pseudopatterns. These pseudopatterns are approximations of previous inputs and they can be interleaved with the learning of new inputs. The use of these pseudopatterns could be biologically plausible as parallels between the consolidation of learning that occurs during sleep and the use of interleaved pseudopatterns. Specifically, they both serve to integrate new information with old information without disruption of the old information. When given an input is fed into the pseudo-recurrent network would act as follows:- When a pattern is fed from the environment, the information travels both to the early processing area and the final storage area, however the teacher nodes will inhibit the output from the final storage area
- The new pattern is learned by the early processing area by the standard backpropagation algorithm
- At the same time random input is also fed into the network and causes pseudopatterns to be generated by the final storage area
- Output from the final-storage area, in the form of pseudopatterns, will be used as a teacher for the early-processing area. In this way, the pseudopatterns are interleaved with the 'real inputs' from the environment
- Once the new pattern and the pseudopattern are learned by the early processing area, its weights are copied to the corresponding weights in the final storage area.
Not only did the pseudo-recurrent model show reduced interference but also it models list-length and list-strength effects seen in humans. The list-length effect means that adding new items to a list harms the memory of earlier items. Like humans, the pseudo recurrent network showed a more gradual forgetting when to be trained list is lengthened. The list-strength effect means that when the strength of recognition for one item is increased, there is no effect on the recognition of the other list items. This is an important finding as other models often exhibit a decrease in the recognition of other list items when one list item is strengthened. Since the direct copying of weights from the early processing area to the final storage area does not seem highly biologically plausible, the transfer of information to the final storage area can be done through training the final storage area with pseudopatterns created by the early processing area. However, a disadvantage of the pseudo-recurrent model is that the number of hidden units in the early processing and final storage sub-networks must be identical.
Neural networks with self-refreshing memory
Following the same basic idea contributed by Robins, Ans and Rousset have also proposed a two-network artificial neural architecture with memory self-refreshing that overcomes catastrophic interference when sequential learning tasks are carried out in distributed networks trained by backpropagation. The principle is to interleave, at the time when new external patterns are learned, those to-be-learned new external patterns with internally generated pseudopatterns, or 'pseudo-memories', that reflect the previously learned information. What mainly distinguishes this model from those that use classical pseudorehearsal in feedforward multilayer networks is a reverberating process that is used for generating pseudopatterns. This process which, after a number of activity re-injections from a single random seed, tends to go up to nonlinear network attractors, is more suitable for optimally capturing the deep structure of previously learned knowledge than a single feedforward pass of random activation. Ans and Rousset have shown that the learning mechanism they proposed avoiding catastrophic forgetting, provides a more appropriate way to deal with knowledge transfer as measured by learning speed, ability to generalize and vulnerability to network damages. Musca, Rousset and Ans have also shown that pseudopatterns originating from an artificial reverberating neural network could induce familiarity in humans with never seen items in the way predicted by simulations conducted with a two-network artificial neural architecture. Furthermore, Ans has implemented a version of the self-refreshing mechanism using only one network trained by the Contrastive Hebbian Learning rule, a training rule considered as more realistic than the largely used backpropagation algorithm, but fortunately equivalent to the latter.So far, the different solutions to catastrophic interference that have been presented concern tasks of sequential learning involving only non-temporally ordered lists of items. But, to be credible, the self-refreshing mechanism for 'static' learning has to encompass our human ability to learn serially many temporal sequences of patterns without catastrophic interference. This was done by Ans, Rousset, French and Musca who have presented, in addition to simulation work, an experiment that evidences a close similarity between the behaviour of humans and the behaviour of the proposed neuromimetic architecture.
Latent learning
Latent Learning is a technique used by Gutstein & Stump both to mitigate catastrophic interference and to take advantage of transfer learning. Rather than manipulating the representations for new classes used by the hidden nodes, this approach tries to train optimal representations for new classes into the output nodes. It chooses output encodings that are least likely to catastrophically interfere with existing responses.Given a net that has learned to discriminate among one set of classes using Error Correcting Output Codes , optimal encodings for new classes are chosen by observing the net's average responses to them. Since these average responses arose while learning the original set of classes without any exposure to the new classes, they are referred to as 'Latently Learned Encodings'. This terminology borrows from the concept of Latent Learning, as introduced by Tolman in 1930. In effect, this technique uses transfer learning to avoid catastrophic interference, by making a net's responses to new classes as consistent as possible with existing responses to classes already learned.
Elastic weight consolidation
Kirkpatrick et al. demonstrated a method to train a single artificial neural network on multiple tasks using a technique called elastic weight consolidation. This technique suppose that some weights of trained neural network are more important for learned task than the others. So, during further training neural network on next task the weights of network must elasticly resist to changes the more the greater their importance. To determine the importance of network weights elastic weight consolidation method uses probabilistic mechanisms, in particular the Fisher information matrix.Kutalev introduced a modification of the EWC method that uses the total absolute signal transmitted through the corresponding connection of the trained network as the weight importance. It has been supposed that such a modification emulates the process of myelination in the brain of vertebrates.