Catastrophic interference


Catastrophic interference, also known as catastrophic forgetting, is the tendency of an artificial neural network to completely and abruptly forget previously learned information upon learning new information. Neural networks are an important part of the network approach and connectionist approach to cognitive science. These networks use computer simulations to try to model human behaviours, such as memory and learning. Catastrophic interference is an important issue to consider when creating connectionist models of memory. It was originally brought to the attention of the scientific community by research from McCloskey and Cohen, and Ratcliff. It is a radical manifestation of the 'sensitivity-stability' dilemma or the 'stability-plasticity' dilemma. Specifically, these problems refer to the issue of being able to make an artificial neural network that is sensitive to, but not disrupted by, new information. Lookup tables and connectionist networks lie on the opposite sides of the stability plasticity spectrum. The former remains completely stable in the presence of new information but lacks the ability to generalize, i.e. infer general principles, from new inputs. On the other hand, connectionist networks like the standard backpropagation network are very sensitive to new information and can generalize on new inputs. Backpropagation models can be considered good models of human memory insofar as they mirror the human ability to generalize but these networks often exhibit less stability than human memory. Notably, these backpropagation networks are susceptible to catastrophic interference. This is considered an issue when attempting to model human memory because, unlike these networks, humans typically do not show catastrophic forgetting. Thus, the issue of catastrophic interference must be eradicated from these backpropagation models in order to enhance the plausibility as models of human memory.

History of catastrophic interference

The term catastrophic interference was originally coined by McCloskey and Cohen but was also brought to the attention of the scientific community by research from Ratcliff.

''The Sequential Learning Problem'': McCloskey and Cohen (1989)

McCloskey and Cohen noted the problem of catastrophic interference during two different experiments with backpropagation neural network modelling.
In their first experiment they trained a standard backpropagation neural network on a single training set consisting of 17 single-digit ones problems until the network could represent and respond properly to all of them. The error between the actual output and the desired output steadily declined across training sessions, which reflected that the network learned to represent the target outputs better across trials. Next they trained the network on a single training set consisting of 17 single-digit twos problems until the network could represent, respond properly to all of them. They noted that their procedure was similar to how a child would learn their addition facts. Following each learning trial on the twos facts, the network was tested for its knowledge on both the ones and twos addition facts. Like the ones facts, the twos facts were readily learned by the network. However, McCloskey and Cohen noted the network was no longer able to properly answer the ones addition problems even after one learning trial of the twos addition problems. The output pattern produced in response to the ones facts often resembled an output pattern for an incorrect number more closely than the output pattern for a correct number. This is considered to be a drastic amount of error. Furthermore, the problems 2+1 and 2+1, which were included in both training sets, even showed dramatic disruption during the first learning trials of the twos facts.
In their second connectionist model, McCloskey and Cohen attempted to replicate the study on retroactive interference in humans by Barnes and Underwood. They trained the model on A-B and A-C lists and used a context pattern in the input vector, to differentiate between the lists. Specifically the network was trained to responds with the right B response when shown the A stimulus and A-B context pattern and to respond with the correct C response when shown the A stimulus and the A-C context pattern. When the model was trained concurrently on the A-B and A-C items then the network readily learned all of the associations correctly. In sequential training the A-B list was trained first, followed by the A-C list. After each presentation of the A-C list, performance was measured for both the A-B and A-C lists. They found that the amount of training on the A-C list in Barnes and Underwood study that lead to 50% correct responses, lead to nearly 0% correct responses by the backpropagation network. Furthermore, they found that the network tended to show responses that looked like the C response pattern when the network was prompted to give the B response pattern. This indicated that the A-C list apparently had overwritten the A-B list. This could be likened to learning the word dog, followed by learning the word stool and then finding that you cannot recognize the word cat well but instead think of the word stool when presented with the word dog.
McCloskey and Cohen tried to reduce interference through a number of manipulations including changing the number of hidden units, changing the value of the learning rate parameter, overtraining on the A-B list, freezing certain connection weights, changing target values 0 and 1 instead 0.1 and 0.9. However none of these manipulations satisfactorily reduced the catastrophic interference exhibited by the networks.
Overall, McCloskey and Cohen concluded that:
Ratcliff used multiple sets of backpropagation models applied to standard recognition memory procedures, in which the items were sequentially learned. After inspecting the recognition performance models he found two major problems:
Even one learning trial with new information resulted in a significant loss of the old information, paralleling the findings of McCloskey and Cohen. Ratcliff also found that the resulting outputs were often a blend of the previous input and the new input. In larger networks, items learned in groups were more resistant to forgetting than were items learned singly. However, the forgetting for items learned in groups was still large. Adding new hidden units to the network did not reduce interference.
This finding contradicts with studies on human memory, which indicated that discrimination increases with learning. Ratcliff attempted to alleviate this problem by adding 'response nodes' that would selectively respond to old and new inputs. However, this method did not work as these response nodes would become active for all inputs. A model which used a context pattern also failed to increase discrimination between new and old items.

Proposed solutions

Many researchers have suggested that the main cause of catastrophic interference is overlap in the representations at the hidden layer of distributed neural networks. In a distributed representation any given input will tend to create changes in the weights to many of the nodes. Catastrophic forgetting occurs because when many of the weights where "knowledge is stored" are changed, it is impossible for prior knowledge to be kept intact. During sequential learning, the inputs become mixed with the new input being superimposed over top of the old input. Another way to conceptualize this is through visualizing learning as movement through a weight space. This weight space can be likened to a spatial representation of all of the possible combinations of weights that the network can possess. When a network first learns to represent a set of patterns, it has found a point in weight space which allows it to recognize all of the patterns that it has seen. However, when the network learns a new set of patterns sequentially it will move to a place in the weight space that allows it to only recognize the new pattern. To recognize both sets of patterns, the network must find a place in weight space that can represent both the new and the old output. One way to do this is by connecting a hidden unit to only a subset of the input units. This reduces the likelihood that two different inputs will be encoded by the same hidden units and weights, and so will decrease the chance of interference. Indeed, a number of the proposed solutions to catastrophic interference involve reducing the amount of overlap that occurs when storing information in these weights.
Many of the early techniques in reducing representational overlap involved making either the input vectors or the hidden unit activation patterns orthogonal to one another. Lewandowsky and Li noted that the interference between sequentially learned patterns is minimized if the input vectors are orthogonal to each other. Input vectors are said to be orthogonal to each other if the pairwise product of their elements across the two vectors sum to zero. For example, the patterns and are said to be orthogonal because = 0. One of the techniques which can create orthogonal representations at the hidden layers involves bipolar feature coding. Orthogonal patterns tend to produce less interference with each other. However, not all learning problems can be represented using these types of vectors and some studies report that the degree of interference is still problematic with orthogonal vectors. Simple techniques such as varying the learning rate parameters in the backpropagation equation were not successful in reducing interference. Varying the number of hidden nodes has also been used to try and reduce interference. However, the findings have been mixed, with some studies finding that more hidden units decrease interference and other studies finding it does not.
Below are a number of techniques which have empirical support in successfully reducing catastrophic interference in backpropagation neural networks:

Node sharpening technique

French proposed that catastrophic interference arises in feedforward backpropagation networks due to the interaction of node activations, or activation overlap, that occur in distributed representations at the hidden layer. Specifically, he defined this activation overlap as the average shared activation over all units in the hidden layer, calculated by summing the lowest activation of the nodes at the hidden layer and averaging this sum. For example, if the activations at the hidden layer from one input are and the activations from the next input are the activation overlap would be / 4 = 0.275. When using binary representation of input vectors, activation values will be 0 through 1, where 0 indicates no activation overlap and 1 indicates full activation overlap. French noted that neural networks which employ very localized representations do not show catastrophic interference because of the lack of overlap at the hidden layer. That is to say, each input pattern will create a hidden layer representation that involves the activation of only one node, so differed inputs will have an activation overlap of 0. Thus, he suggested that reducing the value of activation overlap at the hidden layer would reduce catastrophic interference in distributed networks. Specifically he proposed that this could be done through changing the distributed representations at the hidden layer to 'semi-distributed' representations. A 'semi-distributed' representation has fewer hidden nodes that are active, and/or a lower activation value for these nodes, for each representation, which will make the representations of the different inputs overlap less at the hidden layer. French recommended that this could be done through 'activation sharpening', a technique which slightly increases the activation of a certain number of the most active nodes in the hidden layer, slightly reduces the activation of all the other units and then changes the input-to-hidden layer weights to reflect these activation changes. Overall the guidelines for the process of 'activation sharpening' are as follows:
  1. Perform a forward activation pass by feeding an input from the input layer to the hidden layer and record the activations at the hidden layer
  2. "Sharpen" the activation of x number of most active nodes by a sharpening factor α:
  3. : Anew = Aold + α For nodes to be sharpened, i.e. more activated
  4. : Anew = AoldαAold For all other nodes
  5. : French suggested the number of nodes to be sharpened should be log n nodes, where n is the number of hidden layer nodes
  6. Use the difference between the old activation and the sharpened activation as an error, backpropagate this error to the input layer, and modify the weights of input-to-output appropriately
  7. Do a full forward pass with the input through to the output layer
  8. Backpropagate as usual from the output to the input layer
  9. Repeat
In his tests of an 8-8-8 node backpropagation network where one node was sharpened, French found that this sharpening paradigm did result in one node being much more active than the other seven. Moreover, when sharpened, this network took one fourth the time to relearn the initial inputs than a standard backpropagation without node sharpening. Relearning is a measure of memory savings and thus extent of forgetting, where more time to relearn suggests more forgetting. A two-node sharpened network performed even slightly better, however if more than two nodes were sharpened forgetting increased again.
According to French, the sharpened activations interfere less with weights in the network than unsharpened weights and this is due specifically to the way that backpropagation algorithm calculates weight changes. Activations near 0 will change the weights of links less than activations near 1. Consequently, when there are many nodes with low activations, the weights to and from these nodes will be modified much less than the weights on very active nodes. As a result, when a new input is fed into the network, sharpening will reduce activation overlap by limiting the number of highly active hidden units and will reduce the likelihood of representational overlap by reducing the number of weights that are to be changed. Thus, node sharpening will decrease the amount of disruption in the old weights, which store prior input patterns, thereby reducing the likelihood of catastrophic forgetting.

Novelty rule

Kortge proposed a learning rule for training neural networks, called the 'novelty rule', to help alleviate catastrophic interference. As its name suggests, this rule helps the neural network to learn only the components of a new input that differ from an old input. Consequently, the novelty rule changes only the weights that were not previously dedicated to storing information, thereby reducing the overlap in representations at the hidden units. Thus, even when inputs are somewhat similar to another, dissimilar representations can be made at the hidden layer. In order to apply the novelty rule, during learning the input pattern is replaced by a novelty vector that represents the components that differ. The novelty vector for the first layer is determined by taking the target pattern away from the current output of the network. For the second layer the novelty vector is simply the activation of the hidden units that resulted from using the novelty vector as an input through the first layer. Weight changes in the network are computed by using a modified delta rule with the novelty vector replacing the activation value :
When the novelty rule is used in a standard backpropagation network there is no, or lessened, forgetting of old items when new items are presented sequentially. However, this rule can only apply to auto-encoder or auto-associative networks, in which the target response for the output layer is identical to the input pattern. This is because the novelty vector would be meaningless if the desired output was not identical to the input as it would be impossible to calculate how much a new input differed from the old input.

Pre-training networks

McRae and Hetherington argued that humans, unlike most neural networks, do not take on new learning tasks with a random set of weights. Rather, people tend to bring a wealth of prior knowledge to a task and this helps to avoid the problem of interference. They proposed that when a network is pre-trained on a random sample of data prior to starting a sequential learning task that this prior knowledge will naturally constrain how the new information can be incorporated. This would occur because a random sample of data from a domain which has a high degree of internal structure, such as the English language, training would capture the regularities, or recurring patterns, found within that domain. Since the domain is based on regularities, a newly learned item will tend to be similar to the previously learned information, which will allow the network to incorporate new data with little interference with existing data. Specifically, an input vector which follows the same pattern of regularities as the previously trained data should not cause a drastically different pattern of activation at the hidden layer or drastically alter weights.
To test their hypothesis, McRae and Hetherington compared the performance of a naïve and pre-trained auto-encoder backpropagation network on three simulations of verbal learning tasks. The pre-trained network was trained using letter based representations of English monosyllabic words or English word pairs. All three tasks involved the learning of some consonant-vowel-consonant strings or CVC pairs, followed by training on a second list of these items. Afterwards, the distributions of the hidden node activations were compared between the naïve and pre-trained network. In all three tasks, the representations of a CVC in the naïve network tended to be spread fairly evenly across all hidden nodes, whereas most hidden nodes were inactive in the pre-trained network. Furthermore, in the pre-trained network the representational overlap between CVCs was reduced compared to the naïve network. The pre-trained network also retained some similarity information as the representational overlap between similar CVCs, like "JEP" and "ZEP", was greater than for dissimilar CVCs, such as "JEP" and "YUG". This suggests that the pre-trained network had a better ability to generalize, i.e. notice the patterns, than the naïve network. Most importantly, this reduction in hidden unit activation and representational overlap resulted in significantly less forgetting in the pre-trained network than the naïve network, essentially eliminating catastrophic interference. Essentially, the pre-training acted to create internal orthogonalization of the activations at the hidden layer, which reduced interference. Thus, pre-training is a simple way to reduce catastrophic forgetting in standard backpropagation networks.

Pseudo-recurrent networks

French proposed the idea of a pseudo-recurrent backpropagation network in order to help reduce catastrophic interference. In this model the network is separated into two functionally distinct but interacting sub-networks. This model is biologically inspired and is based on research from McClelland, McNaughton, and O'Reilly. In this research McClelland et al., suggested that the hippocampus and neocortex act as separable but complementary memory systems. Specifically, the hippocampus short term memory storage and acts gradually over time to transfer memories into the neocortex for long term memory storage. They suggest that the information that is stored can be "brought back" to the hippocampus during active rehearsal, reminiscence, and sleep and renewed activation is what acts to transfer the information to the neocortex over time. In the pseudo-recurrent network, one of the sub-networks acts as an early processing area, akin to the hippocampus, and functions to learn new input patters. The other sub-network acts as a final-storage area, akin to the neocortex. However, unlike in McClelland et al. model, the final-storage area sends internally generated representation back to the early processing area. This creates a recurrent network. French proposed that this interleaving of old representations with new representations is the only way to reduce radical forgetting. Since the brain would most likely not have access to the original input patterns, the patterns that would be fed back to the neocortex would be internally generated representations called pseudopatterns. These pseudopatterns are approximations of previous inputs and they can be interleaved with the learning of new inputs. The use of these pseudopatterns could be biologically plausible as parallels between the consolidation of learning that occurs during sleep and the use of interleaved pseudopatterns. Specifically, they both serve to integrate new information with old information without disruption of the old information. When given an input is fed into the pseudo-recurrent network would act as follows:
When tested on sequential learning of real world patterns, categorization of edible and poisonous mushrooms, the pseudo-recurrent network was shown less interference than a standard backpropagation network. This improvement was with both memory savings and exact recognition of old patterns. When the activation patterns of the pseudo-recurrent network were investigated, it was shown that this network automatically formed semi-distributed representations. Since these types of representations involve fewer nodes being activated for each pattern, it is likely what helped to reduce interference.
Not only did the pseudo-recurrent model show reduced interference but also it models list-length and list-strength effects seen in humans. The list-length effect means that adding new items to a list harms the memory of earlier items. Like humans, the pseudo recurrent network showed a more gradual forgetting when to be trained list is lengthened. The list-strength effect means that when the strength of recognition for one item is increased, there is no effect on the recognition of the other list items. This is an important finding as other models often exhibit a decrease in the recognition of other list items when one list item is strengthened. Since the direct copying of weights from the early processing area to the final storage area does not seem highly biologically plausible, the transfer of information to the final storage area can be done through training the final storage area with pseudopatterns created by the early processing area. However, a disadvantage of the pseudo-recurrent model is that the number of hidden units in the early processing and final storage sub-networks must be identical.

Neural networks with self-refreshing memory

Following the same basic idea contributed by Robins, Ans and Rousset have also proposed a two-network artificial neural architecture with memory self-refreshing that overcomes catastrophic interference when sequential learning tasks are carried out in distributed networks trained by backpropagation. The principle is to interleave, at the time when new external patterns are learned, those to-be-learned new external patterns with internally generated pseudopatterns, or 'pseudo-memories', that reflect the previously learned information. What mainly distinguishes this model from those that use classical pseudorehearsal in feedforward multilayer networks is a reverberating process that is used for generating pseudopatterns. This process which, after a number of activity re-injections from a single random seed, tends to go up to nonlinear network attractors, is more suitable for optimally capturing the deep structure of previously learned knowledge than a single feedforward pass of random activation. Ans and Rousset have shown that the learning mechanism they proposed avoiding catastrophic forgetting, provides a more appropriate way to deal with knowledge transfer as measured by learning speed, ability to generalize and vulnerability to network damages. Musca, Rousset and Ans have also shown that pseudopatterns originating from an artificial reverberating neural network could induce familiarity in humans with never seen items in the way predicted by simulations conducted with a two-network artificial neural architecture. Furthermore, Ans has implemented a version of the self-refreshing mechanism using only one network trained by the Contrastive Hebbian Learning rule, a training rule considered as more realistic than the largely used backpropagation algorithm, but fortunately equivalent to the latter.


So far, the different solutions to catastrophic interference that have been presented concern tasks of sequential learning involving only non-temporally ordered lists of items. But, to be credible, the self-refreshing mechanism for 'static' learning has to encompass our human ability to learn serially many temporal sequences of patterns without catastrophic interference. This was done by Ans, Rousset, French and Musca who have presented, in addition to simulation work, an experiment that evidences a close similarity between the behaviour of humans and the behaviour of the proposed neuromimetic architecture.

Latent learning

Latent Learning is a technique used by Gutstein & Stump both to mitigate catastrophic interference and to take advantage of transfer learning. Rather than manipulating the representations for new classes used by the hidden nodes, this approach tries to train optimal representations for new classes into the output nodes. It chooses output encodings that are least likely to catastrophically interfere with existing responses.
Given a net that has learned to discriminate among one set of classes using Error Correcting Output Codes , optimal encodings for new classes are chosen by observing the net's average responses to them. Since these average responses arose while learning the original set of classes without any exposure to the new classes, they are referred to as 'Latently Learned Encodings'. This terminology borrows from the concept of Latent Learning, as introduced by Tolman in 1930. In effect, this technique uses transfer learning to avoid catastrophic interference, by making a net's responses to new classes as consistent as possible with existing responses to classes already learned.

Elastic weight consolidation

Kirkpatrick et al. demonstrated a method to train a single artificial neural network on multiple tasks using a technique called elastic weight consolidation. This technique suppose that some weights of trained neural network are more important for learned task than the others. So, during further training neural network on next task the weights of network must elasticly resist to changes the more the greater their importance. To determine the importance of network weights elastic weight consolidation method uses probabilistic mechanisms, in particular the Fisher information matrix.
Kutalev introduced a modification of the EWC method that uses the total absolute signal transmitted through the corresponding connection of the trained network as the weight importance. It has been supposed that such a modification emulates the process of myelination in the brain of vertebrates.

Anapoiesis

proposes that biological systems solve the problem of catastrophic interference by storing long-term memories only in a general form, not applicable to a given situation but instead loosely applicable to a class of different situations. In order to adjust the loosely applicable knowledge to the given current situation, the process of anapoiesis is applied. Anapoiesis stands for "reconstruction of knowledge"—transforming knowledge from a general form to a specific one. Practopoietic theory is founded in the theorems of cybernetics and is concerned with the question of how cybernetic systems obtain their capabilities to control and act.