Event correlation


Event correlation is a technique for making sense of a large number of events and pinpointing the few events that are really important in that mass of information. This is accomplished by looking for and analyzing relationships between events.

History

Event correlation has been used in various fields for many years:
Integrated management is traditionally subdivided into various fields:
Event correlation takes place in different components depending on the field of study:
In this article, we focus on event correlation in integrated management and provide links to other fields.

Event correlation in integrated management

The goal of integrated management is to integrate the management of networks, systems and IT services in a coherent manner. The scope of this discipline notably includes network management, systems management and Service-Level Management.

Events and event correlator

Event correlation usually takes place inside one or several management platforms. It is implemented by a piece of software known as the event correlator. This component is automatically fed with events originating from managed elements, monitoring tools, the Trouble Ticket System, etc. Each event captures something special that happened in the domain of interest to the event correlator, which will vary depending upon the type of analysis the correlator is attempting to perform.
The event correlator plays a key role in integrated management, for only within it do events from many disparate sources come together and allow for comparison across sources. For instance, this is where the failure of a service can be ascribed to a specific failure in the underlying IT infrastructure, or where the root cause of a potential security attack can be identified.
Most event correlators can receive events from trouble ticket systems. However, only some of them are able to notify trouble ticket systems when a problem is solved, which partly explains the difficulty for Service Desks to keep updated with the latest news. In theory, the integration of management in organizations requires the communication between the event correlator and the trouble ticket system to work both ways.
An event may convey an alarm or report an incident, but not necessarily. It may also report that a situation goes back to normal, or simply send some information that it deems relevant. The severity of the event is an indication given by the event source to the event destination of the priority that this event should be given while being processed.

Step-by-step decomposition

Event correlation can be decomposed into four steps: event filtering, event aggregation, event masking and root cause analysis. A fifth step is often associated with event correlation and therefore briefly mentioned here.

Event filtering

Event filtering consists in discarding events that are deemed to be irrelevant by the event correlator. For instance, a number of bottom-of-the-range devices are difficult to configure and occasionally send events of no interest to the management platform. Another example is the filtering of informational or debugging events by an event correlator that is only interested in availability and faults.

Event aggregation

Event aggregation is a technique where multiple events that are very similar are combined into an aggregate that represents the underlying event data. Its main objective is to summarize a collection of input events into a smaller collection that can be processed using various analytics methods. For example, the aggregate may provide statistical summaries of the underlying events and the resources that are affected by those events. Another example is temporal aggregation, when the same problem is reported over and over again by the event source, until the problem is finally solved.
Event de-duplication is a special type of event aggregation that consists in merging exact duplicates of the same event. Such duplicates may be caused by network instability.

Event masking

Event masking consists of ignoring events pertaining to systems that are downstream of a failed system. For example, servers that are downstream of a crashed router will fail availability polling.

Root cause analysis

Root cause analysis is the last and most complex step of event correlation. It consists of analyzing dependencies between events, based for instance on a model of the environment and dependency graphs, to detect whether some events can be explained by others. For example, if database D runs on server S and this server gets durably overloaded, the event “the SLA for database D is no longer fulfilled” can be explained by the event “Server S is durably overloaded”.

Action triggering

At this stage, the event correlator is left with at most a handful of events that need to be acted upon. Strictly speaking, event correlation ends here. However, by language abuse, the event correlators found on the market sometimes also include problem-solving capabilities. For instance, they may trigger corrective actions or further investigations automatically.

Event correlation in other fields

Event correlation in ITIL

The scope of ITIL is larger than that of integrated management. However, event correlation in ITIL is quite similar to event correlation in integrated management.
In the ITIL version 2 framework, event correlation spans three processes: Incident Management, Problem Management and Service Level Management.
In the ITIL version 3 framework, event correlation takes place in the Event Management process. The event correlator is called a correlation engine.

Event correlation in publish-subscribe systems

Event correlation in complex event processing

Event correlation in business activity monitoring

Event correlation in industrial process control