Check MK


Checkmk is a software developed in Python and C++ for IT Infrastructure monitoring. It is used for the monitoring of servers, applications, networks, cloud infrastructures, containers, storage, databases and environment sensors.
Checkmk is available in three editions: an open source edition, a commercial enterprise edition and a commercial edition for managed services providers. These Checkmk-Editions are available for a range of platforms, in particular for various versions of Debian, Ubuntu, SLES and RedHat / CentOS, and also as a Docker Image. In addition, physical appliances of various sizes as well as a virtual appliance are offered to simplify the administration of the underlying operating system through a graphical user interface and to enable high-availability solutions.
The agents used by Checkmk to collect data are available for 11 platforms, including Windows.

History

Checkmk originated in 2008 as an Agent-substituting shell script for Inetd, and was published in April 2009 under GPL. It was initially based on Nagios, and extended this with a number of new components. The open source edition also continues to be based on the Nagios-core, and bundles this with additional open source components into a complete system.
Over many years Checkmk's commercial editions have evolved into a self-contained monitoring system – one that has replaced all of the essential Nagios components with its own – including its very own monitoring core. The majority of the developments for the commercial editions, in particular all plug-ins, are also available into the Checkmk Raw Edition.
While in the past Checkmk was designed for monitoring large and heterogeneous on-premise environments, from version 1.5+ it also supports the monitoring of AWS, Azure, Docker and Kubernetes services.
Checkmk is being developed by tribe29 GmbH in Munich Germany, which until 16.04.2019 operated under the name of Mathias Kettner GmbH. Together with the company name change, the product name "Check_MK" was also changed to "Checkmk".
tribe29 GmbH follows an open core business model. The open source edition is available under different open source licenses – mostly GPLv2, while large parts of the commercial editions run under the proprietary "Checkmk Enterprise License".

The Product

Checkmk combines three types of IT monitoring:
In order to ensure a very broad monitoring, Checkmk currently has 1700+ plug-ins in each edition – all of which are licensed under GPLv2. These plug-ins are maintained as part of the product and are regularly supplemented with additional plug-ins or extensions. Connecting existing legacy Nagios plug-ins is possible as well.
To simplify setup and operation, all components of Checkmk are delivered fully integrated. A rule-based 1:n configuration, as well as a high degree of automation significantly accelerate workflows. This includes:
In addition, there are also playbooks for the use of configuration and deployment tools such as Ansible or Salt.
Checkmk is often used in very large distributed environments where a high number of sites and/or well over 100.000 devices are monitored. This is possible, among other things, because Checkmk's microcore consumes much less CPU resources than, for example Nagios, and therefore offers a significantly higher performance on the same hardware. Furthermore the non-persistent data is stored in-memory in RAM which significantly improves the access time.

Components

The Monitoring Core ("Checkmk Microcore - CMC")

The commercial editions of Checkmk use their own monitoring core, written in C++. This has a much higher performance than the Nagios core. In addition, as of version 1.6 it allows the dynamic recording of objects with a short lifespan, such as Containers. This is possible because - in contrast to the Nagios core - the Checkmk Microcore does not require a reboot when making configuration changes. The open source version "Checkmk Raw Edition" currently still uses the Nagios core.

Configuration & Check Engine

Checkmk offers self-contained service discovery and settings generation. Checkmk uses its own method when carrying out the checks. During the test period each host is contacted only once. The test results are transmitted to the monitoring core as passive checks. This significantly improves the performance on the monitoring server, as well as on the hosts being monitored.
Checkmk uses different methods to access the data in the target systems. These include agents installed on the target system, "special agents" running on the monitoring server and communicating with the API of the target system, the SNMP API for monitoring, for example, network devices and printers, and HTTP/TCP protocols to communicate with web and internet services. By default, Checkmk follows the "pull principle", i.e. the data is explicitly queried by the monitoring system to quickly identify when a system suddenly fails and does not respond to a "pull". As an alternative, however, a "push" can be configured with which the system transfers its data directly to Checkmk or to an intermediate host.

Data Interface ("Livestatus")

Livestatus is the main interface in Checkmk. It provides live access to all data from the monitored hosts and services. The data is fetched directly from the RAM, which avoids slow hard disk access and gives fast access to the information without overloading the system too much. Access is done via a simple protocol and it is possible from all programming languages without requiring a special library.

Web-GUI ("Multisite")

Multisite is Checkmk’s web GUI. In addition to having a quick page layout, it offers user-definable views and dashboards, distributed monitoring by integrating multiple monitoring instances via Livestatus, integration of NagVis, an integrated LDAP connection, access to status data via web services, and much more. Dashboards and views can be differentiated for various users or groups of users, for example vSphere-specific views for VMware admins. The web GUI is currently available both in German and English.

Web Administration ("WATO")

The Web Administration Tool makes a system based on Checkmk completely administrable via the browser. This includes managing users, roles, groups, time periods, and more. Permissions can be granted in a granular way using a role concept. Existing role-based access controls can be used for this. The WATO works rule-based, so that the configuration remains intuitive even in complex environments, and the necessary effort is low. Automatic discovery and configuration, as well as the automatic agent update further accelerate the configuration process. An HTTP API can also be used to integrate CMDBs for accelerated configuration.

Alert System

Several notification channels can be set up and configured with different rules for each user. For example, emails can be triggered at any time of the day, but notifications via SMS are sent only for important issues during on-call hours. The notifications can be set for all or for specific teams, e.g. notify only the storage admins about a failed hard drive. Duplicate notifications are grouped together so that no user is notified twice through a particular channel. Furthermore, users can configure their own notifications themselves. In distributed environments alerts can be managed centrally. For detected issues, actions can be triggered automatically via scripts. Checkmk includes integrations to email and SMS gateways as well as to communication and IT service-management solutions such as Slack, Jira, PagerDuty, OpsGenie, VictorOps and ServiceNow.

Business Intelligence

The BI module is integrated into the graphical user interface. It aggregates the overall status of business processes, their dependency on complex applications and IT infrastructure elements from many individual hosts and services in a rule-based manner. It can also be used to represent applications made up of microservices, which in turn consist of Kubernetes pods and deployments. In addition, worst-case scenarios can be simulated in real time and historical data can be analyzed to understand the causes of performance degradation.

Event Console

The Event Console integrates the processing of log messages and SNMP traps into the monitoring. It is configured via a flexible set of rules, and decides whether incoming messages are to be discarded or how they are to be classified. It can count, correlate, expect messages, rewrite messages, and more. Similar entries can be grouped into a single event to keep track of events. It also has a built-in syslog daemon that receives messages directly on port 514, and an SNMP trap receiver that receives traps on port 162.

Metrics Graphing

The commercial Checkmk editions use their own metric and graphing system. This allows time series metrics to be analysed over long intervals using interactive HTML5 graphs. The maximum resolution is one second. Data can be imported from a variety of data sources and metrics formats and stored on the disk of a long-term data storage device.
Alternatively, Graphite or InfluxDB can be connected via an export interface. From CEE version 1.5p16 there is also a plug-in available for integrating data directly from Checkmk into Grafana for visualization purposes. The Checkmk Raw Edition currently uses PNP4Nagios as its graphing system.

Reporting

Reporting enables the direct delivery of PDF reports, ad-hoc or automatically, at regular intervals. It includes the availability analysis in which the history of the states over any desired time period can be provided with a click. Availability calculations can exclude unmonitored times, adjust the resolution, or ignore short intervals. In addition to the availability calculations, reporting also includes SLA reporting in which complex SLAs can be monitored. The reporting is only available in the commercial versions of Checkmk.

Hardware/Software Inventory

The hardware/software inventory can be used, for example, to monitor hardware and software changes, to verify the presence of installed security updates, and to update static data with dynamic parameters. The Configuration Management Database i-doit has a deep integration that enables the exchange of CMDB data with monitoring data.