Machine-generated data is information automatically generated by a computer process, application, or other mechanism without the active intervention of a human. While the term dates back over fifty years, there is some current indecision as to the scope of the term. Monash Research's Curt Monash defines it as "data that was produced entirely by machines OR data that is more about observing humans than recording their choices." Meanwhile, Daniel Abadi, CS Professor at Yale, proposes a narrower definition, "Machine-generated data is data that is generated as a result of a decision of an independent computational agent or a measurement of an event that is not caused by a human action." Regardless of definition differences, both exclude data manually entered by a person. Machine-generated data crosses all industry sectors. Often and increasingly, humans are unaware their actions are generating the data.
Relevance
Machine-generated data has no single form; rather, the type, format, metadata, and frequency respond to some particular business purpose. Machines often create it on a defined time schedule or in response to a state change, action, transaction, or other event. Since the event is historical, the data is not prone to be updated or modified. Partly because of this quality, the U.S. court systems consider machine-generated data as highly reliable. Machine-generated data is the lifeblood of the Internet of Things.
Growth
In 2009, Gartner published that data will grow by 650% over the following five years. Most of the growth in data is the byproduct of machine-generated data. IDC estimated that in 2020, there will be 26 times more connected things than people. Wikibon issued a forecast of $514 billion to be spent on the Industrial Internet in 2020.
Processing
Given the fairly static yet voluminous nature of machine-generated data, data owners rely on highly scalable tools to process and analyze the resulting dataset. Almost all machine-generated data is unstructured but then derived into a common structure. Typically, these derived structures contain many data points/columns. With these data points, the challenge lies mostly with analyzing the data. Given high performance requirements along with large data sizes, traditional database indexing and partitioning limits the size and history of the dataset for processing. Alternative approaches exist with columnar databases as only particular "columns" of the dataset would be accessed during particular analysis.