ELKI

ELKI is a data mining software framework developed for use in research and teaching. It was originally at the database systems research unit of Professor Hans-Peter Kriegel at the Ludwig Maximilian University of Munich, Germany, and now continued at the Technical University of Dortmund, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.

Description

The ELKI framework is written in Java and built around a modular architecture. Most currently included algorithms belong to clustering, outlier detection and database indexes. The object-oriented architecture allows the combination of arbitrary algorithms, data types, distance functions, indexes, and evaluation measures. The Java just-in-time compiler optimizes all combinations to a similar extent, making benchmarking results more comparable if they share large parts of the code. When developing new algorithms or index structures, the existing components can be easily reused, and the type safety of Java detects many programming errors at compile time.
ELKI has been used in data science for example to cluster sperm whale codas, phoneme clustering, for anomaly detection in spaceflight operations, for bike sharing redistribution, and traffic prediction.

Objectives

The university project is developed for use in teaching and research. The source code is written with extensibility and reusability in mind, but is also optimized for performance. The experimental evaluation of algorithms depends on many environmental factors and implementation details can have a large impact on the runtime. ELKI aims at providing a shared codebase with comparable implementations of many algorithms.
As research project, it currently does not offer integration with business intelligence applications or an interface to common database management systems via SQL. The copyleft license may also be a hindrance to an integration in commercial products; nevertheless it can be used to evaluate algorithms prior to developing an own implementation for a commercial product. Furthermore, the application of the algorithms requires knowledge about their usage, parameters, and study of original literature. The audience are students, researchers, data scientists, and software engineers.

Architecture

ELKI is modeled around a database-inspired core, which uses a vertical data layout that stores data in column groups. This database core provides nearest neighbor search, range/radius search, and distance query functionality with index acceleration for a wide range of dissimilarity measures. Algorithms based on such queries can be implemented easily and benefit from the index acceleration.
The database core also provides fast and memory efficient collections for object collections and associative structures such as nearest neighbor lists.
ELKI makes extensive use of Java interfaces, so that it can be extended easily in many places. For example, custom data types, distance functions, index structures, algorithms, input parsers, and output modules can be added and combined without modifying the existing code. This includes the possibility of defining a custom distance function and using existing indexes for acceleration.
ELKI uses a service loader architecture to allow publishing extensions as separate jar files.
ELKI uses optimized collections for performance rather than the standard Java API. For loops for example are written similar to C++ iterators:

for ; iter.valid; iter.advance)

In contrast to typical Java iterators, this conserves memory, because the iterator can internally use primitive values for data storage. The reduced garbage collection improves the runtime. Optimized collections libraries such as GNU Trove3, Koloboke, and fastutil employ similar optimizations. ELKI includes data structures such as object collections and heaps using such optimizations.

Visualization

The visualization module uses SVG for scalable graphics output, and Apache Batik for rendering of the user interface as well as lossless export into PostScript and PDF for easy inclusion in scientific publications in LaTeX.
Exported files can be edited with SVG editors such as Inkscape. Since cascading style sheets are used, the graphics design can be restyled easily.
Unfortunately, Batik is rather slow and memory intensive, so the visualizations are not very scalable to large data sets.

Awards

Version 0.4, presented at the "Symposium on Spatial and Temporal Databases" 2011, which included various methods for spatial outlier detection, won the conference's "best demonstration paper award".

Included algorithms

Select included algorithms:

Cluster analysis:
*K-means clustering
*K-medians clustering
*K-medoids clustering
*Expectation-maximization algorithm for Gaussian mixture modeling
*Hierarchical clustering
*Single-linkage clustering
*Leader clustering
*DBSCAN
*OPTICS, including the extensions OPTICS-OF, DeLi-Clu, HiSC, HiCO and DiSH
*HDBSCAN
*Mean-shift clustering
*BIRCH clustering
*SUBCLU
*CLIQUE clustering
*ORCLUS and PROCLUS clustering
*COPAC, ERiC and 4C clustering
*CASH clustering
*DOC and FastDOC subspace clustering
*P3C clustering
*Canopy clustering algorithm
Anomaly detection:
*k-Nearest-Neighbor outlier detection
*LOF
*LoOP
*OPTICS-OF
*DB-Outlier
*LOCI
*LDOF
*EM-Outlier
*SOD
*COP
Frequent Itemset Mining and association rule learning
*Apriori algorithm
*Eclat
*FP-growth
Dimensionality reduction
*Principal component analysis
*Multidimensional scaling
*T-distributed stochastic neighbor embedding
Spatial index structures and other search indexes:
*R-tree
*R*-tree
*M-tree
*k-d tree
*X-tree
*Cover tree
*iDistance
*NN descent
*Locality sensitive hashing
Evaluation:
*Precision and recall, F1 score, Average Precision
*Receiver operating characteristic
*Discounted cumulative gain
*Silhouette index
*Davies–Bouldin index
*Dunn index
*Density-based cluster validation
Visualization
*Scatter plots
*Histograms
*Parallel coordinates
Other:
*Statistical distributions and many parameter estimators, including robust MAD based and L-moment based estimators
*Dynamic time warping
*Change point detection in time series
*Intrinsic dimensionality estimators
Version history

Version 0.1 contained several Algorithms from cluster analysis and anomaly detection, as well as some index structures such as the R*-tree. The focus of the first release was on subspace clustering and correlation clustering algorithms.
Version 0.2 added functionality for time series analysis, in particular distance functions for time series.
Version 0.3 extended the choice of anomaly detection algorithms and visualization modules.
Version 0.4 added algorithms for geo data mining and support for multi-relational database and index structures.
Version 0.5 focuses on the evaluation of cluster analysis results, adding new visualizations and some new algorithms.
Version 0.6 introduces a new 3D adaption of parallel coordinates for data visualization, apart from the usual additions of algorithms and index structures.
Version 0.7 adds support for uncertain data types, and algorithms for the analysis of uncertain data.
Version 0.7.5 adds additional clustering algorithms, anomaly detection algorithms, evaluation measures, and indexing structures.

Similar applications

Scikit-learn: machine learning library in python
Weka: A similar project by the University of Waikato, with a focus on classification algorithms
RapidMiner: An application available commercially
KNIME: An open source platform which integrates various components for machine learning and data mining

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...