MemSQL
MemSQL is a distributed, relational, SQL database management system that features ANSI SQL support and is known for speed in data ingest, transaction processing, and query processing. MemSQL supports blended workloads, commonly referred to as translytical,HTAP, HOAP, and ATP workloads. For queries, it compiles Structured Query Language into machine code. The same MemSQL database engine runs wherever Linux does, including on-premise installations, public cloud providers and in private clouds, in containers via a Kubernetes operator, or as a hosted service in the cloud known as memSQL Helios.
Rather than the traditional B-tree index, MemSQL uses skiplists optimized for fast, lock-free processing in memory. Due to its fully distributed nature, MemSQL is fully scalable, and it achieves high speeds for time-critical processing of large data sets. MemSQL is compatible with the MySQL wire protocol, meaning most MySQL and MariaDB drivers and products work on MemSQL. MemSQL includes pipeline features for real-time ingest of data from Kafka, Spark, Amazon S3, Azure Blob Storage, and on-disk files. MemSQL can store data in in-memory rowstore tables or disk-based columnstore tables. MemSQL primarily stores relational data, though it can also store JSON data, graph data, and time series data. On April 23, 2013, MemSQL launched its first generally available version of the database to the public.
Because of MemSQL's unique distributed workload processing and ANSI SQL processing, it is ideal for HTAP workloads requiring real-time data processing at scale. “HTAP solves the issue of analytic latency in several ways, including eliminating the need for multiple copies of the same data and the requirement for data to be offloaded from operational databases to data warehouses via ETL processes.”
Distribution Formats
MemSQL can be downloaded for free and run on Linux for systems up to 4 leaf nodes of 32 gigs RAM each; an Enterprise license is required for larger deployments and for official MemSQL support. MemSQL clusters can be managed in containers using the MemSQL Kubernetes Operator. MemSQL is also available as a managed service named memSQL Helios, available in various regions in Google Cloud and Amazon Web Services, with a Microsoft Azure implementation promised for the near future. The underlying engine and potential system performance are identical in all distribution formats.Row and column table formats
MemSQL can store database tables either as rowstores or columnstores. The format used is determined by the user when creating the table.Data for all rowstore tables is stored completely in-memory, making random reads fast, with snapshots and transaction logs persisted to disk. Data for all columnstore tables is stored on-disk, supporting fast sequential reads and compression that typically reaches 5-10x.
Rowstore tables, as the name implies, store information in row format, which is the traditional data format used by RDBMS systems. Rowstores are optimized for singleton or small insert, update or delete queries and are most closely associated with OLTP use cases.
Columnstores are optimized for complex SELECT queries, typically associated with OLAP / data warehousing use cases. As an example, a large clinical data set for data analysis is best stored in columnar format, since queries run against it will typically be ad hoc queries where aggregates are computed over large numbers of similar data items.
Distributed architecture
A MemSQL database is distributed across many commodity machines. Data is stored in partitions on leaf nodes, and users connect to aggregator nodes. A single piece of software is installed for MemSQL aggregator and leaf nodes; administrators merely designate the machine’s purpose when installing. An aggregator node is responsible for receiving SQL queries, breaking them up across leaf nodes, and aggregating results back to the client. A leaf node stores MemSQL data and processes queries from the aggregator. All communication between aggregators and leaf nodes is done over the network using SQL. MemSQL uses hash partitioning to distribute data uniformly across the number of leaf nodes.Real-time Ingest
MemSQL Pipelines allow for real-time data ingest. A pipeline is a native connector to data sources such as Apache Kafka, Apache Spark, Amazon S3 buckets, Azure Blob Storage, or files on disk. The Pipeline pulls data into the system at high speed. Because of the lock-free skip lists, queries can retrieve the data as soon as it lands, but are not blocked from continuing while data is imported.Durability
Durability for the in-memory rowstore is implemented with a write-ahead log and snapshots, similar to checkpoints. With default settings, as soon as a transaction is acknowledged in memory, the database will asynchronously write the transaction to disk as fast as the disk allows.The on-disk columnstore is actually fronted by an in-memory rowstore-like structure. This structure has the same durability guarantees as the MemSQL rowstore. Apart from that, the columnstore is durable, since its data is stored on disk.
Replication
A MemSQL cluster can be configured in "High Availability" mode, where every data partition is automatically created with master and slave versions on two separate leaf nodes. In HA mode, aggregators send transactions to the master partitions, which then send logs to the slave partitions. In the event of an unexpected master failure, the slave partitions take over as master partitions, in a fully online operation with no downtime.MemSQL Components
MemSQL ships with a set of installation, management, and monitoring tools called MemSQL Tools. When installing MemSQL, Tools can be used to set up the distributed MemSQL database across machines.MemSQL also provides a browser-based query and management UI called MemSQL Studio. MemSQL Studio provides query processing and database monitoring, and shows health and informational details about the running cluster.