A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases, semi-structured data, unstructured data and binary data. A data lake can be established "on premises" or "in the cloud". A data swamp is a deteriorated and unmanaged data lake that is either inaccessible to its intended users or is providing little value.
Background
James Dixon, then chief technology officer at Pentaho, coined the term to contrast it with data mart, which is a smaller repository of interesting attributes derived from raw data. In promoting data lakes, he argued that data marts have several inherent problems, such as information siloing. PricewaterhouseCoopers said that data lakes could "put an end to data silos." In their study on data lakes they noted that enterprises were "starting to extract and place data for analytics into a single, Hadoop-based repository." Hortonworks, Google, Oracle, Microsoft, Zaloni, Teradata, Impetus Technologies, Cloudera, and Amazon now all have data lake offerings.
In June 2015, David Needle characterized "so-called data lakes" as "one of the more controversial ways to manage big data". PricewaterhouseCoopers was also careful to note in their research that not all data lake initiatives are successful. They quote Sean Martin, CTO of Cambridge Semantics, They describe companies that build successful data lakes as gradually maturing their lake as they figure out which data and metadata are important to the organization. Another criticism is that the concept is fuzzy and arbitrary. It refers to any tool or data management practice that does not fit into the traditional data warehouse architecture. The data lake has been referred to as a particular technology. The data lake has been labeled as a raw data reservoir or a hub for ETL offload. The data lake has been defined as a central hub for self-service analytics. The concept of the data lake has been overloaded with meanings, which puts the usefulness of the term into question.. While critiques of data lakes are warranted, in many cases they are overly broad and could be applied to any technology endeavor generally and data projects specifically. For example, the term “data warehouse” currently suffers from the same opaque and changing definition as a data lake. It can also be said that not all data warehouse efforts have been successful either. In response to various critiques, McKinsey noted that the data lake should be viewed as a service model for delivering business value within the enterprise, not a technology outcome.