Data wrangling, sometimes referred to as datamunging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations. This may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses. Data munging as a process typically follows a set of general steps which begin with extracting the data in a raw form from the data source, "munging" the raw data using algorithms or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use.
Background
The "wrangler" non-technical term is often said to derive from work done by the United States Library of Congress's National Digital Information Infrastructure and Preservation Program and their program partner the Emory University Libraries based MetaArchive Partnership. The term "mung" has roots in munging as described in the Jargon File. The term "Data Wrangler" was also suggested as the best analogy to coder for someone working with data. The terms data wrangling and data wrangler had sporadic use in the 1990s and early 2000s. One of the earliest business mentions of data wrangling was in an article in Byte Magazine in 1997 referencing “Perl’s data wrangling services”. In 2001 it was reported that CNN hired “a dozen data wranglers” to help track down information for news stories. One of the first mentions of data wrangling in a scientific context was by Donald Cline during the NASA/NOAA Cold Lands Processes Experiment. Cline stated the data wranglers “coordinate the acquisition of the entire collection of the experiment data.” Cline also specifies duties typically handled by a storage administrator for working with large amounts of data. This can occur in areas like major research projects and the making of films with a large amount of complex computer-generated imagery. In research, this involves both data transfer from research instrument to storage grid or storage facility as well as data manipulation for re-analysis via high performance computing instruments or access via cyberinfrastructure-based digital libraries.
Typical use
The data transformations are typically applied to distinct entities within a data set, and could include such actions as extractions, parsing, joining, standardizing, augmenting, cleansing, consolidating and filtering to create desired wrangling outputs that can be leveraged downstream. The recipients could be individuals, such as data architects or data scientists who will investigate the data further, business users who will consume the data directly in reports, or systems that will further process the data and write it into targets such as data warehouses, data lakes or downstream applications.
Modus operandi
Depending on the amount and format of the incoming data, data wrangling has traditionally been performed manually or via scripts in languages such as Python or SQL. R, a language often used in data mining and statistical data analysis, is now also often used for data wrangling. Visual data wrangling systems were developed to make data wrangling accessible for non-programmers, and simpler for programmers. Some of these also include embedded AIrecommenders and Programming by Example facilities to provide user assistance, and Program Synthesis techniques to autogenerate scalable dataflow code. Early prototypes of visual data wrangling tools include OpenRefine and the Stanford/Berkeley research system; the latter evolved into Trifacta. Other terms for these processes have included data franchising, data preparation and data munging.