Data pre-processing


Data preprocessing is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. Data-gathering methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of data is first and foremost before running an analysis.
Often, data preprocessing is the most important phase of a machine learning project, especially in computational biology.
If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. Data preparation and filtering steps can take considerable amount of processing time. Data preprocessing includes cleaning, Instance selection, normalization, transformation, feature extraction and selection, etc. The product of data preprocessing is the final training set.
Data pre-processing may affect the way in which outcomes of the final data processing can be interpreted. This aspect should be carefully considered when interpretation of the results is a key point, such in the multivariate processing of chemical data.

Tasks of data pre-processing

The origins of data preprocessing are located in data mining. The idea is to aggregate existing information and search in the content. Later it was recognized, that for machine learning and neural networks a data preprocessing step is needed too. So it has become to a universal technique which is used in computing in general.
From a users perspective, data preprocessing is equal to put existing comma-separated values files together. Data are usually stored in files. The CSV format was mentioned already but it's possible that the data are stored in a Microsoft Excel sheet or in a json file. A self-created script is applied to the file. From a technical side the script can be written in Python and in R.
The reason why a user transforms existing files into a new one is because of many reasons. Data preprocessing has the objective to add missing values, aggregate information, label data with categories and smooth a trajectory. More advanced techniques like principle component analysis and feature selection are working with statistical formulas and are applied to complex datasets which are recorded by GPS trackers and motion capture devices.

Semantic data preprocessing

Complex problems are asking for more elaborated analyzing techniques of existing information. Instead of creating a simple script for aggregating different numerical values into one, it make sense to focus on semantic based data preprocessing. Here is the idea to build a dedicated ontology which explains on a higher level what the problem is about. The Protégé is the standard tool for this purpose. A second more advanced technique is Fuzzy preprocessing. Here is the idea to ground numerical values with linguistic information. Raw data are transformed into natural language.