Data transformation (statistics)

In statistics, data transformation is the application of a deterministic mathematical function to each point in a data set—that is, each data point z_i is replaced with the transformed value y_i = f, where f is a function. Transforms are usually applied so that the data appear to more closely meet the assumptions of a statistical inference procedure that is to be applied, or to improve the interpretability or appearance of graphs.
Nearly always, the function that is used to transform the data is invertible, and generally is continuous. The transformation is usually applied to a collection of comparable measurements. For example, if we are working with data on peoples' incomes in some currency unit, it would be common to transform each person's income value by the logarithm function.

Motivation

Guidance for how data should be transformed, or whether a transformation should be applied at all, should come from the particular statistical analysis to be performed. For example, a simple way to construct an approximate 95% confidence interval for the population mean is to take the sample mean plus or minus two standard error units. However, the constant factor 2 used here is particular to the normal distribution, and is only applicable if the sample mean varies approximately normally. The central limit theorem states that in many situations, the sample mean does vary normally if the sample size is reasonably large. However, if the population is substantially skewed and the sample size is at most moderate, the approximation provided by the central limit theorem can be poor, and the resulting confidence interval will likely have the wrong coverage probability. Thus, when there is evidence of substantial skew in the data, it is common to transform the data to a symmetric distribution before constructing a confidence interval. If desired, the confidence interval can then be transformed back to the original scale using the inverse of the transformation that was applied to the data.
Data can also be transformed to make them easier to visualize. For example, suppose we have a scatterplot in which the points are the countries of the world, and the data values being plotted are the land area and population of each country. If the plot is made using untransformed data, most of the countries would be plotted in tight cluster of points in the lower left corner of the graph. The few countries with very large areas and/or populations would be spread thinly around most of the graph's area. Simply rescaling units will not change this. However, following logarithmic transformations of both area and population, the points will be spread more uniformly in the graph.
Another reason for applying data transformation is to improve interpretability, even if no formal statistical analysis or visualization is to be performed. For example, suppose we are comparing cars in terms of their fuel economy. These data are usually presented as "kilometers per liter" or "miles per gallon". However, if the goal is to assess how much additional fuel a person would use in one year when driving one car compared to another, it is more natural to work with the data transformed by applying the reciprocal function, yielding liters per kilometer, or gallons per mile.

In regression

Data transformation may be used as a remedial measure to make data suitable for modeling with linear regression if the original data violates one or more assumptions of linear regression. For example, the simplest linear regression models assume a linear relationship between the expected value of Y and each independent variable. If linearity fails to hold, even approximately, it is sometimes possible to transform either the independent or dependent variables in the regression model to improve the linearity. For example, addition of quadratic functions of the original independent variables may lead to a linear relationship with expected value of Y, resulting in a polynomial regression model, a special case of linear regression.
Another assumption of linear regression is homoscedasticity, that is the variance of errors must be the same regardless of the values of predictors. If this assumption is violated, it may be possible to find a transformation of Y alone, or transformations of both X and Y, such that the homoscedasticity assumption holds true on the transformed variables and linear regression may therefore be applied on these.
Yet another application of data transformation is to address the problem of lack of normality in error terms. Univariate normality is not needed for least squares estimates of the regression parameters to be meaningful. However confidence intervals and hypothesis tests will have better statistical properties if the variables exhibit multivariate normality. Transformations that stabilize the variance of error terms often also help make the error terms approximately normal.

Examples

Equation:
Equation:

Equation:
Equation:

Alternative

provide a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. GLMs allow the linear model to be related to the response variable via a link function and allow the magnitude of the variance of each measurement to be a function of its predicted value.

Common cases

The logarithm and square root transformations are commonly used for positive data, and the multiplicative inverse transformation can be used for non-zero data. The power transformation is a family of transformations parameterized by a non-negative value λ that includes the logarithm, square root, and multiplicative inverse as special cases. To approach data transformation systematically, it is possible to use statistical estimation techniques to estimate the parameter λ in the power transformation, thereby identifying the transformation that is approximately the most appropriate in a given setting. Since the power transformation family also includes the identity transformation, this approach can also indicate whether it would be best to analyze the data without a transformation. In regression analysis, this approach is known as the Box–Cox technique.
The reciprocal transformation, some power transformations such as the Yeo–Johnson transformation, and certain other transformations such as applying the inverse hyperbolic sine, can be meaningfully applied to data that include both positive and negative values. However, when both negative and positive values are observed, it is sometimes common to begin by adding a constant to all values, producing a set of non-negative data to which any power transformation can be applied.
A common situation where a data transformation is applied is when a value of interest ranges over several orders of magnitude. Many physical and social phenomena exhibit such behavior — incomes, species populations, galaxy sizes, and rainfall volumes, to name a few. Power transforms, and in particular the logarithm, can often be used to induce symmetry in such data. The logarithm is often favored because it is easy to interpret its result in terms of "fold changes."
The logarithm also has a useful effect on ratios. If we are comparing positive quantities X and Y using the ratio X / Y, then if X < Y, the ratio is in the unit interval, whereas if X > Y, the ratio is in the half-line, where the ratio of 1 corresponds to equality. In an analysis where X and Y are treated symmetrically, the log-ratio log is zero in the case of equality, and it has the property that if X is K times greater than Y, the log-ratio is the equidistant from zero as in the situation where Y is K times greater than X and −log.
If values are naturally restricted to be in the range 0 to 1, not including the end-points, then a logit transformation may be appropriate: this yields values in the range.

Transforming to normality

1. It is not always necessary or desirable to transform a data set to resemble a normal distribution. However, if symmetry or normality are desired, they can often be induced through one of the power transformations.;
2. A linguistic power function is distributed according to the Zipf-Mandelbrot law. The distribution is extremely spiky and leptokurtic, reason why researchers had to turn their backs to statistics to solve e.g. authorship attribution problems. Nevertheless, usage of Gaussian statistics is perfectly possible by applying data transformation.
3. To assess whether normality has been achieved after transformation, any of the standard normality tests may be used. A graphical approach is usually more informative than a formal statistical test and hence.a normal quantile plot is commonly used to assess the fit of a data set to a normal population. Alternatively, rules of thumb based on the sample skewness and kurtosis have also been proposed.

Transforming to a uniform distribution or an arbitrary distribution

If we observe a set of n values X₁,..., X_n with no ties, we can replace X_i with the transformed value Y_i = k, where k is defined such that X_i is the k^th largest among all the X values. This is called the rank transform, and creates data with a perfect fit to a uniform distribution. This approach has a population analogue.
Using the probability integral transform, if X is any random variable, and F is the cumulative distribution function of X, then as long as F is invertible, the random variable U = F follows a uniform distribution on the unit interval .
From a uniform distribution, we can transform to any distribution with an invertible cumulative distribution function. If G is an invertible cumulative distribution function, and U is a uniformly distributed random variable, then the random variable G⁻¹ has G as its cumulative distribution function.
Putting the two together, if X is any random variable, F is the invertible cumulative distribution function of X, and G is an invertible cumulative distribution function then the random variable G⁻¹ has G as its cumulative distribution function.

Variance stabilizing transformations

Many types of statistical data exhibit a "variance-on-mean relationship", meaning that the variability is different for data values with different expected values. As an example, in comparing different populations in the world, the variance of income tends to increase with mean income. If we consider a number of small area units and obtain the mean and variance of incomes within each county, it is common that the counties with higher mean income also have higher variances.
A variance-stabilizing transformation aims to remove a variance-on-mean relationship, so that the variance becomes constant relative to the mean. Examples of variance-stabilizing transformations are the Fisher transformation for the sample correlation coefficient, the square root transformation or Anscombe transform for Poisson data, the Box–Cox transformation for regression analysis, and the arcsine square root transformation or angular transformation for proportions. While commonly used for statistical analysis of proportional data, the arcsine square root transformation is not recommended because logistic regression or a logit transformation are more appropriate for binomial or non-binomial proportions, respectively, especially due to decreased type-II error.

Transformations for multivariate data

Univariate functions can be applied point-wise to multivariate data to modify their marginal distributions. It is also possible to modify some attributes of a multivariate distribution using an appropriately constructed transformation. For example, when working with time series and other types of sequential data, it is common to difference the data to improve stationarity. If data generated by a random vector X are observed as vectors X_i of observations with covariance matrix Σ, a linear transformation can be used to decorrelate the data. To do this, the Cholesky decomposition is used to express Σ = A A'. Then the transformed vector Y_i = A⁻¹X_i has the identity matrix as its covariance matrix.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...