Multicollinearity

In statistics, multicollinearity is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation, the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multivariate regression model with collinear predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others.
Note that in statements of the assumptions underlying regression analyses such as ordinary least squares, the phrase "no multicollinearity" usually refers to the absence of multicollinearity, which is an exact linear relation among the predictors. In such case, the data matrix has less than full rank, and therefore the moment matrix cannot be inverted. Under these circumstances, for a general linear model, the ordinary least squares estimator does not exist.
In any case, multicollinearity is a characteristic of the data matrix, not the underlying statistical model. Since it is generally more severe in small samples, Arthur Goldberger went so far as to call it "micronumerosity."

Definition

Collinearity is a linear association between two explanatory variables. Two variables are perfectly collinear if there is an exact linear relationship between them. For example, and are perfectly collinear if there exist parameters and such that, for all observations i, we have
Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related. We have perfect multicollinearity if, for example as in the equation above, the correlation between two independent variables is equal to 1 or −1. In practice, we rarely face perfect multicollinearity in a data set. More commonly, the issue of multicollinearity arises when there is an approximate linear relationship among two or more independent variables.
Mathematically, a set of variables is perfectly multicollinear if there exist one or more exact linear relationships among some of the variables. For example, we may have
holding for all observations i, where are constants and is the i^th observation on the k^th explanatory variable. We can explore one issue caused by multicollinearity by examining the process of attempting to obtain estimates for the parameters of the multiple regression equation
The ordinary least squares estimates involve inverting the matrix
where
is an N × matrix, where N is the number of observations and k is the number of explanatory variables. If there is an exact linear relationship among the independent variables, at least one of the columns of X is a linear combination of the others, and so the rank of X is less than k+1, and the matrix X^TX will not be invertible.
Perfect multicollinearity is fairly common when working with raw datasets, which frequently contain redundant information. Once redundancies are identified and removed, however, nearly multicollinear variables often remain due to correlations inherent in the system being studied. In such a case, instead of the above equation holding, we have that equation in modified form with an error term :
In this case, there is no exact linear relationship among the variables, but the variables are nearly perfectly multicollinear if the variance of is small for some set of values for the 's. In this case, the matrix X^TX has an inverse, but is ill-conditioned so that a given computer algorithm may or may not be able to compute an approximate inverse, and if it does so the resulting computed inverse may be highly sensitive to slight variations in the data and so may be very inaccurate or very sample-dependent.

Detection of multicollinearity

Indicators that multicollinearity may be present in a model include the following:

Large changes in the estimated regression coefficients when a predictor variable is added or deleted
Insignificant regression coefficients for the affected variables in the multiple regression, but a rejection of the joint hypothesis that those coefficients are all zero
If a multivariable regression finds an insignificant coefficient of a particular explanator, yet a simple linear regression of the explained variable on this explanatory variable shows its coefficient to be significantly different from zero, this situation indicates multicollinearity in the multivariable regression.
Some authors have suggested a formal detection-tolerance or the variance inflation factor for multicollinearity:

where is the coefficient of determination of a regression of explanator j on all the other explanators. A tolerance of less than 0.20 or 0.10 and/or a VIF of 5 or 10 and above indicates a multicollinearity problem.

Farrar–Glauber test: If the variables are found to be orthogonal, there is no multicollinearity; if the variables are not orthogonal, then at least some degree of multicollinearity is present. C. Robert Wichers has argued that Farrar–Glauber partial correlation test is ineffective in that a given partial correlation may be compatible with different multicollinearity patterns. The Farrar–Glauber test has also been criticized by other researchers.
Condition number test: The standard measure of ill-conditioning in a matrix is the condition index. It will indicate that the inversion of the matrix is numerically unstable with finite-precision numbers. This indicates the potential sensitivity of the computed inverse to small changes in the original matrix. The condition number is computed by finding the square root of the maximum eigenvalue divided by the minimum eigenvalue of the design matrix. If the condition number is above 30, the regression may have severe multicollinearity; multicollinearity exists if, in addition, two or more of the variables related to the high condition number have high proportions of variance explained. One advantage of this method is that it also shows which variables are causing the problem.
Perturbing the data. Multicollinearity can be detected by adding random noise to the data and re-running the regression many times and seeing how much the coefficients change.
Construction of a correlation matrix among the explanatory variables will yield indications as to the likelihood that any given couplet of right-hand-side variables are creating multicollinearity problems. Correlation values of at least 0.4 are sometimes interpreted as indicating a multicollinearity problem. This procedure is, however, highly problematic and cannot be recommended. Intuitively, correlation describes a bivariate relationship, whereas collinearity is a multivariate phenomenon.
Consequences of multicollinearity

One consequence of a high degree of multicollinearity is that, even if the matrix is invertible, a computer algorithm may be unsuccessful in obtaining an approximate inverse, and if it does obtain one it may be numerically inaccurate. But even in the presence of an accurate matrix, the following consequences arise.
In the presence of multicollinearity, the estimate of one variable's impact on the dependent variable while controlling for the others tends to be less precise than if predictors were uncorrelated with one another. The usual interpretation of a regression coefficient is that it provides an estimate of the effect of a one unit change in an independent variable,, holding the other variables constant. If is highly correlated with another independent variable,, in the given data set, then we have a set of observations for which and have a particular linear stochastic relationship. We don't have a set of observations for which all changes in are independent of changes in, so we have an imprecise estimate of the effect of independent changes in.
In some sense, the collinear variables contain the same information about the dependent variable. If nominally "different" measures actually quantify the same phenomenon then they are redundant. Alternatively, if the variables are accorded different names and perhaps employ different numeric measurement scales but are highly correlated with each other, then they suffer from redundancy.
One of the features of multicollinearity is that the standard errors of the affected coefficients tend to be large. In that case, the test of the hypothesis that the coefficient is equal to zero may lead to a failure to reject a false null hypothesis of no effect of the explanator, a type II error.
Another issue with multicollinearity is that small changes to the input data can lead to large changes in the model, even resulting in changes of sign of parameter estimates.
A principal danger of such data redundancy is that of overfitting in regression analysis models. The best regression models are those in which the predictor variables each correlate highly with the dependent variable but correlate at most only minimally with each other. Such a model is often called "low noise" and will be statistically robust.
So long as the underlying specification is correct, multicollinearity does not actually bias results; it just produces large standard errors in the related independent variables. More importantly, the usual use of regression is to take coefficients from the model and then apply them to other data. Since multicollinearity causes imprecise estimates of coefficient values, the resulting out-of-sample predictions will also be imprecise. And if the pattern of multicollinearity in the new data differs from that in the data that was fitted, such extrapolation may introduce large errors in the predictions.

Remedies for multicollinearity

Make sure you have not fallen into the dummy variable trap; including a dummy variable for every category and including a constant term in the regression together guarantee perfect multicollinearity.
Try seeing what happens if you use independent subsets of your data for estimation and apply those estimates to the whole data set. Theoretically you should obtain somewhat higher variance from the smaller datasets used for estimation, but the expectation of the coefficient values should be the same. Naturally, the observed coefficient values will vary, but look at how much they vary.
Leave the model as is, despite multicollinearity. The presence of multicollinearity doesn't affect the efficiency of extrapolating the fitted model to new data provided that the predictor variables follow the same pattern of multicollinearity in the new data as in the data on which the regression model is based.
Drop one of the variables. An explanatory variable may be dropped to produce a model with significant coefficients. However, you lose information. Omission of a relevant variable results in biased coefficient estimates for the remaining explanatory variables that are correlated with the dropped variable.
Obtain more data, if possible. This is the preferred solution. More data can produce more precise parameter estimates, as seen from the formula in variance inflation factor for the variance of the estimate of a regression coefficient in terms of the sample size and the degree of multicollinearity.
Mean-center the predictor variables. Generating polynomial terms or interaction terms can cause some multicollinearity if the variable in question has a limited range. Mean-centering will eliminate this special kind of multicollinearity. However, in general, this has no effect. It can be useful in overcoming problems arising from rounding and other computational steps if a carefully designed computer program is not used.
Standardize your independent variables. This may help reduce a false flagging of a condition index above 30.
It has also been suggested that using the Shapley value, a game theory tool, the model could account for the effects of multicollinearity. The Shapley value assigns a value for each predictor and assesses all possible combinations of importance.
Ridge regression or principal component regression or partial least squares regression can be used.
If the correlated explanators are different lagged values of the same underlying explanator, then a distributed lag technique can be used, imposing a general structure on the relative values of the coefficients to be estimated.
Examples of contexts in which multicollinearity arises

Survival analysis

Multicollinearity may represent a serious issue in survival analysis. The problem is that time-varying covariates may change their value over the time line of the study. A special procedure is recommended to assess the impact of multicollinearity on the results.

Interest rates for different terms to maturity

In various situations it might be hypothesized that multiple interest rates of various terms to maturity all influence some economic decision, such as the amount of money or some other financial asset to hold, or the amount of fixed investment spending to engage in. In this case, including these various interest rates will in general create a substantial multicollinearity problem because interest rates tend to move together. If in fact each of the interest rates has its own separate effect on the dependent variable, it can be extremely difficult to separate out their effects.

Extension

The concept of lateral collinearity expands on the traditional view of multicollinearity, comprising also collinearity between explanatory and criteria variables, in the sense that they may be measuring almost the same thing as each other.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...