2 Multicollinearity

In Finance and Economics, it is not uncommon to consider explanatory variables of a model that are positively or negatively correlated. One of the assumptions in regression models is that the independent variables are not strongly correlated. The reason of this assumption is that if two or more independent variables of a multiple regression model are strongly correlated, then the standard error of their beta coefficients will be very volatile, or very unstable, meaning that for a minor change in the data set (such as incorporating a new observation), then both the beta coefficient of an independent variable that has strong correlation with another, and their standard errors will radically change. This happens because the formula to calculate a standard error of a beta coefficient is a ratio where the denominator is (1−R2i). Then, when the partial coefficient of determination R2i is close to 1, this ratio will tend to be a very big number. Then, in sum, a strong correlation between two independent variables can cause regression coefficients and/or standard errors to be very “inflated”. Then, how can we detect a potential multicollienarity problem? Fortunately, there is a simple measure called Variance Inflation Factor or VIF that can be easily computed in any programming language like R or a statistical software like R. We have to calculate the VIF for each independent variable to identify which independent variables are causing the multicollienarity problem. The formula for VIF of the variable Xi is:

VIFi=11−R2i

Where R2i is the coefficient of determination when running a regression where the Xi is the dependent variable and the rest of the independent variables X’s are the independent variables of this regression.

In R, the VIF for all independent variables can be easily calculated with the vif() function from the car (Companion to Applied Regression) package.

The next question is, what is the minimum value of VIF for a variable to be considered having a multicollienarity problem? The rule of thumb is 10. If VIFi>10, then Xi is causing the multicollienarity problem. With the VIF, we cannot know with which other variable(s) Xi is strongly correlated. We can know this by estimating the correlation matrix of the independent variables. We can run this in R with the cor() function.

The easiest way and the most common solution for multicollienarity is dropping the variable with VIF greater than 10, and re-run the regression and calculate the VIF’s and repeat the process until there is no VIF greater than 10.

The other possible solution is to examine whether there is an omitted variable that might be very correlated to two independent variables. If this is the case, then it is possible to include the ommitted variable instead of these two independent variables. Unfortunately, in real world, it is not easy to find these omitted variables.

2.1 Data collection and data management

You have to do the following:

Download the panel data set of the BMV firms from http://www.apradie.com/datos/datamx2020q4.xlsx. Keep only the observations for the 4th quarter of 2019 of active firms. Then, you will end up with a cross-sectional data set with all BMV firms with accounting and finance items of the financial statements of 2019, quarter 4.

# load the package readxl 
library(readxl)

# Download the excel file from a web site:
download.file("http://www.apradie.com/datos/datamx2020q4.xlsx", "dataw10.xlsx", mode="wb")
# Save the data from the file in an R object
data <- read_excel("dataw10.xlsx")

# Select only active firms:
data <- data[data$status=="active",]
# Keep only the last quarter of 2019
data$quarter <- as.Date(data$quarter)
data <- as.data.frame(data[(data$quarter=="2019-10-01"),])

2.2 Testing for multicollienarity

Run a multiple regression model to examine whether sales (revenue), cost of good sold (cogs), and operating administrative expenses (administrativeexpenses) are significantly related to net income (netincome).

model1 <- lm(netincome ~ revenue + cogs + administrativeexpenses, data = data)
s_model1 <- summary(model1)
s_model1
## 
## Call:
## lm(formula = netincome ~ revenue + cogs + administrativeexpenses, 
##     data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -18083082   -917266   -385004    634700  36387570 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             3.582e+05  4.681e+05   0.765 0.445656    
## revenue                 1.472e-01  2.346e-02   6.274 6.03e-09 ***
## cogs                   -1.291e-01  3.728e-02  -3.463 0.000745 ***
## administrativeexpenses -1.843e-03  7.802e-02  -0.024 0.981189    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4784000 on 118 degrees of freedom
##   (30 observations deleted due to missingness)
## Multiple R-squared:  0.7385, Adjusted R-squared:  0.7318 
## F-statistic: 111.1 on 3 and 118 DF,  p-value: < 2.2e-16

You have to install the car package.

# Load package car
library(car)
## Loading required package: carData
# Use the fucntion vif(), which receives an lm object as argument
car::vif(model1)
##                revenue                   cogs administrativeexpenses 
##              46.417521              52.547229               4.389234
# Store the results in an object names vif_model1 
# Adding the column of the inverse VIF  (1/VIF) is optional
vif_model1 <- data.frame("VIF" = vif(model1), "inverseVIF" = 1/vif(model1))

# Create vector of VIF values only in order to graph
vif_values <- vif(model1)

# Create horizontal bar chart to display each VIF value
barplot(vif_values, main = "VIF Values", horiz = TRUE, col = "steelblue")
abline(v = 10, lwd = 3, lty = 2)

# Compute correlation between independent variables: 
indvar <- cbind(data$revenue,data$cogs, data$administrativeexpenses)
colnames(indvar) <- c("revenue", "cogs", "administrativeexpenses")
cor(na.omit(indvar))
##                          revenue     cogs administrativeexpenses
## revenue                1.0000000 0.989032              0.8595498
## cogs                   0.9890320 1.000000              0.8770930
## administrativeexpenses 0.8595498 0.877093              1.0000000

Run the VIF for all variables, check for multicollienarity problems, and solve the problem as suggested above. If you end up with a different regression model, run it. If you found multicollienarity problem, does the initial regression coefficients are much bigger than the coefficient(s) in the last regression model after correcting for multicollienarity? # WRITE your response, diagnosis and conclusion of the process.

I run the regression without the COGS

model2 <- lm(netincome ~ revenue + administrativeexpenses, data = data)
summary(model2)
## 
## Call:
## lm(formula = netincome ~ revenue + administrativeexpenses, data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -13065922   -887117   -450337    547803  36713025 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.516e+05  4.884e+05   0.925    0.357    
## revenue                 6.937e-02  7.044e-03   9.849   <2e-16 ***
## administrativeexpenses -9.839e-02  7.616e-02  -1.292    0.199    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5e+06 on 119 degrees of freedom
##   (30 observations deleted due to missingness)
## Multiple R-squared:  0.7119, Adjusted R-squared:  0.7071 
## F-statistic:   147 on 2 and 119 DF,  p-value: < 2.2e-16
car::vif(model2)
##                revenue administrativeexpenses 
##               3.828862               3.828862

BOTH VIFS ARE <10, MEANING THERE IS NO MULTICOLIENARITY PROBLEM IN THIS MODEL THE COEFFICIEN REVENUE IS 0.069 AND IN THE FIRST ONE IT WAS 0.14.