What is Collineartiy and What can we do about it?

In this post we will attempt to better understand what Collinearity is what some thoughts on how we can deal with it. As part of this exploration we will use the Data from the RBA’s website and look into the Commodity Price Index (https://rba.gov.au/statistics/frequency/commodity-prices/2020/).

For the purposes of this discussion we will look to better understand which elements influence the Commodity Index influence the most.

What is Collinearity

Collinearity exists where there is a high correlation between predictor variables in a linear regression model. This means that when highly correlated variables are used in the same model it is hard to determine which variable are influencing the dependant variable. It is of particular concern when there is an increase in the p value of one predictor when another predictor variable is added. It ultimately undermines the strength of the prediction within the model as the effects the variables are having on the dependant variable are lost in the ‘noise’.

The Data

The Data Set has been downloaded from the RBA website as above. It is the list of the Commodity price indices from the early ’80’s to the current month. The observations are monthly and they are the average price across the month the observation is made. There variables we are concerned with are:

Commodity Prices
Rural Commodity Prices
Non-Rural Commodity Prices
Base Metals Prices
Bulk Commodities Prices

The intent is to determine which of Rural, Non-Rural, Base Metals and Bulk Commodities prices influence the Commodity Price Index the most. We will assume that we have imported and tidied the Data and the Data set we will work with is below. For the purposes of this we wont consider in detail the time aspect of this problem, this will be covered in a subsequent post…

## # A tibble: 6 x 6
##   Date       Commodity_Prices Rural_Commodity… Non_Rural_Commo… Base_Metals_Pri…
##   <date>                <dbl>            <dbl>            <dbl>            <dbl>
## 1 1982-07-01             28.9             39.3             28.1             24.7
## 2 1982-08-01             29.2             38.9             28.8             25.1
## 3 1982-09-01             30.3             39.7             30.3             25.2
## 4 1982-10-01             30.6             40.2             30.5             25.1
## 5 1982-11-01             30.6             40.3             30.6             24.1
## 6 1982-12-01             30.1             39.4             30.2             23.4
## # … with 1 more variable: Bulk_Commodities_Prices <dbl>

An initial inspection of the Variables

We will do a basic analysis of the variables. To do this the skim function from the skimr package. From this we can see that we have no missing Data, the highest mean is Rural Commodity Price and the lowest mean is the Bulk Commodity Price. The Standard Deviation for the Rural Commodity Price is the lowest and the Bulk Commodities standard deviation is the highest.

df3 %>%
  skim()

Data summary
Name	Piped data
Number of rows	454
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Commodity_Prices	1	57.30	24.63	28.9	37.7	43.35	75.95	125.0	▇▂▂▂▁
Rural_Commodity_Prices	1	67.79	15.80	38.9	56.1	65.50	80.27	107.4	▅▇▅▅▂
Non_Rural_Commodity_Prices	1	55.18	26.96	27.9	33.8	38.75	75.40	132.2	▇▁▂▂▁
Base_Metals_Prices	1	65.42	24.94	23.4	46.7	60.50	80.60	149.5	▆▇▅▂▁
Bulk_Commodities_Prices	1	52.76	31.63	23.5	29.6	33.60	72.58	159.5	▇▂▂▁▁

How do the variables interact with each other

We can see from the table below that all of the variables are positively correlated and generally the correlation level is strong.

(corrdf3 <- round(cor(df3), 3))

##                            Commodity_Prices Rural_Commodity_Prices
## Commodity_Prices                      1.000                  0.862
## Rural_Commodity_Prices                0.862                  1.000
## Non_Rural_Commodity_Prices            0.998                  0.828
## Base_Metals_Prices                    0.774                  0.777
## Bulk_Commodities_Prices               0.976                  0.766
##                            Non_Rural_Commodity_Prices Base_Metals_Prices
## Commodity_Prices                                0.998              0.774
## Rural_Commodity_Prices                          0.828              0.777
## Non_Rural_Commodity_Prices                      1.000              0.756
## Base_Metals_Prices                              0.756              1.000
## Bulk_Commodities_Prices                         0.984              0.646
##                            Bulk_Commodities_Prices
## Commodity_Prices                             0.976
## Rural_Commodity_Prices                       0.766
## Non_Rural_Commodity_Prices                   0.984
## Base_Metals_Prices                           0.646
## Bulk_Commodities_Prices                      1.000

Graph for visual representation

Another way to view this is in a table format using the corrplot function. We can see below that Commodity Price index is perfectly correlated with Non-Rural Commodity Prices and it has a very strong correlation with Bulk Commodity Prices as well. The weakest correlation is with the Base Metals Prices and even this is consider a strong correlation. Probably of more interest for the purpose of this is how the variables interact with each other. We can see that they appear to all be strongly correlated with not only the Commodity Price but with each other. The Bulk Commodity Price has the weakest correlation with the Base Metals Price.

corrplot(corrdf3, method = "number")

Scatter plots with linear modelling lines

Let’s graph the relationship of the Commodity Prices with each of the inputs to visually understand this.

ggplot(df3, aes(x = Rural_Commodity_Prices, y = Commodity_Prices)) +
  geom_point() +
  labs(x = "Rural Commodity Prices $A", y = "Commodity Prices $A", title = "Rural Commodity Prices and Commodity Price") +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

ggplot(df3, aes(x = Non_Rural_Commodity_Prices, y = Commodity_Prices)) +
  geom_point() +
  labs(x = "Non - Rural Commodity Prices $A", y = "Commodity Prices $A", title = "Non - Rural Commodity Prices and Commodity Price") +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

ggplot(df3, aes(x = Base_Metals_Prices, y = Commodity_Prices)) +
  geom_point() +
  labs(x = "Base Metals Commodity Prices $A", y = "Commodity Prices $A", title = "Base Metals Commodity Prices and Commodity Price") +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

ggplot(df3, aes(x = Bulk_Commodities_Prices, y = Commodity_Prices)) +
  geom_point() +
  labs(x = "Bulk Commodity Prices $A", y = "Commodity Prices $A", title = "Bulk Commodity Prices and Commodity Price") +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

We can see the Strength of correlation is highlighted in the scatter plots above.

Initial model to see how it performs

commodity_model <- lm(Commodity_Prices ~ Rural_Commodity_Prices + Non_Rural_Commodity_Prices + Base_Metals_Prices + Bulk_Commodities_Prices, data = df3)

commodity_model

## 
## Call:
## lm(formula = Commodity_Prices ~ Rural_Commodity_Prices + Non_Rural_Commodity_Prices + 
##     Base_Metals_Prices + Bulk_Commodities_Prices, data = df3)
## 
## Coefficients:
##                (Intercept)      Rural_Commodity_Prices  
##                   -0.22366                     0.17716  
## Non_Rural_Commodity_Prices          Base_Metals_Prices  
##                    0.73033                     0.02606  
##    Bulk_Commodities_Prices  
##                    0.06669

summary(commodity_model)

## 
## Call:
## lm(formula = Commodity_Prices ~ Rural_Commodity_Prices + Non_Rural_Commodity_Prices + 
##     Base_Metals_Prices + Bulk_Commodities_Prices, data = df3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8202 -0.3407  0.1824  0.4270  1.0986 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -0.223661   0.148211  -1.509    0.132    
## Rural_Commodity_Prices      0.177158   0.003707  47.784  < 2e-16 ***
## Non_Rural_Commodity_Prices  0.730332   0.013288  54.960  < 2e-16 ***
## Base_Metals_Prices          0.026055   0.003136   8.309 1.15e-15 ***
## Bulk_Commodities_Prices     0.066686   0.009426   7.075 5.79e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6095 on 449 degrees of freedom
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9994 
## F-statistic: 1.849e+05 on 4 and 449 DF,  p-value: < 2.2e-16

(step_commodity_model <- stepAIC(commodity_model, direction = "forward"))

## Start:  AIC=-444.66
## Commodity_Prices ~ Rural_Commodity_Prices + Non_Rural_Commodity_Prices + 
##     Base_Metals_Prices + Bulk_Commodities_Prices

## 
## Call:
## lm(formula = Commodity_Prices ~ Rural_Commodity_Prices + Non_Rural_Commodity_Prices + 
##     Base_Metals_Prices + Bulk_Commodities_Prices, data = df3)
## 
## Coefficients:
##                (Intercept)      Rural_Commodity_Prices  
##                   -0.22366                     0.17716  
## Non_Rural_Commodity_Prices          Base_Metals_Prices  
##                    0.73033                     0.02606  
##    Bulk_Commodities_Prices  
##                    0.06669

Results of the first model

We can see from the results above are as expected, there is an over fitting of the model. The very low P value and the very high T for each of the variables after the other variables have been added to the model. This shows that there is partial effect of adding that variable to the model, this works when there is a relatively small number of variables, as the number of variables increases then this method is less reliable. This is where we start to consider methods to determine which variables should be selected in model. The Akaike Information Criterion (AIC) along with the adjusted R² can assist in deciding the variables that should be included in the model. The most methodical way is to run models with each of the variables individually then in various combinations with each other, with a total of 5² models to consider, this is neither efficient nor practical.

Forward Stepwise selection

The modelling in R helps with some of these issues, the forward stepwise selection will add the variables to the model in the order it considers to be most relevant. This can help determine which of the variables has the most influence on the outcome variable in priority order in additional improvement order. Based on this, simply taking the last added variable out of the model can help reduce some of the collineratity.

Backward Stepwise selection

This does a similar thing to forward stepwise but in reverse, it builds the complete model then removes the variables that have the least effect on the model based on the RSS.

Considering AIC

Another way to consider the variable selection in set of variables that are highly correlated is to, the smaller the AIC the more influence on the model the variable will have. This is the opposite of R².

The way forward

As we can see from the results above it would appear that Non-Rural Commodity Prices and Rural Commodity Prices have the most influence on the Commodity Price Index. When we compare this finding with the correlation between these two variables we can see that they are highly correlated with a value of 0.83 - the second highest value between the variables. With this in mind let’s choose one of them and drop the other one from the model to see how it will perform - we will drop Non Rural Commodity Prices. This choice is further reinforced when we consider the relationship between Bulk Commodity Prices and Non-Rural Commodity Prices, these two items are very highly correlated. Therefore, in order to reduce some of the collinear relationships within the model we will drop Non-Rural Commodity Prices to see the effects.

commodity_model1 <- lm(Commodity_Prices ~ Rural_Commodity_Prices +  Base_Metals_Prices + Bulk_Commodities_Prices, data = df3)

commodity_model1

## 
## Call:
## lm(formula = Commodity_Prices ~ Rural_Commodity_Prices + Base_Metals_Prices + 
##     Bulk_Commodities_Prices, data = df3)
## 
## Coefficients:
##             (Intercept)   Rural_Commodity_Prices       Base_Metals_Prices  
##                 -1.2233                   0.2529                   0.1657  
## Bulk_Commodities_Prices  
##                  0.5788

summary(commodity_model1)

## 
## Call:
## lm(formula = Commodity_Prices ~ Rural_Commodity_Prices + Base_Metals_Prices + 
##     Bulk_Commodities_Prices, data = df3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8219 -1.0224 -0.0236  0.7838  4.3995 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -1.223254   0.408435  -2.995   0.0029 ** 
## Rural_Commodity_Prices   0.252946   0.009556  26.469   <2e-16 ***
## Base_Metals_Prices       0.165722   0.005101  32.486   <2e-16 ***
## Bulk_Commodities_Prices  0.578838   0.003940 146.928   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.692 on 450 degrees of freedom
## Multiple R-squared:  0.9953, Adjusted R-squared:  0.9953 
## F-statistic: 3.184e+04 on 3 and 450 DF,  p-value: < 2.2e-16

(step_commodity_model1 <- stepAIC(commodity_model1, direction = "forward"))

## Start:  AIC=481.67
## Commodity_Prices ~ Rural_Commodity_Prices + Base_Metals_Prices + 
##     Bulk_Commodities_Prices

## 
## Call:
## lm(formula = Commodity_Prices ~ Rural_Commodity_Prices + Base_Metals_Prices + 
##     Bulk_Commodities_Prices, data = df3)
## 
## Coefficients:
##             (Intercept)   Rural_Commodity_Prices       Base_Metals_Prices  
##                 -1.2233                   0.2529                   0.1657  
## Bulk_Commodities_Prices  
##                  0.5788

As we can see from the results above, this has changed the influence the variables are having on the model but the overall result is still a ‘good’ result.

Conclusion

As we can see from the models above, removing collinear effects from a multiple-regression model can be challenging. One small change can influence the outcome of the model significantly. A combination of both model analysis and variable interaction is a way to determine the effects of the variables on the model. Other, more advanced ways to determine the strength of a model is to use methods such as Ridge Regression and The Lasso. These two methods are forms of shrinkage methods.