Data analysis in the Social Sciences

Model diagnostics

Today we will perform some steps of evaluating linear models quality.

Let us load data on WGI indicators and get rid of missing values:

wgi <- read.csv("https://raw.githubusercontent.com/allatambov/cluster-analysis/master/clust1/wgi_fh.csv", dec = ",")
wgi <- na.omit(wgi)

As we do not need all variables for our model, we can choose some of them via select() from dplyr:

library(dplyr)
wgi_small <- wgi %>% select(fh, va, rq, rl, ge)

Now we can run a model that will try to explain how the Freedom House score (fh, proxy for the level of the freedom in a country) is affected by the level of voice and accountability (va), rule of law (rl), regulatory quality (rq) and government effectiveness (ge).

model <- lm(data = wgi_small, fh ~ va + rq + rl + ge)
summary(model)

## 
## Call:
## lm(formula = fh ~ va + rq + rl + ge, data = wgi_small)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61625 -0.24693 -0.03692  0.22139  1.56133 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.28897    0.02857 115.117   <2e-16 ***
## va          -2.07400    0.04502 -46.071   <2e-16 ***
## rq           0.12101    0.08830   1.370    0.172    
## rl           0.13139    0.09971   1.318    0.189    
## ge          -0.12479    0.09988  -1.249    0.213    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3976 on 190 degrees of freedom
## Multiple R-squared:  0.9616, Adjusted R-squared:  0.9608 
## F-statistic:  1189 on 4 and 190 DF,  p-value: < 2.2e-16

If we look at the output, we will see a strange thing: R-squared is very high, very close to 1, but the model contains only one statistically significant coefficient except the intercept! There are no substantial reasons why it should be like this: it is clear that these indicators should be related to the level of freedom.

Such a strange situation occurs when our model suffers from multicollinearity, i.e. from highly correlated predictors (independent variables). Let us check it!

cor(wgi_small)

##            fh         va         rq         rl         ge
## fh  1.0000000 -0.9796781 -0.6808791 -0.7171862 -0.6452727
## va -0.9796781  1.0000000  0.7213949  0.7566152  0.6800161
## rq -0.6808791  0.7213949  1.0000000  0.9239895  0.9336181
## rl -0.7171862  0.7566152  0.9239895  1.0000000  0.9402080
## ge -0.6452727  0.6800161  0.9336181  0.9402080  1.0000000

All independent variables are highly correlated, some of them have a correlation coefficient greater than 0.9. It is logical, these indicators are from WGI, from the set of indices that measure the quality of government, so they ‘duplicate’ each other to some extent. We can see the same picture if we create a scatterplot matrix via ggpairs():

library(GGally)
ggpairs(wgi_small)

To understand whether this problem is really serious, we can calculate VIF (variance inflation factor) that will show us problematic variables, i.e. that cause multicollinearity. To do this we will need the car library:

install.packages("car")

Load library and compute VIF:

library(car)
vif(model)

##        va        rq        rl        ge 
##  2.471387  9.418184 11.917750 12.016688

If VIF is greater than 10, such variables are problematic, highly correlated with others. We should either exclude them from the model (leave only one variable from the pair of correlated variables) or combine them using methods like PCA (will discuss later).

We can also get a single measure: calculate the mean VIF for the whole model:

mean(vif(model))

## [1] 8.956002

It is usually thought that the average values of VIF greater than 5 serves as the evidence of multicollinearity.

In our case large VIFs correspond to rq, rl, re. So, let us exclude some of these variables and look at the summary for a simplified model:

model2 <- lm(data = wgi_small, fh ~ va + rl)
summary(model2)

## 
## Call:
## lm(formula = fh ~ va + rl, data = wgi_small)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.54328 -0.22815 -0.04739  0.21810  1.59500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.28902    0.02859 115.045   <2e-16 ***
## va          -2.05862    0.04383 -46.972   <2e-16 ***
## rl           0.11428    0.04420   2.585   0.0105 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3979 on 192 degrees of freedom
## Multiple R-squared:  0.9611, Adjusted R-squared:  0.9607 
## F-statistic:  2373 on 2 and 192 DF,  p-value: < 2.2e-16

Now we still have high R-squared, but the coefficients now are significant, as expected!

Then, let us test whether heteroskedasticity is present in our model. We will use a Breush-Pagan test from the lmtest library:

# install it
install.packages("lmtest")

# load and use test
library(lmtest)
bptest(model2)

## 
##  studentized Breusch-Pagan test
## 
## data:  model2
## BP = 1.0953, df = 2, p-value = 0.5783

The null hypothesis states that the variance of residuals is constant, i.e. there is no heteroskedasticity. In our case it is rejected because of the low p-value, so the assumption about constant variance of residuals does not hold. As we discussed, it does not necessarily mean that our model has to be improved. If we have no substantial reasons to suspect heteroskedasticity (and we do not see serious patterns in residuals in graphs), we can leave everything as is. However, if you really want to adjust your model, you can use a different type of standard errors of coefficients that are robust, non-sensitive to heteroskedasticity. To do this you will need the sandwich library:

install.packages('sandwich)

Now you can load it and use:

library(sandwich)
coeftest(model2, vcov = vcovHC(model2, type="HC1"))

## 
## t test of coefficients:
## 
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  3.289023   0.028269 116.3474  < 2e-16 ***
## va          -2.058624   0.033182 -62.0400  < 2e-16 ***
## rl           0.114276   0.042659   2.6788  0.00803 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comments: we put a model that needs correction inside coeftest() and specify a different type of the variance-covariance matrix for residuals, "HC1" here stands for heteroskedasticity consistent standard errors.

The output of the code above contains new values of coefficients and new p-values for them, all controlled for heteroskedasticity.

Now only one thing is left: check whether there are influential observations. To detect them in an easier way we will need graphs from the ggfortify library:

# install it
install.packages("ggfortify")

# load it
library(ggfortify)

The function autoplot() from this model produces a lot of different graphs, we will need only two of them (fourth and fifth):

autoplot(model2, which = 4:5)

In the first graph we see the values of the Cook’s distance that is used as a measure of influence. By x-axis go numbers of observations, by y-axis go values of the Cook’s distance. The longer vertical lines correspond to higher values of the Cook’s distance. It is commonly thought that influential observations have values of the Cook’s distance greater than 1. Here it is not the case. Although some observations are marked (31, 136, 148), they are still do not have much influence.

The second graph is the scatter plot standardized residuals vs leverage with some trend line added. Influential observations are those that are both non-typical and with high leverage. So, we should search for points that are on the right and distant from the line \(y=0\). Here again, we have no certain influential points. If we had them, we had better simply exclude such points from analysis (delete from a data set).

Data analysis in the Social Sciences

Alla Tambovtseva

Model diagnostics