Data Analysis in the Social Sciences

Linear models with categorical predictors

Let us again work with WGI data set, but with a modified one. Now we have one more column with a status of a country set by Freedom House (free, partly free, not free).

Load data, delete rows with missing values and check the structure of the data frame so as to make sure all variable types are correct:

wgi <- read.csv("http://math-info.hse.ru/f/2018-19/pep/wgi-new.csv")
wgi <- na.omit(wgi) # delete NAs
str(wgi) # types are correct

## 'data.frame':    195 obs. of  16 variables:
##  $ X.1        : int  2 3 4 5 7 8 9 10 11 12 ...
##  $ X          : int  2 3 4 6 9 10 12 13 14 15 ...
##  $ country    : Factor w/ 202 levels "Afghanistan",..: 4 1 5 2 7 8 6 10 11 12 ...
##  $ cnt_code   : Factor w/ 202 levels "ABW","ADO","AFG",..: 2 3 4 5 7 8 9 10 11 12 ...
##  $ year       : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ va         : num  1.2 -1.09 -1.17 0.16 0.54 -0.62 0.65 1.3 1.29 -1.6 ...
##  $ ps         : num  1.4 -2.75 -0.39 0.26 0.22 -0.6 1.01 0.96 0.82 -0.87 ...
##  $ ge         : num  1.86 -1.22 -1.04 0 0.18 -0.15 0.27 1.58 1.51 -0.16 ...
##  $ rq         : num  0.87 -1.33 -1 0.19 -0.47 0.25 0.34 1.9 1.44 -0.28 ...
##  $ rl         : num  1.56 -1.62 -1.08 -0.35 -0.35 -0.11 0.51 1.75 1.78 -0.57 ...
##  $ cc         : num  1.23 -1.56 -1.41 -0.4 -0.31 -0.57 0.69 1.77 1.54 -0.87 ...
##  $ fh_score   : num  1 6 6 3 2 4.5 2 1 1 6.5 ...
##  $ not_free   : int  0 1 1 0 0 0 0 0 0 1 ...
##  $ partly_free: int  0 0 0 1 0 1 0 0 0 0 ...
##  $ free       : int  1 0 0 0 1 0 1 1 1 0 ...
##  $ fh_type    : Factor w/ 3 levels "free","not_free",..: 1 2 2 3 1 3 1 1 1 2 ...
##  - attr(*, "na.action")= 'omit' Named int  1 6 44 73 75 113 196
##   ..- attr(*, "names")= chr  "1" "6" "44" "73" ...

Now we will run a model that explains how Voice and Accountability (va) is affected by Control of corruption (cc) taking into account the status of a country set by Freedom House(fh_type). The variable fh_type is categorical (factor), so R will authomatically split it into a set of binary dummy variables and add them to the model excluding one that corresponds to the base level of fh_type. The base level (reference level which we will compare results with) is also chosen authomatically, it is the first value of fh_type if its values are sorted alphabetically.

model1 <- lm(data = wgi, va ~ cc + fh_type)
summary(model1)

## 
## Call:
## lm(formula = va ~ cc + fh_type, data = wgi)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78496 -0.19360  0.02069  0.17467  0.74308 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.71577    0.03577   20.01   <2e-16 ***
## cc                  0.29037    0.02743   10.59   <2e-16 ***
## fh_typenot_free    -1.86105    0.06481  -28.71   <2e-16 ***
## fh_typepartly_free -0.81188    0.05744  -14.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.286 on 191 degrees of freedom
## Multiple R-squared:  0.919,  Adjusted R-squared:  0.9177 
## F-statistic:   722 on 3 and 191 DF,  p-value: < 2.2e-16

In our case, judging by the output, the base level is free since its dummy is absent in the model. R dropped it out so as to avoid multicollinearity. So, we will compare values of our dependent variable (va) refering to free countries.

Now let us proceed to interpretation.

The coefficient of fh_typenot_free shows that all else equal, on average, values of Voice and Accountability (va) are 1.86 lower in not free countries than in free countries. This coefficient is statistically significant, so this difference is important.
The coefficient of fh_typepartly_free shows that that all else equal, on average, values of Voice and Accountability (va) are 0.81 lower in partly free countries than in free countries. This coefficient is statistically significant, so this difference make sense.

Other coefficients (the intercept and the coefficient of cc) are interpreted in a usual way.

This model shows that there are significant differences in Voice and Accountability indicator in three types of countries. Let us visualise this fact using boxplots:

library(ggplot2)

# by x-axis goes grouping variable (group by fh_type)
# by y-axis goes variable of interest (look at va distribution)

ggplot(data = wgi, 
       aes(x = fh_type, y = va)) +
  geom_boxplot()

During the lecture we discussed that we can choose a base level ourselves. We can realise it in R, but it is better to do it outside the model, just changing the order of levels in the factor variable. Look at the structure of fh_type first:

str(wgi$fh_type)

##  Factor w/ 3 levels "free","not_free",..: 1 2 2 3 1 3 1 1 1 2 ...

By default levels (unique values of fh_type) are sorted alphabetically. We can set an order of levels we want by passing a vector of new values to the factor() function:

wgi$fh_type2 <- factor(wgi$fh_type, 
                   levels = c("not_free", 
                               "partly_free", 
                               "free"))

We put not free in the first place. Let’s see the changes:

str(wgi$fh_type2)

##  Factor w/ 3 levels "not_free","partly_free",..: 3 1 1 2 3 2 3 3 3 1 ...

Now we can run a model with a renewed factor variable fh_type2.

model2 <- lm(data = wgi, va ~ cc + fh_type2)
summary(model2)

## 
## Call:
## lm(formula = va ~ cc + fh_type2, data = wgi)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78496 -0.19360  0.02069  0.17467  0.74308 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.14529    0.04615  -24.82   <2e-16 ***
## cc                   0.29037    0.02743   10.59   <2e-16 ***
## fh_type2partly_free  1.04918    0.05547   18.91   <2e-16 ***
## fh_type2free         1.86105    0.06481   28.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.286 on 191 degrees of freedom
## Multiple R-squared:  0.919,  Adjusted R-squared:  0.9177 
## F-statistic:   722 on 3 and 191 DF,  p-value: < 2.2e-16

Everything is the same except dummies included: now they are and is left. Interpretation:

The coefficient of fh_typenot_free shows that all else equal, on average, values of Voice and Accountability (va) are higher by 1.049 in partly free countries than in not free countries. This coefficient is statistically significant, so this difference is important.
The coefficient of fh_typepartly_free shows that that all else equal, on average, values of Voice and Accountability (va) are higher by 1.86 in free countries than in not free countries. This coefficient is statistically significant, so this difference make sense.

Data Analysis in the Social Sciences

Alla Tambovtseva

Linear models with categorical predictors