Example 1: Predicting the mark by books and attendance

Data from J.Miles and M.Shevlin (2000)

library(foreign)
df <- read.csv("data1_1.csv")
model1 <- lm(grade ~ books, data = df)
#summary(model1)
model2 <- lm(grade ~ attend, data = df)
#summary(model2)
model3 <- lm(grade ~ books + attend, data = df)
#summary(model3)
df$attend_cent <- scale(df$attend, center = T, scale = F) # centering attendance
df$attend_cent <- as.numeric(df$attend_cent)
model4 <- lm(grade ~ attend_cent, data = df)
model5 <- lm(grade ~ attend_cent + books, data = df)
#summary(model5)
anova(model4, model5)

## Analysis of Variance Table
## 
## Model 1: grade ~ attend_cent
## Model 2: grade ~ attend_cent + books
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     38 8353.4                              
## 2     37 7306.2  1    1047.1 5.3028 0.02701 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

library(sjPlot)
tab_model(model5)

	grade
Predictors	Estimates	CI	p
(Intercept)	55.48	47.34 – 63.61	<0.001
attend cent	1.28	0.13 – 2.43	0.035
books	4.04	0.60 – 7.47	0.027
Observations	40
R² / adjusted R²	0.329 / 0.292

Question: Is books or attend more important? Use beta coefficients to compare the effects within one model.

library(QuantPsyc)
lm.beta(model5)

## attend_cent       books 
##   0.3286422   0.3460987

sjPlot::plot_model(model5, type = "std")

Shown are betas, standardised coefficients. They can be compared across models and should be reported along with non-standardised coefficient.

The graph shows that the beta coefficient for books is slightly higher than for attendance. But the confidence intervals are really wide, therefore, both variables can predict “grade” to the same extent here.

Deliver models in a table (1)

library(stargazer)
stargazer(model1, model4, model5, type = "text")

## 
## =======================================================================================
##                                             Dependent variable:                        
##                     -------------------------------------------------------------------
##                                                    grade                               
##                              (1)                    (2)                    (3)         
## ---------------------------------------------------------------------------------------
## books                      5.738***                                      4.037**       
##                            (1.647)                                       (1.753)       
##                                                                                        
## attend_cent                                       1.883***               1.283**       
##                                                   (0.555)                (0.587)       
##                                                                                        
## Constant                  52.075***              63.550***              55.476***      
##                            (4.035)                (2.344)                (4.151)       
##                                                                                        
## ---------------------------------------------------------------------------------------
## Observations                  40                     40                    40          
## R2                          0.242                  0.233                  0.329        
## Adjusted R2                 0.222                  0.212                  0.292        
## Residual Std. Error    14.735 (df = 38)       14.826 (df = 38)      14.052 (df = 37)   
## F Statistic         12.130*** (df = 1; 38) 11.512*** (df = 1; 38) 9.059*** (df = 2; 37)
## =======================================================================================
## Note:                                                       *p<0.1; **p<0.05; ***p<0.01

tab_model(model1, model4, model5)

	grade			grade			grade
Predictors	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p
(Intercept)	52.07	44.17 – 59.98	<0.001	63.55	58.96 – 68.14	<0.001	55.48	47.34 – 63.61	<0.001
books	5.74	2.51 – 8.97	0.001				4.04	0.60 – 7.47	0.027
attend cent				1.88	0.80 – 2.97	0.002	1.28	0.13 – 2.43	0.035
Observations	40			40			40
R² / adjusted R²	0.242 / 0.222			0.233 / 0.212			0.329 / 0.292

Deliver models in a plot (2)

plot_model(model5, type = "pred")

## $attend_cent

## 
## $books

Let’s try interaction:

model6 <- lm(grade ~ attend_cent * books, data = df)
tab_model(model6)

	grade
Predictors	Estimates	CI	p
(Intercept)	53.29	45.25 – 61.34	<0.001
attend cent	-0.14	-1.86 – 1.58	0.877
books	4.15	0.87 – 7.44	0.018
attend_cent:books	0.73	0.05 – 1.42	0.042
Observations	40
R² / adjusted R²	0.402 / 0.352

anova(model5, model6)

## Analysis of Variance Table
## 
## Model 1: grade ~ attend_cent + books
## Model 2: grade ~ attend_cent * books
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     37 7306.2                              
## 2     36 6506.4  1    799.78 4.4252 0.04246 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model with interaction (model 6) is better than the additive model (model 5).

How can one summarise the “pure” relationship between books and grade, taking attendance into account (“net of attendance”)?

One way is to calculate the “average marginal effect”. Another way is to show plots. Use either way when suitable.

The principle of calculating marginal effects is shown in the following gif:

Nick HK @nickchk https://twitter.com/nickchk/status/1067136325751529472

Marginal effects show the expected change in the outcome associate with 1-unit change of the predictor, taking into account all the other predictors. In other words, it is the “pure” effect of predictors.

library(margins)
margins(model6)

##  attend_cent books
##        1.333 4.155

margins(model6, at = list(books = 0:4))

##  at(books) attend_cent books
##          0     -0.1368 4.155
##          1      0.5982 4.155
##          2      1.3331 4.155
##          3      2.0680 4.155
##          4      2.8030 4.155

library(interplot)
interplot(m = model6, var1 = "attend_cent", var2 = "books") + xlab("Books read") + 
    ylab("Estimated Coefficient for Attendance") + ggtitle("Estimated Coefficient of Books on Grade by Attendance") + 
    theme(plot.title = element_text(face = "bold")) + geom_hline(yintercept = 0, 
    linetype = "dashed") + theme_bw()

Both the table and the plot convey the same message: The higher the number of books read by a student, the higher the positive relationship between attendance and grade.

Interaction Plots: Predicted Values of Y

When there are two continuous variables in an interaction effect, the moderator variable is chosen by researcher and there are a few options of how to show it:

plot_model(model6, type = "pred", terms = c("books", "attend_cent[-8, 5]"), 
    axis.lim = c(30, 100))  # set y-axis limits for comparing the plots

plot_model(model6, type = "int", terms = "attend_cent", mdrt.values = "minmax", 
    axis.lim = c(30, 100))

plot_model(model6, type = "int", terms = "attend_cent", mdrt.values = "meansd", 
    axis.lim = c(30, 100))

plot_model(model6, type = "int", terms = "attend_cent", mdrt.values = "quart", 
    axis.lim = c(30, 100))

When the student has attended at least the average number of classes, there is a statistically significant difference (for the grade) between reading a minimal (zero) and a maximal (four) books.

However, there is no statistically significant difference by attendance if the student has read an average number of books, +- 1 standard deviation (from .5 of a book to 3.4 books). The same picture appears when comparing students by quartiles of book reading.

In other words, the difference is substantial only for the most and least reading students.

Example 2: Modelling the ideal retirement age

Data from the ESS round 5

library(foreign)
data <- read.spss("ESS5e03.sav", to.data.frame = TRUE)
savevars <- c("eduyrs", "agea", "agertr", "gndr", "cntry", "icmnact", "hinctnta", 
    "domicil")
ESS1 <- data[savevars]
rm(data, savevars)
str(ESS1, give.attr = F)

## 'data.frame':    52458 obs. of  8 variables:
##  $ eduyrs  : Factor w/ 45 levels "0","1","2","3",..: 16 16 14 16 16 14 18 14 13 13 ...
##  $ agea    : Factor w/ 88 levels "14","15","16",..: 9 30 6 10 45 49 13 27 35 55 ...
##  $ agertr  : Factor w/ 77 levels "0","1","2","5",..: NA NA NA NA 45 45 NA NA 43 46 ...
##  $ gndr    : Factor w/ 2 levels "Male","Female": 1 1 2 2 2 2 1 1 1 1 ...
##  $ cntry   : Factor w/ 27 levels "Belgium","Bulgaria",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ icmnact : Factor w/ 3 levels "In paid work",..: 3 1 3 1 3 2 1 1 1 2 ...
##  $ hinctnta: Factor w/ 10 levels "J","R","C","M",..: NA 5 1 10 6 5 6 2 6 8 ...
##  $ domicil : Factor w/ 5 levels "A big city","Suburbs or outskirts of big city",..: 1 3 1 1 2 2 3 1 4 4 ...

ESS1$agertr <- as.numeric(as.character(ESS1$agertr))
ESS1$eduyrs <- as.numeric(as.character(ESS1$eduyrs))
ESS1$agea <- as.numeric(as.character(ESS1$agea))
ESS1$income <- as.numeric(ESS1$hinctnta)  #decile income, 1 - min, 10 - max

ESS2 <- na.omit(subset(ESS1, icmnact == "In paid work"))
ESS2$agea_c <- ESS2$agea - mean(ESS2$agea)  #center age
ESS2$agea_c <- as.numeric(ESS2$agea_c)
str(ESS2, give.attr = F)

## 'data.frame':    7415 obs. of  10 variables:
##  $ eduyrs  : num  12 14 15 17 8 15 21 15 14 19 ...
##  $ agea    : num  48 59 50 46 48 52 62 50 59 56 ...
##  $ agertr  : num  58 60 62 65 60 56 65 65 60 65 ...
##  $ gndr    : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 1 2 2 1 ...
##  $ cntry   : Factor w/ 27 levels "Belgium","Bulgaria",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ icmnact : Factor w/ 3 levels "In paid work",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ hinctnta: Factor w/ 10 levels "J","R","C","M",..: 6 8 5 9 4 7 10 7 8 8 ...
##  $ domicil : Factor w/ 5 levels "A big city","Suburbs or outskirts of big city",..: 4 3 4 2 3 4 4 4 2 4 ...
##  $ income  : num  6 8 5 9 4 7 10 7 8 8 ...
##  $ agea_c  : num  -6.08 4.92 -4.08 -8.08 -6.08 ...

m1 <- lm(agertr ~ agea_c, data = ESS2)
# summary(m1)
m2 <- lm(agertr ~ agea_c + gndr, data = ESS2)
# summary(m2)
m3 <- lm(agertr ~ agea_c + gndr + income, data = ESS2)
# summary(m3)
anova(m1, m2, m3)

## Analysis of Variance Table
## 
## Model 1: agertr ~ agea_c
## Model 2: agertr ~ agea_c + gndr
## Model 3: agertr ~ agea_c + gndr + income
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1   7413 198259                                   
## 2   7412 193946  1    4313.0 165.546 < 2.2e-16 ***
## 3   7411 193080  1     865.9  33.236 8.489e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

m4 <- lm(agertr ~ agea_c + gndr + income + eduyrs, data = ESS2)
# summary(m4)
m5 <- update(m4, . ~ . + domicil)
anova(m3, m4, m5)

## Analysis of Variance Table
## 
## Model 1: agertr ~ agea_c + gndr + income
## Model 2: agertr ~ agea_c + gndr + income + eduyrs
## Model 3: agertr ~ agea_c + gndr + income + eduyrs + domicil
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   7411 193080                                  
## 2   7410 191464  1    1615.8 63.878 1.524e-15 ***
## 3   7406 187334  4    4130.8 40.826 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The largest model (m5) is also the best model: it explains the most of agertr’s variance (12 per cent vs. 10 per cent in model m4).

tab_model(m4, m5)

	agertr			agertr
Predictors	Estimates	CI	p	Estimates	CI	p
(Intercept)	59.10	58.62 – 59.57	<0.001	58.27	57.73 – 58.82	<0.001
agea c	0.25	0.23 – 0.27	<0.001	0.24	0.22 – 0.26	<0.001
Female	-1.52	-1.76 – -1.29	<0.001	-1.46	-1.69 – -1.23	<0.001
income	0.07	0.02 – 0.12	0.003	0.08	0.03 – 0.12	0.001
eduyrs	0.13	0.10 – 0.17	<0.001	0.13	0.10 – 0.17	<0.001
Suburbs or outskirts of big city				1.87	1.45 – 2.29	<0.001
Town or small city				0.51	0.19 – 0.84	0.002
Country village				0.56	0.23 – 0.89	0.001
Farm or home in countryside				2.62	2.13 – 3.10	<0.001
Observations	7415			7415
R² / adjusted R²	0.102 / 0.101			0.121 / 0.120

lm.beta(m5)

##                                  agea_c 
##                              0.26170497 
##                              gndrFemale 
##                             -0.13600567 
##                                  income 
##                              0.03734194 
##                                  eduyrs 
##                              0.09354962 
## domicilSuburbs or outskirts of big city 
##                              0.43437170 
##               domicilTown or small city 
##                              0.55451400 
##                  domicilCountry village 
##                              0.05258001 
##      domicilFarm or home in countryside 
##                              1.27287355

plot_model(m5, type = "std")

# table(ESS2$domicil)

By standardised coefficients, respondent’s age can predict the desired retirement age better than domicile, years of education, or income.

Interaction between a categorical and a continuous variables

Is the relationship between the respondent’s income and desired retirement age the same across all types of domicile?

m5i <- lm(agertr ~ agea_c + gndr + eduyrs + domicil * income, data = ESS2)
anova(m5, m5i)

## Analysis of Variance Table
## 
## Model 1: agertr ~ agea_c + gndr + income + eduyrs + domicil
## Model 2: agertr ~ agea_c + gndr + eduyrs + domicil * income
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1   7406 187334                                
## 2   7402 186987  4    346.78 3.4319 0.008256 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tab_model(m5, m5i)

	agertr			agertr
Predictors	Estimates	CI	p	Estimates	CI	p
(Intercept)	58.27	57.73 – 58.82	<0.001	58.16	57.38 – 58.94	<0.001
agea c	0.24	0.22 – 0.26	<0.001	0.24	0.22 – 0.26	<0.001
Female	-1.46	-1.69 – -1.23	<0.001	-1.44	-1.67 – -1.21	<0.001
income	0.08	0.03 – 0.12	0.001	0.09	-0.01 – 0.19	0.068
eduyrs	0.13	0.10 – 0.17	<0.001	0.14	0.10 – 0.17	<0.001
Suburbs or outskirts of big city	1.87	1.45 – 2.29	<0.001	2.02	0.88 – 3.17	0.001
Town or small city	0.51	0.19 – 0.84	0.002	0.29	-0.57 – 1.15	0.511
Country village	0.56	0.23 – 0.89	0.001	0.58	-0.29 – 1.44	0.191
Farm or home in countryside	2.62	2.13 – 3.10	<0.001	4.30	3.10 – 5.50	<0.001
domicilSuburbs or outskirts of big city:income				-0.02	-0.19 – 0.14	0.776
domicilTown or small city:income				0.04	-0.09 – 0.16	0.565
domicilCountry village:income				-0.00	-0.13 – 0.13	0.985
domicilFarm or home in countryside:income				-0.29	-0.47 – -0.10	0.002
Observations	7415			7415
R² / adjusted R²	0.121 / 0.120			0.123 / 0.121

sjPlot::plot_model(m5i, type = "int")

margins(m5i)

##  agea_c eduyrs income gndrFemale domicilSuburbs or outskirts of big city
##  0.2404  0.135 0.0781      -1.44                                   1.875
##  domicilTown or small city domicilCountry village
##                     0.5174                 0.5706
##  domicilFarm or home in countryside
##                               2.511

m5i2 <- lm(agertr ~ agea_c + gndr + eduyrs + income * domicil, data = ESS2)
plot_model(m5i2, type = "int")

The model with interaction between domicile type and income fits better than the additive model. It means that the relationship between income and desired retirement age depends on where the respondent reside.

In the countryside, it is the richest who want to retire earlier, whereas in small towns it is the poorest who want to retire earlier.

In big cities, suburbs, or village, the difference is not statistically significant. But we can see that, except for the farms, poor people would like to retire earlier.

Marginal effects show that, according to model 5i, living on a farm (as compared to a big city) has a positive relationship comparable by size to being 10 years older than the average age, for predicting the desired retirement age. In other words, people 10 years older than average age want to retire 2.5 years later than people of average age. People living on a farm want to retire 2.5 years later than those living in a big city.

Interaction between two categorical variables

Is the relationship between the type of domicile and the desired age of retirement the same for men and women?

m5ii <- lm(agertr ~ agea_c + gndr * domicil + eduyrs + income, data = ESS2)
anova(m5, m5ii)

## Analysis of Variance Table
## 
## Model 1: agertr ~ agea_c + gndr + income + eduyrs + domicil
## Model 2: agertr ~ agea_c + gndr * domicil + eduyrs + income
##   Res.Df    RSS Df Sum of Sq     F   Pr(>F)   
## 1   7406 187334                               
## 2   7402 186985  4     348.5 3.449 0.008014 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tab_model(m5, m5ii)

	agertr			agertr
Predictors	Estimates	CI	p	Estimates	CI	p
(Intercept)	58.27	57.73 – 58.82	<0.001	58.60	58.00 – 59.20	<0.001
agea c	0.24	0.22 – 0.26	<0.001	0.24	0.22 – 0.26	<0.001
Female	-1.46	-1.69 – -1.23	<0.001	-2.08	-2.58 – -1.58	<0.001
income	0.08	0.03 – 0.12	0.001	0.08	0.03 – 0.12	0.002
eduyrs	0.13	0.10 – 0.17	<0.001	0.13	0.10 – 0.17	<0.001
Suburbs or outskirts of big city	1.87	1.45 – 2.29	<0.001	1.10	0.51 – 1.68	<0.001
Town or small city	0.51	0.19 – 0.84	0.002	0.17	-0.30 – 0.64	0.476
Country village	0.56	0.23 – 0.89	0.001	0.19	-0.27 – 0.66	0.423
Farm or home in countryside	2.62	2.13 – 3.10	<0.001	2.36	1.70 – 3.03	<0.001
gndrFemale:domicilSuburbs or outskirts of big city				1.55	0.72 – 2.39	<0.001
gndrFemale:domicilTown or small city				0.65	0.01 – 1.30	0.048
gndrFemale:domicilCountry village				0.72	0.07 – 1.38	0.030
gndrFemale:domicilFarm or home in countryside				0.45	-0.52 – 1.43	0.360
Observations	7415			7415
R² / adjusted R²	0.121 / 0.120			0.123 / 0.121

sjPlot::plot_model(m5ii, type = "int")

The interaction (moderation, multiplicative) model fits better than the additive model.

On average, for men there is no difference whether they live in a big city (reference) or in a town/small city, or in a country village - they want to retire at appr. 61. However, men living in suburbs or on a farm would like to retire later, at 62 or 63.

Women living in suburbs or on a farm would like to retire at 61, which is earlier than males, but this is more than one year later than for women living in a big city/town/country village.

In other words, the relationship between gender and the desired retirement age is, indeed, dependent on where the respondents live. Although it is slightly different for males and females (the interaction effect is statistically significant), it is the same types of domicile (farm / suburbs) where both men and women are planning to retire later than in other domicile types.

Data Analysis in Sociology: Interaction Effects in a Linear Regression