Anna Shirokanova and Olesya Volchenko
library(foreign)
df <- read.csv("data1_1.csv")
model1 <- lm(grade ~ books, data = df)
#summary(model1)
model2 <- lm(grade ~ attend, data = df)
#summary(model2)
model3 <- lm(grade ~ books + attend, data = df)
#summary(model3)
df$attend_cent <- scale(df$attend, center = T, scale = F) # centering attendance
df$attend_cent <- as.numeric(df$attend_cent)
model4 <- lm(grade ~ attend_cent, data = df)
model5 <- lm(grade ~ attend_cent + books, data = df)
#summary(model5)
anova(model4, model5)## Analysis of Variance Table
##
## Model 1: grade ~ attend_cent
## Model 2: grade ~ attend_cent + books
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 38 8353.4
## 2 37 7306.2 1 1047.1 5.3028 0.02701 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| grade | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 55.48 | 47.34 – 63.61 | <0.001 |
| attend cent | 1.28 | 0.13 – 2.43 | 0.035 |
| books | 4.04 | 0.60 – 7.47 | 0.027 |
| Observations | 40 | ||
| R2 / adjusted R2 | 0.329 / 0.292 | ||
Question: Is books or attend more important? Use beta coefficients to compare the effects within one model.
## attend_cent books
## 0.3286422 0.3460987
Shown are betas, standardised coefficients. They can be compared across models and should be reported along with non-standardised coefficient.
The graph shows that the beta coefficient for books is slightly higher than for attendance. But the confidence intervals are really wide, therefore, both variables can predict “grade” to the same extent here.
##
## =======================================================================================
## Dependent variable:
## -------------------------------------------------------------------
## grade
## (1) (2) (3)
## ---------------------------------------------------------------------------------------
## books 5.738*** 4.037**
## (1.647) (1.753)
##
## attend_cent 1.883*** 1.283**
## (0.555) (0.587)
##
## Constant 52.075*** 63.550*** 55.476***
## (4.035) (2.344) (4.151)
##
## ---------------------------------------------------------------------------------------
## Observations 40 40 40
## R2 0.242 0.233 0.329
## Adjusted R2 0.222 0.212 0.292
## Residual Std. Error 14.735 (df = 38) 14.826 (df = 38) 14.052 (df = 37)
## F Statistic 12.130*** (df = 1; 38) 11.512*** (df = 1; 38) 9.059*** (df = 2; 37)
## =======================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
| grade | grade | grade | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 52.07 | 44.17 – 59.98 | <0.001 | 63.55 | 58.96 – 68.14 | <0.001 | 55.48 | 47.34 – 63.61 | <0.001 |
| books | 5.74 | 2.51 – 8.97 | 0.001 | 4.04 | 0.60 – 7.47 | 0.027 | |||
| attend cent | 1.88 | 0.80 – 2.97 | 0.002 | 1.28 | 0.13 – 2.43 | 0.035 | |||
| Observations | 40 | 40 | 40 | ||||||
| R2 / adjusted R2 | 0.242 / 0.222 | 0.233 / 0.212 | 0.329 / 0.292 | ||||||
| grade | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 53.29 | 45.25 – 61.34 | <0.001 |
| attend cent | -0.14 | -1.86 – 1.58 | 0.877 |
| books | 4.15 | 0.87 – 7.44 | 0.018 |
| attend_cent:books | 0.73 | 0.05 – 1.42 | 0.042 |
| Observations | 40 | ||
| R2 / adjusted R2 | 0.402 / 0.352 | ||
## Analysis of Variance Table
##
## Model 1: grade ~ attend_cent + books
## Model 2: grade ~ attend_cent * books
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 37 7306.2
## 2 36 6506.4 1 799.78 4.4252 0.04246 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model with interaction (model 6) is better than the additive model (model 5).
How can one summarise the “pure” relationship between books and grade, taking attendance into account (“net of attendance”)?
One way is to calculate the “average marginal effect”. Another way is to show plots. Use either way when suitable.
The principle of calculating marginal effects is shown in the following gif:
Nick HK @nickchk https://twitter.com/nickchk/status/1067136325751529472
Marginal effects show the expected change in the outcome associate with 1-unit change of the predictor, taking into account all the other predictors. In other words, it is the “pure” effect of predictors.
## attend_cent books
## 1.333 4.155
## at(books) attend_cent books
## 0 -0.1368 4.155
## 1 0.5982 4.155
## 2 1.3331 4.155
## 3 2.0680 4.155
## 4 2.8030 4.155
library(interplot)
interplot(m = model6, var1 = "attend_cent", var2 = "books") + xlab("Books read") +
ylab("Estimated Coefficient for Attendance") + ggtitle("Estimated Coefficient of Books on Grade by Attendance") +
theme(plot.title = element_text(face = "bold")) + geom_hline(yintercept = 0,
linetype = "dashed") + theme_bw()Both the table and the plot convey the same message: The higher the number of books read by a student, the higher the positive relationship between attendance and grade.
When there are two continuous variables in an interaction effect, the moderator variable is chosen by researcher and there are a few options of how to show it:
plot_model(model6, type = "pred", terms = c("books", "attend_cent[-8, 5]"),
axis.lim = c(30, 100)) # set y-axis limits for comparing the plotsplot_model(model6, type = "int", terms = "attend_cent", mdrt.values = "minmax",
axis.lim = c(30, 100))plot_model(model6, type = "int", terms = "attend_cent", mdrt.values = "meansd",
axis.lim = c(30, 100))plot_model(model6, type = "int", terms = "attend_cent", mdrt.values = "quart",
axis.lim = c(30, 100))When the student has attended at least the average number of classes, there is a statistically significant difference (for the grade) between reading a minimal (zero) and a maximal (four) books.
However, there is no statistically significant difference by attendance if the student has read an average number of books, +- 1 standard deviation (from .5 of a book to 3.4 books). The same picture appears when comparing students by quartiles of book reading.
In other words, the difference is substantial only for the most and least reading students.
Data from the ESS round 5
library(foreign)
data <- read.spss("ESS5e03.sav", to.data.frame = TRUE)
savevars <- c("eduyrs", "agea", "agertr", "gndr", "cntry", "icmnact", "hinctnta",
"domicil")
ESS1 <- data[savevars]
rm(data, savevars)
str(ESS1, give.attr = F)## 'data.frame': 52458 obs. of 8 variables:
## $ eduyrs : Factor w/ 45 levels "0","1","2","3",..: 16 16 14 16 16 14 18 14 13 13 ...
## $ agea : Factor w/ 88 levels "14","15","16",..: 9 30 6 10 45 49 13 27 35 55 ...
## $ agertr : Factor w/ 77 levels "0","1","2","5",..: NA NA NA NA 45 45 NA NA 43 46 ...
## $ gndr : Factor w/ 2 levels "Male","Female": 1 1 2 2 2 2 1 1 1 1 ...
## $ cntry : Factor w/ 27 levels "Belgium","Bulgaria",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ icmnact : Factor w/ 3 levels "In paid work",..: 3 1 3 1 3 2 1 1 1 2 ...
## $ hinctnta: Factor w/ 10 levels "J","R","C","M",..: NA 5 1 10 6 5 6 2 6 8 ...
## $ domicil : Factor w/ 5 levels "A big city","Suburbs or outskirts of big city",..: 1 3 1 1 2 2 3 1 4 4 ...
ESS1$agertr <- as.numeric(as.character(ESS1$agertr))
ESS1$eduyrs <- as.numeric(as.character(ESS1$eduyrs))
ESS1$agea <- as.numeric(as.character(ESS1$agea))
ESS1$income <- as.numeric(ESS1$hinctnta) #decile income, 1 - min, 10 - max
ESS2 <- na.omit(subset(ESS1, icmnact == "In paid work"))
ESS2$agea_c <- ESS2$agea - mean(ESS2$agea) #center age
ESS2$agea_c <- as.numeric(ESS2$agea_c)
str(ESS2, give.attr = F)## 'data.frame': 7415 obs. of 10 variables:
## $ eduyrs : num 12 14 15 17 8 15 21 15 14 19 ...
## $ agea : num 48 59 50 46 48 52 62 50 59 56 ...
## $ agertr : num 58 60 62 65 60 56 65 65 60 65 ...
## $ gndr : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 1 2 2 1 ...
## $ cntry : Factor w/ 27 levels "Belgium","Bulgaria",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ icmnact : Factor w/ 3 levels "In paid work",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ hinctnta: Factor w/ 10 levels "J","R","C","M",..: 6 8 5 9 4 7 10 7 8 8 ...
## $ domicil : Factor w/ 5 levels "A big city","Suburbs or outskirts of big city",..: 4 3 4 2 3 4 4 4 2 4 ...
## $ income : num 6 8 5 9 4 7 10 7 8 8 ...
## $ agea_c : num -6.08 4.92 -4.08 -8.08 -6.08 ...
m1 <- lm(agertr ~ agea_c, data = ESS2)
# summary(m1)
m2 <- lm(agertr ~ agea_c + gndr, data = ESS2)
# summary(m2)
m3 <- lm(agertr ~ agea_c + gndr + income, data = ESS2)
# summary(m3)
anova(m1, m2, m3)## Analysis of Variance Table
##
## Model 1: agertr ~ agea_c
## Model 2: agertr ~ agea_c + gndr
## Model 3: agertr ~ agea_c + gndr + income
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 7413 198259
## 2 7412 193946 1 4313.0 165.546 < 2.2e-16 ***
## 3 7411 193080 1 865.9 33.236 8.489e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m4 <- lm(agertr ~ agea_c + gndr + income + eduyrs, data = ESS2)
# summary(m4)
m5 <- update(m4, . ~ . + domicil)
anova(m3, m4, m5)## Analysis of Variance Table
##
## Model 1: agertr ~ agea_c + gndr + income
## Model 2: agertr ~ agea_c + gndr + income + eduyrs
## Model 3: agertr ~ agea_c + gndr + income + eduyrs + domicil
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 7411 193080
## 2 7410 191464 1 1615.8 63.878 1.524e-15 ***
## 3 7406 187334 4 4130.8 40.826 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The largest model (m5) is also the best model: it explains the most of agertr’s variance (12 per cent vs. 10 per cent in model m4).
| agertr | agertr | |||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 59.10 | 58.62 – 59.57 | <0.001 | 58.27 | 57.73 – 58.82 | <0.001 |
| agea c | 0.25 | 0.23 – 0.27 | <0.001 | 0.24 | 0.22 – 0.26 | <0.001 |
| Female | -1.52 | -1.76 – -1.29 | <0.001 | -1.46 | -1.69 – -1.23 | <0.001 |
| income | 0.07 | 0.02 – 0.12 | 0.003 | 0.08 | 0.03 – 0.12 | 0.001 |
| eduyrs | 0.13 | 0.10 – 0.17 | <0.001 | 0.13 | 0.10 – 0.17 | <0.001 |
|
Suburbs or outskirts of big city |
1.87 | 1.45 – 2.29 | <0.001 | |||
| Town or small city | 0.51 | 0.19 – 0.84 | 0.002 | |||
| Country village | 0.56 | 0.23 – 0.89 | 0.001 | |||
|
Farm or home in countryside |
2.62 | 2.13 – 3.10 | <0.001 | |||
| Observations | 7415 | 7415 | ||||
| R2 / adjusted R2 | 0.102 / 0.101 | 0.121 / 0.120 | ||||
## agea_c
## 0.26170497
## gndrFemale
## -0.13600567
## income
## 0.03734194
## eduyrs
## 0.09354962
## domicilSuburbs or outskirts of big city
## 0.43437170
## domicilTown or small city
## 0.55451400
## domicilCountry village
## 0.05258001
## domicilFarm or home in countryside
## 1.27287355
By standardised coefficients, respondent’s age can predict the desired retirement age better than domicile, years of education, or income.
Is the relationship between the respondent’s income and desired retirement age the same across all types of domicile?
## Analysis of Variance Table
##
## Model 1: agertr ~ agea_c + gndr + income + eduyrs + domicil
## Model 2: agertr ~ agea_c + gndr + eduyrs + domicil * income
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 7406 187334
## 2 7402 186987 4 346.78 3.4319 0.008256 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| agertr | agertr | |||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 58.27 | 57.73 – 58.82 | <0.001 | 58.16 | 57.38 – 58.94 | <0.001 |
| agea c | 0.24 | 0.22 – 0.26 | <0.001 | 0.24 | 0.22 – 0.26 | <0.001 |
| Female | -1.46 | -1.69 – -1.23 | <0.001 | -1.44 | -1.67 – -1.21 | <0.001 |
| income | 0.08 | 0.03 – 0.12 | 0.001 | 0.09 | -0.01 – 0.19 | 0.068 |
| eduyrs | 0.13 | 0.10 – 0.17 | <0.001 | 0.14 | 0.10 – 0.17 | <0.001 |
|
Suburbs or outskirts of big city |
1.87 | 1.45 – 2.29 | <0.001 | 2.02 | 0.88 – 3.17 | 0.001 |
| Town or small city | 0.51 | 0.19 – 0.84 | 0.002 | 0.29 | -0.57 – 1.15 | 0.511 |
| Country village | 0.56 | 0.23 – 0.89 | 0.001 | 0.58 | -0.29 – 1.44 | 0.191 |
|
Farm or home in countryside |
2.62 | 2.13 – 3.10 | <0.001 | 4.30 | 3.10 – 5.50 | <0.001 |
| domicilSuburbs or outskirts of big city:income | -0.02 | -0.19 – 0.14 | 0.776 | |||
| domicilTown or small city:income | 0.04 | -0.09 – 0.16 | 0.565 | |||
| domicilCountry village:income | -0.00 | -0.13 – 0.13 | 0.985 | |||
| domicilFarm or home in countryside:income | -0.29 | -0.47 – -0.10 | 0.002 | |||
| Observations | 7415 | 7415 | ||||
| R2 / adjusted R2 | 0.121 / 0.120 | 0.123 / 0.121 | ||||
## agea_c eduyrs income gndrFemale domicilSuburbs or outskirts of big city
## 0.2404 0.135 0.0781 -1.44 1.875
## domicilTown or small city domicilCountry village
## 0.5174 0.5706
## domicilFarm or home in countryside
## 2.511
m5i2 <- lm(agertr ~ agea_c + gndr + eduyrs + income * domicil, data = ESS2)
plot_model(m5i2, type = "int")The model with interaction between domicile type and income fits better than the additive model. It means that the relationship between income and desired retirement age depends on where the respondent reside.
In the countryside, it is the richest who want to retire earlier, whereas in small towns it is the poorest who want to retire earlier.
In big cities, suburbs, or village, the difference is not statistically significant. But we can see that, except for the farms, poor people would like to retire earlier.
Marginal effects show that, according to model 5i, living on a farm (as compared to a big city) has a positive relationship comparable by size to being 10 years older than the average age, for predicting the desired retirement age. In other words, people 10 years older than average age want to retire 2.5 years later than people of average age. People living on a farm want to retire 2.5 years later than those living in a big city.
Is the relationship between the type of domicile and the desired age of retirement the same for men and women?
## Analysis of Variance Table
##
## Model 1: agertr ~ agea_c + gndr + income + eduyrs + domicil
## Model 2: agertr ~ agea_c + gndr * domicil + eduyrs + income
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 7406 187334
## 2 7402 186985 4 348.5 3.449 0.008014 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| agertr | agertr | |||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 58.27 | 57.73 – 58.82 | <0.001 | 58.60 | 58.00 – 59.20 | <0.001 |
| agea c | 0.24 | 0.22 – 0.26 | <0.001 | 0.24 | 0.22 – 0.26 | <0.001 |
| Female | -1.46 | -1.69 – -1.23 | <0.001 | -2.08 | -2.58 – -1.58 | <0.001 |
| income | 0.08 | 0.03 – 0.12 | 0.001 | 0.08 | 0.03 – 0.12 | 0.002 |
| eduyrs | 0.13 | 0.10 – 0.17 | <0.001 | 0.13 | 0.10 – 0.17 | <0.001 |
|
Suburbs or outskirts of big city |
1.87 | 1.45 – 2.29 | <0.001 | 1.10 | 0.51 – 1.68 | <0.001 |
| Town or small city | 0.51 | 0.19 – 0.84 | 0.002 | 0.17 | -0.30 – 0.64 | 0.476 |
| Country village | 0.56 | 0.23 – 0.89 | 0.001 | 0.19 | -0.27 – 0.66 | 0.423 |
|
Farm or home in countryside |
2.62 | 2.13 – 3.10 | <0.001 | 2.36 | 1.70 – 3.03 | <0.001 |
| gndrFemale:domicilSuburbs or outskirts of big city | 1.55 | 0.72 – 2.39 | <0.001 | |||
| gndrFemale:domicilTown or small city | 0.65 | 0.01 – 1.30 | 0.048 | |||
| gndrFemale:domicilCountry village | 0.72 | 0.07 – 1.38 | 0.030 | |||
| gndrFemale:domicilFarm or home in countryside | 0.45 | -0.52 – 1.43 | 0.360 | |||
| Observations | 7415 | 7415 | ||||
| R2 / adjusted R2 | 0.121 / 0.120 | 0.123 / 0.121 | ||||
The interaction (moderation, multiplicative) model fits better than the additive model.
On average, for men there is no difference whether they live in a big city (reference) or in a town/small city, or in a country village - they want to retire at appr. 61. However, men living in suburbs or on a farm would like to retire later, at 62 or 63.
Women living in suburbs or on a farm would like to retire at 61, which is earlier than males, but this is more than one year later than for women living in a big city/town/country village.
In other words, the relationship between gender and the desired retirement age is, indeed, dependent on where the respondents live. Although it is slightly different for males and females (the interaction effect is statistically significant), it is the same types of domicile (farm / suburbs) where both men and women are planning to retire later than in other domicile types.
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]