###Setting Up the Environment
# List of packages
packages <- c("tidyverse", "infer", "fst", "modelsummary", "effects", "survey", "MASS", "aod", "interactions", "kableExtra", "flextable", "scales") # add any you need here
# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load the packages
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: carData
##
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
##
## Loading required package: grid
##
## Loading required package: Matrix
##
##
## Attaching package: 'Matrix'
##
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
##
## Loading required package: survival
##
##
## Attaching package: 'survey'
##
##
## The following object is masked from 'package:graphics':
##
## dotchart
##
##
##
## Attaching package: 'MASS'
##
##
## The following object is masked from 'package:dplyr':
##
## select
##
##
##
## Attaching package: 'aod'
##
##
## The following object is masked from 'package:survival':
##
## rats
##
##
##
## Attaching package: 'kableExtra'
##
##
## The following object is masked from 'package:dplyr':
##
## group_rows
##
##
##
## Attaching package: 'flextable'
##
##
## The following objects are masked from 'package:kableExtra':
##
## as_image, footnote
##
##
## The following object is masked from 'package:purrr':
##
## compose
##
##
##
## Attaching package: 'scales'
##
##
## The following object is masked from 'package:purrr':
##
## discard
##
##
## The following object is masked from 'package:readr':
##
## col_factor
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "infer" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "fst" "infer" "lubridate" "forcats" "stringr" "dplyr"
## [7] "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [13] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [19] "base"
##
## [[4]]
## [1] "modelsummary" "fst" "infer" "lubridate" "forcats"
## [6] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [11] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [16] "grDevices" "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "effects" "carData" "modelsummary" "fst" "infer"
## [6] "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [11] "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [16] "stats" "graphics" "grDevices" "utils" "datasets"
## [21] "methods" "base"
##
## [[6]]
## [1] "survey" "survival" "Matrix" "grid" "effects"
## [6] "carData" "modelsummary" "fst" "infer" "lubridate"
## [11] "forcats" "stringr" "dplyr" "purrr" "readr"
## [16] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [21] "graphics" "grDevices" "utils" "datasets" "methods"
## [26] "base"
##
## [[7]]
## [1] "MASS" "survey" "survival" "Matrix" "grid"
## [6] "effects" "carData" "modelsummary" "fst" "infer"
## [11] "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [16] "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [21] "stats" "graphics" "grDevices" "utils" "datasets"
## [26] "methods" "base"
##
## [[8]]
## [1] "aod" "MASS" "survey" "survival" "Matrix"
## [6] "grid" "effects" "carData" "modelsummary" "fst"
## [11] "infer" "lubridate" "forcats" "stringr" "dplyr"
## [16] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [21] "tidyverse" "stats" "graphics" "grDevices" "utils"
## [26] "datasets" "methods" "base"
##
## [[9]]
## [1] "interactions" "aod" "MASS" "survey" "survival"
## [6] "Matrix" "grid" "effects" "carData" "modelsummary"
## [11] "fst" "infer" "lubridate" "forcats" "stringr"
## [16] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [21] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [26] "utils" "datasets" "methods" "base"
##
## [[10]]
## [1] "kableExtra" "interactions" "aod" "MASS" "survey"
## [6] "survival" "Matrix" "grid" "effects" "carData"
## [11] "modelsummary" "fst" "infer" "lubridate" "forcats"
## [16] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [21] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [26] "grDevices" "utils" "datasets" "methods" "base"
##
## [[11]]
## [1] "flextable" "kableExtra" "interactions" "aod" "MASS"
## [6] "survey" "survival" "Matrix" "grid" "effects"
## [11] "carData" "modelsummary" "fst" "infer" "lubridate"
## [16] "forcats" "stringr" "dplyr" "purrr" "readr"
## [21] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [26] "graphics" "grDevices" "utils" "datasets" "methods"
## [31] "base"
##
## [[12]]
## [1] "scales" "flextable" "kableExtra" "interactions" "aod"
## [6] "MASS" "survey" "survival" "Matrix" "grid"
## [11] "effects" "carData" "modelsummary" "fst" "infer"
## [16] "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [21] "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [26] "stats" "graphics" "grDevices" "utils" "datasets"
## [31] "methods" "base"
Use data for Germany and the variable coding for model3 and model4 from the tutorial (i.e., both for the cleaning & recode, as well as the linear model formulas). Do a modelsummary table displaying both model outputs. Interpret the coefficients for the MLR without the interaction (i.e., the first displayed model), and interpret model fit metrics for both models.
First load Germany data and rename to df to keep germany_data as is to refer back to the original dataset needed:
germany_data <- read_fst("germany_data.fst")
df <- germany_data
Next weight the data since the most accurate estimates are obtained only after weighing data:
df$weight <- df$dweight * df$pweight
survey_desgign <- svydesign(ids = ~1, data = df, weights = ~weight)
Next perform the variable recoding for model3 and model4 from the tutorial:
# Use authoritarian scale
df <- df %>%
mutate(behave = ipbhprp,
secure = impsafe,
safety = ipstrgv,
tradition = imptrad,
rules = ipfrule) %>%
mutate(across(c("behave", "secure", "safety", "tradition", "rules"),
~ na_if(.x, 7) %>% na_if(8) %>% na_if(9))) %>%
# Apply the reverse coding
mutate(across(c("behave", "secure", "safety", "tradition", "rules"), ~ 7 - .x ))
# Now you can calculate 'schwartzauth' after the NA recoding
df$auth <- scales::rescale(df$behave +
df$secure +
df$safety +
df$tradition +
df$rules, to=c(0,100), na.rm=TRUE)
df <- df %>% filter(!is.na(auth))
# Add covariates of interest: lrscale with 3 categories (left/right/moderate) and a cleaned version of how religious respondents are from 0-19. We are testing ideas related to the political right and being more religious as associated to authoritarian values.
df <- df %>%
mutate(
polID = case_when(
lrscale %in% 0:3 ~ "Left",
lrscale %in% 7:10 ~ "Right",
lrscale %in% 4:6 ~ "Moderate",
lrscale %in% c(77, 88, 99) ~ NA_character_
),
religious = case_when(
rlgdgr %in% c(77, 88, 99) ~ NA_real_,
TRUE ~ rlgdgr
)
)
Next run the linear model formulas for model3 and model4. Model 3 is a MLR and Model 4 is a MLR with interaction between religiousness and ideological self-placement as left, moderate and right.
model3 <- lm(auth ~ polID + religious, data = df, weights = weight)
model4 <- lm(auth ~ polID + religious + polID*religious, data = df, weights = weight)
Do a model summary table displaying both model outputs:
modelsummary(
list(model3, model4),
fmt = 1,
estimate = c( "{estimate} ({std.error}){stars}",
"{estimate} ({std.error}){stars}"),
statistic = NULL)
| Â (1) | Â Â (2) | |
|---|---|---|
| (Intercept) | 55.3 (0.3)*** | 56.4 (0.3)*** |
| polIDModerate | 4.6 (0.3)*** | 3.4 (0.4)*** |
| polIDRight | 7.7 (0.4)*** | 4.2 (0.7)*** |
| religious | 0.9 (0.0)*** | 0.6 (0.1)*** |
| polIDModerate × religious | 0.3 (0.1)*** | |
| polIDRight × religious | 0.8 (0.1)*** | |
| Num.Obs. | 23373 | 23373 |
| R2 | 0.048 | 0.050 |
| R2 Adj. | 0.048 | 0.050 |
| AIC | 200594.1 | 200558.6 |
| BIC | 200634.4 | 200615.0 |
| Log.Lik. | −100292.049 | −100272.298 |
| RMSE | 17.27 | 17.24 |
Interpret the coefficients for the MLR without interactions, and the model fit metrics for both models:
The intercept of 55.3 represents the expected value of the outcome variable when all the predictor variables are set to their reference levels, which would be polID as left and religiousness as 0. Therefore, the baseline group that is omitted for the categorical variable is interpreted as part of the intercept.
The polIDModerate coefficient is 4.6. This suggests that comparing individuals with the same religiousness, the average outcome for individuals identifying as politically moderate are 4.6 units higher compared to the reference category of identifying ideologically as on the left on the authoritarian values scale.
The polIDRight coefficient is 7.7. This coefficient suggests that comparing individuals with the same religiousness, individuals identifying as politically right are expected to be, on average, 7.7 units higher compared than the reference category of identifying ideologically as on the left on the authoritarian values scale.
The religious coefficient is 0.9. This coefficient of 0.9 means that with every one-unit increase in religiousness, there is an associated expected average increase of 0.9 units on the authoritarian values scale, assuming that we are comparing people with the same ideological categorization, that is holding the ideological self-placement variable as constant.
All three of the coefficients have the three stars (***) indicating a p-value < 0.001. This means that there is a less than 0.1% probability that the observed result is due to chance and that we incorrectly reject the null hypothesis under frequentist assumptions.
Looking at the model fit metrics, the R squared and adjusted R squared values are very close between the MLR and the MLR with interactions. Therefore, the MLR with interactions does not appear to explain much more proportion in variance of the outcome variable than the MLR model does. The AIC and BIC measures between the two models are also very close. While the MLR with interactions has model fit metrics that are slightly better than the MLR, the difference in these metrics is very small, and the interaction coefficients themselves are also quite small.
Now generate the model4 interaction plot that we did in the tutorial, but again using the German data instead of the French. Interpret.
Generate the model4 interaction plot using German data
interaction_plot <- effect("polID*religious", model4, na.rm=TRUE)
plot(interaction_plot,
main="Interaction effect",
xlab="Religiousness",
ylab="Authoritarian attitudes scale")
interaction_plot
##
## polID*religious effect
## religious
## polID 0 2 5 8 10
## Left 56.40706 57.56775 59.30879 61.04983 62.21053
## Moderate 59.85063 61.66176 64.37844 67.09513 68.90626
## Right 60.60237 63.35032 67.47224 71.59416 74.34211
Interpret the interaction plot:
Referring back to the model output for model4, we saw that the interaction coefficients are statistically significant, indicated by the ***, however, the coefficients are small. Also, the model fit metrics were only very slightly better than the model fit metrics in the MLR model. Looking at an interaction plot, we know that if the lines are parallel that indicates no interaction effect, and if the lines are not parallel, that indicates an interaction effect. In this model interaction plot for model4, although the lines are not exactly parallel with exactly equal slopes, the slopes of these lines appear to be very close and look close to parallel. The interaction plots for polID=Left and polID=Moderate are almost parallel, and the interaction plots for polID=Moderate and polID=Right are almost parallel. There are no sections of the plot where the lines are moving in different directions. All three interaction plots are linearly increasing over the scale. We can calculate the slopes of the lines to see how similar the slopes are:
m1 <- (62.21053-56.40706)/(10-0)
m2 <- (68.90626-59.85063)/(10-0)
m3 <- (74.34211-60.60237)/(10-0)
m1
## [1] 0.580347
m2
## [1] 0.905563
m3
## [1] 1.373974
As I explained above, the plots between polID=Left and polID=Moderate looked almost parallel and the slopes of polID=Moderate and polID=Right looked almost parallel. Looking at the slopes though, the slopes between the first two plots are close and between the second and third plots are close but you can see that the lines are not parallel because the slopes of the lines are not equal. Therefore there is some interaction effect between polID and religiousness.
Use data for the Netherlands and the variable coding for model5 and model6 from the tutorial (i.e., both for the cleaning & recode, as well as the linear model formulas). Do a modelsummary table displaying both model outputs. Interpret the coefficients for the MLR without the interaction (i.e., the first displayed model), and interpret model fit metrics for both models.
First load Netherlands data and rename to df to keep netherlands_data as is to refer back to the original dataset needed:
netherlands_data <- read_fst("netherlands_data.fst")
df <- netherlands_data
Next weight the data since the most accurate estimates are obtained only after weighing data:
df$weight <- df$dweight * df$pweight
survey_desgign <- svydesign(ids = ~1, data = df, weights = ~weight)
Next perform the variable recoding for model5 and model6 from the tutorial:
# Use authoritarian scale
df <- df %>%
mutate(behave = ipbhprp,
secure = impsafe,
safety = ipstrgv,
tradition = imptrad,
rules = ipfrule) %>%
mutate(across(c("behave", "secure", "safety", "tradition", "rules"),
~ na_if(.x, 7) %>% na_if(8) %>% na_if(9))) %>%
# Apply the reverse coding
mutate(across(c("behave", "secure", "safety", "tradition", "rules"), ~ 7 - .x ))
# Now you can calculate 'schwartzauth' after the NA recoding
df$auth <- scales::rescale(df$behave +
df$secure +
df$safety +
df$tradition +
df$rules, to=c(0,100), na.rm=TRUE)
df <- df %>% filter(!is.na(auth))
# Use a predictor about whether a respondent belongs to a particular religion or not
df <- df %>%
mutate(religion = case_when(
rlgblg == 2 ~ "No",
rlgblg == 1 ~ "Yes",
rlgblg %in% c(7, 8, 9) ~ NA_character_,
TRUE ~ as.character(rlgblg)
))
# check
table(df$religion)
##
## No Yes
## 10913 6750
# Recode lrscale as left and right, omitting 5
df <- df %>%
mutate(ID = case_when(
lrscale >= 0 & lrscale <= 4 ~ "Left",
lrscale >= 6 & lrscale <= 10 ~ "Right",
lrscale > 10 ~ NA_character_, # Set values above 10 as NA
TRUE ~ NA_character_ # Ensure value 5 and any other unexpected values are set as NA
))
table(df$ID)
##
## Left Right
## 5449 7010
# Recode gen which is needed for model5 and model6
df <- df %>%
mutate(
cohort = ifelse(yrbrn < 1930 | yrbrn > 2000, NA, yrbrn),
# Recoding generational cohorts based on the year of birth (yrbrn).
# The year of birth is categorized into different generational cohorts.
# Interwar (1900-1945), Baby Boomers (1946-1964), Gen X (1965-1979), Millennials (1980-1996).
# The 'TRUE' line is a catch-all that keeps the original year of birth for those not in these ranges.
gen = case_when(
yrbrn %in% 1900:1945 ~ "1",
yrbrn %in% 1946:1964 ~ "2",
yrbrn %in% 1965:1979 ~ "3",
yrbrn %in% 1980:1996 ~ "4",
TRUE ~ as.character(yrbrn)
),
# After recoding, the gen variable is converted into a factor with labels for clearer interpretation.
# Factors are used in R to handle categorical variables.
gen = factor(gen,
levels = c("1", "2", "3", "4"),
labels = c("Interwar", "Baby Boomers", "Gen X", "Millennials"))
)
table(df$gen)
##
## Interwar Baby Boomers Gen X Millennials
## 3771 6147 4548 2730
Next run the linear model formulas for model5 and model6. Model 5 is a MLR and Model 6 is a MLR with interaction between belonging to a particular religion and ideological self-placement as left and right.
model5 <- lm(auth ~ religion + ID + gen, data = df, weights = weight)
model6 <- lm(auth ~ religion + ID + gen + religion*gen, data = df, weights = weight)
Do a model summary table displaying both model outputs:
modelsummary(
list(model5, model6),
fmt = 1,
estimate = c( "{estimate} ({std.error}){stars}",
"{estimate} ({std.error}){stars}"),
statistic = NULL,
coef_omit = "Intercept")
| Â (1) | Â Â (2) | |
|---|---|---|
| religionYes | 7.2 (0.3)*** | 5.7 (0.7)*** |
| IDRight | 4.2 (0.3)*** | 4.3 (0.3)*** |
| genBaby Boomers | −5.2 (0.4)*** | −5.9 (0.6)*** |
| genGen X | −7.7 (0.4)*** | −8.8 (0.6)*** |
| genMillennials | −8.4 (0.5)*** | −9.7 (0.6)*** |
| religionYes × genBaby Boomers | 1.1 (0.8) | |
| religionYes × genGen X | 2.3 (0.9)** | |
| religionYes × genMillennials | 2.9 (1.0)** | |
| Num.Obs. | 12116 | 12116 |
| R2 | 0.124 | 0.125 |
| R2 Adj. | 0.124 | 0.125 |
| AIC | 101219.3 | 101213.5 |
| BIC | 101271.2 | 101287.5 |
| Log.Lik. | −50602.672 | −50596.757 |
| RMSE | 15.27 | 15.26 |
Interpret the coefficients for the MLR without interactions, and the model fit metrics for both models:
The religionYes coefficient is 7.2. This suggests that comparing individuals with the same political ID and generation (i.e. holding these 2 variables constant), the average outcome for individuals identifying as belonging to a particular religion is 7.2 units higher, compared to the reference category of not identifying as belonging to a particular religion, on the authoritarian values scale.
The IDRight coefficient is 4.2. This suggests that comparing individuals with the same religion belonging and generation (i.e. holding these 2 variables constant), the average outcome for individuals identifying as politically right is expected to be 7.2 units higher, compared to the reference category of not identifying as politically left, on the authoritarian values scale.
The genBaby Boomers coefficient is -5.2. This suggests that comparing individuals with the same religion belonging and political ID (i.e. holding these 2 variables constant), the average outcome for individuals in generation Baby Boomers is expected to be 5.2 units lower, compared to the reference category of generation Interwar, on the authoritarian values scale.
The genGen X coefficient is -7.7. This suggests that comparing individuals with the same religion belonging and political ID (i.e. holding these 2 variables constant), the average outcome for individuals in generation Gen X is expected to be 7.7 units lower, compared to the reference category of generation Interwar, on the authoritarian values scale.
The genMillenials coefficient is -8.4. This suggests that comparing individuals with the same religion belonging and political ID (i.e. holding these 2 variables constant), the average outcome for individuals in generation Gen X is expected to be 8.4 units lower, compared to the reference category of generation Interwar, on the authoritarian values scale.
All of the coefficients have the three stars (***) indicating a p-value < 0.001 and that all coefficients are statistically significant. This means that there is a less than 0.1% probability that the observed result is due to chance and that we incorrectly reject the null hypothesis under frequentist assumptions.
Looking at the model fit metrics, the R squared and adjusted R squared values are very close between the MLR and the MLR with interactions. Both the R squared and adjusted R squared values are 0.124 in the MLR, and 0.125 in the MLR with interactions. Therefore, the MLR with interactions does not appear to explain much more proportion in variance of the outcome variable than the MLR model does. The AIC and BIC measures between the two models are also very close. The AIC appears to be better in the MLR with interactions as the AIC is lower, but the BIC appears to be better in the MLR as the BIC is lower in this model. While the MLR with interactions has more model fit metrics that are slightly better than the MLR (R squared, adjusted R squared, AIC), the difference in these metrics is very small. Unlike in model 4, however, the coefficients of the interaction terms are larger.
Produce the model7 interaction plot from the tutorial, but again using the Netherlands data. Interpret.
We need to run the linear model formula and produce the interaction plot for model7 which uses cohort and polID which we have not yet recoded for the Netherlands dataset.
First recode cohort and polID for the netherlands dataset:
df <- df %>%
mutate(
cohort = ifelse(yrbrn < 1930 | yrbrn > 2000, NA, yrbrn),
polID = case_when(
lrscale %in% 0:3 ~ "Left",
lrscale %in% 7:10 ~ "Right",
lrscale %in% 4:6 ~ "Moderate",
lrscale %in% c(77, 88, 99) ~ NA_character_
))
Next run the linear model formula and produce the interaction plot from the tutorial for model7 which is a MLR with interaction between cohort and political ideological self-placement as left, moderate and right:
model7 <- lm(auth ~ + cohort + polID + cohort*polID, data = df, weights = weight)
interact_plot(model7, pred = cohort, modx = polID, jnplot = TRUE)
Interpret:
Referring back to the model output for model6, we saw that not all the interaction coefficients are statistically significant, as the religionYesxgenBaby Boomers interaction term does not have the three stars ***. Also, as mentioned above, the coefficients are not as small as we saw in the model4. Looking at an interaction plot, we know that if the lines are parallel that indicates no interaction effect, and if the lines are not parallel, that indicates an interaction effect. In this model interaction plot for model6, the lines are not parallel and do not have equal slopes. The slope for the polIDRight line appears to be a lot different than the other 2 lines, it even intersects near the cohort 2000. There are no sections of the plot where the lines are moving in different directions. All three interaction plots are linearly decreasing over the scale. Since the slopes of the lines are not the same and the lines are not parallel, you can conclude that there is an interaction effect between polID and cohort.
Produce the model8 interaction plot from the tutorial, but again using the Netherlands data. Interpret.
We need to run the linear model formula and product the interaction plot for model8 which uses religious and ID. We have already recoded ID for the netherlands dataset but have not recoded the religious variable.
First recode religious for the netherlands dataset:
df <- df %>%
mutate(
polID = case_when(
lrscale %in% 0:3 ~ "Left",
lrscale %in% 7:10 ~ "Right",
lrscale %in% 4:6 ~ "Moderate",
lrscale %in% c(77, 88, 99) ~ NA_character_
),
religious = case_when(
rlgdgr %in% c(77, 88, 99) ~ NA_real_,
TRUE ~ rlgdgr
))
Next run the linear model formula and produce the interaction plot from the tutorial for model8 which is a MLR with interaction between political ideological self-placement on the left right scale and religiousness
model8 <- lm(auth ~ religious + ID + religious*ID, data = df, weights = weight)
interact_plot(model8, pred = religious, modx = ID, jnplot = TRUE)
Interpret:
Looking at the interaction plot, although the lines are not exactly parallel with the exact same slope, the lines are very close to parallel and if the slope was calculated, the values would be very close. These two points lead us to conclude that there is no interaction effect between ideologically identifying as politically left or right, and how religious the respondent is. We can calculate these slopes using estimates of the points and see how close the values are:
m1_left <- (57.5-54)/(2.5-0)
m2_right <- (62-58)/(2.5-0)
m1_left
## [1] 1.4
m2_right
## [1] 1.6
The slopes of these two lines are very close as expected, therefore we can conclude that there is no significant interaction effect between these two variables.