# List of packages
packages <- c("tidyverse", "infer", "fst", "modelsummary", "effects", "survey", "interactions", "tidycat", "broom")
# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load the packages
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: 패키지 'infer'는 R 버전 4.3.3에서 작성되었습니다
## Warning: 패키지 'effects'는 R 버전 4.3.3에서 작성되었습니다
## 필요한 패키지를 로딩중입니다: carData
## Warning: 패키지 'carData'는 R 버전 4.3.3에서 작성되었습니다
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
## Warning: 패키지 'survey'는 R 버전 4.3.3에서 작성되었습니다
## 필요한 패키지를 로딩중입니다: grid
## 필요한 패키지를 로딩중입니다: Matrix
##
## 다음의 패키지를 부착합니다: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## 필요한 패키지를 로딩중입니다: survival
##
## 다음의 패키지를 부착합니다: 'survey'
##
## The following object is masked from 'package:graphics':
##
## dotchart
## Warning: 패키지 'interactions'는 R 버전 4.3.3에서 작성되었습니다
## Warning: 패키지 'tidycat'는 R 버전 4.3.3에서 작성되었습니다
## Registered S3 methods overwritten by 'broom':
## method from
## tidy.glht jtools
## tidy.summary.glht jtools
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "infer" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "fst" "infer" "lubridate" "forcats" "stringr" "dplyr"
## [7] "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [13] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [19] "base"
##
## [[4]]
## [1] "modelsummary" "fst" "infer" "lubridate" "forcats"
## [6] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [11] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [16] "grDevices" "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "effects" "carData" "modelsummary" "fst" "infer"
## [6] "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [11] "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [16] "stats" "graphics" "grDevices" "utils" "datasets"
## [21] "methods" "base"
##
## [[6]]
## [1] "survey" "survival" "Matrix" "grid" "effects"
## [6] "carData" "modelsummary" "fst" "infer" "lubridate"
## [11] "forcats" "stringr" "dplyr" "purrr" "readr"
## [16] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [21] "graphics" "grDevices" "utils" "datasets" "methods"
## [26] "base"
##
## [[7]]
## [1] "interactions" "survey" "survival" "Matrix" "grid"
## [6] "effects" "carData" "modelsummary" "fst" "infer"
## [11] "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [16] "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [21] "stats" "graphics" "grDevices" "utils" "datasets"
## [26] "methods" "base"
##
## [[8]]
## [1] "tidycat" "interactions" "survey" "survival" "Matrix"
## [6] "grid" "effects" "carData" "modelsummary" "fst"
## [11] "infer" "lubridate" "forcats" "stringr" "dplyr"
## [16] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [21] "tidyverse" "stats" "graphics" "grDevices" "utils"
## [26] "datasets" "methods" "base"
##
## [[9]]
## [1] "broom" "tidycat" "interactions" "survey" "survival"
## [6] "Matrix" "grid" "effects" "carData" "modelsummary"
## [11] "fst" "infer" "lubridate" "forcats" "stringr"
## [16] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [21] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [26] "utils" "datasets" "methods" "base"
Graph the histogram distribution using the populist scale (as we did in the tutorial) for the country of Sweden. What do you note?
sweden_data <- read.fst("sweden_data.fst")
sweden_data$trstplt <- ifelse(sweden_data$trstplt > 10, NA, sweden_data$trstplt)
sweden_data$trstprl <- ifelse(sweden_data$trstprl > 10, NA, sweden_data$trstprl)
sweden_data$trstprt <- ifelse(sweden_data$trstprt > 10, NA, sweden_data$trstprt)
sweden_data$trust <- scales::rescale(sweden_data$trstplt + sweden_data$trstprl + sweden_data$trstprt, na.rm = TRUE, to = c(0, 100))
sweden_data$populist <- scales::rescale(sweden_data$trust, na.rm = TRUE, to=c(100,0))
ggplot(sweden_data, aes(x = populist)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
theme_minimal() +
labs(title = "Distribution of Populist Scale for Sweden",
x = "Populist Scale",
y = "Count")
## Warning: Removed 2503 rows containing non-finite values (`stat_bin()`).
I note that the distribution is unimodal; between 47.5 and 50 of populist scale is where the values occur most frequently in this histogram. Additionally, there are more people below 50 of populist scale compared to those above 50 of populist scale, indicating that there are more people that are less interested populism than those who are more interested in populism, for the country of Sweden.
Run a linear regression with populist attitudes as the outcome, educational attainment (recoded as a binary “BA or more” and “No BA”), and using the Swedish data. Print the intercept and coefficients only and interpret. Then do a tidy model table displaying the p-value (with digits = 3) and interpret.
df <- read.fst("sweden_data.fst")
model_data <- df %>%
mutate(
educ.ba = case_when(
essround < 5 & edulvla == 5 ~ "BA or more",
essround >= 5 & edulvlb > 600 ~ "BA or more",
TRUE ~ "No BA"),
edulvla = ifelse(edulvla %in% c(77, 88, 99), NA_integer_, edulvla),
edulvlb = ifelse(edulvlb %in% c(5555, 7777, 8888), NA_integer_, edulvlb),
educ.ba = factor(educ.ba, levels = c("No BA", "BA or more")),
ID = case_when(
lrscale %in% 0:4 ~ "Left",
lrscale %in% 6:10 ~ "Right",
lrscale %in% c(77, 88, 99) ~ NA_character_,
TRUE ~ NA_character_
),
)
model_data$trstplt <- ifelse(model_data$trstplt > 10, NA, model_data$trstplt)
model_data$trstprl <- ifelse(model_data$trstprl > 10, NA, model_data$trstprl)
model_data$trstprt <- ifelse(model_data$trstprt > 10, NA, model_data$trstprt)
model_data$trust <- scales::rescale(model_data$trstplt + model_data$trstprl + model_data$trstprt, na.rm = TRUE, to = c(0, 100))
model_data$populist <- scales::rescale(model_data$trust, na.rm = TRUE, to=c(100,0))
table(model_data$populist)
##
## 0 3.33333333333333 6.66666666666667 10
## 46 15 47 69
## 13.3333333333333 16.6666666666667 20 23.3333333333333
## 123 180 398 505
## 26.6666666666667 30 33.3333333333333 36.6666666666667
## 806 879 931 1011
## 40 43.3333333333333 46.6666666666667 50
## 1143 1118 956 1161
## 53.3333333333333 56.6666666666667 60 63.3333333333333
## 804 841 683 631
## 66.6666666666667 70 73.3333333333333 76.6666666666667
## 544 596 431 364
## 80 83.3333333333333 86.6666666666667 90
## 331 257 200 164
## 93.3333333333333 96.6666666666667 100
## 155 80 244
table(model_data$educ.ba)
##
## No BA BA or more
## 13477 4739
model1 <- lm(populist ~ educ.ba, data = model_data)
coefficients_sweden <- coef(model1)
print(coefficients_sweden)
## (Intercept) educ.baBA or more
## 51.358680 -7.868121
The intercept value of 51.4 represents the average score on the populist scale for the “no BA” group. Simply put, the expected average populist scale score would be 51.4 for those who identify as “no BA” in Sweden.
The coefficient of -7.9 associated with the “BA or more” group indicates that, on average, those in in the “BA or more” group have populist scale scores that are 7.9 points lower than those who identify as “No BA” (based on the re-coding of variables) in Sweden.
According to these outputs, for Sweden, people who identify as “No BA”, on average, possess stronger populists attitudes than those who identify as “BA or more”.
tidy(model1) |>
knitr::kable(digits = 3)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 51.359 | 0.181 | 283.363 | 0 |
| educ.baBA or more | -7.868 | 0.350 | -22.494 | 0 |
The p-value of 0 indicates that it is statistically significant. Thus, identifying as ‘No BA’ is statistically associated with stronger populist attitudes and vice versa, for the country of Sweden.
Now using the Italian dataset and the authoritarian values scale (as we did in the tutorial), graph the average by survey year. Interpret.
Between 2012 and 2016, there is a small decrease (75.2 for 2012, 72.6 for 2016) in the authoritarian scale average in Italy. Then, beginning from the year of 2016, there is a very small, but still noticeable, increasing trend in the authoritarian scale average in Italy (72.7 for 2016, 72.8 for 2018, and 73.2 for 2020).
Now run a linear regression model, using the modelsummary package, with authoritarian values as the outcome, and generations as the predictor/potential explanatory variable, and using the Italian data again. Rename the coefficients using the coef_rename function. Interpret the coefficients, whether they are statistically significant, and the adjusted R-squared.
df <- read.fst("italy_data.fst")
model_data <- df %>%
mutate(
age = ifelse(agea == 999, NA, agea),
cohort = ifelse(yrbrn < 1930 | yrbrn > 2000, NA, yrbrn),
gen = case_when(
yrbrn %in% 1900:1945 ~ "1",
yrbrn %in% 1946:1964 ~ "2",
yrbrn %in% 1965:1979 ~ "3",
yrbrn %in% 1980:1996 ~ "4",
TRUE ~ as.character(cohort)
),
gen = factor(gen,
levels = c("1", "2", "3", "4"),
labels = c("Interwar", "Baby Boomers", "Gen X", "Millennials"))
)
table(model_data$gen)
##
## Interwar Baby Boomers Gen X Millennials
## 1470 3179 2636 2031
model_data <- model_data %>%
mutate(behave = ipbhprp,
secure = impsafe,
safety = ipstrgv,
tradition = imptrad,
rules = ipfrule) %>%
mutate(across(c("behave", "secure", "safety", "tradition", "rules"),
~ na_if(.x, 7) %>% na_if(8) %>% na_if(9))) %>%
mutate(across(c("behave", "secure", "safety", "tradition", "rules"), ~ 7 - .x ))
model_data$auth <- scales::rescale(model_data$behave +
model_data$secure +
model_data$safety +
model_data$tradition +
model_data$rules, to=c(0,100), na.rm=TRUE)
model_data <- model_data %>% filter(!is.na(auth))
table(model_data$secure)
##
## 1 2 3 4 5 6
## 54 139 626 1856 2876 2881
model2 <- lm(auth ~ gen, data = model_data)
modelsummary(
list(model2),
fmt = 1,
estimate = c( "{estimate} ({std.error}){stars}"),
statistic = NULL,
coef_omit = "Intercept",
coef_rename = c("genBaby Boomers" = "Baby Boomers", "genGen X" = "Gen X", "genMillennials" = "Millennials")
)
| (1) | |
|---|---|
| Baby Boomers | -1.1 (0.6)* |
| Gen X | -2.1 (0.6)*** |
| Millennials | -3.6 (0.6)*** |
| Num.Obs. | 7631 |
| R2 | 0.006 |
| R2 Adj. | 0.006 |
| AIC | 62934.0 |
| BIC | 62968.7 |
| Log.Lik. | -31462.009 |
| RMSE | 14.94 |
tidy(model2)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 75.3 0.467 162. 0
## 2 genBaby Boomers -1.12 0.552 -2.04 4.14e- 2
## 3 genGen X -2.08 0.564 -3.69 2.29e- 4
## 4 genMillennials -3.59 0.584 -6.15 8.01e-10
Coefficients interpretation:
Baby Boomers (-1.1, p<0.05): Compared to the Interwar generation (reference category that is omitted), Baby Boomers present scores that are, on average, 1.1 points lower on the authoritarian values scale (0-100). The single asterisk (*) denotes statistical significance at a lower p-value than the alpha threshold of 0.05 (p = 0.04142871).
Gen X (-2.1, p<0.001): Generation X shows an even more pronounced decrease, with scores 2.1 points lower than the Interwar generation on the authoritarian values scale. The triple asterisks (***) denote statistical significance at a very lower p-value than the alpha threshold of 0.05 (p = 0.00002291382).
Millennials (-3.6, p<0.001): Millennials exhibit the largest decrease in authoritarian values, with their scores being 3.6 points lower than those of the Interwar generation. This marked difference is statistically significant, indicating that Millennials diverge most strongly from the Interwar generation in comparing their authoritarian values. Again, the triple asterisks (***) denote statistical significance at a very, very, very lower p-value than the alpha threshold of 0.05 (p = 0.0000000008013214).
Adjusted R-Squared interpretation (R2 Adj. = 0.006):
The adjusted R-squared value of 0.006 means that the ‘gen’ variable (with the four coded generations) explain approximately 0.6% of the variance in the authoritarian values scale for Italy, after accounting for generations as the predictor. While statistically significant, this suggests that the model explains only a small portion of the variability in authoritarian values, pointing to other factors outside generational differences that may play critical roles in shaping these attitudes.
Now visualize, using the modelplot() function from the modelsummary package, coefficient estimates and 95% confidence intervals for the following three models:
Model 1 = populist attitudes scale as the outcome; left/right as the single predictor (omitting moderates as we did in the tutorial), and using data for Greece.
Model 2 = populist attitudes scale as the outcome; gender as the single predictor, and using data for Greece.
Model 3 = populist attitudes scale as the outcome; generations as the single predictor (using the four generational category recode we did in the tutorial), and using data for Greece. Interpret.
df <- read.fst("greece_data.fst")
model_data <- df %>%
mutate(
ID = case_when(
lrscale %in% 0:4 ~ "Left",
lrscale %in% 6:10 ~ "Right",
lrscale %in% c(77, 88, 99) ~ NA_character_,
TRUE ~ NA_character_,
),
gndr = case_when(
gndr == 1 ~ "Male",
gndr == 2 ~ "Female",
gndr == 9 ~ NA_character_,
TRUE ~ as.character(gndr)
),
age = ifelse(agea == 999, NA, agea),
cohort = ifelse(yrbrn < 1930 | yrbrn > 2000, NA, yrbrn),
gen = case_when(
yrbrn %in% 1900:1945 ~ "1",
yrbrn %in% 1946:1964 ~ "2",
yrbrn %in% 1965:1979 ~ "3",
yrbrn %in% 1980:1996 ~ "4",
TRUE ~ as.character(cohort)
),
gen = factor(gen,
levels = c("1", "2", "3", "4"),
labels = c("Interwar", "Baby Boomers", "Gen X", "Millennials"))
)
model_data$trstplt <- ifelse(model_data$trstplt > 10, NA, model_data$trstplt)
model_data$trstprl <- ifelse(model_data$trstprl > 10, NA, model_data$trstprl)
model_data$trstprt <- ifelse(model_data$trstprt > 10, NA, model_data$trstprt)
model_data$trust <- scales::rescale(model_data$trstplt + model_data$trstprl + model_data$trstprt, na.rm = TRUE, to = c(0, 100))
model_data$populist <- scales::rescale(model_data$trust, na.rm = TRUE, to=c(100,0))
table(model_data$ID)
##
## Left Right
## 2688 3522
table(model_data$gndr)
##
## Female Male
## 6929 5629
table(model_data$gen)
##
## Interwar Baby Boomers Gen X Millennials
## 2978 3484 3498 2330
model3 <- lm(populist ~ ID, data = model_data)
model4 <- lm(populist ~ gndr, data = model_data)
model5 <- lm(populist ~ gen, data = model_data)
models2 <- list(
"Model 1" = lm(populist ~ ID, data = model_data),
"Model 2" = lm(populist ~ gndr, data = model_data),
"Model 3" = lm(populist ~ gen, data = model_data))
modelsummary(models2, fmt = 1,
estimate = c("{estimate} ({std.error}){stars}",
"{estimate} ({std.error}){stars}",
"{estimate} ({std.error}){stars}"),
coef_rename = c("genBaby Boomers" = "Baby Boomers", "genGen X" = "Gen X", "genMillennials" = "Millennials"),
statistic = NULL,
coef_omit = "Intercept")
| Model 1 | Model 2 | Model 3 | |
|---|---|---|---|
| IDRight | -8.8 (0.6)*** | ||
| gndrMale | -1.0 (0.5)* | ||
| Baby Boomers | 3.5 (0.7)*** | ||
| Gen X | 5.6 (0.7)*** | ||
| Millennials | 5.8 (0.7)*** | ||
| Num.Obs. | 4936 | 9823 | 9562 |
| R2 | 0.037 | 0.000 | 0.009 |
| R2 Adj. | 0.036 | 0.000 | 0.009 |
| AIC | 44723.8 | 88873.2 | 86473.9 |
| BIC | 44743.3 | 88894.8 | 86509.7 |
| Log.Lik. | -22358.894 | -44433.590 | -43231.944 |
| RMSE | 22.44 | 22.30 | 22.25 |
tidy(model3)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 72.0 0.478 151. 0
## 2 IDRight -8.80 0.643 -13.7 6.33e-42
tidy(model4)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 70.0 0.305 230. 0
## 2 gndrMale -0.967 0.452 -2.14 0.0324
tidy(model5)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 65.8 0.508 130. 0
## 2 genBaby Boomers 3.55 0.661 5.37 7.97e- 8
## 3 genGen X 5.58 0.659 8.47 2.85e-17
## 4 genMillennials 5.79 0.706 8.21 2.51e-16
modelplot(models2, coef_omit = 'Interc', coef_rename = c("genBaby Boomers" = "Baby Boomers", "genGen X" = "Gen X", "genMillennials" = "Millennials"))
Coefficient estimates interpretation:
Millennials (5.8, p<0.001): Millennials present the largest increase in populist values, with their scores being 5.8 points higher than those of the Interwar generation (as the reference group). This marked difference is statistically significant, indicating that Millennials, on average, possess the strongest populist attitudes among the four types of generations in the dataset, for the country of Greece. The triple asterisks (***) denote statistical significance at a very, very, very lower p-value than the alpha threshold of 0.05 (p = 0.0000000000000002513449).
Gen X (5.6, p<0.001): Generation X presents a slightly less pronounced increase compared to Millennials, with their scores being 5.6 points higher than the Interwar generation (the reference group) on the populist values scale. Again, the triple asterisks (***) denote statistical significance at a very, very, very lower p-value than the alpha threshold of 0.05 (p = 0.00000000000000002853823).
Baby Boomers (3.6, p<0.001): Compared to the Interwar generation (reference category that is omitted), Baby Boomers present scores that are, on average, 3.6 points higher on the populist values scale. Again, the triple asterisks (***) denote statistical significance at a very, very lower p-value than the alpha threshold of 0.05 (p = 0.00000007974151).
gnderMale (-1.0, p<0.05): Male is statistically significant as their scores are 1.0 point lower than female on the populist values scale, with the p-value of 0.032, indicating that males, on average, possess a slightly weaker populist attitudes than females, for the country of Greece. The single asterisk (*) denotes statistical significance at a lower p-value than the alpha threshold of 0.05 (p = 0.03244758).
IDRight (-8.8, p<0.001): Compared to those who lean politically left, those who lean politically right present scores that are, on average, 8.8 points lower on the populist values scale. This marked difference is statistically significant, indicating that those who lean politically right, on average, possess quite weaker populist attitudes than those who lean politically left, for the country of Greece. The triple asterisks (***) denote statistical significance at an exceedingly lower p-value than the alpha threshold of 0.05 (p = 0.000000000000000000000000000000000000000006333249).
confint(model3, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 71.0653 72.938927
## IDRight -10.0591 -7.539754
confint(model4, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 69.416151 70.60995391
## gndrMale -1.852559 -0.08087043
confint(model5, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 64.777052 66.768087
## genBaby Boomers 2.254559 4.845308
## genGen X 4.288926 6.872488
## genMillennials 4.409253 7.175453
95% confidence interval interpretation:
For the variable IDRight, the confidence interval ranging from 61.0 (71.1-10.1) to 65.4 (72.9-7.5) of populist scores scale is expected to situate the true population parameter, across 95% of random samples.
For the variable gndrMale, the confidence interval ranging from 67.5 (69.4-1.9) to 70.5 (70.6-0.1) of populist scores scale is expected to situate the true population parameter, across 95% of random samples.
For the variable genBaby Boomers, the confidence interval ranging from 67.0 (64.8+2.2) to 71.6 (66.8+4.8) of populist scores scale is expected to situate the true population parameter, across 95% of random samples.
For the variable genGen X, the confidence interval ranging from 69.1 (64.8+4.3) to 73.7 (66.8+6.9) of populist scores scale is expected to situate the true population parameter, across 95% of random samples.
For the variable genMillennials, the confidence interval ranging from 69.2 (64.8+4.4) to 74.0 (66.8+7.2) of populist scores scale is expected to situate the true population parameter, across 95% of random samples.