getwd()
## [1] "/Users/hayaouri/Desktop/SOC222 R"
rm(list=ls()); gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 529512 28.3 1178182 63 NA 669417 35.8
## Vcells 977455 7.5 8388608 64 16384 1851787 14.2
# List of packages
packages <- c("tidyverse", "broom", "infer", "fst", "modelsummary", "effects", "survey", "interactions") # add any you need here
# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load the packages
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: carData
##
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
##
## Loading required package: grid
##
## Loading required package: Matrix
##
##
## Attaching package: 'Matrix'
##
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
##
## Loading required package: survival
##
##
## Attaching package: 'survey'
##
##
## The following object is masked from 'package:graphics':
##
## dotchart
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "broom" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "infer" "broom" "lubridate" "forcats" "stringr" "dplyr"
## [7] "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [13] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [19] "base"
##
## [[4]]
## [1] "fst" "infer" "broom" "lubridate" "forcats" "stringr"
## [7] "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [13] "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets"
## [19] "methods" "base"
##
## [[5]]
## [1] "modelsummary" "fst" "infer" "broom" "lubridate"
## [6] "forcats" "stringr" "dplyr" "purrr" "readr"
## [11] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [16] "graphics" "grDevices" "utils" "datasets" "methods"
## [21] "base"
##
## [[6]]
## [1] "effects" "carData" "modelsummary" "fst" "infer"
## [6] "broom" "lubridate" "forcats" "stringr" "dplyr"
## [11] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [16] "tidyverse" "stats" "graphics" "grDevices" "utils"
## [21] "datasets" "methods" "base"
##
## [[7]]
## [1] "survey" "survival" "Matrix" "grid" "effects"
## [6] "carData" "modelsummary" "fst" "infer" "broom"
## [11] "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [16] "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [21] "stats" "graphics" "grDevices" "utils" "datasets"
## [26] "methods" "base"
##
## [[8]]
## [1] "interactions" "survey" "survival" "Matrix" "grid"
## [6] "effects" "carData" "modelsummary" "fst" "infer"
## [11] "broom" "lubridate" "forcats" "stringr" "dplyr"
## [16] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [21] "tidyverse" "stats" "graphics" "grDevices" "utils"
## [26] "datasets" "methods" "base"
Graph the histogram distribution using the populist scale (as we did in the tutorial) for the country of Sweden. What do you note?
sweden_data <- read.fst("sweden_data.fst")
# Populist scale (Norris/Inglehart use low levels of trust in parliaments, politicians and parties as an indicator of populism)
# Note: Schafer does it 2 ways, we will focus now on the most straightforward
# Setting values greater than 10 to NA
sweden_data$trstplt <- ifelse(sweden_data$trstplt > 10, NA, sweden_data$trstplt)
sweden_data$trstprl <- ifelse(sweden_data$trstprl > 10, NA, sweden_data$trstprl)
sweden_data$trstprt <- ifelse(sweden_data$trstprt > 10, NA, sweden_data$trstprt)
# Creating and rescaling the trust variable
sweden_data$trust <- scales::rescale(sweden_data$trstplt + sweden_data$trstprl + sweden_data$trstprt, na.rm = TRUE, to = c(0, 100))
# Here's how you would "flip it" because populist is conceived by N + I as 'anti-trust'
# As Schafer explains there are some issues with that conceptualization, but N + I is also widely applied
sweden_data$populist <- scales::rescale(sweden_data$trust, na.rm = TRUE, to=c(100,0))
ggplot(sweden_data, aes(x = populist)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
theme_minimal() +
labs(title = "Distribution of Populist Scale for Sweden",
x = "Populist Scale",
y = "Count")
## Warning: Removed 2503 rows containing non-finite values (`stat_bin()`).
Sweden graph shows a trend of majority of people being left wing
populist. The mode of the histogram distribution is less than 50, and it
appears to be significantly skewed to the right. The majority of people
in Sweden have low levels of trust in parliaments, politicians, and
political parties because to low populist scales. Additionally, there
are fewer counts for 0 to 25 and 75 to 100 in the populist scale than
for 25 to 75. However, in the popular scale, 25 to 50 has more counts
than 50 to 75.
Run a linear regression with populist attitudes as the outcome, educational attainment (recoded as a binary “BA or more” and “No BA”), and using the Swedish data. Print the intercept and coefficients only and interpret. Then do a tidy model table displaying the p-value (with digits = 3) and interpret.
sweden_data <- read.fst("sweden_data.fst")
# Populist scale (Norris/Inglehart use low levels of trust in parliaments, politicians and parties as an indicator of populism)
# Note: Schafer does it 2 ways, we will focus now on the most straightforward
# Setting values greater than 10 to NA
sweden_data$trstplt <- ifelse(sweden_data$trstplt > 10, NA, sweden_data$trstplt)
sweden_data$trstprl <- ifelse(sweden_data$trstprl > 10, NA, sweden_data$trstprl)
sweden_data$trstprt <- ifelse(sweden_data$trstprt > 10, NA, sweden_data$trstprt)
# Creating and rescaling the trust variable
sweden_data$trust <- scales::rescale(sweden_data$trstplt + sweden_data$trstprl + sweden_data$trstprt, na.rm = TRUE, to = c(0, 100))
# Here's how you would "flip it" because populist is conceived by N + I as 'anti-trust'
# As Schafer explains there are some issues with that conceptualization, but N + I is also widely applied
sweden_data$populist <- scales::rescale(sweden_data$trust, na.rm = TRUE, to=c(100,0))
sweden_data <- sweden_data %>%
mutate(
# Recoding education
educ.ba = case_when(
essround < 5 & edulvla == 5 ~ "BA or more",
essround >= 5 & edulvlb > 600 ~ "BA or more",
TRUE ~ "No BA"
),
# Handle NAs for education levels
edulvla = ifelse(edulvla %in% c(77, 88, 99), NA_integer_, edulvla),
edulvlb = ifelse(edulvlb %in% c(5555, 7777, 8888), NA_integer_, edulvlb),
# Explicitly making 'No BA' the reference category
educ.ba = factor(educ.ba, levels = c("No BA", "BA or more")),
)
sweden_pop_model <- lm(populist ~ educ.ba, data = sweden_data)
coefficients_sweden <- coef(sweden_pop_model)
print(coefficients_sweden)
## (Intercept) educ.baBA or more
## 51.358680 -7.868121
Interpretation: Intercept (51.4): this number is the populist scale’s average score for the Swedish reference group, which consists of individuals who have “No BA.” Simplified, those with lower educational attainment should expect an average populist scale score of 51.4. Higher scores denote more strongly held populist views. This score represents a certain degree of populism. Coefficient (-7.9): According to the negative coefficient linked to the “BA or more” category, individuals in Sweden who have completed their bachelor’s degree or more often score 7.9 points lower on the populist scale than those who have not. The p-value in the neat model table that shows the p-value is 0 for both “No BA” and “BA or more.” It suggests that there is a statistically significant relationship between the Populist scale and educational level in the Swedish data.
Now using the Italian dataset and the authoritarian values scale (as we did in the tutorial), graph the average by survey year. Interpret.
df <- read.fst("italy_data.fst")
## Creates an authoritarian values scale based on human modules items (can also do libertarian value scale)
df <- df %>%
mutate(behave = ipbhprp,
secure = impsafe,
safety = ipstrgv,
tradition = imptrad,
rules = ipfrule) %>%
mutate(across(c("behave", "secure", "safety", "tradition", "rules"),
~ na_if(.x, 7) %>% na_if(8) %>% na_if(9))) %>%
# Apply the reverse coding
mutate(across(c("behave", "secure", "safety", "tradition", "rules"), ~ 7 - .x ))
# Now you can calculate 'schwartzauth' after the NA recoding
df$auth <- scales::rescale(df$behave +
df$secure +
df$safety +
df$tradition +
df$rules, to=c(0,100), na.rm=TRUE)
df <- df %>% filter(!is.na(auth))
table(df$secure)
##
## 1 2 3 4 5 6
## 54 139 626 1856 2876 2881
table(df$auth)
##
## 0 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84
## 7 4 4 9 13 20 27 57 113 155 163 323 367 596 647 789 823 897 992 733
## 88 92 96 100
## 632 465 299 297
df$year <- NA
replacements <- c(2002, 2004, 2006, 2008, 2010, 2012, 2014, 2016, 2018, 2020)
for(i in 1:10){
df$year[df$essround == i] <- replacements[i]
}
auth_avg <- df %>%
group_by(year) %>%
summarize(auth_avg = mean(auth, na.rm = TRUE))
plot_auth <- ggplot(auth_avg, aes(x = year, y = auth_avg)) +
geom_point(aes(color = auth_avg), alpha = 1.0) +
geom_line(aes(group = 1), color = "blue", linetype = "dashed") +
labs(title = "Authoritarian Scale Average by Year (Italy)",
x = "Survey Year",
y = "Authoritarian Scale Average") +
theme_minimal() +
theme(legend.position = "none") +
scale_y_continuous(limits = c(0, 100))
print(plot_auth)
Italy’s average exhibits only little variations throughout time, staying
largely constant at 75. The mean score on the authoritarian scale has
seen little fluctuation between 2002 and 2020, suggesting that the
general public’s sentiments toward authoritarianism have remained stable
during this time. Even though there might have been some political or
social shifts during these years, the average score of 75 indicates that
the Italian people generally have an authoritarian mentality.
Now run a linear regression model, using the modelsummary package, with authoritarian values as the outcome, and generations as the predictor/potential explanatory variable, and using the Italian data again. Rename the coefficients using the coef_rename function. Interpret the coefficients, whether they are statistically significant, and the adjusted R-squared.
df <- read.fst("italy_data.fst")
## Creates an authoritarian values scale based on human modules items (can also do libertarian value scale)
df <- df %>%
mutate(behave = ipbhprp,
secure = impsafe,
safety = ipstrgv,
tradition = imptrad,
rules = ipfrule) %>%
mutate(across(c("behave", "secure", "safety", "tradition", "rules"),
~ na_if(.x, 7) %>% na_if(8) %>% na_if(9))) %>%
# Apply the reverse coding
mutate(across(c("behave", "secure", "safety", "tradition", "rules"), ~ 7 - .x ))
# Now you can calculate 'schwartzauth' after the NA recoding
df$auth <- scales::rescale(df$behave +
df$secure +
df$safety +
df$tradition +
df$rules, to=c(0,100), na.rm=TRUE)
df <- df %>% filter(!is.na(auth))
table(df$secure)
##
## 1 2 3 4 5 6
## 54 139 626 1856 2876 2881
model_data <- df %>%
mutate(
# Recoding cohort variable, setting years before 1930 and after 2000 to NA
cohort = ifelse(yrbrn < 1930 | yrbrn > 2000, NA, yrbrn),
# Recoding generational cohorts based on year of birth
gen = case_when(
yrbrn %in% 1900:1945 ~ "1",
yrbrn %in% 1946:1964 ~ "2",
yrbrn %in% 1965:1979 ~ "3",
yrbrn %in% 1980:1996 ~ "4",
TRUE ~ as.character(cohort) # Keeping other values as character if they do not fit the ranges
),
# Converting cohort to a factor with labels
gen = factor(gen,
levels = c("1", "2", "3", "4"),
labels = c("Interwar", "Baby Boomers", "Gen X", "Millennials"))
)
table(model_data$gen)
##
## Interwar Baby Boomers Gen X Millennials
## 1026 2580 2214 1811
model1 <- lm(auth ~ gen, data = model_data)
modelsummary(
list(model1),
fmt = 1,
estimate = c( "{estimate} ({std.error}){stars}"),
statistic = NULL,
coef_rename = c("genBaby Boomers" = "Baby Boomers", "genGen X" = "Gen X", "genMillennials" = "Millennials"),
coef_omit = "Intercept")
| (1) | |
|---|---|
| Baby Boomers | −1.1 (0.6)* |
| Gen X | −2.1 (0.6)*** |
| Millennials | −3.6 (0.6)*** |
| Num.Obs. | 7631 |
| R2 | 0.006 |
| R2 Adj. | 0.006 |
| AIC | 62934.0 |
| BIC | 62968.7 |
| Log.Lik. | −31462.009 |
| RMSE | 14.94 |
On the authoritarian values scale, Baby Boomers exhibit an average decline of 1.1 points, with statistical significance (p < 0.001), when compared to the Interwar generation. Millennials have the biggest decline of 3.6 points, while Generation X shows an even more marked fall of 2.1 points. The statistical significance of these differences (p < 0.001) indicates that Millennials deviate from the Interwar generation in terms of authoritarian ideals the most. Nevertheless, the corrected R-squared value of 0.006 indicates that only approximately 0.6% of the variability in authoritarian values in Italy can be explained by the “gen” variable, which represents the four generations. This suggests that these opinions are highly influenced by causes other than generational differences.
Now visualize, using the modelplot() function from the modelsummary package, coefficient estimates and 95% confidence intervals for the following three models:
Model 1 = populist attitudes scale as the outcome; left/right as the single predictor (omitting moderates as we did in the tutorial), and using data for Greece.
Model 2 = populist attitudes scale as the outcome; gender as the single predictor, and using data for Greece.
Model 3 = populist attitudes scale as the outcome; generations as the single predictor (using the four generational category recode we did in the tutorial), and using data for Greece. Interpret.
greece_data <- read.fst("greece_data.fst")
# Populist scale (Norris/Inglehart use low levels of trust in parliaments, politicians and parties as an indicator of populism)
# Note: Schafer does it 2 ways, we will focus now on the most straightforward
# Setting values greater than 10 to NA
greece_data$trstplt <- ifelse(greece_data$trstplt > 10, NA, greece_data$trstplt)
greece_data$trstprl <- ifelse(greece_data$trstprl > 10, NA, greece_data$trstprl)
greece_data$trstprt <- ifelse(greece_data$trstprt > 10, NA, greece_data$trstprt)
# Creating and rescaling the trust variable
greece_data$trust <- scales::rescale(greece_data$trstplt + greece_data$trstprl + greece_data$trstprt, na.rm = TRUE, to = c(0, 100))
# Here's how you would "flip it" because populist is conceived by N + I as 'anti-trust'
# As Schafer explains there are some issues with that conceptualization, but N + I is also widely applied
greece_data$populist <- scales::rescale(greece_data$trust, na.rm = TRUE, to=c(100,0))
model_data <- greece_data %>%
mutate(
# Recoding gender
gndr = case_when(
gndr == 1 ~ "Male",
gndr == 2 ~ "Female",
gndr == 9 ~ NA_character_,
TRUE ~ as.character(gndr)
),
lrscale = case_when(
lrscale %in% 0:4 ~ "Left", # This time: Values 0 to 3 in 'lrscale' are categorized as "Left"
lrscale %in% 6:10 ~ "Right", # This time: Values 6 to 10 in 'lrscale' are categorized as "Right"
lrscale %in% c(77, 88, 99) ~ NA_character_, # Values 77, 88, 99 in 'lrscale' are set as NA
TRUE ~ NA_character_ # Any other values not explicitly matched above
),
# Recoding cohort variable, setting years before 1930 and after 2000 to NA
cohort = ifelse(yrbrn < 1930 | yrbrn > 2000, NA, yrbrn),
# Recoding generational cohorts based on year of birth
gen = case_when(
yrbrn %in% 1900:1945 ~ "1",
yrbrn %in% 1946:1964 ~ "2",
yrbrn %in% 1965:1979 ~ "3",
yrbrn %in% 1980:1996 ~ "4",
TRUE ~ as.character(cohort) # Keeping other values as character if they do not fit the ranges
),
# Converting cohort to a factor with labels
gen = factor(gen,
levels = c("1", "2", "3", "4"),
labels = c("Interwar", "Baby Boomers", "Gen X", "Millennials"))
)
table(model_data$gen)
##
## Interwar Baby Boomers Gen X Millennials
## 2978 3484 3498 2330
model1 <- lm(populist ~ lrscale, data = model_data)
model2 <- lm(populist ~ gndr, data = model_data)
model3 <- lm(populist ~ gen, data = model_data)
models <- list(
"Model 1" = lm(populist ~ lrscale, data = model_data),
"Model 2" = lm(populist ~ gndr, data = model_data),
"Model 3" = lm(populist ~ gen, data = model_data))
modelsummary(models, fmt = 1,
estimate = c("{estimate} ({std.error}){stars}",
"{estimate} ({std.error}){stars}",
"{estimate} ({std.error}){stars}"),
statistic = NULL)
| Model 1 | Model 2 | Model 3 | |
|---|---|---|---|
| (Intercept) | 72.0 (0.5)*** | 70.0 (0.3)*** | 65.8 (0.5)*** |
| lrscaleRight | −8.8 (0.6)*** | ||
| gndrMale | −1.0 (0.5)* | ||
| genBaby Boomers | 3.5 (0.7)*** | ||
| genGen X | 5.6 (0.7)*** | ||
| genMillennials | 5.8 (0.7)*** | ||
| Num.Obs. | 4936 | 9823 | 9562 |
| R2 | 0.037 | 0.000 | 0.009 |
| R2 Adj. | 0.036 | 0.000 | 0.009 |
| AIC | 44723.8 | 88873.2 | 86473.9 |
| BIC | 44743.3 | 88894.8 | 86509.7 |
| Log.Lik. | −22358.894 | −44433.590 | −43231.944 |
| RMSE | 22.44 | 22.30 | 22.25 |
modelplot(models, coef_omit = 'Interc')
Coefficients Interpretation: genBaby Boomers (3.5): Baby Boomers have
scores that are, on average, 3.5 points higher on the authoritarian
values scale (0-100) than the Interwar generation (the reference
category that is removed). A very low p-value (below our alpha criterion
of 0.05) indicates statistical significance, indicated by the triple
asterisks (). genGen X (5.6): On the authoritarian values
scale, Generation X exhibits an even more marked increase, scoring 5.3
points higher than the Interwar generation. Once more, statistical
significance at a very low p-value (below our alpha criterion of 0.05)
is shown by the triple asterisks (). Millennials (5.8):
Millennials’ scores are 5.8 points higher than the Interwar
generation’s, indicating the largest increase in authoritarian
attitudes. When it comes to authoritarian principles, Millennials depart
from the Interwar generation the least when compared to one another, as
evidenced by this notable and statistically significant difference.
Adjusted R-Squared (Adj. R² = 0.009): After taking into account the number of predictors (in this example, just one), the adjusted R-squared value of 0.009 indicates that the “gen” variable, with our four coded generations, explain roughly 0.9% of the variance in the authoritarian values scale for Greece. Statistically significant as it may be, this indicates that the model only partially explains the diversity in authoritarian views, suggesting that variables other than generational variations may be important in forming these beliefs.
Interpretation: There is no statistically significant difference between males and females. Additionally, the left/right model demonstrates statistical relevance, with those who choose to remain in the “Right” group (as opposed to the “Left” reference category) scoring, on average, 8.8 points lower on the authoritarian value scale (which ranges from 0-100). However, the biggest variance explained, at 3.6%, still remains. Confidence intervals provide ranges for true population paramteres across 95% of random samples for each predictor variable.