Naccarato_Isabella_Homework

###Setting Up the Environment

# List of packages
packages <- c("tidyverse", "infer", "fst", "modelsummary", "effects", "survey", "interactions","broom") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: carData
## 
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
## 
## Loading required package: grid
## 
## Loading required package: Matrix
## 
## 
## Attaching package: 'Matrix'
## 
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## 
## Loading required package: survival
## 
## 
## Attaching package: 'survey'
## 
## 
## The following object is masked from 'package:graphics':
## 
##     dotchart
## 
## 
## Registered S3 methods overwritten by 'broom':
##   method            from  
##   tidy.glht         jtools
##   tidy.summary.glht jtools

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "infer"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "fst"       "infer"     "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"     
## 
## [[4]]
##  [1] "modelsummary" "fst"          "infer"        "lubridate"    "forcats"     
##  [6] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [11] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [16] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[5]]
##  [1] "effects"      "carData"      "modelsummary" "fst"          "infer"       
##  [6] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [11] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [16] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [21] "methods"      "base"        
## 
## [[6]]
##  [1] "survey"       "survival"     "Matrix"       "grid"         "effects"     
##  [6] "carData"      "modelsummary" "fst"          "infer"        "lubridate"   
## [11] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [16] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [21] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [26] "base"        
## 
## [[7]]
##  [1] "interactions" "survey"       "survival"     "Matrix"       "grid"        
##  [6] "effects"      "carData"      "modelsummary" "fst"          "infer"       
## [11] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [16] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [21] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [26] "methods"      "base"        
## 
## [[8]]
##  [1] "broom"        "interactions" "survey"       "survival"     "Matrix"      
##  [6] "grid"         "effects"      "carData"      "modelsummary" "fst"         
## [11] "infer"        "lubridate"    "forcats"      "stringr"      "dplyr"       
## [16] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [21] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [26] "datasets"     "methods"      "base"

Task 1

Graph the histogram distribution using the populist scale (as we did in the tutorial) for the country of Sweden. What do you note?

MY ANSWER

# Read in the Sweden dataset:
sweden_data <- read.fst("sweden_data.fst")

# Populist scale (Norris/Inglehart use low levels of trust in parliaments, politicians and parties as an indicator of populism)

# Setting values greater than 10 to NA
sweden_data$trstplt <- ifelse(sweden_data$trstplt > 10, NA, sweden_data$trstplt)
sweden_data$trstprl <- ifelse(sweden_data$trstprl > 10, NA, sweden_data$trstprl)
sweden_data$trstprt <- ifelse(sweden_data$trstprt > 10, NA, sweden_data$trstprt)

# Creating and rescaling the trust variable
sweden_data$trust <- scales::rescale(sweden_data$trstplt + sweden_data$trstprl + sweden_data$trstprt, na.rm = TRUE, to = c(0, 100))

# Here's how you would "flip it" because populist is conceived by N + I as 'anti-trust'
# As Schafer explains there are some issues with that conceptualization, but N + I is also widely applied
sweden_data$populist <- scales::rescale(sweden_data$trust, na.rm = TRUE, to=c(100,0))

# Graphing the histogram distribution using the populist scale:
ggplot(sweden_data, aes(x = populist)) +
  geom_histogram(bins = 30, fill = "blue", color = "black") +
  theme_minimal() +
  labs(title = "Distribution of Populist Scale for Sweden",
       x = "Populist Scale",
       y = "Count")

## Warning: Removed 2503 rows containing non-finite values (`stat_bin()`).

Looking at the histogram distribution for Sweden, I see that there is one mode slightly less than 50 on the populist scale, as compared to Hungary and Spain that we saw in the tutorial had 2 modes at slightly less than 50 and 100. I also see that te graph seems to be slightly skewed to the right with high counts between 0-50 on the populist scale, and lower counts above 50 on the populist scale. This histogram looks very similar to the Norway histogram we saw in the tutorial.

Task 2

Run a linear regression with populist attitudes as the outcome, educational attainment (recoded as a binary “BA or more” and “No BA”), and using the Swedish data. Print the intercept and coefficients only and interpret. Then do a tidy model table displaying the p-value (with digits = 3) and interpret.

MY ANSWER

First set up the variables that we need, then run a linear regression:

# Recode educational attainment as a binary
sweden_data <- sweden_data %>%
  mutate(
    educ.ba = case_when(
      essround < 5 & edulvla == 5 ~ "BA or more",
      essround >= 5 & edulvlb > 600 ~ "BA or more",
      TRUE ~ "No BA"
    ),
    # Handle NAs for education levels
    edulvla = ifelse(edulvla %in% c(77, 88, 99), NA_integer_, edulvla),
    edulvlb = ifelse(edulvlb %in% c(5555, 7777, 8888), NA_integer_, edulvlb),
    
    # Explicitly making 'No BA' the reference category
    educ.ba = factor(educ.ba, levels = c("No BA", "BA or more"))
  )

# Check that the recoding worked as expected:
table(sweden_data$educ.ba)

## 
##      No BA BA or more 
##      13477       4739

# Run a linear regression
sweden_pop_model <- lm(populist ~ educ.ba, data = sweden_data)

# Print the intercept and coefficients only:
coefficients_sweden <- coef(sweden_pop_model)
print(coefficients_sweden)

##       (Intercept) educ.baBA or more 
##         51.358680         -7.868121

The intercept of 51.36 represents the average score on the populist scale for the reference group in Sweden which is the “No BA” group. The expected average populist scale score would be 51.36 for the “No BA” group since it is the reference category. Higher scores indicate stronger populist attitudes while lower scores indicate lower populist attitudes.

The educ.baBA or more coefficient of -7.86 associated with the “BA or more” group indicates that on average, individuals who have a BA or more have populist scale scores that are 7.96 points lower than those who have “No BA”. Comparing “BA or more” individuals to “No BA” individuals in Sweden, we expect to see the average populist scale scores to be 7.86 points lower for the “BA or more” group, suggesting that “No BA” individuals exhibit stronger populist attitudes than the “BA or more” individuals.

# Now do a tidy model table displaying p-value and interpret:
tidy(sweden_pop_model) |>
knitr::kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	51.359	0.181	283.363	0
educ.baBA or more	-7.868	0.350	-22.494	0

The p-value of 0 (we know it is a very small number close to 0) is very small indicating that it is statistically significant and we can say that there is a statistically significant association between populist attitudes and educational attainment.

Task 3

Now using the Italian dataset and the authoritarian values scale (as we did in the tutorial), graph the average by survey year. Interpret.

MY ANSWER

italy_data <- read.fst("italy_data.fst")

# Load necessary libraries
library(dplyr)
library(ggplot2)

# Read the Italian dataset
italy_data <- read.fst("italy_data.fst")

# Create authoritarian values scale based on human modules items
italy_data <- italy_data %>%
  mutate(behave = ipbhprp,
         secure = impsafe,
         safety = ipstrgv,
         tradition = imptrad,
         rules = ipfrule) %>%
  mutate(across(c("behave", "secure", "safety", "tradition", "rules"),
                ~ na_if(.x, 7) %>% na_if(8) %>% na_if(9))) %>%
  # Apply reverse coding
  mutate(across(c("behave", "secure", "safety", "tradition", "rules"), ~ 7 - .x ))

# Now you can calculate 'schwartzauth' after the NA recoding
italy_data$auth <- scales::rescale(italy_data$behave + 
                      italy_data$secure + 
                      italy_data$safety + 
                      italy_data$tradition + 
                      italy_data$rules, to=c(0,100), na.rm=TRUE)


italy_data <- italy_data %>% filter(!is.na(auth))
table(italy_data$secure)

## 
##    1    2    3    4    5    6 
##   54  139  626 1856 2876 2881

table(italy_data$auth)

## 
##   0  12  16  20  24  28  32  36  40  44  48  52  56  60  64  68  72  76  80  84 
##   7   4   4   9  13  20  27  57 113 155 163 323 367 596 647 789 823 897 992 733 
##  88  92  96 100 
## 632 465 299 297

# Now doing a year-by-year average and plotting it:
italy_data$year <- NA
replacements <- c(2002, 2004, 2006, 2008, 2010, 2012, 2014, 2016, 2018, 2020)
for(i in 1:10){
  italy_data$year[italy_data$essround == i] <- replacements[i]
}
auth_avg <- italy_data %>%
  group_by(year) %>%
  summarize(auth_avg = mean(auth, na.rm = TRUE))

plot_auth <- ggplot(auth_avg, aes(x = year, y = auth_avg)) +
  geom_point(aes(color = auth_avg), alpha = 1.0) + 
  geom_line(aes(group = 1), color = "blue", linetype = "dashed") +  
  labs(title = "Authoritarian Scale Average by Year (Portugal)",
       x = "Survey Year",
       y = "Authoritarian Scale Average") +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_y_continuous(limits = c(0, 100))

print(plot_auth)

This visualization shows that there are small flucuations in the authoritarian scale average by survey year. Starting from an average in survey year 2012 of 57, the average drops to around 74 by 2016, rising slightly to about 74.5 bewteen survey year 2018 and 2020.

Task 4

Now run a linear regression model, using the modelsummary package, with authoritarian values as the outcome, and generations as the predictor/potential explanatory variable, and using the Italian data again. Rename the coefficients using the coef_rename function. Interpret the coefficients, whether they are statistically significant, and the adjusted R-squared.

MY ANSWER

First need to set up the variables by recoding the generations predictor variable.

model_data <- italy_data %>%
  mutate(
    # Recoding cohort variable, setting years before 1930 and after 2000 to NA
    cohort = ifelse(yrbrn < 1930 | yrbrn > 2000, NA, yrbrn),

    # Recoding generational cohorts based on year of birth
    gen = case_when(
      yrbrn %in% 1900:1945 ~ "1",
      yrbrn %in% 1946:1964 ~ "2",
      yrbrn %in% 1965:1979 ~ "3",
      yrbrn %in% 1980:1996 ~ "4",
      TRUE ~ as.character(cohort)  # Keeping other values as character if they do not fit the ranges
    ),

    # Converting cohort to a factor with labels
    gen = factor(gen,
                 levels = c("1", "2", "3", "4"),
                 labels = c("Interwar", "Baby Boomers", "Gen X", "Millennials"))
  )
table(model_data$gen)

## 
##     Interwar Baby Boomers        Gen X  Millennials 
##         1026         2580         2214         1811

# Next, run a linear regression model and get the summary outputs using the modelsummary package, and rename the coefficients:
model <- lm(auth ~ gen, data = model_data)
modelsummary(
  list(model),coef_rename = c("genBaby Boomers" = "Baby Boomers", "genGen X" = "Gen X", "genMillennials" = "Millennials"))

	(1)
(Intercept)	75.345
	(0.467)
Baby Boomers	−1.125
	(0.552)
Gen X	−2.080
	(0.564)
Millennials	−3.592
	(0.584)
Num.Obs.	7631
R2	0.006
R2 Adj.	0.006
AIC	62934.0
BIC	62968.7
Log.Lik.	−31462.009
RMSE	14.94

Coefficients and statistical significance interpretation: Baby Boomers - Compared to the Interwar generation which is the reference category that is omitted, Baby Boomers exhibit scores that are, on average, 1.125 points lower on the authoritarian values scale (0-100). There is no triple asterisks for this coefficient indicating that we do not have a very low p-value below our 0.05 alpha threshold, and therefore we cannot say that this is statistically significant. Gen X - Compared to the Interwar generation, Gen X exhibits scores that are, on average, 2.080 points lower on the authoritarian values scale. This is a larger decrease than was observed for the Baby Boomers generation compared to the Interwar generation. Again, there is no triple asterisks for this coefficient indicating that we do not have a very low p-value below our 0.05 alpha threshold, and therefore we cannot say that this is statistically significant. Millennials - Compared to the Interwar generation, Millenials exhibits scores that are, on average, 3.592 points lower on the authoritarian values scale. This is a larger decrease than was observed for both the Baby Boomers generation and the Gen X generation compared to the Interwar generation. Again, there is no triple asterisks for this coefficient indicating that we do not have a very low p-value below our 0.05 alpha threshold, and therefore we cannot say that this is statistically significant.

Adjusted R squared interpretaion. From this summary output, we observe that the adjusted R squared value is 0.006. This small value means that the ‘gen’ variable explains approximately 0.6% of the variance in authoritarian values scales for Italy, after accounting for the number of predictors which is 1 (taking into account number of predictors is used in adjusted R squared but not R squared). This suggests that this model only explains a very small portion of the variability in authoritarian values and that there are likely other factors outside of generational differences that may play a significant role in shaping these attitudes. It also makes sense that the adjusted R squared value is very low for this model, considering the fact that none of the generation variables were statistically significant.

Task 5

Now visualize, using the modelplot() function from the modelsummary package, coefficient estimates and 95% confidence intervals for the following three models:

Model 1 = populist attitudes scale as the outcome; left/right as the single predictor (omitting moderates as we did in the tutorial), and using data for Greece.

Model 2 = populist attitudes scale as the outcome; gender as the single predictor, and using data for Greece.

Model 3 = populist attitudes scale as the outcome; generations as the single predictor (using the four generational category recode we did in the tutorial), and using data for Greece. Interpret.

MY ANSWER

Create Model 1

# Read in the Sweden dataset:
greece_data <- read.fst("greece_data.fst")

# Populist scale (Norris/Inglehart use low levels of trust in parliaments, politicians and parties as an indicator of populism)

# Setting values greater than 10 to NA
greece_data$trstplt <- ifelse(greece_data$trstplt > 10, NA, greece_data$trstplt)
greece_data$trstprl <- ifelse(greece_data$trstprl > 10, NA, greece_data$trstprl)
greece_data$trstprt <- ifelse(greece_data$trstprt > 10, NA, greece_data$trstprt)

# Creating and rescaling the trust variable
greece_data$trust <- scales::rescale(greece_data$trstplt + greece_data$trstprl + greece_data$trstprt, na.rm = TRUE, to = c(0, 100))

# Here's how you would "flip it" because populist is conceived by N + I as 'anti-trust'
# As Schafer explains there are some issues with that conceptualization, but N + I is also widely applied
greece_data$populist <- scales::rescale(greece_data$trust, na.rm = TRUE, to=c(100,0))

# Recode lrscale variable
greece_data <- greece_data %>%
  mutate(
    ID = case_when(
      lrscale %in% 0:4 ~ "Left",                     # This time: Values 0 to 3 in 'lrscale' are categorized as "Left"
      lrscale %in% 6:10 ~ "Right",                   # This time: Values 6 to 10 in 'lrscale' are categorized as "Right"
      lrscale %in% c(77, 88, 99) ~ NA_character_,    # Values 77, 88, 99 in 'lrscale' are set as NA
      TRUE ~ NA_character_                           # Any other values not explicitly matched above
    )
  )

greece_lr_model <- lm(populist ~ ID, data = greece_data)

Create Model 2

# Recode gender
greece_data <- greece_data %>%
  mutate(
    # Recoding gender
    gndr = case_when(
      gndr == 1 ~ "Male",
      gndr == 2 ~ "Female",
      gndr == 9 ~ NA_character_,
      TRUE ~ as.character(gndr)
    ))

greece_gndr_model <- lm(populist ~ gndr, data = greece_data)

Create Model 3

greece_data <- greece_data %>%
  mutate(
    # Recoding cohort variable, setting years before 1930 and after 2000 to NA
    cohort = ifelse(yrbrn < 1930 | yrbrn > 2000, NA, yrbrn),

    # Recoding generational cohorts based on year of birth
    gen = case_when(
      yrbrn %in% 1900:1945 ~ "1",
      yrbrn %in% 1946:1964 ~ "2",
      yrbrn %in% 1965:1979 ~ "3",
      yrbrn %in% 1980:1996 ~ "4",
      TRUE ~ as.character(cohort)  # Keeping other values as character if they do not fit the ranges
    ),

    # Converting cohort to a factor with labels
    gen = factor(gen,
                 levels = c("1", "2", "3", "4"),
                 labels = c("Interwar", "Baby Boomers", "Gen X", "Millennials"))
  )
table(greece_data$gen)

## 
##     Interwar Baby Boomers        Gen X  Millennials 
##         2978         3484         3498         2330

greece_gen_model <- lm(populist ~ gen, data = greece_data)

Now visualize coefficients estimates and 95% confidence intervals for the 3 models using the modelplot() function:

models <- list(
   "Model 1" = lm(populist ~ ID, data = greece_data),
   "Model 2" = lm(populist ~ gndr, data = greece_data),
   "Model 3" = lm(populist ~ gen, data = greece_data))
modelsummary(models, fmt = 1,
  estimate  = c("{estimate} ({std.error}){stars}",
                "{estimate} ({std.error}){stars}",
                "{estimate} ({std.error}){stars}"),
  statistic = NULL)

	Model 1	Model 2	Model 3
(Intercept)	72.0 (0.5)***	70.0 (0.3)***	65.8 (0.5)***
IDRight	−8.8 (0.6)***
gndrMale		−1.0 (0.5)*
genBaby Boomers			3.5 (0.7)***
genGen X			5.6 (0.7)***
genMillennials			5.8 (0.7)***
Num.Obs.	4936	9823	9562
R2	0.037	0.000	0.009
R2 Adj.	0.036	0.000	0.009
AIC	44723.8	88873.2	86473.9
BIC	44743.3	88894.8	86509.7
Log.Lik.	−22358.894	−44433.590	−43231.944
RMSE	22.44	22.30	22.25

modelplot(models, coef_omit = 'Interc')

The lrscale model shows statistical significance (as there are 3 asterisks beside the coefficient), with right-winged (relative to the reference category of left-winged) holding a score, on average, which is 8.8 points lower on the populist scale. Not much of the variance in the populist scales values for Greece is explained by this model though as the adjusted R squared value is only 3.6%. However, this model explains more of the variance than Models 2 and 3. Looking at the output for the gender model, you can see that Male relative to Female is not statistically singificant as there is only one asterisk beside the coefficient. This model also has an adjusted R squared value of 0 so none of the variance in the populist scales values for Greece is explained by this model. The generations model shows statistical significance (as there are 3 asterisks beside the 3 generation coefficients), with Baby Boomers (relative to the reference category of Interwar) holding a score, on average, which is 3.5 points higher on the populist scale, GenX having a score, on average, which is 5.6 points higher on the populist scale, and Millenials having a score, on average, which is 5.8 points higher on the populist scale. In this model, 0.9% of the variation in the populist scales values for Greece is explained, given by the adjusted R squared value.

The plot of coefficient estiamates and 95% confidence intervals is shown above. This shows us the value of the coefficient, given by the dot, and the confidence interval that we would expect the esimate to be in 95% of the time. The three generations coefficients are above 0, with the Millenials coefficient beint the largest as seen in the model summary. The gender coefficient is negative and closer to 0 than the IDRight coefficient which is also negative but has a large value.

Naccarato_Isabella_Homework_4

Isabella Naccarato

2024-03-18

Task 1

MY ANSWER

Task 2

MY ANSWER

Task 3

MY ANSWER

Task 4

MY ANSWER

Task 5

MY ANSWER