exercice3

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Note: Loading all required packages for Exercise 3.

packages <- c(
 "tidyverse",
 "sjPlot",
 "broom",
 "modelsummary",
 "gt",
 "knitr",
 "scales",
 "kableExtra",
 "flextable",
 "car",
 "fst"
)
# Install and load packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## Warning: package 'sjPlot' was built under R version 4.4.3

## `modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
##   backend. Learn more at: https://vincentarelbundock.github.io/tinytable/
## 
## Revert to `kableExtra` for one session:
## 
##   options(modelsummary_factory_default = 'kableExtra')
##   options(modelsummary_factory_latex = 'kableExtra')
##   options(modelsummary_factory_html = 'kableExtra')
## 
## Silence this message forever:
## 
##   config_modelsummary(startup_message = FALSE)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

## Warning: package 'kableExtra' was built under R version 4.4.3

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

## Warning: package 'flextable' was built under R version 4.4.3

## 
## Attaching package: 'flextable'
## 
## The following objects are masked from 'package:kableExtra':
## 
##     as_image, footnote
## 
## The following object is masked from 'package:purrr':
## 
##     compose

## Warning: package 'car' was built under R version 4.4.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.4.3

## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "sjPlot"    "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "broom"     "sjPlot"    "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"     
## 
## [[4]]
##  [1] "modelsummary" "broom"        "sjPlot"       "lubridate"    "forcats"     
##  [6] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [11] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [16] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[5]]
##  [1] "gt"           "modelsummary" "broom"        "sjPlot"       "lubridate"   
##  [6] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [11] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [16] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [21] "base"        
## 
## [[6]]
##  [1] "knitr"        "gt"           "modelsummary" "broom"        "sjPlot"      
##  [6] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [11] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [16] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [21] "methods"      "base"        
## 
## [[7]]
##  [1] "scales"       "knitr"        "gt"           "modelsummary" "broom"       
##  [6] "sjPlot"       "lubridate"    "forcats"      "stringr"      "dplyr"       
## [11] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [16] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [21] "datasets"     "methods"      "base"        
## 
## [[8]]
##  [1] "kableExtra"   "scales"       "knitr"        "gt"           "modelsummary"
##  [6] "broom"        "sjPlot"       "lubridate"    "forcats"      "stringr"     
## [11] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [16] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [21] "utils"        "datasets"     "methods"      "base"        
## 
## [[9]]
##  [1] "flextable"    "kableExtra"   "scales"       "knitr"        "gt"          
##  [6] "modelsummary" "broom"        "sjPlot"       "lubridate"    "forcats"     
## [11] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [16] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [21] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[10]]
##  [1] "car"          "carData"      "flextable"    "kableExtra"   "scales"      
##  [6] "knitr"        "gt"           "modelsummary" "broom"        "sjPlot"      
## [11] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [16] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [21] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [26] "methods"      "base"        
## 
## [[11]]
##  [1] "fst"          "car"          "carData"      "flextable"    "kableExtra"  
##  [6] "scales"       "knitr"        "gt"           "modelsummary" "broom"       
## [11] "sjPlot"       "lubridate"    "forcats"      "stringr"      "dplyr"       
## [16] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [21] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [26] "datasets"     "methods"      "base"

Task 1: Life Satisfaction in Atlantic Canada Regions

Data Preparation

Note: Reading the data set chs.

chs_data <- read_csv("CHS2021ECL_PUMF.csv",show_col_types = FALSE)
head(chs_data)

## # A tibble: 6 × 109
##   PUMFID EHA_10 EHA_10A EHA_10B EHA_25 DWS_05A DWI_05A DWI_05B DWI_05C DWI_05D
##    <dbl>  <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1  63501      3       6       6      2       3       2       2       2       2
## 2  63502      2       6       2      2       3       2       2       2       1
## 3  63503      2       6       2      2       2       2       2       2       2
## 4  63504      5       6       6      2       3       2       2       2       2
## 5  63505      3       6       6      2       1       2       2       2       2
## 6  63506      3       6       6      2       3       2       2       2       2
## # ℹ 99 more variables: NES_05A <dbl>, NSC_30A <dbl>, NSC_30B <dbl>,
## #   NSC_30C <dbl>, NEI_05A <dbl>, NEI_05B <dbl>, NEI_05C <dbl>, NEI_05D <dbl>,
## #   NEI_05E <dbl>, NEI_05F <dbl>, NEI_05G <dbl>, NEI_05H <dbl>, NEI_05I <dbl>,
## #   WSA_05 <dbl>, SDH_05 <dbl>, CER_05 <dbl>, CER_20 <dbl>, LIS_10 <dbl>,
## #   COS_10 <dbl>, COS_15 <dbl>, GH_05 <dbl>, GH_10 <dbl>, REGION <chr>,
## #   PAGEGR1 <dbl>, PAGEGR2 <dbl>, PAGEGR3 <dbl>, PAGEGR4 <dbl>, PAGEP1 <dbl>,
## #   PCER_10 <dbl>, PCER_15 <dbl>, PCHN <dbl>, PCOS_05 <dbl>, PDCLASS <dbl>, …

Note: Filtering the data to look at just Atlantic Canada regions.

atlantic_data <- chs_data %>%
  filter(PGEOGR %in% c('01','02','03','04','05','06'))
head(atlantic_data)

## # A tibble: 6 × 109
##   PUMFID EHA_10 EHA_10A EHA_10B EHA_25 DWS_05A DWI_05A DWI_05B DWI_05C DWI_05D
##    <dbl>  <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1  63501      3       6       6      2       3       2       2       2       2
## 2  63510      4       6       6      2       3       2       2       2       2
## 3  63513      3       6       6      2       3       2       2       2       2
## 4  63537      4       6       6      1       1       2       2       2       2
## 5  63541      4       6       6      2       3       2       2       2       2
## 6  63543      4       6       6      2       3       2       2       2       2
## # ℹ 99 more variables: NES_05A <dbl>, NSC_30A <dbl>, NSC_30B <dbl>,
## #   NSC_30C <dbl>, NEI_05A <dbl>, NEI_05B <dbl>, NEI_05C <dbl>, NEI_05D <dbl>,
## #   NEI_05E <dbl>, NEI_05F <dbl>, NEI_05G <dbl>, NEI_05H <dbl>, NEI_05I <dbl>,
## #   WSA_05 <dbl>, SDH_05 <dbl>, CER_05 <dbl>, CER_20 <dbl>, LIS_10 <dbl>,
## #   COS_10 <dbl>, COS_15 <dbl>, GH_05 <dbl>, GH_10 <dbl>, REGION <chr>,
## #   PAGEGR1 <dbl>, PAGEGR2 <dbl>, PAGEGR3 <dbl>, PAGEGR4 <dbl>, PAGEP1 <dbl>,
## #   PCER_10 <dbl>, PCER_15 <dbl>, PCHN <dbl>, PCOS_05 <dbl>, PDCLASS <dbl>, …

Note: Filtering out the NA data.

atlantic_data <- atlantic_data %>%
  filter(!is.na(PLIS_05) & PLIS_05 != 99) 
sum(is.na(atlantic_data))

## [1] 0

Note: creating new factor variable ‘Region’ with descriptive region names for Atlantic Canada.

atlantic_data <- atlantic_data %>%
  mutate(Region = factor(PGEOGR,
                         levels = c('01', '02', '03', '04','05','06'),
                         labels = c("Newfoundland and Labrador",
                                    "Prince Edward Island",
                                    "Halifax",
                                    "Nova Scotia",
                                    "Saint John and Moncton  ",
                                    "New Brunswick")))
atlantic_data

## # A tibble: 11,177 × 110
##    PUMFID EHA_10 EHA_10A EHA_10B EHA_25 DWS_05A DWI_05A DWI_05B DWI_05C DWI_05D
##     <dbl>  <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1  63501      3       6       6      2       3       2       2       2       2
##  2  63510      4       6       6      2       3       2       2       2       2
##  3  63513      3       6       6      2       3       2       2       2       2
##  4  63537      4       6       6      1       1       2       2       2       2
##  5  63541      4       6       6      2       3       2       2       2       2
##  6  63543      4       6       6      2       3       2       2       2       2
##  7  63544      2       6       1      2       3       2       2       2       2
##  8  63547      3       6       6      2       1       2       2       2       2
##  9  63549      3       6       6      2       3       2       2       2       2
## 10  63550      2       6       1      2       3       2       1       2       2
## # ℹ 11,167 more rows
## # ℹ 100 more variables: NES_05A <dbl>, NSC_30A <dbl>, NSC_30B <dbl>,
## #   NSC_30C <dbl>, NEI_05A <dbl>, NEI_05B <dbl>, NEI_05C <dbl>, NEI_05D <dbl>,
## #   NEI_05E <dbl>, NEI_05F <dbl>, NEI_05G <dbl>, NEI_05H <dbl>, NEI_05I <dbl>,
## #   WSA_05 <dbl>, SDH_05 <dbl>, CER_05 <dbl>, CER_20 <dbl>, LIS_10 <dbl>,
## #   COS_10 <dbl>, COS_15 <dbl>, GH_05 <dbl>, GH_10 <dbl>, REGION <chr>,
## #   PAGEGR1 <dbl>, PAGEGR2 <dbl>, PAGEGR3 <dbl>, PAGEGR4 <dbl>, PAGEP1 <dbl>, …

Note: Summarizing the data and grouping Atlantic regions to find the average life satisfaction and the median of how satisfied people are across Atlantic Canada.

summary_by_region <- atlantic_data %>%
  group_by(Region) %>%
  summarise(
    Avg_Life_Satisfaction = mean(PLIS_05, na.rm = TRUE),
    Median_Life_Satisfaction = median(PLIS_05, na.rm = TRUE),
    Count = n()
  )
summary_by_region

## # A tibble: 6 × 4
##   Region                      Avg_Life_Satisfaction Median_Life_Satisfac…¹ Count
##   <fct>                                       <dbl>                  <dbl> <int>
## 1 "Newfoundland and Labrador"                  6.69                      7  2425
## 2 "Prince Edward Island"                       6.77                      7  1707
## 3 "Halifax"                                    6.25                      7  1084
## 4 "Nova Scotia"                                6.62                      7  1962
## 5 "Saint John and Moncton  "                   6.52                      7  2018
## 6 "New Brunswick"                              6.71                      7  1981
## # ℹ abbreviated name: ¹Median_Life_Satisfaction

Visualization with Boxplot

Note: Using a box plot to look at how life satisfaction is distributed across Atlantic Canada.

ggplot(atlantic_data, aes(x = Region, y = PLIS_05, fill = Region)) +
  geom_boxplot() +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),  # Rotate and enlarge x-axis labels
    axis.text.y = element_text(size = 12),  # Increase font size for y-axis labels
    plot.title = element_text(face = "bold", size = 16),  # Bold and enlarge title
    legend.position = "none"
  ) +
  labs(
    x = "Region", 
    y = "Life satisfaction", 
    title = "Distribution of Life Satisfaction by Region"
  )

Descriptive Analysis:

The box pot named “Distribution of Life Satisfaction by Region” gives us insight on individuals, life satisfaction across Atlanta, Canada. The six regions are Newfoundland And Labrador, Prince Edward Island, Nova Scotia, Saint John and Moncton, and New Brunswick . they are all individually colour coated to tell the difference between each Atlantic region

When looking at the box plot, it seems the median life satisfaction scores are very similar. Most regions life satisfaction falls around 7 to 7.5 however, Newfoundland and New Brunswick‘s median scores are higher in comparison to other regions the scoring would suggest that residents living in these areas are generally satisfied or happy with their lives.

The IQ are or enter qualitative range shows that an all regions third typical score for their level of life satisfaction is six there can be outliers seen showing as dots on the bottom of the whiskers. These outliers show that there is lower life satisfaction in some regions regions like Prince Edward Island in Nova Scotia, have the most eye out lairs. This means there is more variety in peoples answers on life satisfaction.

In general, the box plot shows, life satisfaction levels are mostly consistent throughout Atlanta, Canada. There is many factors as to why there is variation between regions, such as economic or social further research could be done to look into what causes these different answers.

Regression Analysis

Note: Creating simple regression model with life satisfaction as outcome and region as the predictor in Atlantic_data data set.

model_atlantic <- lm(PLIS_05 ~ Region, data = atlantic_data)

summary(model_atlantic)

## 
## Call:
## lm(formula = PLIS_05 ~ Region, data = atlantic_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7721 -0.7721  0.3052  1.4802  2.7500 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     6.69485    0.04166 160.702  < 2e-16 ***
## RegionPrince Edward Island      0.07727    0.06482   1.192  0.23323    
## RegionHalifax                  -0.44485    0.07495  -5.935 3.03e-09 ***
## RegionNova Scotia              -0.07711    0.06230  -1.238  0.21582    
## RegionSaint John and Moncton   -0.17502    0.06182  -2.831  0.00464 ** 
## RegionNew Brunswick             0.01389    0.06213   0.224  0.82313    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.052 on 11171 degrees of freedom
## Multiple R-squared:  0.005023,   Adjusted R-squared:  0.004577 
## F-statistic: 11.28 on 5 and 11171 DF,  p-value: 7.196e-11

Note: Viewing the results using broom libary.

tidy(model_atlantic, conf.int = TRUE)

## # A tibble: 6 × 7
##   term                   estimate std.error statistic p.value conf.low conf.high
##   <chr>                     <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 "(Intercept)"            6.69      0.0417   161.    0         6.61      6.78  
## 2 "RegionPrince Edward …   0.0773    0.0648     1.19  2.33e-1  -0.0498    0.204 
## 3 "RegionHalifax"         -0.445     0.0750    -5.93  3.03e-9  -0.592    -0.298 
## 4 "RegionNova Scotia"     -0.0771    0.0623    -1.24  2.16e-1  -0.199     0.0450
## 5 "RegionSaint John and…  -0.175     0.0618    -2.83  4.64e-3  -0.296    -0.0539
## 6 "RegionNew Brunswick"    0.0139    0.0621     0.224 8.23e-1  -0.108     0.136

Note:Creating a coefficients table using sjplot for life satisfaction model.

plot_model(model_atlantic, 
           type = "est",         # Type of plot: estimates (coefficients)
           vline.color = "red",  # Vertical reference line in red
           show.values = TRUE,   # Display coefficient values on the plot
           show.p = TRUE) +
  labs(title = "Regression Coefficients for Life Satisfaction Model",
       subtitle = "Effect of region on Life Satisfaction",
       x = "Estimated Effect (Coefficient Value)",
       y = "Predictors") +
  theme_minimal()# Show significance levels

Regression Interpretation:

The regression model displayed above, looks at the differences of life satisfaction between Atlantic Canada regions, using Newfoundland and Labrador as a reference to compare.

It can be seen that Halifax has the most negative life satisfaction from the region of reference falling at -0.44 St. John and Moncton also have negative coefficient of -0.18 but have less magnitude than Halifax is, Nova Scotia has a negative coefficient but it is -0.08 meaning that it has no impact or magnitude. Prince Edward Island has 0.08 and New Brunswick has 0.01 making them the only ones to have positive coefficients. Their coefficient are too low to have any significant impact even though they are positive.

Halifax and St. John and Moncton stand out as areas were life satisfaction noticeably lower because their intervals do not overlap with the zero line. Where on the other hand New Brunswick and Prince Edward Island are almost on the zero line. since the R² value is low it means that there’s other factors that could be playing in a roll in affecting peoples over all life satisfaction besides where they currently live.

In this case, I would say, regions are not a strong factor, and tell why peoples life satisfaction can be affected. Halifax and Saint johns and Moncton report lower life satisfaction, and this could be due to other things such as Socio economic challenges or things in their personal life. By adding another variable, like their employment status or income level, it could help to understand what is contributing to lower life satisfaction.

Note: creating a professional regression table for life satisfaction model.

tab_model(model_atlantic, auto.label=TRUE)

	PLIS_05
Predictors	Estimates	CI	p
(Intercept)	6.69	6.61 – 6.78	<0.001
Region [Prince Edward Island]	0.08	-0.05 – 0.20	0.233
Region [Halifax]	-0.44	-0.59 – -0.30	<0.001
Region [Nova Scotia]	-0.08	-0.20 – 0.05	0.216
RegionSaint John and Moncton	-0.18	-0.30 – -0.05	0.005
Region [New Brunswick]	0.01	-0.11 – 0.14	0.823
Observations	11177
R² / R² adjusted	0.005 / 0.005

Task 2: Populist Attitudes in Portugal

part A: Data Preparation

Note:Loading the data for Portugal.

portugal_data <- read_fst("portugal_data.fst")

Note:Cleaning the trust variables.

head(portugal_data[, c("trstplt", "trstprl", "trstprt", "gndr")])

##   trstplt trstprl trstprt gndr
## 1      88      88      NA    2
## 2       1      88      NA    2
## 3       2      88      NA    1
## 4       0      88      NA    1
## 5       3      88      NA    2
## 6       0       0      NA    1

Note:Checking the occurrence of each unique entries of the trstprt variable in Portugal_data data set.

table(portugal_data$trstprt)

## 
##    0    1    2    3    4    5    6    7    8    9   10   77   88   99 
## 4544 2049 2418 2243 1715 1902  545  360  152   45   37   37  314    9

Note:Creating a trust scale combining trust variables and rescaling them to 0-100.I also Flipped the scale to create a populist measure where higher values indicate stronger populist attitudes.

# Step 1: Clean the trust variables (set values > 10 to NA)
portugal_data$trstplt <- ifelse(portugal_data$trstplt > 10, NA, portugal_data$trstplt)
portugal_data$trstprl <- ifelse(portugal_data$trstprl > 10, NA, portugal_data$trstprl)
portugal_data$trstprt <- ifelse(portugal_data$trstprt > 10, NA, portugal_data$trstprt)

# Step 2: Create a trust scale (0-100)
portugal_data$trust <- scales::rescale(
  portugal_data$trstplt + portugal_data$trstprl + portugal_data$trstprt, 
  na.rm = TRUE, 
  to = c(0, 100)
)

# Step 3: Flip the scale to create a populist measure (higher = more populist)
# This reverses the direction so that lower trust = higher populism
portugal_data$populist <- scales::rescale(portugal_data$trust, na.rm = TRUE, to = c(100, 0))

summary(portugal_data$populist, na.rm = TRUE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   56.67   73.33   72.64   90.00  100.00    2350

Note:Creating a generational variable based on birth year with proper factor ordering.

portugal_data <- portugal_data %>%
  mutate(
    # Recode year of birth, setting invalid values to NA
    birth_year = ifelse(yrbrn < 1928 | yrbrn > 2005, NA, yrbrn),
    
    # Create generation variable
    generation = case_when(
      birth_year %in% 1928:1945 ~ "Interwar",
      birth_year %in% 1946:1964 ~ "Baby Boomers",
      birth_year %in% 1965:1980 ~ "Generation X",
      birth_year %in% 1981:1996 ~ "Millennials",
      birth_year %in% 1997:2005 ~ "Gen Z",
      TRUE ~ NA_character_
    ),
    
    # Convert to factor with correct order
    generation = factor(generation, 
                      levels = c("Interwar", "Baby Boomers", 
                               "Generation X", "Millennials", "Gen Z"))
  )

Part B: Descriptive Analysis Note:Converting numeric codes to male and female from gender variable.

unique(portugal_data$gndr)

## [1] 2 1

portugal_data$gender <- factor(portugal_data$gndr, 
                           levels = c(1, 2), 
                           labels = c("Male", "Female"))

Note:Cleaning the generation and populist variable and creating a boxplot visualization showing populist attitudes by generation.

cleaned_data <- portugal_data %>%
  filter(!is.na(generation) & !is.na(populist))

ggplot(cleaned_data, aes(x = generation, y = populist, fill = generation)) +
  geom_boxplot() +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),  # Rotate and enlarge x-axis labels
    axis.text.y = element_text(size = 12),  # Increase font size for y-axis labels
    plot.title = element_text(face = "bold", size = 16),  # Bold and enlarge title
    legend.position = "none"
  ) +
  labs(
    x = "Generation", 
    y = "Popoulist", 
    title = "Distribution of populist attitude by generation"
  )

The box part titled “distribution of populist attitude by generation” looks at five different generations as follows: interwar, baby boomers, generation, X, millennials, and Gen z.The Interwar, Baby Boomer, and Generation X cohorts all display median scores around 75. This score would suggest that generations below millennials have stronger populist attitudes than those younger then them. Millennials score is only slightly lower than the older generations where as gen z has the lowest populist attitude out of all generations landing a score of 60.

Gen Z has the most variation of scores making it the widest interquartile range (IQR), indicating greater dispersion in their attitudes. Some Gen Z individuals reported high levels of populism while others display much lower scores. In comparison, the Interwar, Baby Boomer, and Gen X groups have more similar distributions, with their upper quartile reaching close to 100.

Outliers are present across all generations, and can be seen at the bottom whiskers.Some individuals report very low levels of populist sentiment falling between 0-25.This would suggest that while trends exist across different generations, there is still an amount of individual variation.

One could conclude from this visual information that there is a potential decline in populist attitudes with in younger generations. Gen Z is the generation that shows the most notable shift away from populism.

Part C: Regression Models

Note:Creating simple regression model where populist is the outcome variable and generation is the predictor.

model_gen <- lm(populist ~ gender, data = portugal_data)

Note: Creating multivariate liner regression where populist attitudes is predicted by both gender and generation

# Run multiple regression with both predictors
model_gender_generation <- lm(populist ~ gender + generation, data = portugal_data)

Note:Creating a side by side comparison table.

modelsummary(
  list("Gender Only" = model_gen, "Gender + Generation" = model_gender_generation),
  fmt = 2,  # Format numbers to 2 decimal places
  estimate = "{estimate} [{conf.low}, {conf.high}]",  # Show coefficient estimates with confidence intervals
  statistic = NULL,  # Remove standard errors to keep table clean
  stars = TRUE,  # Add significance stars
  coef_rename = c(
    "(Intercept)" = "Intercept",
    "genderFemale" = "Female (vs Male)",
    "generationGen X" = "Gen X (vs Gen Z)",
    "generationBoomers" = "Boomers (vs Gen Z)"
  ),  # Rename coefficients for clarity
  gof_map = c("nobs", "r.squared", "adj.r.squared", "aic", "bic"),  # Include model fit statistics
  title = "Comparison of Populist Attitudes by Gender and Generation",
  notes = "Data: Portugal Survey Data 2024"
)

Comparison of Populist Attitudes by Gender and Generation
	Gender Only	Gender + Generation
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Data: Portugal Survey Data 2024
Intercept	72.12 [71.63, 72.60]	73.71 [72.96, 74.47]
Female (vs Male)	0.89 [0.26, 1.53]	0.75 [0.11, 1.40]
generationBaby Boomers		-0.67 [-1.53, 0.19]
generationGeneration X		-1.65 [-2.54, -0.75]
generationMillennials		-3.49 [-4.52, -2.47]
generationGen Z		-13.92 [-16.28, -11.55]
Num.Obs.	15531	15001
R2	0.000	0.012
R2 Adj.	0.000	0.011
AIC	137097.9	132243.9
BIC	137120.8	132297.2

Part D: Regression Interpretation

The regression results examine the level of impact gender and generation have on individuals populist attitudes. Two models were constructed the first being Model 1 which looks at the effects of gender by itself and then secondly Model 2 which adds generation as an additional variable to predict populist attitudes.

We will look into model one first. In the model it can be seen that the intercept value 72.12 represents a baseline level of peoples populist attitudes. The coefficient for women is +0.89 which means that women on average have a slightly higher populist attitudes than men. However, the magnitude of this effect is relatively small. This is because the R² value for this model is low. Since it is low that could mean that gender by itself does not explain the variety of populist attitudes.

Now looking at model two where generation is added as another variable. Generation X, Millennials, and Gen Z all show lower populist attitudes when compared to the reference category. It can be seen that Gen Z has the largest decline of -13.92, indicating a generational shift away from having populist attitudes. Baby Boomers have -0.67 which does not show a statistical difference from the reference group, this could mean that their populist attitudes are more consistent regardless of additional variables.

We can see that genders effect is still present just not as strong being +0.75 which means that once generation is applied it plays a much larger role in explaining differences in peoples populist attitudes.

The improved R² in Model 2 tells viewers overall that generation is a much stronger predictor than gender. The generational decline in populist attitudes could have to do with societal changes in social norms. This could be because younger generations are less drawn to politics. It could also be due to many factors like modern day media consumption and current changes in economics.

Task 3: Authoritarian Values in France

Part A: Data Preparation

Note: reading the fst data for France.

france_data <- read_fst("france_data.fst")

Note: creating a value scale.

# Create authoritarian values scale from Schwartz human values items
france_data <- france_data %>%
  mutate(
    # Rename variables for clarity
    behave = ipbhprp,   # Proper behavior
    secure = impsafe,   # Security
    safety = ipstrgv,   # Strong government
    tradition = imptrad, # Tradition
    rules = ipfrule     # Following rules
  ) %>%
  # Set invalid values to NA
  mutate(across(c("behave", "secure", "safety", "tradition", "rules"),
                ~ na_if(.x, 7) %>% na_if(8) %>% na_if(9))) %>%
  # Apply reverse coding (original scale is 1=very much like me, 6=not like me at all)
  # After reversing: higher scores = more authoritarian
  mutate(across(c("behave", "secure", "safety", "tradition", "rules"), ~ 7 - .x))

# Calculate the authoritarian scale (0-100)
france_data$auth <- scales::rescale(
  france_data$behave + 
  france_data$secure + 
  france_data$safety + 
  france_data$tradition + 
  france_data$rules, 
  to = c(0, 100), 
  na.rm = TRUE
)

# Examine the distribution
summary(france_data$auth)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   48.00   60.00   59.75   72.00  100.00     800

Note:Cleaning the 3 variables data.

france_data <- france_data %>%
  mutate(
    # Recode education level, setting invalid values to NA
    ignore = ifelse(eisced == 55 | eisced == 5, NA, eisced),
    ignorehincfel = ifelse(hincfel == 3 | hincfel == 4, NA, hincfel),

    # Create education level variable
    education_level = case_when(
       ignore  %in% c(1,2) ~ "Lower Secondary or Less",
      ignore  %in% c(3,4) ~ "Upper Secondary",
        ignore  %in% c(6,7) ~ "Tertiary Education",
      
      TRUE ~ NA_character_
    ),
    
    residential_area = case_when(
       domicil  %in% c(1,2) ~ "Urban",
       domicil  %in% c(3,4,5) ~ "Rural",
        
      TRUE ~ NA_character_
    ),
     economic_status = case_when(
        ignorehincfel %in% c(1) ~ "Economic Comfort",
        ignorehincfel  %in% c(2) ~ "Economic Strain",
        
      TRUE ~ NA_character_
    ),
    
    # Convert to factor with correct order
    education_level = factor(education_level, 
                      levels = c("Lower Secondary or Less", "Upper Secondary", 
                               "Tertiary education")),
    residential_area = factor(residential_area, 
                      levels = c("Urban", "Rural")),
    economic_status = factor(economic_status, 
                      levels = c("Economic Comfort", "Economic Strain"))
  )

table(france_data$education_level, useNA = "ifany")

## 
## Lower Secondary or Less         Upper Secondary      Tertiary education 
##                    4218                    6529                       0 
##                    <NA> 
##                    8291

table(france_data$residential_area, useNA = "ifany")

## 
## Urban Rural  <NA> 
##  5945 13087     6

table(france_data$economic_status, useNA = "ifany")

## 
## Economic Comfort  Economic Strain             <NA> 
##             4981             7868             6189

Part B: Descriptive Analysis

Note:Creating Violin plot.

cleaned_education <- france_data %>%
  filter(!is.na(education_level))

ggplot(cleaned_education, aes(x = education_level, y = auth, fill = education_level)) +
    geom_violin(alpha = 0.7, trim = FALSE) +  # Violin plot for distribution
    geom_boxplot(width = 0.15, color = "black", alpha = 0.5) +  # Internal boxplot for summary statistics
    theme_minimal() +  # Clean theme
    labs(
        title = "Distribution of Authoritarian Values by Education Level",
        subtitle = "Higher scores indicate stronger authoritarian values",
        x = "Education Level",
        y = "Authoritarian Values (0-100)"
    ) +
    theme(
        axis.text.x = element_text(angle = 30, hjust = 1, size = 12),  # Rotate x-axis labels
        axis.text.y = element_text(size = 12),  # Adjust y-axis text
        plot.title = element_text(face = "bold", size = 16),  # Bold title
        legend.position = "none"  # Remove legend
    ) +
    scale_fill_manual(values = c("#0072B2", "#D55E00", "#009E73"))  # Custom colors for education levels

## Warning: Removed 343 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

## Warning: Removed 343 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

The violin plot “Distribution of Authoritarian Values by Education Level” shows differences in authoritarian values between those with lower secondary education and those with upper secondary education. Lower Secondary or Less can be seen in blue and Upper Secondary is orange. Individuals who hold a upper secondary education seem to have lower authoritarian values then those with lower secondary education.

While both groups show a various range of authoritarian values all the way from 0–100. Lower Secondary or Less group has a higher median authoritarian score. Authoritarian attitudes in Secondary or Less group are more varied. In comparison the Upper Secondary group has a lower median score. This means that Individuals in this group have more moderate authoritarian values.

This shows that education can play a hand in shaping peoples attitudes toward authority. It could shape peoples perspectives by exposing them to a diversity of perspectives and encouraging critical thinking. Since authoritarian values are found in both groups this means that education is a factor but there can still be other influences.

Part C: Progressive Model Building

Note:Running regression model one.

france_model1 <- lm(auth ~ education_level, data = france_data)

summary(france_model1)

## 
## Call:
## lm(formula = auth ~ education_level, data = france_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.924 -11.924   0.527  12.527  40.527 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     63.9244     0.2912  219.49   <2e-16 ***
## education_levelUpper Secondary  -4.4510     0.3719  -11.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.47 on 10402 degrees of freedom
##   (8634 observations deleted due to missingness)
## Multiple R-squared:  0.01358,    Adjusted R-squared:  0.01349 
## F-statistic: 143.2 on 1 and 10402 DF,  p-value: < 2.2e-16

Note: Running regression model 2.

france_model2 <- lm(auth ~ education_level + residential_area, data = france_data)

summary(france_model2)

## 
## Call:
## lm(formula = auth ~ education_level + residential_area, data = france_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.416 -12.913   0.584  13.041  41.041 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     65.3707     0.4204 155.514  < 2e-16 ***
## education_levelUpper Secondary  -4.4572     0.3715 -12.000  < 2e-16 ***
## residential_areaRural           -1.9546     0.4101  -4.766  1.9e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.45 on 10399 degrees of freedom
##   (8636 observations deleted due to missingness)
## Multiple R-squared:  0.01569,    Adjusted R-squared:  0.0155 
## F-statistic:  82.9 on 2 and 10399 DF,  p-value: < 2.2e-16

Note: running the regression model 3.

france_model3 <- lm(auth ~ education_level + residential_area + economic_status, data = france_data)

summary(france_model3)

## 
## Call:
## lm(formula = auth ~ education_level + residential_area + economic_status, 
##     data = france_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.458 -11.880   0.704  12.704  42.091 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     63.8798     0.5703 112.004  < 2e-16 ***
## education_levelUpper Secondary  -4.1622     0.4189  -9.937  < 2e-16 ***
## residential_areaRural           -1.8090     0.4647  -3.893 9.99e-05 ***
## economic_statusEconomic Strain   1.3874     0.4430   3.132  0.00174 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.32 on 8152 degrees of freedom
##   (10882 observations deleted due to missingness)
## Multiple R-squared:  0.01504,    Adjusted R-squared:  0.01468 
## F-statistic:  41.5 on 3 and 8152 DF,  p-value: < 2.2e-16

Note:Creating regression table comparing the 3 models.

pred_labels <- c(
  "(Intercept)" = "Intercept",
  "education_levelUpper Secondary" = "Upper Secondary",
  "residential_areaRural" = "Rural",
  "economic_statusEconomic Strain" = "Economic Strain"
  
)

tab_model(
  france_model1, france_model2, france_model3,
  pred.labels = pred_labels,
  dv.labels = c("Model 1:\nEducation_level", "Model 2:\n+residential_area ", "Model 3:\n+ economic_status"),
  string.pred = "Predictors",
  string.est = "Î²",
  string.ci = "95% CI",
  string.p = "p-value",
  p.style = "stars",
  p.threshold = c(0.05, 0.01, 0.001),
  show.reflvl = TRUE,
  show.aic = TRUE,
  show.r2 = TRUE,
  show.fstat = TRUE,
  show.loglik = FALSE,
  digits = 3,
  title = "Table: Predicting Authoritarian Values",
  CSS = list(
    css.table = "font-size: 12px; width: auto; margin-bottom: 20px;",
    css.thead = "font-weight: bold; border-bottom: 1px solid #ddd;",
    css.tdata = "padding: 5px; border-bottom: 1px solid #f0f0f0;",
    css.caption = "font-weight: bold; color: #333333; font-size: 14px; margin-bottom: 10px;",
    css.footnote = "font-size: 11px; color: #555555;",
    css.depvarhead = "color: #333333; font-weight: bold; text-align: center; border-bottom: 1px solid #ddd;",
    css.centeralign = "text-align: center;",
    css.firsttablecol = "font-weight: bold; font-size: 12px;"
  )
)

Table: Predicting Authoritarian Values
	Model 1: Education_level		Model 2: +residential_area		Model 3: + economic_status
Predictors	Î²	95% CI	Î²	95% CI	Î²	95% CI
Intercept	63.924 ^***	63.354 – 64.495	65.371 ^***	64.547 – 66.195	63.880 ^***	62.762 – 64.998
Upper Secondary	-4.451 ^***	-5.180 – -3.722	-4.457 ^***	-5.185 – -3.729	-4.162 ^***	-4.983 – -3.341
Rural			-1.955 ^***	-2.758 – -1.151	-1.809 ^***	-2.720 – -0.898
Economic Strain					1.387 ^**	0.519 – 2.256
Observations	10404		10402		8156
R² / R² adjusted	0.014 / 0.013		0.016 / 0.016		0.015 / 0.015
AIC	90211.654		90167.847		70588.377
p<0.05 p<0.01 * p<0.001

Note: creating data to compare AIC and BIC

aic_bic_data <- data.frame(
  Model = c("Model 1", "Model 2", "Model 3"),
  AIC = c(AIC(france_model1), AIC(france_model2), AIC(france_model3)),
  BIC = c(BIC(france_model1), BIC(france_model2), BIC(france_model3))
)

kable(aic_bic_data, digits = 1, caption = "Comparison of AIC and BIC Across Models")

Comparison of AIC and BIC Across Models
Model	AIC	BIC
Model 1	90211.7	90233.4
Model 2	90167.8	90196.8
Model 3	70588.4	70623.4

Note: creating regression model for authoritarian predictions.

plot_model(france_model3, 
           type = "est",         # Type of plot: estimates (coefficients)
           vline.color = "red",  # Vertical reference line in red
           show.values = TRUE,   # Display coefficient values on the plot
           show.p = TRUE) +
  labs(title = "Regression Coefficients for Authorotrian prediction Model",
       subtitle = "Effect of education level,economic status and residential level on authoritorian values ",
       x = "Estimated Effect (Coefficient Value)",
       y = "Predictors") +
  theme_minimal()# Show significance levels

Part D: Model Comparison and Interpretation

I used 3 regression models to help examine the relationship between education, residential area, and economic status in shaping individuals authoritarian values. Model 1 focuses on the variable education by itself. It could be seen based on the score -4.45 that Upper Secondary education is generally associated with having lower authoritarian values. This model alone explains about 1.4% of the difference in peoples authoritarian values. Along with that its AIC is 90,211. Model 2 adds another variable to education which is residential area peoples reside in. It shows us that Upper Secondarys score remains around −4.46. Rural residents score is −1.96 which concludes that even with its addition their is still a result of lower authoritarian values. Their is a shift in R² meaning that there is more strength in the relationship between education and residential areas.AIC decreased only a bit now being 90,168.

The final model, model 3 adds the variable of Economic strain to education and residential areas. Upper Secondary education score is -4.16 and rural residential area is -1.81 with the addition of economic status meaning there is still a connection. Economic strain is positively connected with authoritarian values scoring +1.39. This score means that individuals experiencing financial struggles or strain in turn have stronger authoritarian beliefs. With the added variable of economic strain AIC drops to 70,588. R² remains low now being 0.015 which means the three variables explain only some of the differences between authoritarian values.

The takeaway from these results could be that education has the strongest impact on authoritarian values.Living in rural areas also can be seen to be a aspect of why some people have lower authoritarian values. In contrast to the first variables economic strain increased authoritarian values. In conclusion even with 3 variables it is clear that something else is impacting peoples authoritarian values.

exercice3

Madeline Perkey

2025-03-19

R Markdown