Homework 3

Task 1

Based on the recoding of lrscale for “left” and “right” and omitting “moderates” (see Tutorial 4), and educ.ba from this tutorial, do the long form coding for Chi-Square. Using the steps outlined in the tutorial, first generate the tables of expected proportions and frequencies. Determine and interpret the critical value for independence. Finally, determine and interpret both the Pearson’s chi-squared statistic and the p-value. Discuss main takeaways.

rm(list=ls()); gc()

##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 527509 28.2    1172460 62.7         NA   669417 35.8
## Vcells 972201  7.5    8388608 64.0      16384  1851610 14.2

# List of packages
packages <- c("tidyverse", "infer", "fst") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "infer"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "fst"       "infer"     "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"

ess <- read_fst("All-ESS-Data.fst")

france_data <- ess %>%
  # Filter for France
  filter(cntry == "FR") %>%
  
  # Recoding clsprty and cleaning trstplt (will use later)
  mutate(
    clsprty = case_when(
      clsprty == 1 ~ "Yes",  # Recode 1 to "Yes"
      clsprty == 2 ~ "No",   # Recode 2 to "No"
      clsprty %in% c(7, 8, 9) ~ NA_character_,  # Handle other specific cases where you want to set it as NA
      TRUE ~ as.character(clsprty)  # Keep other values as-is but ensure they are characters
    ),
    trstplt = ifelse(trstplt %in% c(77, 88, 99), NA, trstplt),
    
    # Recoding gender
    gndr = case_when(
      gndr == 1 ~ "Male",
      gndr == 2 ~ "Female",
      gndr == 9 ~ NA_character_,
      TRUE ~ as.character(gndr)
    ),

    # Recoding education with case_when for consistency
    educ.ba = case_when(
      essround < 5 & edulvla == 5 ~ "BA or more",
      essround >= 5 & edulvlb > 600 ~ "BA or more",
      TRUE ~ "No BA"
    )
  ) %>%
  # Ensure all NAs are handled uniformly
  mutate(
    edulvla = ifelse(edulvla %in% c(77, 88), NA_integer_, edulvla),
    edulvlb = ifelse(edulvlb %in% c(5555, 7777, 8888), NA_integer_, edulvlb),
    educ.ba = as.character(educ.ba)  # Force educ.ba to be treated as character
  )

## Not necessary, but always good to rename when doing operations so you can backtrack to clean dataset
tab_dat <- france_data

# Create a crosstable and save it into the object mytab
mytab <- table(tab_dat$clsprty, tab_dat$educ.ba)

# Calculate row-wise sums of counts in table 
rsums <- rowSums(mytab)

# Calculate column-wise sums of counts in table
csums <- colSums(mytab)

# Get the total count of the table (N)
N <- sum(mytab)

# Generate table of expected proportions
ptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")

## Table of Expected Proportions:

print(ptab)

##           [,1]      [,2]
## [1,] 0.1101248 0.3815066
## [2,] 0.1138739 0.3944946

# Table of expected frequencies
ftab <- N * ptab
cat("Table of Expected Frequencies:\n")

## Table of Expected Frequencies:

print(ftab)

##          [,1]     [,2]
## [1,] 2059.444 7134.556
## [2,] 2129.556 7377.444

# Critical Value for Independence:
alpha <- 0.05 # significance level
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")

## Critical Value: 3.841

Interpretation: Should the computed Chi-squared test statistic surpass 3.841, as obtained from the chisq.test function, we would reject the null hypothesis at the 0.05 significance level. Consequently, this would indicate a statistically significant association between feeling close to a party and education level within your dataset.

# Pearson's chi-squared statistic:
test_stat <- sum((mytab - ftab)^2 / ftab)
cat("Pearson's X^2:", round(test_stat, 4), "\n")

## Pearson's X^2: 166.1849

Contrast with the critical value: After computing a critical value of 3.841, we find that our observed Chi-squared value greatly exceeds this threshold (166.1849).

Interpretation: The notable disparity between our observed data and the expected outcomes under the independence assumption demonstrates statistical significance. Consequently, there exists a statistically significant relationship between feeling close to a party and educational level. This indicates that the likelihood of feeling close to a party varies across different levels of education.

# p-value:
p_val <- pchisq(test_stat, df = 1, lower.tail = F)
cat("p-value:", round(p_val, 4), "\n")

## p-value: 0

# Using chisq.test for validation:
cat("Chi-squared test result:\n")

## Chi-squared test result:

print(chisq.test(mytab, correct = F))

## 
##  Pearson's Chi-squared test
## 
## data:  mytab
## X-squared = 166.18, df = 1, p-value < 2.2e-16

p-value < 2.2e-16: the p-value corresponding to the Chi-squared test. The p-value is very small (essentially zero), indicating strong evidence against the null hypothesis. In other words, there is a statistically significant association between the variables in the dataset.

Task 2

Do Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables sbbsntx (recoded to the corresponding 5 categories) as the response and domicil (recoded to urban and rural, setting 2 to urban instead of peri-urban) as the explanatory variable for the country of France.

See variable info here: https://ess.sikt.no/en/datafile/ffc43f48-e15a-4a1c-8813-47eda377c355/92?tab=1&elements=[%22874a86e2-a0f6-40b4-aef7-51c8eed98a7d/1%22])

Provide interpretations of the output (consider the variable info).

france_data <- ess %>%
  # Filter for France
  filter(cntry == "FR") %>%
  
  #Recoding to the corresponding 5 categories
  mutate(
   sbbsntx = case_when(
      sbbsntx == 1 ~ "Agree strongly",  
      sbbsntx == 2 ~ "Agree",  
      sbbsntx == 3 ~ "Neither agree nor disagree",
      sbbsntx == 4 ~ "Disagree",
      sbbsntx == 5 ~ "Disagree strongly",
      sbbsntx %in% c(7, 8, 9) ~ NA_character_,  # Handle other specific cases where you want to set it as NA
      TRUE ~ as.character(sbbsntx)  # Keep other values as-is but ensure they are characters
    ),
   
  # Recoding to urban and rural
    domicil = case_when(
      domicil == 1 ~ "Urban",  
      domicil == 2 ~ "Urban",  
      domicil == 3 ~ "Rural",
      domicil == 4 ~ "Rural",
      domicil == 5 ~ "Rural",
      domicil %in% c(7, 8, 9) ~ NA_character_,  # Handle other specific cases where you want to set it as NA
      TRUE ~ as.character(domicil)  # Keep other values as-is but ensure they are characters
   )
  )

test_stat <- france_data %>%
  specify(explanatory = domicil, # change variable name for explanatory variable
          response = sbbsntx) %>% # change variable name for outcome of interest
  hypothesize(null = "independence") %>%
  calculate(stat = "Chisq") # replace in between quotation marks appropriate test statistic

## Warning: Removed 14999 rows containing missing values.

print(test_stat$stat)

## X-squared 
##  19.97187

Interpretation: The chi-squared statistic of 19.97187 indicates a notable correlation between residing in urban or rural areas and the extent of agreement regarding the perceived burden of social benefits/services costs on businesses through taxes/charges. This suggests that agreement levels among individuals significantly differ based on their residential settings. Furthermore, this finding provides compelling evidence to dismiss the null hypothesis.

null_distribution <- france_data %>%
  specify(explanatory = domicil,
          response = sbbsntx) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>% # only line we are adding, use reps = 1000 as standard but change if too computationally demanding
  calculate(stat = "Chisq")

## Warning: Removed 14999 rows containing missing values.

p_val <- null_distribution %>% # replace name here if you assigned something other than null_distribution above
  get_pvalue(obs_stat = test_stat, direction = "greater") # would only replace test_stat if assigned another name in Step 1

## Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
## based on the number of `reps` chosen in the `generate()` step.
## ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Interpretation: According to the assumptions of frequentist hypothesis testing, the p-value being 0 signifies strong evidence to reject the null hypothesis.

Task 3

Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 2, and interpret the output.

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "greater") +
  shade_confidence_interval(endpoints = conf_int)

## Warning in min(diff(unique_loc)): no non-missing arguments to min; returning
## Inf

null_distribution

## Response: sbbsntx (factor)
## Explanatory: domicil (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1  2.61
##  2         2  3.36
##  3         3  2.43
##  4         4  5.56
##  5         5  3.20
##  6         6  2.31
##  7         7  1.96
##  8         8  4.82
##  9         9  2.96
## 10        10 12.3 
## # ℹ 990 more rows

Interpretation: The distribution depicted in the image is right-skewed, meaning that its mean, mode, and median are all less than 10. Moreover, the confidence intervals fall between 0 and about 10, indicating that the true number most likely lies within this range. Furthermore, the red line shows evidence against the null hypothesis since it is at 20 and does not fall within the null distribution.

Task 4

Conduct Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables ccrdprs (leaving it on the 0-10 numeric scale) as the response (or outcome) variable and lrscale (recoded as left and right, omitting “moderates” as we did in Task 1) as the explanatory variable.

Variable info here: https://ess.sikt.no/en/datafile/ffc43f48-e15a-4a1c-8813-47eda377c355/92?tab=1&elements=[%2283b08cf2-508e-49a0-9fc8-c3cc281290ec/4%22]

Provide interpretations of the output (consider the variable info).

france_data <- ess %>%
  # Filter for France
  filter(cntry == "FR") %>%
  
     # Recoding 'lrscale' 
    mutate(
    lrscale = case_when(
      lrscale %in% 0:3 ~ "Left",                    # Values 0 to 3 in 'lrscale' are categorized as "Left"
      lrscale %in% 7:10 ~ "Right",                  # Values 7 to 10 in 'lrscale' are categorized as "Right"
      lrscale %in% 4:6 ~ "Moderate",                # Values 4 to 6 in 'lrscale' are categorized as "Moderate"
      lrscale %in% c(77, 88, 99) ~ NA_character_    # Values 77, 88, 99 in 'lrscale' are set as NA
    ),

    ccrdprs = ifelse(ccrdprs %in% c(66, 77, 88, 99), NA, ccrdprs),   # Set values 66, 77, 88, 99 in 'ccrdprs' to NA
  )

test_stat <- france_data %>%
  specify(explanatory = lrscale, # change variable name for explanatory variable
          response = ccrdprs) %>% # change variable name for outcome of interest
  hypothesize(null = "independence") %>%
  calculate(stat = "F")

## Warning: Removed 15413 rows containing missing values.

print(test_stat$stat)

## [1] 15.41871

Interpretation: The ANOVA number of 15.41871 tells us that there are differences between the groups being compared. There’s a clear link between where someone stands on the political spectrum (left or right) and how much they feel personally responsible for fighting climate change. Basically, whether someone is on the left or right side politically affects how much they feel they need to do to tackle climate change.

null_distribution <- france_data %>%
  specify(explanatory = lrscale,
          response = ccrdprs) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>% # only line we are adding, use reps = 1000 as standard but change if too computationally demanding
  calculate(stat = "F")

## Warning: Removed 15413 rows containing missing values.

p_val <- null_distribution %>% # replace name here if you assigned something other than null_distribution above
  get_pvalue(obs_stat = test_stat, direction = "greater") # would only replace test_stat if assigned another name in Step 1

## Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
## based on the number of `reps` chosen in the `generate()` step.
## ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

With a p-value of 0, under the assumptions of frequentist hypothesis testing, there is compelling evidence to reject the null hypothesis.

Task 5

Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 4, and interpret the output.

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "greater") +
  shade_confidence_interval(endpoints = conf_int)

## Warning in min(diff(unique_loc)): no non-missing arguments to min; returning
## Inf

null_distribution

## Response: ccrdprs (numeric)
## Explanatory: lrscale (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate   stat
##        <int>  <dbl>
##  1         1 1.23  
##  2         2 1.56  
##  3         3 0.0724
##  4         4 0.156 
##  5         5 0.497 
##  6         6 0.709 
##  7         7 0.188 
##  8         8 0.0912
##  9         9 2.19  
## 10        10 0.951 
## # ℹ 990 more rows

Interpretation: The distribution depicted in the image is right-skewed, meaning that its mean, mode, and median are all less than 5. Additionally, as the confidence intervals go from 0 to nearly 5, it is likely that the true number is in this range. Furthermore, the red line provides evidence against the null hypothesis because it is slightly over 15 and does not fit inside the null distribution.