Homework 7 (2.5%) due Nov. 13

Submit on Quercus using assignment tab – not on the discussion board. Make sure to provide both your markdown file and R pubs link. If you do not submit both, you will be penalized 1 pt. out of the 2.5 total.

packages <- c("tidyverse", "infer", "fst") 

new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "infer"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "fst"       "infer"     "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"
library(fst)

ess <- read_fst("All-ESS-Data.fst")

Task 1

Set up the calculations for an exploration of educ_level (as BA, No BA) and brncntr (recoded as yes/no *shown in tutorial 4) for Switzerland. Using the steps outlined in the tutorial, first generate the tables of expected proportions and frequencies. Determine and interpret the critical value for independence. Finally, determine and interpret both the Pearson’s chi-squared statistic and the p-value. Discuss main takeaways.

switzerland_data <- ess %>%
  filter(cntry == "CH") %>%
  
  mutate(born_in_country = recode(brncntr,
                                  `1` = "Yes",
                                  `2` = "No",
                                  `7` = NA_character_,
                                  `8` = NA_character_,
                                  `9` = NA_character_),
    edulvla = case_when(
      essround < 5 & edulvla == 55 ~ NA_real_,
      TRUE ~ edulvla
    ),

    edulvlb = case_when(
      essround >= 5 & edulvlb == 5555 ~ NA_real_,
      TRUE ~ edulvlb
    ),

    educ_level = case_when(
      essround < 5 & edulvla == 5 ~ "BA",
      essround >= 5 & edulvlb > 600 ~ "BA",
      TRUE ~ "No BA"
    ),

    gndr = case_when(
      gndr == 1 ~ "Male",
      gndr == 2 ~ "Female",
      gndr == 9 ~ NA_character_,
      TRUE ~ as.character(gndr)
    )
  )
tab_dat <- switzerland_data

mytab <- table(tab_dat$brncntr, tab_dat$educ_level)

rsums <- rowSums(mytab)

csums <- colSums(mytab)

N <- sum(mytab)

ptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")
## Table of Expected Proportions:
print(ptab)
##              [,1]         [,2]
## [1,] 1.757478e-01 5.956555e-01
## [2,] 5.205397e-02 1.764246e-01
## [3,] 1.346107e-05 4.562312e-05
## [4,] 1.346107e-05 4.562312e-05
ftab <- N * ptab
cat("Table of Expected Frequencies:\n")
## Table of Expected Frequencies:
print(ftab)
##              [,1]         [,2]
## [1,] 2974.5309306 1.008147e+04
## [2,]  881.0134121 2.985987e+03
## [3,]    0.2278287 7.721713e-01
## [4,]    0.2278287 7.721713e-01
alpha <- 0.05 # significance level
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")
## Critical Value: 3.841

Interpretation of the critical value

We reject the null hypothesis at the 0.05 significance level because the value is 3.841 or greater.There is a significant association between being born in Switzerland and education level.

test_stat <- sum((mytab - ftab)^2 / ftab)
cat("Pearson's X^2:", round(test_stat, 4), "\n")
## Pearson's X^2: 65.7174

Interpretation of the Pearson’s chi-squared statistic

This value is much larger than the critical value. There is a significant connection between education level and being born in Switzerland. This means that the probability of being born in the country isn’t the same across different education levels.

p_val <- pchisq(test_stat, df = 1, lower.tail = F)
cat("p-value:", round(p_val, 4), "\n")
## p-value: 0
cat("Chi-squared test result:\n")
## Chi-squared test result:
print(chisq.test(mytab, correct = F))
## Warning in chisq.test(mytab, correct = F): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytab
## X-squared = 65.717, df = 3, p-value = 3.523e-14

Interpretation of the P-value

The p-value is incredibly close to 0. Therefore, a strong connection between the two variables exist–further emphasized by the large value granted by the X-square. We can reject the null hypothesis.

Task 2

Using the infer package and following the 4 steps outlined in the tutorial and during the lecture, conduct and visualize the hypothesis testing for domicil (recoded as urban/rural – see tutorial 4) and gincdif (Government should reduce differences in income levels) for the country of Netherlands. For gincdif, you must recode for the five categories (i.e, cannot keep as numeric).

See variable info here:

https://ess-search.nsd.no/en/variable/1ad9f42e-0c02-4b13-9a67-ee83d82ac576

Make decision over appropriate test given table provided during lecture. After conducting the 4 steps, interpret all the output and discuss your takeaways.

netherlands_data <- ess %>%
  filter(cntry == "NL") %>%
  mutate(geo = recode(as.character(domicil), 
                      '1' = "Urban", 
                      '2' = "Peri-Urban",
                      '3' = "Rural", 
                      '4' = "Rural", 
                      '5' = "Rural",
                      '7' = NA_character_,
                      '8' = NA_character_,
                      '9' = NA_character_),
         gincdif = case_when(
           gincdif == '1' ~ "Agree strongly",
           gincdif == '2' ~ "Agree",
           gincdif == '3' ~ "Neither agree nor disagree",
           gincdif == '4' ~ "Disagree",
           gincdif == '5' ~ "Disagree strongly",
           gincdif %in% c(7, 8, 9) ~ NA_character_,
           TRUE ~ as.character(gincdif)
         )


)
test_stat <- netherlands_data %>%
  specify(explanatory = geo, 
          response = gincdif) %>% 
  hypothesize(null = "independence") %>%
  calculate(stat = "chisq")
## Warning: Removed 198 rows containing missing values.
print(test_stat$stat)
## X-squared 
##  24.81544
null_distribution <- netherlands_data %>%
  specify(explanatory = geo,
          response = gincdif) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "chisq")
## Warning: Removed 198 rows containing missing values.
p_val <- null_distribution %>% # replace name here if you assigned something other than null_distribution above
  get_pvalue(obs_stat = test_stat, direction = "greater") # would only replace test_stat if assigned another name in Step 1

p_val
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.003
null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "greater")

null_distribution
## Response: gincdif (factor)
## Explanatory: geo (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1  4.07
##  2         2  4.58
##  3         3  2.02
##  4         4  2.99
##  5         5 25.3 
##  6         6  4.81
##  7         7  4.09
##  8         8  7.54
##  9         9 15.2 
## 10        10  6.91
## # ℹ 990 more rows
conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "greater") +
  shade_confidence_interval(endpoints = conf_int)

null_distribution
## Response: gincdif (factor)
## Explanatory: geo (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1  4.07
##  2         2  4.58
##  3         3  2.02
##  4         4  2.99
##  5         5 25.3 
##  6         6  4.81
##  7         7  4.09
##  8         8  7.54
##  9         9 15.2 
## 10        10  6.91
## # ℹ 990 more rows

Interpretation

The red line falls outside the confidence interval of the simulated null distribution. We can reject the null hypothesis and assume there to be a relation between the two variables.

Task 3

First, talk about a research question you would like to consider, which identifies a main relationship between an explanatory variable and an outcome variable. Using the infer package and following the 4 steps outlined in the tutorial and during the lecture, conduct and visualize the hypothesis testing for your variables based on your research question. After conducting the 4 steps, interpret all the output and discuss your takeaways.

Research question of interest

I want to research the relationship between how location relates with interest in politics in Finland. I will do this by having the explanatory variable = polintr and explanatory variable = geo.

finland_data <- ess %>%
  filter(cntry == "FI") %>% 
  mutate(geo = recode(as.character(domicil), 
                      '1' = "Urban", 
                      '2' = "Peri-Urban",
                      '3' = "Rural", 
                      '4' = "Rural", 
                      '5' = "Rural",
                      '7' = NA_character_,
                      '8' = NA_character_,
                      '9' = NA_character_),
polintr = recode(polintr,
                                  `1` = "Very interested",
                                  `2` = "Quite interested",
                                  `3` = "Hardly interested",
                                  `4` = "Not at all interested",
                                  `7` = NA_character_,
                                  `8` = NA_character_,
                                  `9` = NA_character_)
)
test_stat <- finland_data %>%
  specify(explanatory = geo,
          response = polintr) %>% # t
  hypothesize(null = "independence") %>%
  calculate(stat = "chisq") 
## Warning: Removed 28 rows containing missing values.
print(test_stat$stat)
## X-squared 
##  370.8845
null_distribution <- finland_data %>%
  specify(explanatory = geo,
          response = polintr) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>% # only line we are adding, use reps = 1000 as standard
  calculate(stat = "chisq")
## Warning: Removed 28 rows containing missing values.
p_val <- null_distribution %>% 
  get_pvalue(obs_stat = test_stat, direction = "greater") 
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step.
## See `?get_p_value()` for more information.
p_val
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0
null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "greater")
## Warning in min(diff(unique_loc)): no non-missing arguments to min; returning
## Inf

null_distribution
## Response: polintr (factor)
## Explanatory: geo (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1  3.71
##  2         2  5.23
##  3         3  4.03
##  4         4 15.5 
##  5         5  5.39
##  6         6  1.59
##  7         7  2.61
##  8         8  4.21
##  9         9  7.10
## 10        10 10.1 
## # ℹ 990 more rows
conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)
## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

null_distribution
## Response: polintr (factor)
## Explanatory: geo (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1  3.71
##  2         2  5.23
##  3         3  4.03
##  4         4 15.5 
##  5         5  5.39
##  6         6  1.59
##  7         7  2.61
##  8         8  4.21
##  9         9  7.10
## 10        10 10.1 
## # ℹ 990 more rows

Interpretation

There is a strong correlation between the two variables, as represented in the p-value being at 0 and the red line straying far away from the simulated null distribution.