Salomon_Kayla_Homework

Task 1

Based on the recoding of lrscale for “left” and “right” and omitting “moderates” (see Tutorial 4), and educ.ba from this tutorial, do the long form coding for Chi-Square. Using the steps outlined in the tutorial, first generate the tables of expected proportions and frequencies. Determine and interpret the critical value for independence. Finally, determine and interpret both the Pearson’s chi-squared statistic and the p-value. Discuss main takeaways.

packages <- c("tidyverse", "infer", "fst")
new_packages <- packages[!(packages %in% installed.packages() [,"Package"])]
if(length(new_packages)) install.packages(new_packages)
lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "infer"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "fst"       "infer"     "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"

france_data <- read.fst("france_data.fst")
france_data <- france_data %>%
  filter(cntry == "FR") %>%
  
  mutate(
    ID = case_when(
      lrscale %in% 0:3 ~ "Left",
      lrscale %in% 7:10 ~ "Right",
      lrscale %in% c(77, 88, 99) ~ NA_character_
    ), 
    
    educ.ba = case_when(
      essround < 5 & edulvla == 5 ~ "BA or more",
      essround >= 5 & edulvlb > 600 ~ "BA or more",
      TRUE ~ "No BA"
    )
  ) %>%
  mutate(
    edulvla = ifelse(edulvla %in% c(77, 88), NA_integer_, edulvla),
    edulvlb = ifelse(edulvlb %in% c(5555, 7777, 8888), NA_integer_, edulvlb),
    educ.ba = as.character(educ.ba)
  )

unique(france_data$ID)

## [1] "Left"  NA      "Right"

unique(france_data$educ.ba)

## [1] "No BA"      "BA or more"

france_data <- france_data %>% filter(!is.na(ID))
france_data <- france_data %>% filter(!is.na(educ.ba))
unique(france_data$educ.ba)

## [1] "No BA"      "BA or more"

table(france_data$ID)

## 
##  Left Right 
##  4709  4302

tab_data <- france_data
mytab <- table(tab_data$lrscale, tab_data$educ.ba)
rsums <- rowSums(mytab)
csums <- colSums(mytab)
N <- sum(mytab)
ptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")

## Table of Expected Proportions:

print(ptab)

##            [,1]       [,2]
## [1,] 0.02631397 0.07855785
## [2,] 0.01389489 0.04148187
## [3,] 0.03533590 0.10549197
## [4,] 0.05557956 0.16592748
## [5,] 0.04870173 0.14539437
## [6,] 0.04059870 0.12120354
## [7,] 0.01138880 0.03400017
## [8,] 0.01910199 0.05702718

ftab <- N * ptab
cat("Table of Expected Frequencies:\n")

## Table of Expected Frequencies:

print(ftab)

##          [,1]      [,2]
## [1,] 237.1152  707.8848
## [2,] 125.2069  373.7931
## [3,] 318.4118  950.5882
## [4,] 500.8274 1495.1726
## [5,] 438.8513 1310.1487
## [6,] 365.8349 1092.1651
## [7,] 102.6245  306.3755
## [8,] 172.1281  513.8719

alpha <- 0.05 
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")

## Critical Value: 3.841

Since the Chi-squared test statistic is 3.841, it is exactly at the critical boundary of being statistically significant. If the critical value exceeds 3.841, the two variables in question (political attitude and level of education) would be seen as having a statistically significant association. However, since it is exactly 3.841, it can be presumed that there is still a positive association between political attitude and level of education.

test_stat <- sum((mytab - ftab)^2 / ftab)
cat("Pearson's X^2:", round(test_stat, 4), "\n")

## Pearson's X^2: 226.1399

Pearson’s chi-squared statistic is much larger than the critical value of 3.841. This means the difference between our observed data and what we would expect if the variables were independent is statistically significant. This shows there is a significant association between political attitudes and level of education; the probabibility of being left vs. right varies across education levels.

p_val <- pchisq(test_stat, df = 1, lower.tail = F)
cat("p-value:", round(p_val, 4), "\n")

## p-value: 0

A p-value of 0 means we reject the null hypothesis that political attitudes and level of education are independent of each other. This p-value evidences a significant association between these two variables.

cat("Chi-squared test result:\n")

## Chi-squared test result:

print(chisq.test(mytab, correct = F))

## 
##  Pearson's Chi-squared test
## 
## data:  mytab
## X-squared = 226.14, df = 7, p-value < 2.2e-16

Task 2

Do Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables sbbsntx (recoded to the corresponding 5 categories) as the response and domicil (recoded to urban and rural, setting 2 to urban instead of peri-urban) as the explanatory variable for the country of France.

See variable info here: https://ess.sikt.no/en/datafile/ffc43f48-e15a-4a1c-8813-47eda377c355/92?tab=1&elements=[%22874a86e2-a0f6-40b4-aef7-51c8eed98a7d/1%22])

Provide interpretations of the output (consider the variable info).

Step 1: Calculate the test statistic of sample

france_data <- read.fst("france_data.fst")
france_data <- france_data %>%
  filter(cntry == "FR") %>%
  
  mutate(
    sbbsntx = case_when(
      sbbsntx == 1 ~"Agree strongly",
      sbbsntx == 2 ~"Agree",
      sbbsntx == 3 ~"Neither agree nor disagree",
      sbbsntx == 4 ~"Disagree",
      sbbsntx == 5 ~"Disagree strongly",
      sbbsntx %in% c(7,8,9) ~ NA_character_,
      TRUE ~ as.character(sbbsntx)
    ),
    
    domicil = case_when(
      domicil == 1 ~ "Big city",
      domicil == 2 ~ "Suburbs",
      domicil == 3 ~ "Town",
      domicil == 4 ~ "Village",
      domicil == 5 ~ "Farm", 
      domicil %in% c(7,8,9) ~ NA_character_, 
      TRUE ~ as.character(domicil)
    )
  )

test_stat <- france_data %>%
  specify(explanatory = domicil, 
          response = sbbsntx) %>%
  hypothesize(null = "independence") %>%
  calculate(stat = "chisq")

## Warning: Removed 14999 rows containing missing values.

print(test_stat$stat)

## X-squared 
##  36.01302

An X-squared value of 36 indicates a significant difference in political attitudes between those with and without a BA in comparison to what would be expected if the two variables were independent of each other.

Step 2: simulate the null distribution

null_distribution <- france_data %>%
  specify(explanatory = domicil,
          response = sbbsntx) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "Chisq")

## Warning: Removed 14999 rows containing missing values.

Step 3: calculate the p-value of sample

p_val <- null_distribution %>% 
  get_pvalue(obs_stat = test_stat, direction = "two-sided") 
p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.004

A p-value of 0.006 represents strong evidence to reject the null hypothesis. This means there is a low probability of producing these results if there was no relaitonship between political attitudes and level of education. This means the relationship between the two variables is statistically significant.

Task 3

Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 2, and interpret the output.

Step 4: Visualize

null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided")

## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

null_distribution

## Response: sbbsntx (factor)
## Explanatory: domicil (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1 12.7 
##  2         2 15.0 
##  3         3 12.2 
##  4         4 11.0 
##  5         5 18.6 
##  6         6  6.68
##  7         7 16.1 
##  8         8  7.17
##  9         9 25.4 
## 10        10 12.5 
## # ℹ 990 more rows

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

null_distribution

## Response: sbbsntx (factor)
## Explanatory: domicil (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1 12.7 
##  2         2 15.0 
##  3         3 12.2 
##  4         4 11.0 
##  5         5 18.6 
##  6         6  6.68
##  7         7 16.1 
##  8         8  7.17
##  9         9 25.4 
## 10        10 12.5 
## # ℹ 990 more rows

The red line being shown outside the shaded regions shows that the observed result is not due to random chance, thus, we reject the null hypothesis. This is because the red line is our observed test statistic, while the shaded regions represent the values we would expect if there was no association between the two (if the null hypothesis were true). The lack of overlap indicates rejection of the null hypothesis.

Task 4

Conduct Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables ccrdprs (leaving it on the 0-10 numeric scale) as the response (or outcome) variable and lrscale (recoded as left and right, omitting “moderates” as we did in Task 1) as the explanatory variable.

Variable info here: https://ess.sikt.no/en/datafile/ffc43f48-e15a-4a1c-8813-47eda377c355/92?tab=1&elements=[%2283b08cf2-508e-49a0-9fc8-c3cc281290ec/4%22]

Provide interpretations of the output (consider the variable info).

Step 1:

france_data <- read.fst("france_data.fst")
france_data <- france_data %>%
  filter(cntry == "FR") %>%
  
     mutate(
    lrscale = case_when(
      lrscale %in% 0:3 ~ "Left",
      lrscale %in% 7:10 ~ "Right",
      lrscale %in% c(77, 88, 99) ~ NA_character_,
      TRUE ~ as.character(lrscale)
    ), 
    
    ccrdprs = case_when(
      ccrdprs %in% 0:5 ~ "Low responsibility",
      ccrdprs %in% 6:10 ~ "High responsibility",
      TRUE ~ as.character(ccrdprs)
    )
  )

infer_test <- france_data %>%
  specify(explanatory = lrscale, 
          response = ccrdprs) %>%
  hypothesize(null = "independence") %>%
  calculate(stat = "Chisq")

## Warning: Removed 15368 rows containing missing values.

print(test_stat$stat)

## X-squared 
##  36.01302

An X-squared value of 36 indicates a significant difference in feelings of personal responsibility to reduce climate change between leftists and rightists in comparison to what would be expected if the two variables were independent of each other.

Step 2:

null_distribution <- france_data %>%
  specify(explanatory = lrscale,
          response = ccrdprs) %>%
  hypothesize(null = "independence") %>%
    generate(reps = 1000, type = "permute") %>%
  calculate(stat = "Chisq")

## Warning: Removed 15368 rows containing missing values.

Step 3:

p_val <- null_distribution %>% 
  get_pvalue(obs_stat = test_stat, direction = "two-sided")
p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.002

A p-value of 0.012 is below the 0.05 threshold, meaning there is strong evidence to reject the null hypothesis. This means there is a low probability of producing these results if there was no relaitonship between feelings of personal responsibility for reducing climate change and political attitudes. This means the relationship between the two variables is statistically significant. ## Task 5

Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 4, and interpret the output.

null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided")

## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

null_distribution

## Response: ccrdprs (factor)
## Explanatory: lrscale (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1 16.7 
##  2         2 12.0 
##  3         3 15.1 
##  4         4  9.36
##  5         5 15.6 
##  6         6 24.3 
##  7         7 13.1 
##  8         8  8.86
##  9         9 14.2 
## 10        10  6.84
## # ℹ 990 more rows

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

null_distribution

## Response: ccrdprs (factor)
## Explanatory: lrscale (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1 16.7 
##  2         2 12.0 
##  3         3 15.1 
##  4         4  9.36
##  5         5 15.6 
##  6         6 24.3 
##  7         7 13.1 
##  8         8  8.86
##  9         9 14.2 
## 10        10  6.84
## # ℹ 990 more rows

Salomon_Kayla_Homework_2

Kayla Salomon

2024-03-12

Task 1

Task 2

Task 3

Task 4