Homework_7_YilIn

packages <- c("tidyverse", "infer", "fst") # add any you need here

new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "infer"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "fst"       "infer"     "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"

ess <- read_fst("All-ESS-Data.fst")

Task 1

Set up the calculations for an exploration of educ_level (as BA, No BA) and brncntr (recoded as yes/no *shown in tutorial 4) for Switzerland. Using the steps outlined in the tutorial, first generate the tables of expected proportions and frequencies. Determine and interpret the critical value for independence. Finally, determine and interpret both the Pearson’s chi-squared statistic and the p-value. Discuss main takeaways.

Switzerland_data <- ess %>%
  filter(cntry == "CH") %>%
  
  mutate(
    brncntr = case_when(
      brncntr == 1 ~ "Yes",  # Recode 1 to "Yes"
      brncntr == 2 ~ "No",   # Recode 2 to "No"
      brncntr %in% c(7, 8, 9) ~ NA_character_,  
      TRUE ~ as.character(brncntr)  
    ),

    edulvla = case_when(
      essround < 5 & edulvla == 55 ~ NA_real_,
      TRUE ~ edulvla
    ),
    
    edulvlb = case_when(
      essround >= 5 & edulvlb == 5555 ~ NA_real_,
      TRUE ~ edulvlb
    ),
    
    educ_level = case_when(
      essround < 5 & edulvla == 5 ~ "BA",
      essround >= 5 & edulvlb > 600 ~ "BA",
      TRUE ~ "No BA"
    ),
  )

unique(Switzerland_data$educ_level)

## [1] "BA"    "No BA"

unique(Switzerland_data$brncntr)

## [1] "No"  "Yes" NA

Switzerland_data <- Switzerland_data %>% filter(!is.na(brncntr))
Switzerland_data <- Switzerland_data %>% filter(!is.na(educ_level))

unique(Switzerland_data$brncntr)

## [1] "No"  "Yes"

table(Switzerland_data$brncntr)

## 
##    No   Yes 
##  3867 13056

tab_dat <- Switzerland_data

mytab <- table(tab_dat$brncntr, tab_dat$educ_level)

rsums <- rowSums(mytab)

csums <- colSums(mytab)

N <- sum(mytab)

pptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")

## Table of Expected Proportions:

print(pptab)

##            [,1]      [,2]
## [1,] 0.05206627 0.1764393
## [2,] 0.17578931 0.5957051

fqtab <- N * pptab
cat("Table of Expected Frequencies:\n")

## Table of Expected Frequencies:

print(fqtab)

##           [,1]      [,2]
## [1,]  881.1175  2985.882
## [2,] 2974.8825 10081.118

alpha <- 0.05 # significance level
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")

## Critical Value: 3.841

# Pearson's chi-squared statistic:
test_stat <- sum((mytab - fqtab)^2 / fqtab)
cat("Pearson's X^2:", round(test_stat, 4), "\n")

## Pearson's X^2: 65.1218

# p-value:
p_val <- pchisq(test_stat, df = 1, lower.tail = F)
cat("p-value:", round(p_val, 4), "\n")

## p-value: 0

cat("Chi-squared test result:\n")

## Chi-squared test result:

print(chisq.test(mytab, correct = F))

## 
##  Pearson's Chi-squared test
## 
## data:  mytab
## X-squared = 65.122, df = 1, p-value = 7.041e-16

Takeways First in the table I expected the proportions and frequencies. Secondly, it can be seen from the code that the critical value is 3.841, if the Chi-squared test statistic that you compute (from the chisq.test function) exceeds 3.841,then we reject the null hypothesis at the 0.05 significance level.Thirdly,through calculate the Pearson’s chi-squared statistic，and comparison to the previously calculated a critical value of 3.841. Our observed X-squared value is much larger than this critical value，so we can reject the null- hypothesis.Chi-square tests tell us whether two categorical variables are independent or related in some way. In code, a very small p-value (essentially zero) means that we reject the null hypothesis that two variables are independent of each other. In other words, there is strong evidence that there is a significant association between the two variables in mytab.

Task 2

Using the infer package and following the 4 steps outlined in the tutorial and during the lecture, conduct and visualize the hypothesis testing for domicil (recoded as urban/rural – see tutorial 4) and gincdif (Government should reduce differences in income levels) for the country of Netherlands. For gincdif, you must recode for the five categories (i.e, cannot keep as numeric).

See variable info here:

https://ess-search.nsd.no/en/variable/1ad9f42e-0c02-4b13-9a67-ee83d82ac576

Make decision over appropriate test given table provided during lecture. After conducting the 4 steps, interpret all the output and discuss your takeaways.

#Steps1

Netherlands_data <- ess %>%
  filter(cntry == "NL")

Netherlands_data <- Netherlands_data %>%
   mutate(geo = recode(as.character(domicil), 
                      '1' = "Urban", 
                      '2' = "Urban",
                      '3' = "Rural",
                      '4' = "Rural", 
                      '5' = "Rural",
                      '7' = NA_character_,
                      '8' = NA_character_,
                      '9' = NA_character_))
Netherlands_data <- Netherlands_data %>%
  mutate(
    gincdif = case_when(
      gincdif == 1 ~ "Agree strongly",  # Recode 1 to "Agree strongly"
      gincdif == 2 ~ "Agree",   # Recode 2 to "Agree"
      gincdif == 3 ~ "Neither agree nor disagree",  # Recode 3 to "Neither agree nor disagree"
      gincdif == 4 ~ "Disagree",  # Recode 4 to "Disagree"
      gincdif == 5 ~ "Disagree strongly",  # Recode 5 to "Disagree strongly"
      gincdif %in% c(7, 8, 9) ~ NA_character_,  # Recode 7, 8, 9 to NA
      TRUE ~ as.character(gincdif)  # Keep any other values as they are
    )
  )

test_stat <- Netherlands_data %>%
  specify(explanatory = geo,
           response = gincdif) %>%
  hypothesize(null = "independence") %>%
  calculate(stat = "Chisq")

## Warning: Removed 198 rows containing missing values.

print(test_stat$stat)

## X-squared 
##  8.176048

#Steps2

null_distribution <- Netherlands_data %>%
  specify(explanatory = geo,
          response = gincdif) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>% # only line we are adding, use reps = 1000 as standard
  calculate(stat = "chisq")

## Warning: Removed 198 rows containing missing values.

#Steps3

p_val <- null_distribution %>% # replace name here if you assigned something other than null_distribution above
  get_pvalue(obs_stat = test_stat, direction = "two-sided") # would only replace test_stat if assigned another name in Step 1

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.168

#Steps4

null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided")

## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

null_distribution

## Response: gincdif (factor)
## Explanatory: geo (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1 3.45 
##  2         2 1.96 
##  3         3 4.83 
##  4         4 0.974
##  5         5 1.76 
##  6         6 8.21 
##  7         7 8.04 
##  8         8 1.04 
##  9         9 6.17 
## 10        10 1.48 
## # ℹ 990 more rows

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

null_distribution

## Response: gincdif (factor)
## Explanatory: geo (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1 3.45 
##  2         2 1.96 
##  3         3 4.83 
##  4         4 0.974
##  5         5 1.76 
##  6         6 8.21 
##  7         7 8.04 
##  8         8 1.04 
##  9         9 6.17 
## 10        10 1.48 
## # ℹ 990 more rows

Takeway In both charts, the information is heavily skewed to the left, with an overall downward trend. At the same time, it can be concluded from our calculation that because the p-value is 0.162, the P-value is greater than 0.05, so we cannot reject the null hypothesis, so the conclusion that can be drawn from the information we calculate is that in the Netherlands, whether you live in a rural or an urban area, there is no particularly large correlation between individuals who believe that their government should reduce differences in income levels.

Task 3

First, talk about a research question you would like to consider, which identifies a main relationship between an explanatory variable and an outcome variable. Using the infer package and following the 4 steps outlined in the tutorial and during the lecture, conduct and visualize the hypothesis testing for your variables based on your research question. After conducting the 4 steps, interpret all the output and discuss your takeaways.

#Step1

Norway_data <- ess %>%
  filter(cntry == "NO")

Norway_data <- Norway_data %>%
   mutate(geo = recode(as.character(domicil), 
                      '1' = "Urban", 
                      '2' = "Urban",
                      '3' = "Rural",
                      '4' = "Rural", 
                      '5' = "Rural",
                      '7' = NA_character_,
                      '8' = NA_character_,
                      '9' = NA_character_))
Norway_data <- Norway_data %>%
  mutate(
    prtdgcl = case_when(
      prtdgcl == 1 ~ "Very close",  # Recode 1 to "Agree strongly"
      prtdgcl == 2 ~ "Quite close",   # Recode 2 to "Agree"
      prtdgcl == 3 ~ "Not close",  # Recode 3 to "Neither agree nor disagree"
      prtdgcl == 4 ~ "Not at all close",  # Recode 4 to "Disagree"
      prtdgcl %in% c(6, 7, 8, 9) ~ NA_character_,  # Recode 7, 8, 9 to NA
      TRUE ~ as.character(prtdgcl)  # Keep any other values as they are
    )
  )

test_stat <- Norway_data %>%
  specify(explanatory = geo,
           response = prtdgcl) %>%
  hypothesize(null = "independence") %>%
  calculate(stat = "Chisq")

## Warning: Removed 5951 rows containing missing values.

print(test_stat$stat)

## X-squared 
##  12.95907

#Steps2

null_distribution <- Norway_data %>%
  specify(explanatory = geo,
          response = prtdgcl) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>% # only line we are adding, use reps = 1000 as standard
  calculate(stat = "chisq")

## Warning: Removed 5951 rows containing missing values.

#Steps3

p_val <- null_distribution %>% # replace name here if you assigned something other than null_distribution above
  get_pvalue(obs_stat = test_stat, direction = "two-sided") # would only replace test_stat if assigned another name in Step 1

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.012

#Steps4

null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided")

## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

null_distribution

## Response: prtdgcl (factor)
## Explanatory: geo (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1 8.31 
##  2         2 1.20 
##  3         3 0.722
##  4         4 1.16 
##  5         5 0.602
##  6         6 1.10 
##  7         7 4.09 
##  8         8 2.14 
##  9         9 1.73 
## 10        10 3.71 
## # ℹ 990 more rows

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

null_distribution

## Response: prtdgcl (factor)
## Explanatory: geo (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1 8.31 
##  2         2 1.20 
##  3         3 0.722
##  4         4 1.16 
##  5         5 0.602
##  6         6 1.10 
##  7         7 4.09 
##  8         8 2.14 
##  9         9 1.73 
## 10        10 3.71 
## # ℹ 990 more rows

Takeway In these graphs, you can see that the entire graph is lying to the left, with outlier values on the far right, and there is an overall downward trend. At the same time, it can be concluded from the calculation that the value of x-squared is 12.9, which is not too large. At the same time, because the p-value is 0.012, its P-value is less than 0.05, so we can reject the null hypothesis. Therefore, we can draw the conclusion from the information we calculate that in Norway, whether you live in the countryside or the city, There was some correlation with how close individuals thought they were to the government.

Homework_7_YilIn_Zhou

2023-11-13

Task 1

Task 2

Task 3