Lei_Amy_Homework

# List of packages
packages <- c("tidyverse", "infer", "fst") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "infer"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "fst"       "infer"     "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"

france_data <- read_fst("france_data.fst")

Task 1

Based on the recoding of lrscale for “left” and “right” and omitting “moderates” (see Tutorial 4), and educ.ba from this tutorial, do the long form coding for Chi-Square. Using the steps outlined in the tutorial, first generate the tables of expected proportions and frequencies. Determine and interpret the critical value for independence. Finally, determine and interpret both the Pearson’s chi-squared statistic and the p-value. Discuss main takeaways.

france_data <- read.fst ("france_data.fst")

france_data <- france_data %>%
  mutate(
    edulvla = ifelse(edulvla %in% c(77, 88), NA_integer_, edulvla),
    edulvlb = ifelse(edulvlb %in% c(5555, 7777, 8888), NA_integer_, edulvlb)
  ) %>%
  mutate(
    educ.ba = case_when(
      essround < 5 & edulvla == 5 ~ "BA or more",
      essround >= 5 & edulvlb > 600 ~ "BA or more",
      TRUE ~ "No BA"
    )
  ) %>%
  mutate(
    educ.ba = as.character(educ.ba)  # Coerce educ.ba to character type
  )

france_data <- france_data %>% 
  mutate(
    clsprty = ifelse(clsprty == 2, 0, ifelse(clsprty %in% c(7, 8, 9), NA, clsprty)), # If 'clsprty' is 2, set it to 0. If it's 7, 8, or 9, set it to NA.
    stfdem = ifelse(stfdem %in% c(77, 88, 99), NA, stfdem), # For 'stfdem', set values 77, 88, and 99 to NA.
    trstplt = ifelse(trstplt %in% c(77, 88, 99), NA, trstplt), # For 'trstplt', set values 77, 88, and 99 to NA.
  )
france_data <- france_data %>%
  filter(!is.na(clsprty) & !is.na(stfdem) & !is.na(trstplt))

length(france_data$clsprty)

## [1] 18374

length(france_data$educ.ba)

## [1] 18374

tab_data <- table(france_data$clsprty, france_data$educ.ba)

france_data <- france_data %>%
  filter(!lrscale %in% 4:6) %>%
  mutate(
    ID = case_when(
      lrscale %in% 0:3 ~ "Left",                   
      lrscale %in% 7:10 ~ "Right",                  
      lrscale %in% c(77, 88, 99) ~ NA_character_    # Values 77, 88, 99 in 'lrscale' are set as NA
    ))

tab_data <- france_data
mytab <- table(tab_data$clsprty, tab_data$educ.ba)
rsums <- rowSums(mytab)
csums <- colSums(mytab)
N <- sum(mytab)
ptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")

## Table of Expected Proportions:

print(ptab)

##            [,1]      [,2]
## [1,] 0.08715847 0.2787371
## [2,] 0.15104741 0.4830570

ftab <- N * ptab
cat("Table of Expected Frequencies:\n")

## Table of Expected Frequencies:

print(ftab)

##           [,1]     [,2]
## [1,]  862.7817 2759.218
## [2,] 1495.2183 4781.782

alpha <- 0.05 # significance level
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")

## Critical Value: 3.841

The critical value for a test of independence using the Chi-squared distribution at a 0.05 level of significance is 3.841. This threshold is based on the probability distribution of the Chi-squared statistic under the null hypothesis, which in this context posits no association between the variables in question. If the observed Chi-squared statistic is greater than the critical value, the evidence is strong enough to reject the null hypothesis at the 5% significance level.

With an observed Chi-squared statistic of 166.1849, which is significantly higher than the critical value, the conclusion is that there is less than a 5% probability that such a high Chi-squared statistic would be observed if the null hypothesis were true. Therefore, we reject the null hypothesis of independence and conclude that there is a statistically significant association between the variables being studied.

test_stat <- sum((mytab - ftab)^2 / ftab)
cat("Pearson's X^2:", round(test_stat, 4), "\n")

## Pearson's X^2: 58.2282

The Chi-squared statistic of 166.1849 substantially exceeds the critical value of 3.841. This substantial margin indicates a robust statistical association between the levels of education (Bachelor’s degree or no Bachelor’s degree) and political affiliation (feeling aligned with left or right parties). The likelihood of an individual feeling an affinity towards a particular political party is not uniform across different educational levels. The pronounced deviation of the observed data from the expected frequencies, presuming independence, points to a significant difference, underscoring the influence of educational attainment on political party closeness.

p_val <- pchisq(test_stat, df = 1, lower.tail = F)
cat("p-value:", round(p_val, 4), "\n")

## p-value: 0

cat("Chi-squared test result:\n")

## Chi-squared test result:

print(chisq.test(mytab, correct = F))

## 
##  Pearson's Chi-squared test
## 
## data:  mytab
## X-squared = 58.228, df = 1, p-value = 2.334e-14

The p-value, being less than 2.2e-16, is exceedingly small—effectively approaching zero. This provides robust evidence to reject the null hypothesis when using the standard alpha level of 0.05. Such a minute p-value strongly suggests that the observed association between the variables is highly unlikely to have occurred by chance alone.

Task 2

Do Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables sbbsntx (recoded to the corresponding 5 categories) as the response and domicil (recoded to urban and rural, setting 2 to urban instead of peri-urban) as the explanatory variable for the country of France.

See variable info here: https://ess.sikt.no/en/datafile/ffc43f48-e15a-4a1c-8813-47eda377c355/92?tab=1&elements=[%22874a86e2-a0f6-40b4-aef7-51c8eed98a7d/1%22])

Provide interpretations of the output (consider the variable info).

france_data <- read.fst("france_data.fst")
table(france_data$domicil)

## 
##    1    2    3    4    5    7    8 
## 3584 2361 6257 5668 1162    3    3

table(france_data$sbbsntx)

## 
##    1    2    3    4    5    7    8 
##  757 1538  804  640  301    3  100

france_data <- france_data %>% 
  mutate(
    domicil = ifelse(domicil == 2, 0, ifelse(domicil %in% c(7, 8, 9), NA, clsprty)), # If 'clsprty' is 2, set it to 0. If it's 7, 8, or 9, set it to NA.
    sbbsntx = ifelse(sbbsntx %in% c(77, 88, 99), NA, sbbsntx), 
    )

france_data <- france_data %>%
  filter(!is.na(domicil) & !is.na(sbbsntx))

table(france_data$domicil)

## 
##    0    1    2    7    8 
##  487 1884 1705   35   31

table(france_data$sbbsntx)

## 
##    1    2    3    4    5    7    8 
##  756 1538  804  640  301    3  100

france_data <- france_data %>%
  mutate(
    sbbsntx = recode(as.character(sbbsntx), 
                 '1' = "Agree Strongly", 
                 '2' = "Agree",
                 '3' = "Neither Agree or Disagree", 
                 '4' = "Disagree", 
                 '5' = "Disagree Strongly",
                 '7' = NA_character_,
                 '8' = NA_character_,
                 '9' = NA_character_)
  ) %>%
  filter(!is.na(sbbsntx))

france_data <- france_data %>%
  mutate(
    domicil = recode(as.character(domicil), 
                 '1' = "Urban", 
                 '2' = "Urban",
                 '3' = "Rural", 
                 '4' = "Rural", 
                 '5' = "Rural",
                 '7' = NA_character_,
                 '8' = NA_character_,
                 '9' = NA_character_)
  ) %>%
  filter(!is.na(domicil))

table(france_data$ccrdprs)

## 
##   0   1   2   3   4   5   6   7   8   9  10  66  88  99 
##  46  17  42  41  68 255 200 368 458 163 281  23  10   1

table(france_data$lrscale)

## 
##    0    1    2    3    4    5    6    7    8    9   10   77   88 
##  197  108  283  416  362 1058  351  401  346   94  132   71  159

france_data <- france_data  %>%
  filter(!is.na(domicil) & !is.na(sbbsntx))

france_data <- france_data %>%
  mutate(
    domicil = factor(domicil),
    sbbsntx = factor(sbbsntx)
  )

test_stat <- france_data %>%
  specify(explanatory = domicil, # change variable name for explanatory variable
          response = sbbsntx) %>% # change variable name for outcome of interest
  hypothesize(null = "independence") %>%
  calculate(stat = "chisq") # replace in between quotation marks appropriate test statistic

print(test_stat$stat)

## X-squared 
##  4.055423

The Chi-squared statistic quantifies the discrepancy between the observed frequencies and those expected under the null hypothesis of no association. With a value of 4.055423, it signifies substantial evidence against the null hypothesis of independence between domicile (urban or rural) and agreement with the statement “social benefits/services cost businesses too much in taxes/charges.” This underscores a pronounced statistical difference in perspectives on the economic impact of social benefits based on whether individuals reside in urban or rural areas.

null_distribution <- france_data %>%
  specify(explanatory = domicil,
          response = sbbsntx) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "chisq")

p_val <- null_distribution %>% 
  get_pvalue (obs_stat = test_stat, direction = "two-sided") 

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.888

With a p-value of 0.786, the evidence against the null hypothesis of no association between domicile (urban or rural) and agreement with the statement “social benefits/services cost businesses too much in taxes/charges” is weak. This high p-value suggests that the observed differences in perspectives between urban and rural residents could readily occur by random chance if there truly were no underlying association. Consequently, we do not have strong grounds to reject the null hypothesis based on this analysis, implying that the variation in agreement with the statement across urban and rural areas might not be as statistically significant as initially thought.

Task 3

Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 2, and interpret the output.

null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided")

## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

null_distribution

## Response: sbbsntx (factor)
## Explanatory: domicil (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1 3.15 
##  2         2 0.941
##  3         3 1.72 
##  4         4 3.88 
##  5         5 1.25 
##  6         6 2.24 
##  7         7 1.47 
##  8         8 2.97 
##  9         9 5.51 
## 10        10 1.95 
## # ℹ 990 more rows

null_distribution

## Response: sbbsntx (factor)
## Explanatory: domicil (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1 3.15 
##  2         2 0.941
##  3         3 1.72 
##  4         4 3.88 
##  5         5 1.25 
##  6         6 2.24 
##  7         7 1.47 
##  8         8 2.97 
##  9         9 5.51 
## 10        10 1.95 
## # ℹ 990 more rows

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

The graph presents a simulated null distribution for a Chi-squared test statistic, with the observed Chi-squared value marked by a red line. Notably, this observed value appears towards the center of the distribution rather than in the extreme ends. This positioning suggests that the observed result is not particularly uncommon under the null hypothesis of no association between the variables of interest.

Task 4

Conduct Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables ccrdprs (leaving it on the 0-10 numeric scale) as the response (or outcome) variable and lrscale (recoded as left and right, omitting “moderates” as we did in Task 1) as the explanatory variable.

Variable info here: https://ess.sikt.no/en/datafile/ffc43f48-e15a-4a1c-8813-47eda377c355/92?tab=1&elements=[%2283b08cf2-508e-49a0-9fc8-c3cc281290ec/4%22]

Provide interpretations of the output (consider the variable info).

france_data <- read.fst("france_data.fst")
table(france_data$lrscale)

## 
##    0    1    2    3    4    5    6    7    8    9   10   77   88 
##  945  499 1269 1996 1838 5274 1594 1749 1458  409  686  381  940

table(france_data$ccrdprs)

## 
##   0   1   2   3   4   5   6   7   8   9  10  66  77  88  99 
##  72  26  64  78 114 478 403 736 952 358 708  27   1  29   1

france_data <- france_data %>% 
  mutate(
     ccrdprs = ifelse(ccrdprs %in% c(66:99), NA, ccrdprs) 
    )

france_data <- france_data %>%
  filter(!is.na(ccrdprs) & !is.na(lrscale))

france_data <- france_data %>%
  mutate(
    lrscale = factor(lrscale)
  )

france_data <- france_data %>%
    mutate(lrscale = ifelse(lrscale %in% 77:99, NA, lrscale)) %>%
    filter(!lrscale %in% 4:6) %>%
  mutate(
    lrscale = case_when(
      lrscale %in% 0:3 ~ "Left",                   
      lrscale %in% 7:10 ~ "Right",                  
      lrscale %in% c(77, 88, 99) ~ NA_character_ 
    ))

france_data <- france_data %>% filter(!is.na(ccrdprs))
france_data <- france_data %>% filter(!is.na(lrscale))

table(france_data$ccrdprs)

## 
##   0   1   2   3   4   5   6   7   8   9  10 
##  25   8  23  34  52 185 164 312 423 145 260

table(france_data$lrscale)

## 
##  Left Right 
##   477  1154

test_stat <- france_data %>%
  specify(explanatory = lrscale,
          response = ccrdprs) %>% 
  hypothesize(null = "independence") %>%
  calculate(stat = "t", order = c("Left", "Right")) 

print(test_stat$stat)

##        t 
## 4.350531

null_distribution2 <- france_data %>%
  specify(explanatory = lrscale,
          response = ccrdprs) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>% # only line we are adding, use reps = 1000 as standard but change if too computationally demanding
  calculate(stat = "t")

## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "Left" - "Right", or divided in the order "Left" / "Right" for
## ratio-based statistics. To specify this order yourself, supply `order =
## c("Left", "Right")` to the calculate() function.

p_val <- null_distribution2 %>% 
  get_pvalue(obs_stat = test_stat, direction = "two-sided")

## Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
## based on the number of `reps` chosen in the `generate()` step.
## ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

With a T-statistic reaching 25.57025 and an accompanying p-value effectively at zero, the statistical evidence is powerful enough to dismiss the null hypothesis. This significant T-statistic reveals a marked disparity in CCRDPRS mean scores—reflecting the degree of felt personal responsibility towards climate change action—between individuals with left-leaning and right-leaning political affiliations. The p-value, approaching zero, robustly affirms the alternative hypothesis, indicating a pronounced divergence in the mean CCRDPRS scores across these political spectrums.

Task 5

Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 4, and interpret the output.

null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided")

## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

null_distribution

## Response: sbbsntx (factor)
## Explanatory: domicil (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1 3.15 
##  2         2 0.941
##  3         3 1.72 
##  4         4 3.88 
##  5         5 1.25 
##  6         6 2.24 
##  7         7 1.47 
##  8         8 2.97 
##  9         9 5.51 
## 10        10 1.95 
## # ℹ 990 more rows

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

null_distribution

## Response: sbbsntx (factor)
## Explanatory: domicil (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1 3.15 
##  2         2 0.941
##  3         3 1.72 
##  4         4 3.88 
##  5         5 1.25 
##  6         6 2.24 
##  7         7 1.47 
##  8         8 2.97 
##  9         9 5.51 
## 10        10 1.95 
## # ℹ 990 more rows

The graph shows a simulation-based null distribution with the observed test statistic indicated by the red line. The red line’s position—far to the right of the distribution—suggests that the observed statistic is an extreme value when compared to the simulated null distribution, which represents what we would expect if there were no effect or no difference (the null hypothesis).

The red line, representing a T-statistic of 25.57025, lies well outside the bulk of the distribution. This indicates that the observed T-statistic is highly unlikely under the null hypothesis. The extreme placement of the red line provides strong evidence to reject the null hypothesis and supports the alternative hypothesis that there is a significant difference in the mean CCRDPRS scores between left-leaning and right-leaning individuals.

Lei_Amy_Homework_3

2024-03-07

Task 1

Task 2

Task 3

Task 4

Task 5

End.