Task 1

Based on the recoding of lrscale for “left” and “right” and omitting “moderates” (see Tutorial 4), and educ.ba from this tutorial, do the long form coding for Chi-Square. Using the steps outlined in the tutorial, first generate the tables of expected proportions and frequencies. Determine and interpret the critical value for independence. Finally, determine and interpret both the Pearson’s chi-squared statistic and the p-value. Discuss main takeaways.

Setting up environment

packages <- c("tidyverse", "infer", "fst", "modelsummary", "flextable")

new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## Warning: 패키지 'infer'는 R 버전 4.3.3에서 작성되었습니다

## Warning: 패키지 'flextable'는 R 버전 4.3.3에서 작성되었습니다

## 
## 다음의 패키지를 부착합니다: 'flextable'
## 
## The following object is masked from 'package:purrr':
## 
##     compose

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "infer"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "fst"       "infer"     "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"     
## 
## [[4]]
##  [1] "modelsummary" "fst"          "infer"        "lubridate"    "forcats"     
##  [6] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [11] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [16] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[5]]
##  [1] "flextable"    "modelsummary" "fst"          "infer"        "lubridate"   
##  [6] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [11] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [16] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [21] "base"

ess <- read_fst("All-ESS-Data.fst")
ess <- read_fst("france_data.fst")

Reading data

france_data <- read.fst("france_data.fst")

Recoding lrscale while omitting “moderates” and cleaning, and recoding educ.ba

france_data <- france_data %>%
  mutate(
    lrscale = case_when(
      lrscale %in% 0:3 ~ "Left",
      lrscale %in% 7:10 ~ "Right",
      lrscale %in% c(4, 5, 6, 77, 88, 99) ~ NA_character_
    ),
    Gen = case_when(
      yrbrn %in% 1900:1945 ~ "1",                   
      yrbrn %in% 1946:1964 ~ "2",                   
      yrbrn %in% 1965:1979 ~ "3",                   
      yrbrn %in% 1980:1996 ~ "4",                   
      TRUE ~ as.character(yrbrn)                    
    ),

    Gen = factor(Gen, levels = c("1", "2", "3", "4"),
                 labels = c("Interwar", "Baby Boomers", "Gen X", "Millennials"))
  )

table(france_data$lrscale)

## 
##  Left Right 
##  4709  4302

france_data <- france_data %>% filter(!is.na(lrscale))

unique(france_data$lrscale)

## [1] "Left"  "Right"

france_data <- ess %>%
  filter(cntry == "FR") %>%
  
  mutate(
    clsprty = case_when(
      clsprty == 1 ~ "Yes",
      clsprty == 2 ~ "No",
      clsprty %in% c(7, 8, 9) ~ NA_character_,
      
      TRUE ~ as.character(clsprty)
    ),
    trstplt = ifelse(trstplt %in% c(77, 88, 99), NA, trstplt),
    
    gndr = case_when(
      gndr == 1 ~ "Male",
      gndr == 2 ~ "Female",
      gndr == 9 ~ NA_character_,
      TRUE ~ as.character(gndr)
    ),
    
    educ.ba = case_when(
      essround < 5 & edulvla == 5 ~ "BA or more",
      essround >= 5 & edulvlb > 600 ~ "BA or more",
      TRUE ~ "No BA"
    )
  ) %>%
  
  mutate(
    edulvla = ifelse(edulvla %in% c(77, 88), NA_integer_, edulvla),
    edulvlb = ifelse(edulvlb %in% c(5555, 7777, 8888), NA_integer_, edulvlb),
    educ.ba = as.character(educ.ba)  # Force educ.ba to be treated as character
  )

Generating the tables of expected proportions and frequencies

tab_dat <- france_data

mytab <- table(tab_dat$clsprty, tab_dat$educ.ba)

rsums <- rowSums(mytab)

csums <- colSums(mytab)

N <- sum(mytab)

ptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")

## Table of Expected Proportions:

print(ptab)

##           [,1]      [,2]
## [1,] 0.1101248 0.3815066
## [2,] 0.1138739 0.3944946

ftab <- N * ptab
cat("Table of Expected Frequencies:\n")

## Table of Expected Frequencies:

print(ftab)

##          [,1]     [,2]
## [1,] 2059.444 7134.556
## [2,] 2129.556 7377.444

Determining the critical value for independence

alpha <- 0.05
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")

## Critical Value: 3.841

Interpreting the critical value for independence

I reject the null hypothesis at the 0.05 significance level if the Chi-squared test statistic exceeds 3.841. In other words, it implies a statistically significant association between feeling close to a party and education level in my dataset.

Determining both the Pearson’s Chi-squared statistic and the p-value

test_stat <- sum((mytab - ftab)^2 / ftab)
cat("Pearson's X^2:", round(test_stat, 4), "\n")

## Pearson's X^2: 166.1849

p_val <- pchisq(test_stat, df = 1, lower.tail = F)
cat("p-value:", round(p_val, 4), "\n")

## p-value: 0

cat("Chi-squared test result:\n")

## Chi-squared test result:

print(chisq.test(mytab, correct = F))

## 
##  Pearson's Chi-squared test
## 
## data:  mytab
## X-squared = 166.18, df = 1, p-value < 2.2e-16

Interpreting both the Pearson’s Chi-squared statistic and the p-value

As it is said in the tutorial markdown:

p-value < 2.2e-16: This is another representation of the p-value in scientific notation because it’s extremely close to 0.

The p-value being very small (almost same as 0), which does not exceed the threshold 0.05, means that I reject the null hypothesis that the two variables are independent of each other. The two variables being independent of each other suggests a strong evidence regarding the significant association between the two variables.

The chi-squared statistic of 166.18, which largely exceeds the critical value of 3.841, also supports my interpretation; the chi-squared statistic would have been closer to 0 if the two variables were independent of each other.

Task 2

Do Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables sbbsntx (recoded to the corresponding 5 categories) as the response and domicil (recoded to urban and rural, setting 2 to urban instead of peri-urban) as the explanatory variable for the country of France.

See variable info here: https://ess.sikt.no/en/datafile/ffc43f48-e15a-4a1c-8813-47eda377c355/92?tab=1&elements=[%22874a86e2-a0f6-40b4-aef7-51c8eed98a7d/1%22])

Provide interpretations of the output (consider the variable info).

Recoding and cleaning the variables domicil and sbbsntx

france_data <- ess %>%
  filter(cntry == "FR") %>%
  
  mutate(
    domicil = case_when(
      domicil == 1 ~ "Rural",
      domicil == 2 ~ "Urban",
      domicil %in% c(3, 4, 5, 7, 8) ~ NA_character_,
      TRUE ~ as.character(domicil)
    ),
    sbbsntx = case_when(
      sbbsntx == 1 ~ "Agree strongly",
      sbbsntx == 2 ~ "Agree",
      sbbsntx == 3 ~ "Neither agree nor disagree",
      sbbsntx == 4 ~ "Disagree",
      sbbsntx == 5 ~ "Disagree strongly",
      sbbsntx %in% c(7, 8) ~ NA_character_,
      TRUE ~ as.character(sbbsntx)
    ),
  )

table(france_data$domicil)

## 
## Rural Urban 
##  3584  2361

table(france_data$sbbsntx)

## 
##                      Agree             Agree strongly 
##                       1538                        757 
##                   Disagree          Disagree strongly 
##                        640                        301 
## Neither agree nor disagree 
##                        804

unique(france_data$domicil)

## [1] NA      "Urban" "Rural"

unique(france_data$sbbsntx)

## [1] NA                           "Agree strongly"            
## [3] "Neither agree nor disagree" "Agree"                     
## [5] "Disagree strongly"          "Disagree"

france_data <- france_data %>% filter(!is.na(domicil))
france_data <- france_data %>% filter(!is.na(sbbsntx))

unique(france_data$domicil)

## [1] "Rural" "Urban"

unique(france_data$sbbsntx)

## [1] "Neither agree nor disagree" "Agree"                     
## [3] "Agree strongly"             "Disagree"                  
## [5] "Disagree strongly"

table(france_data$domicil)

## 
## Rural Urban 
##   684   479

table(france_data$sbbsntx)

## 
##                      Agree             Agree strongly 
##                        425                        189 
##                   Disagree          Disagree strongly 
##                        220                        102 
## Neither agree nor disagree 
##                        227

Long form coding for Chi-Square

## Not necessary, but always good to rename when doing operations so you can backtrack to clean dataset
tab_dat2 <- france_data

# Create a crosstable and save it into the object mytab
mytab2 <- table(tab_dat2$domicil, tab_dat2$sbbsntx)

# Calculate row-wise sums of counts in table 
rsums2 <- rowSums(mytab2)

# Calculate column-wise sums of counts in table
csums2 <- colSums(mytab2)

# Get the total count of the table (N)
N2 <- sum(mytab2)

# Generate table of expected proportions
ptab2 <- tcrossprod(rsums2/N2, csums2/N2)
cat("Table of Expected Proportions2:\n2")

## Table of Expected Proportions2:
## 2

print(ptab2)

##           [,1]       [,2]     [,3]       [,4]       [,5]
## [1,] 0.2149243 0.09557812 0.111255 0.05158184 0.11479488
## [2,] 0.1505099 0.06693263 0.077911 0.03612237 0.08038998

ftab2 <- N2 * ptab2
cat("Table of Expected Frequencies2:\n")

## Table of Expected Frequencies2:

print(ftab2)

##         [,1]      [,2]      [,3]     [,4]      [,5]
## [1,] 249.957 111.15735 129.38951 59.98968 133.50645
## [2,] 175.043  77.84265  90.61049 42.01032  93.49355

alpha2 <- 0.05
c_val2 <- qchisq(alpha2, df = 2, lower.tail = FALSE)
cat("Critical Value2:", round(c_val2, 3), "\n")

## Critical Value2: 5.991

Using Infer - Step 1

test_stat2 <- france_data %>%
  specify(explanatory = domicil,
          response = sbbsntx) %>%
  hypothesize(null = "independence") %>%
  calculate(stat = "Chisq")

print(test_stat2$stat)

## X-squared 
##      5.87

Using Infer - Step 2

null_distribution <- france_data %>%
  specify(explanatory = domicil,
          response = sbbsntx) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "Chisq")

Using Infer - Step 3

p_val <- null_distribution %>%
  get_pvalue(obs_stat = test_stat2, direction = "two-sided")

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.372

Interpretting the output

As the chi-squared statistic of 5.847 does not exceed the critical value of 5.991 and the p-value of 0.412 exceeds the threshold 0.05, I fail to reject the null hypothesis that the two variables are independent of each other. Because these two variables are not independent of each other, I can conclude that there is little or no statistically significant association between the amount of money that businesses pay in taxes or charges for social benefits/services and the domicile of individuals in our sample.

Task 3

Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 2, and interpret the output.

Visualizing the Steps 1 to 3

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat2, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

Interpretting the output

The constructed confidence interval (green rectangular zone) will contain the true value of interest across 95% of random samples. The tails at each horizontal end are thus the remaining 5%, which will not contain the true value. Because the observed test statistics (the red line) falls within the null distribution, there is considerably a large amount of shaded red regions of the null distribution, indicating little or no statistically significant association between the two variables in my dataset.

Task 4

Conduct Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables ccrdprs (leaving it on the 0-10 numeric scale) as the response (or outcome) variable and lrscale (recoded as left and right, omitting “moderates” as we did in Task 1) as the explanatory variable.

Variable info here: https://ess.sikt.no/en/datafile/ffc43f48-e15a-4a1c-8813-47eda377c355/92?tab=1&elements=[%2283b08cf2-508e-49a0-9fc8-c3cc281290ec/4%22]

Provide interpretations of the output (consider the variable info).

Cleaning and recoding ccrdprs

table(france_data$ccrdprs)

## 
##   0   1   2   3   4   5   6   7   8   9  10  66  88 
##   9   6   6  11  17  52  69 104 131  53  87   4   3

france_data <- ess %>%
  filter(cntry == "FR") %>%
  
  mutate(
    ccrdprs = case_when(
      ccrdprs == 0 ~ "0 - Not at all",
      ccrdprs == 1 ~ "1",
      ccrdprs == 2 ~ "2",
      ccrdprs == 3 ~ "3",
      ccrdprs == 4 ~ "4",
      ccrdprs == 5 ~ "5",
      ccrdprs == 6 ~ "6",
      ccrdprs == 7 ~ "7",
      ccrdprs == 8 ~ "8",
      ccrdprs == 9 ~ "9",
      ccrdprs == 10 ~ "10 - A great deal",
      ccrdprs %in% c(66, 77, 88, 99) ~ NA_character_,
      TRUE ~ as.character(ccrdprs)
    ),
  )

table(france_data$ccrdprs)

## 
##    0 - Not at all                 1 10 - A great deal                 2 
##                72                26               708                64 
##                 3                 4                 5                 6 
##                78               114               478               403 
##                 7                 8                 9 
##               736               952               358

unique(france_data$ccrdprs)

##  [1] NA                  "10 - A great deal" "8"                
##  [4] "2"                 "7"                 "6"                
##  [7] "5"                 "1"                 "9"                
## [10] "0 - Not at all"    "4"                 "3"

france_data <- france_data %>% filter(!is.na(ccrdprs))

unique(france_data$ccrdprs)

##  [1] "10 - A great deal" "8"                 "2"                
##  [4] "7"                 "6"                 "5"                
##  [7] "1"                 "9"                 "0 - Not at all"   
## [10] "4"                 "3"

Re-recoding the lrscale

france_data <- france_data %>%
  mutate(
    lrscale = case_when(
      lrscale %in% 0:3 ~ "Left",
      lrscale %in% 7:10 ~ "Right",
      lrscale %in% c(4, 5, 6, 77, 88, 99) ~ NA_character_
    ),
    Gen = case_when(
      yrbrn %in% 1900:1945 ~ "1",                   
      yrbrn %in% 1946:1964 ~ "2",                   
      yrbrn %in% 1965:1979 ~ "3",                   
      yrbrn %in% 1980:1996 ~ "4",                   
      TRUE ~ as.character(yrbrn)                    
    ),

    Gen = factor(Gen, levels = c("1", "2", "3", "4"),
                 labels = c("Interwar", "Baby Boomers", "Gen X", "Millennials"))
  )

table(france_data$lrscale)

## 
##  Left Right 
##   889   982

france_data <- france_data %>% filter(!is.na(lrscale))

unique(france_data$lrscale)

## [1] "Left"  "Right"

Long-form coding for chi-square

## Not necessary, but always good to rename when doing operations so you can backtrack to clean dataset
tab_dat3 <- france_data

# Create a crosstable and save it into the object mytab
mytab3 <- table(tab_dat3$lrscale, tab_dat3$ccrdprs)

# Calculate row-wise sums of counts in table 
rsums3 <- rowSums(mytab3)

# Calculate column-wise sums of counts in table
csums3 <- colSums(mytab3)

# Get the total count of the table (N)
N3 <- sum(mytab3)

# Generate table of expected proportions
ptab3 <- tcrossprod(rsums3/N3, csums3/N3)
cat("Table of Expected Proportions3:\n3")

## Table of Expected Proportions3:
## 3

print(ptab3)

##             [,1]        [,2]       [,3]        [,4]        [,5]       [,6]
## [1,] 0.008380465 0.003047442 0.08507442 0.007110698 0.009650233 0.01447535
## [2,] 0.009257162 0.003366241 0.09397422 0.007854561 0.010659762 0.01598964
##            [,7]       [,8]      [,9]     [,10]      [,11]
## [1,] 0.05206047 0.04393395 0.0873600 0.1198660 0.04418791
## [2,] 0.05750661 0.04852997 0.0964989 0.1324055 0.04881049

ftab3 <- N3 * ptab3
cat("Table of Expected Frequencies3:\n3")

## Table of Expected Frequencies3:
## 3

print(ftab3)

##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]      [,7]     [,8]
## [1,] 15.67985 5.701764 159.1742 13.30412 18.05559 27.08338  97.40513 82.20043
## [2,] 17.32015 6.298236 175.8258 14.69588 19.94441 29.91662 107.59487 90.79957
##          [,9]    [,10]    [,11]
## [1,] 163.4506 224.2694 82.67557
## [2,] 180.5494 247.7306 91.32443

alpha3 <- 0.05
c_val3 <- qchisq(alpha3, df = 2, lower.tail = FALSE)
cat("Critical Value2:", round(c_val3, 3), "\n")

## Critical Value2: 5.991

Using Infer - Step 1

test_stat3 <- france_data %>%
  specify(explanatory = lrscale,
          response = ccrdprs) %>%
  hypothesize(null = "independence") %>%
  calculate(stat = "Chisq")

print(test_stat3$stat)

## X-squared 
##  35.04425

Using Infer - Step 2

null_distribution <- france_data %>%
  specify(explanatory = lrscale,
          response = ccrdprs) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "Chisq")

Using Infer - Step 3

p_val <- null_distribution %>%
  get_pvalue(obs_stat = test_stat3, direction = "two-sided")

## Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
## based on the number of `reps` chosen in the `generate()` step.
## ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Interpretting the output

As the chi-squared statistic of 35.04 largely exceeds the critical value of 5.991 and the p-value of essentially 0 does not exceed the threshold 0.05, I reject the null hypothesis that the two variables are independent of each other. Because these two variables are independent of each other, I can conclude that there is a statistically significant association between the extent individuals feel personal responsibility to reduce climate change and the political inclination of individuals.

Task 5

Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 4, and interpret the output.

Visualizing the output

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat3, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

Interpreting the output

The constructed confidence interval (green rectangular zone) will contain the true value of interest across 95% of random samples. The very small amount of shaded red regions of the null distribution indicates the possibility of observing cases that are as (or more) extreme than the observed statistic. Because the observed test statistic stays way outside the confident interval, it indicates a statistically significant association between the two variables in my dataset.

Kim_Seokang_Eric_Homework_3

Seokang Kim

2024-03-09

Task 1

Setting up environment

Reading data

Recoding lrscale while omitting “moderates” and cleaning, and recoding educ.ba

Generating the tables of expected proportions and frequencies

Determining the critical value for independence

Interpreting the critical value for independence

Determining both the Pearson’s Chi-squared statistic and the p-value

Interpreting both the Pearson’s Chi-squared statistic and the p-value

Task 2

Recoding and cleaning the variables domicil and sbbsntx

Long form coding for Chi-Square

Using Infer - Step 1

Using Infer - Step 2

Using Infer - Step 3

Interpretting the output

Task 3

Visualizing the Steps 1 to 3

Interpretting the output

Task 4

Cleaning and recoding ccrdprs

Re-recoding the lrscale

Long-form coding for chi-square

Using Infer - Step 1

Using Infer - Step 2

Using Infer - Step 3

Interpretting the output

Task 5

Visualizing the output

Interpreting the output

End.