Based on the recoding of lrscale for “left” and “right” and omitting “moderates” (see Tutorial 4), and educ.ba from this tutorial, do the long form coding for Chi-Square. Using the steps outlined in the tutorial, first generate the tables of expected proportions and frequencies. Determine and interpret the critical value for independence. Finally, determine and interpret both the Pearson’s chi-squared statistic and the p-value. Discuss main takeaways.
packages <- c("tidyverse", "infer", "fst", "modelsummary", "flextable")
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: 패키지 'infer'는 R 버전 4.3.3에서 작성되었습니다
## Warning: 패키지 'flextable'는 R 버전 4.3.3에서 작성되었습니다
##
## 다음의 패키지를 부착합니다: 'flextable'
##
## The following object is masked from 'package:purrr':
##
## compose
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "infer" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "fst" "infer" "lubridate" "forcats" "stringr" "dplyr"
## [7] "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [13] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [19] "base"
##
## [[4]]
## [1] "modelsummary" "fst" "infer" "lubridate" "forcats"
## [6] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [11] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [16] "grDevices" "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "flextable" "modelsummary" "fst" "infer" "lubridate"
## [6] "forcats" "stringr" "dplyr" "purrr" "readr"
## [11] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [16] "graphics" "grDevices" "utils" "datasets" "methods"
## [21] "base"
ess <- read_fst("All-ESS-Data.fst")
ess <- read_fst("france_data.fst")
france_data <- read.fst("france_data.fst")
france_data <- france_data %>%
mutate(
lrscale = case_when(
lrscale %in% 0:3 ~ "Left",
lrscale %in% 7:10 ~ "Right",
lrscale %in% c(4, 5, 6, 77, 88, 99) ~ NA_character_
),
Gen = case_when(
yrbrn %in% 1900:1945 ~ "1",
yrbrn %in% 1946:1964 ~ "2",
yrbrn %in% 1965:1979 ~ "3",
yrbrn %in% 1980:1996 ~ "4",
TRUE ~ as.character(yrbrn)
),
Gen = factor(Gen, levels = c("1", "2", "3", "4"),
labels = c("Interwar", "Baby Boomers", "Gen X", "Millennials"))
)
table(france_data$lrscale)
##
## Left Right
## 4709 4302
france_data <- france_data %>% filter(!is.na(lrscale))
unique(france_data$lrscale)
## [1] "Left" "Right"
france_data <- ess %>%
filter(cntry == "FR") %>%
mutate(
clsprty = case_when(
clsprty == 1 ~ "Yes",
clsprty == 2 ~ "No",
clsprty %in% c(7, 8, 9) ~ NA_character_,
TRUE ~ as.character(clsprty)
),
trstplt = ifelse(trstplt %in% c(77, 88, 99), NA, trstplt),
gndr = case_when(
gndr == 1 ~ "Male",
gndr == 2 ~ "Female",
gndr == 9 ~ NA_character_,
TRUE ~ as.character(gndr)
),
educ.ba = case_when(
essround < 5 & edulvla == 5 ~ "BA or more",
essround >= 5 & edulvlb > 600 ~ "BA or more",
TRUE ~ "No BA"
)
) %>%
mutate(
edulvla = ifelse(edulvla %in% c(77, 88), NA_integer_, edulvla),
edulvlb = ifelse(edulvlb %in% c(5555, 7777, 8888), NA_integer_, edulvlb),
educ.ba = as.character(educ.ba) # Force educ.ba to be treated as character
)
tab_dat <- france_data
mytab <- table(tab_dat$clsprty, tab_dat$educ.ba)
rsums <- rowSums(mytab)
csums <- colSums(mytab)
N <- sum(mytab)
ptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")
## Table of Expected Proportions:
print(ptab)
## [,1] [,2]
## [1,] 0.1101248 0.3815066
## [2,] 0.1138739 0.3944946
ftab <- N * ptab
cat("Table of Expected Frequencies:\n")
## Table of Expected Frequencies:
print(ftab)
## [,1] [,2]
## [1,] 2059.444 7134.556
## [2,] 2129.556 7377.444
alpha <- 0.05
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")
## Critical Value: 3.841
I reject the null hypothesis at the 0.05 significance level if the Chi-squared test statistic exceeds 3.841. In other words, it implies a statistically significant association between feeling close to a party and education level in my dataset.
test_stat <- sum((mytab - ftab)^2 / ftab)
cat("Pearson's X^2:", round(test_stat, 4), "\n")
## Pearson's X^2: 166.1849
p_val <- pchisq(test_stat, df = 1, lower.tail = F)
cat("p-value:", round(p_val, 4), "\n")
## p-value: 0
cat("Chi-squared test result:\n")
## Chi-squared test result:
print(chisq.test(mytab, correct = F))
##
## Pearson's Chi-squared test
##
## data: mytab
## X-squared = 166.18, df = 1, p-value < 2.2e-16
As it is said in the tutorial markdown:
p-value < 2.2e-16: This is another representation of the p-value in scientific notation because it’s extremely close to 0.
The p-value being very small (almost same as 0), which does not exceed the threshold 0.05, means that I reject the null hypothesis that the two variables are independent of each other. The two variables being independent of each other suggests a strong evidence regarding the significant association between the two variables.
The chi-squared statistic of 166.18, which largely exceeds the critical value of 3.841, also supports my interpretation; the chi-squared statistic would have been closer to 0 if the two variables were independent of each other.
Do Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables sbbsntx (recoded to the corresponding 5 categories) as the response and domicil (recoded to urban and rural, setting 2 to urban instead of peri-urban) as the explanatory variable for the country of France.
See variable info here: https://ess.sikt.no/en/datafile/ffc43f48-e15a-4a1c-8813-47eda377c355/92?tab=1&elements=[%22874a86e2-a0f6-40b4-aef7-51c8eed98a7d/1%22])
Provide interpretations of the output (consider the variable info).
france_data <- ess %>%
filter(cntry == "FR") %>%
mutate(
domicil = case_when(
domicil == 1 ~ "Rural",
domicil == 2 ~ "Urban",
domicil %in% c(3, 4, 5, 7, 8) ~ NA_character_,
TRUE ~ as.character(domicil)
),
sbbsntx = case_when(
sbbsntx == 1 ~ "Agree strongly",
sbbsntx == 2 ~ "Agree",
sbbsntx == 3 ~ "Neither agree nor disagree",
sbbsntx == 4 ~ "Disagree",
sbbsntx == 5 ~ "Disagree strongly",
sbbsntx %in% c(7, 8) ~ NA_character_,
TRUE ~ as.character(sbbsntx)
),
)
table(france_data$domicil)
##
## Rural Urban
## 3584 2361
table(france_data$sbbsntx)
##
## Agree Agree strongly
## 1538 757
## Disagree Disagree strongly
## 640 301
## Neither agree nor disagree
## 804
unique(france_data$domicil)
## [1] NA "Urban" "Rural"
unique(france_data$sbbsntx)
## [1] NA "Agree strongly"
## [3] "Neither agree nor disagree" "Agree"
## [5] "Disagree strongly" "Disagree"
france_data <- france_data %>% filter(!is.na(domicil))
france_data <- france_data %>% filter(!is.na(sbbsntx))
unique(france_data$domicil)
## [1] "Rural" "Urban"
unique(france_data$sbbsntx)
## [1] "Neither agree nor disagree" "Agree"
## [3] "Agree strongly" "Disagree"
## [5] "Disagree strongly"
table(france_data$domicil)
##
## Rural Urban
## 684 479
table(france_data$sbbsntx)
##
## Agree Agree strongly
## 425 189
## Disagree Disagree strongly
## 220 102
## Neither agree nor disagree
## 227
## Not necessary, but always good to rename when doing operations so you can backtrack to clean dataset
tab_dat2 <- france_data
# Create a crosstable and save it into the object mytab
mytab2 <- table(tab_dat2$domicil, tab_dat2$sbbsntx)
# Calculate row-wise sums of counts in table
rsums2 <- rowSums(mytab2)
# Calculate column-wise sums of counts in table
csums2 <- colSums(mytab2)
# Get the total count of the table (N)
N2 <- sum(mytab2)
# Generate table of expected proportions
ptab2 <- tcrossprod(rsums2/N2, csums2/N2)
cat("Table of Expected Proportions2:\n2")
## Table of Expected Proportions2:
## 2
print(ptab2)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.2149243 0.09557812 0.111255 0.05158184 0.11479488
## [2,] 0.1505099 0.06693263 0.077911 0.03612237 0.08038998
ftab2 <- N2 * ptab2
cat("Table of Expected Frequencies2:\n")
## Table of Expected Frequencies2:
print(ftab2)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 249.957 111.15735 129.38951 59.98968 133.50645
## [2,] 175.043 77.84265 90.61049 42.01032 93.49355
alpha2 <- 0.05
c_val2 <- qchisq(alpha2, df = 2, lower.tail = FALSE)
cat("Critical Value2:", round(c_val2, 3), "\n")
## Critical Value2: 5.991
test_stat2 <- france_data %>%
specify(explanatory = domicil,
response = sbbsntx) %>%
hypothesize(null = "independence") %>%
calculate(stat = "Chisq")
print(test_stat2$stat)
## X-squared
## 5.87
null_distribution <- france_data %>%
specify(explanatory = domicil,
response = sbbsntx) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "Chisq")
p_val <- null_distribution %>%
get_pvalue(obs_stat = test_stat2, direction = "two-sided")
p_val
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.372
As the chi-squared statistic of 5.847 does not exceed the critical value of 5.991 and the p-value of 0.412 exceeds the threshold 0.05, I fail to reject the null hypothesis that the two variables are independent of each other. Because these two variables are not independent of each other, I can conclude that there is little or no statistically significant association between the amount of money that businesses pay in taxes or charges for social benefits/services and the domicile of individuals in our sample.
Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 2, and interpret the output.
conf_int <- null_distribution %>%
get_confidence_interval(level = 0.95, type = "percentile")
null_distribution %>%
visualize() +
shade_p_value(obs_stat = test_stat2, direction = "two-sided") +
shade_confidence_interval(endpoints = conf_int)
## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.
The constructed confidence interval (green rectangular zone) will contain the true value of interest across 95% of random samples. The tails at each horizontal end are thus the remaining 5%, which will not contain the true value. Because the observed test statistics (the red line) falls within the null distribution, there is considerably a large amount of shaded red regions of the null distribution, indicating little or no statistically significant association between the two variables in my dataset.
Conduct Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables ccrdprs (leaving it on the 0-10 numeric scale) as the response (or outcome) variable and lrscale (recoded as left and right, omitting “moderates” as we did in Task 1) as the explanatory variable.
Variable info here: https://ess.sikt.no/en/datafile/ffc43f48-e15a-4a1c-8813-47eda377c355/92?tab=1&elements=[%2283b08cf2-508e-49a0-9fc8-c3cc281290ec/4%22]
Provide interpretations of the output (consider the variable info).
table(france_data$ccrdprs)
##
## 0 1 2 3 4 5 6 7 8 9 10 66 88
## 9 6 6 11 17 52 69 104 131 53 87 4 3
france_data <- ess %>%
filter(cntry == "FR") %>%
mutate(
ccrdprs = case_when(
ccrdprs == 0 ~ "0 - Not at all",
ccrdprs == 1 ~ "1",
ccrdprs == 2 ~ "2",
ccrdprs == 3 ~ "3",
ccrdprs == 4 ~ "4",
ccrdprs == 5 ~ "5",
ccrdprs == 6 ~ "6",
ccrdprs == 7 ~ "7",
ccrdprs == 8 ~ "8",
ccrdprs == 9 ~ "9",
ccrdprs == 10 ~ "10 - A great deal",
ccrdprs %in% c(66, 77, 88, 99) ~ NA_character_,
TRUE ~ as.character(ccrdprs)
),
)
table(france_data$ccrdprs)
##
## 0 - Not at all 1 10 - A great deal 2
## 72 26 708 64
## 3 4 5 6
## 78 114 478 403
## 7 8 9
## 736 952 358
unique(france_data$ccrdprs)
## [1] NA "10 - A great deal" "8"
## [4] "2" "7" "6"
## [7] "5" "1" "9"
## [10] "0 - Not at all" "4" "3"
france_data <- france_data %>% filter(!is.na(ccrdprs))
unique(france_data$ccrdprs)
## [1] "10 - A great deal" "8" "2"
## [4] "7" "6" "5"
## [7] "1" "9" "0 - Not at all"
## [10] "4" "3"
france_data <- france_data %>%
mutate(
lrscale = case_when(
lrscale %in% 0:3 ~ "Left",
lrscale %in% 7:10 ~ "Right",
lrscale %in% c(4, 5, 6, 77, 88, 99) ~ NA_character_
),
Gen = case_when(
yrbrn %in% 1900:1945 ~ "1",
yrbrn %in% 1946:1964 ~ "2",
yrbrn %in% 1965:1979 ~ "3",
yrbrn %in% 1980:1996 ~ "4",
TRUE ~ as.character(yrbrn)
),
Gen = factor(Gen, levels = c("1", "2", "3", "4"),
labels = c("Interwar", "Baby Boomers", "Gen X", "Millennials"))
)
table(france_data$lrscale)
##
## Left Right
## 889 982
france_data <- france_data %>% filter(!is.na(lrscale))
unique(france_data$lrscale)
## [1] "Left" "Right"
## Not necessary, but always good to rename when doing operations so you can backtrack to clean dataset
tab_dat3 <- france_data
# Create a crosstable and save it into the object mytab
mytab3 <- table(tab_dat3$lrscale, tab_dat3$ccrdprs)
# Calculate row-wise sums of counts in table
rsums3 <- rowSums(mytab3)
# Calculate column-wise sums of counts in table
csums3 <- colSums(mytab3)
# Get the total count of the table (N)
N3 <- sum(mytab3)
# Generate table of expected proportions
ptab3 <- tcrossprod(rsums3/N3, csums3/N3)
cat("Table of Expected Proportions3:\n3")
## Table of Expected Proportions3:
## 3
print(ptab3)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.008380465 0.003047442 0.08507442 0.007110698 0.009650233 0.01447535
## [2,] 0.009257162 0.003366241 0.09397422 0.007854561 0.010659762 0.01598964
## [,7] [,8] [,9] [,10] [,11]
## [1,] 0.05206047 0.04393395 0.0873600 0.1198660 0.04418791
## [2,] 0.05750661 0.04852997 0.0964989 0.1324055 0.04881049
ftab3 <- N3 * ptab3
cat("Table of Expected Frequencies3:\n3")
## Table of Expected Frequencies3:
## 3
print(ftab3)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 15.67985 5.701764 159.1742 13.30412 18.05559 27.08338 97.40513 82.20043
## [2,] 17.32015 6.298236 175.8258 14.69588 19.94441 29.91662 107.59487 90.79957
## [,9] [,10] [,11]
## [1,] 163.4506 224.2694 82.67557
## [2,] 180.5494 247.7306 91.32443
alpha3 <- 0.05
c_val3 <- qchisq(alpha3, df = 2, lower.tail = FALSE)
cat("Critical Value2:", round(c_val3, 3), "\n")
## Critical Value2: 5.991
test_stat3 <- france_data %>%
specify(explanatory = lrscale,
response = ccrdprs) %>%
hypothesize(null = "independence") %>%
calculate(stat = "Chisq")
print(test_stat3$stat)
## X-squared
## 35.04425
null_distribution <- france_data %>%
specify(explanatory = lrscale,
response = ccrdprs) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "Chisq")
p_val <- null_distribution %>%
get_pvalue(obs_stat = test_stat3, direction = "two-sided")
## Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
## based on the number of `reps` chosen in the `generate()` step.
## ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
p_val
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
As the chi-squared statistic of 35.04 largely exceeds the critical value of 5.991 and the p-value of essentially 0 does not exceed the threshold 0.05, I reject the null hypothesis that the two variables are independent of each other. Because these two variables are independent of each other, I can conclude that there is a statistically significant association between the extent individuals feel personal responsibility to reduce climate change and the political inclination of individuals.
Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 4, and interpret the output.
conf_int <- null_distribution %>%
get_confidence_interval(level = 0.95, type = "percentile")
null_distribution %>%
visualize() +
shade_p_value(obs_stat = test_stat3, direction = "two-sided") +
shade_confidence_interval(endpoints = conf_int)
## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.
The constructed confidence interval (green rectangular zone) will contain the true value of interest across 95% of random samples. The very small amount of shaded red regions of the null distribution indicates the possibility of observing cases that are as (or more) extreme than the observed statistic. Because the observed test statistic stays way outside the confident interval, it indicates a statistically significant association between the two variables in my dataset.