This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
packages <- c("tidyverse", "infer", "fst") # add any you need here
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "infer" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "fst" "infer" "lubridate" "forcats" "stringr" "dplyr"
## [7] "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [13] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [19] "base"
rm(list=ls()); gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 1044146 55.8 2177524 116.3 NA 1274208 68.1
## Vcells 1809885 13.9 8388608 64.0 16384 2371298 18.1
ess <- read_fst("All-ESS-Data.fst")
Task 1 Based on the recoding of lrscale for “left” and “right” and omitting “moderates” (see Tutorial 4), and educ.ba from this tutorial, do the long form coding for Chi-Square. Using the steps outlined in the tutorial, first generate the tables of expected proportions and frequencies. Determine and interpret the critical value for independence. Finally, determine and interpret both the Pearson’s chi-squared statistic and the p-value. Discuss main takeaways.
france_data <- ess %>%
# Filter for France
filter(cntry == "FR") %>%
# Recoding clsprty and cleaning trstplt (will use later)
mutate(
clsprty = case_when(
clsprty == 1 ~ "Yes", # Recode 1 to "Yes"
clsprty == 2 ~ "No", # Recode 2 to "No"
clsprty %in% c(7, 8, 9) ~ NA_character_, # Handle other specific cases where you want to set it as NA
TRUE ~ as.character(clsprty) # Keep other values as-is but ensure they are characters
),
trstplt = ifelse(trstplt %in% c(77, 88, 99), NA, trstplt),
# Recoding gender
gndr = case_when(
gndr == 1 ~ "Male",
gndr == 2 ~ "Female",
gndr == 9 ~ NA_character_,
TRUE ~ as.character(gndr)
),
# Recoding education with case_when for consistency
educ.ba = case_when(
essround < 5 & edulvla == 5 ~ "BA or more",
essround >= 5 & edulvlb > 600 ~ "BA or more",
TRUE ~ "No BA"
)
) %>%
# Ensure all NAs are handled uniformly
mutate(
edulvla = ifelse(edulvla %in% c(77, 88), NA_integer_, edulvla),
edulvlb = ifelse(edulvlb %in% c(5555, 7777, 8888), NA_integer_, edulvlb),
educ.ba = as.character(educ.ba) # Force educ.ba to be treated as character
)
# using unique to identify
unique(france_data$gndr)
## [1] "Female" "Male"
# does not return NA, but the following does
unique(france_data$clsprty)
## [1] "Yes" "No" NA
# here's how to make sure it's really cleaned
france_data <- france_data %>% filter(!is.na(clsprty))
france_data <- france_data %>% filter(!is.na(trstplt))
# double-checking + quick distribution
unique(france_data$clsprty)
## [1] "Yes" "No"
table(france_data$clsprty)
##
## No Yes
## 9098 9480
tab_dat <- france_data
# Create a crosstable and save it into the object mytab
mytab <- table(tab_dat$clsprty, tab_dat$educ.ba)
# Calculate row-wise sums of counts in table
rsums <- rowSums(mytab)
# Calculate column-wise sums of counts in table
csums <- colSums(mytab)
# Get the total count of the table (N)
N <- sum(mytab)
# Generate table of expected proportions
ptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")
## Table of Expected Proportions:
print(ptab)
## [,1] [,2]
## [1,] 0.1099746 0.3797444
## [2,] 0.1145921 0.3956889
# Table of expected frequencies
ftab <- N * ptab
cat("Table of Expected Frequencies:\n")
## Table of Expected Frequencies:
print(ftab)
## [,1] [,2]
## [1,] 2043.108 7054.892
## [2,] 2128.892 7351.108
# Critical Value for Independence:
alpha <- 0.05 # significance level
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")
## Critical Value: 3.841
# Pearson's chi-squared statistic:
test_stat <- sum((mytab - ftab)^2 / ftab)
cat("Pearson's X^2:", round(test_stat, 4), "\n")
## Pearson's X^2: 160.4057
Our chi-squared result is significantly more than the 3.841 critical threshold that we determined. If we were to interpret it, it would suggest that feeling close to a party and education levels are statistically significantly correlated.referring to the idea that different educational attainment levels have different effects on how close a person feels to a party.
# p-value:
p_val <- pchisq(test_stat, df = 1, lower.tail = F)
cat("p-value:", round(p_val, 4), "\n")
## p-value: 0
# Using chisq.test for validation:
cat("Chi-squared test result:\n")
## Chi-squared test result:
print(chisq.test(mytab, correct = F))
##
## Pearson's Chi-squared test
##
## data: mytab
## X-squared = 160.41, df = 1, p-value < 2.2e-16
We obtained a p-value that is extremely small and close to 0. For that reason we must reject the null hypothesis that the two variable, feeling close to a party and education levels are independent of each other. As we can see from the large chi-squared value (160.41) and the p-value, the hypothesis that the two variables we are looking at, feeling close to a party and education levels are independent of each other. With the results achieved, it shows that there is a statistically strong association between these two variables.
TASK 2: Do Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables sbbsntx (recoded to the corresponding 5 categories) as the response and domicil (recoded to urban and rural, setting 2 to urban instead of peri-urban) as the explanatory variable for the country of France.
test_stat <- france_data %>%
specify(explanatory = domicil, # change variable name for explanatory variable
response = sbbsntx) %>% # change variable name for outcome of interest
hypothesize(null = "independence") %>%
calculate(stat = "slope") # replace in between quotation marks appropriate test statistic
## Warning: Removed 14528 rows containing missing values.
## Message: The independence null hypothesis does not inform calculation of the
## observed statistic (a slope) and will be ignored.
print(test_stat$stat)
## domicil
## -0.0768133
null_distribution <- france_data %>%
specify(explanatory = domicil,
response = sbbsntx) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>% # only line we are adding, use reps = 1000 as standard but change if too computationally demanding
calculate(stat = "slope")
## Warning: Removed 14528 rows containing missing values.
p_val <- null_distribution %>% # replace name here if you assigned something other than null_distribution above
get_pvalue(obs_stat = test_stat, direction = "two-sided") # would only replace test_stat if assigned another name in Step 1
## Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
## based on the number of `reps` chosen in the `generate()` step.
## ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
p_val
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
The P-value that we obtained is 0 which means that there is strong evidence to reject the null. Meaning that the two variables, social benefits/ services cost businesses too much tax and the respondent’s description are independent of each other.
Task 3: The P-value that we obtained is 0 which means that there is strong evidence to reject the null. Meaning that the two variables, social benefits/ services cost businesses too much tax and the respondent’s description are independent of each other.
conf_int <- null_distribution %>%
get_confidence_interval(level = 0.95, type = "percentile")
null_distribution %>%
visualize() +
shade_p_value(obs_stat = test_stat, direction = "two-sided") +
shade_confidence_interval(endpoints = conf_int)
null_distribution
## Response: sbbsntx (numeric)
## Explanatory: domicil (numeric)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
## replicate stat
## <int> <dbl>
## 1 1 0.0000266
## 2 2 0.00875
## 3 3 -0.0352
## 4 4 -0.0150
## 5 5 -0.00498
## 6 6 -0.0117
## 7 7 -0.0378
## 8 8 -0.00944
## 9 9 0.00114
## 10 10 -0.0354
## # ℹ 990 more rows
Task 4: Conduct Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables ccrdprs (leaving it on the 0-10 numeric scale) as the response (or outcome) variable and lrscale (recoded as left and right, omitting “moderates” as we did in Task 1) as the explanatory variable.
france_data <- ess %>%
# Filter for France
filter(cntry == "FR") %>%
mutate(
lrscale = case_when(
lrscale %in% 0:3 ~ "Left",
lrscale %in% 7:10 ~ "Right",
lrscale %in% c(4,5,6) ~ NA_character_,
TRUE ~ NA_character_
),
ccrdprs = ifelse(ccrdprs %in% c(77, 88, 99), NA, ccrdprs),
)
test_stat <- france_data %>%
specify(explanatory = lrscale,
response = ccrdprs) %>%
hypothesize(null = "independence") %>%
calculate(stat = "F")
## Warning: Removed 17154 rows containing missing values.
print(test_stat$stat)
## [1] 2.627956
null_distribution <- france_data %>%
specify(explanatory = lrscale,
response = ccrdprs) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "F")
## Warning: Removed 17154 rows containing missing values.
p_val <- null_distribution %>%
get_pvalue(obs_stat = test_stat, direction = "greater")
p_val
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.116
Task 5: Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 4, and interpret the output.
conf_int <- null_distribution %>%
get_confidence_interval(level = 0.95, type = "percentile")
null_distribution %>%
visualize() +
shade_p_value(obs_stat = test_stat, direction = "two-sided") +
shade_confidence_interval(endpoints = conf_int)
## Warning: F usually corresponds to right-tailed tests. Proceed with caution.
null_distribution
## Response: ccrdprs (numeric)
## Explanatory: lrscale (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
## replicate stat
## <int> <dbl>
## 1 1 0.000352
## 2 2 0.0212
## 3 3 1.37
## 4 4 0.954
## 5 5 0.0247
## 6 6 0.121
## 7 7 1.47
## 8 8 1.47
## 9 9 0.102
## 10 10 3.59
## # ℹ 990 more rows