Submit on Quercus using assignment tab – not on the discussion board. Make sure to provide both your markdown file and R pubs link. If you do not submit both, you will be penalized 1 pt. out of the 2.5 total.
packages <- c("tidyverse", "infer", "fst")
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "infer" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "fst" "infer" "lubridate" "forcats" "stringr" "dplyr"
## [7] "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [13] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [19] "base"
library(fst)
ess <- read_fst("All-ESS-Data.fst")
Set up the calculations for an exploration of educ_level (as BA, No BA) and brncntr (recoded as yes/no *shown in tutorial 4) for Switzerland. Using the steps outlined in the tutorial, first generate the tables of expected proportions and frequencies. Determine and interpret the critical value for independence. Finally, determine and interpret both the Pearson’s chi-squared statistic and the p-value. Discuss main takeaways.
switzerland_data <- ess %>%
filter(cntry == "CH") %>%
mutate(born_in_country = recode(brncntr,
`1` = "Yes",
`2` = "No",
`7` = NA_character_,
`8` = NA_character_,
`9` = NA_character_),
edulvla = case_when(
essround < 5 & edulvla == 55 ~ NA_real_,
TRUE ~ edulvla
),
edulvlb = case_when(
essround >= 5 & edulvlb == 5555 ~ NA_real_,
TRUE ~ edulvlb
),
educ_level = case_when(
essround < 5 & edulvla == 5 ~ "BA",
essround >= 5 & edulvlb > 600 ~ "BA",
TRUE ~ "No BA"
),
gndr = case_when(
gndr == 1 ~ "Male",
gndr == 2 ~ "Female",
gndr == 9 ~ NA_character_,
TRUE ~ as.character(gndr)
)
)
tab_dat <- switzerland_data
mytab <- table(tab_dat$brncntr, tab_dat$educ_level)
rsums <- rowSums(mytab)
csums <- colSums(mytab)
N <- sum(mytab)
ptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")
## Table of Expected Proportions:
print(ptab)
## [,1] [,2]
## [1,] 1.757478e-01 5.956555e-01
## [2,] 5.205397e-02 1.764246e-01
## [3,] 1.346107e-05 4.562312e-05
## [4,] 1.346107e-05 4.562312e-05
ftab <- N * ptab
cat("Table of Expected Frequencies:\n")
## Table of Expected Frequencies:
print(ftab)
## [,1] [,2]
## [1,] 2974.5309306 1.008147e+04
## [2,] 881.0134121 2.985987e+03
## [3,] 0.2278287 7.721713e-01
## [4,] 0.2278287 7.721713e-01
alpha <- 0.05 # significance level
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")
## Critical Value: 3.841
We reject the null hypothesis at the 0.05 significance level because the value is 3.841 or greater.There is a significant association between being born in Switzerland and education level.
test_stat <- sum((mytab - ftab)^2 / ftab)
cat("Pearson's X^2:", round(test_stat, 4), "\n")
## Pearson's X^2: 65.7174
This value is much larger than the critical value. There is a significant connection between education level and being born in Switzerland. This means that the probability of being born in the country isn’t the same across different education levels.
p_val <- pchisq(test_stat, df = 1, lower.tail = F)
cat("p-value:", round(p_val, 4), "\n")
## p-value: 0
cat("Chi-squared test result:\n")
## Chi-squared test result:
print(chisq.test(mytab, correct = F))
## Warning in chisq.test(mytab, correct = F): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: mytab
## X-squared = 65.717, df = 3, p-value = 3.523e-14
The p-value is incredibly close to 0. Therefore, a strong connection between the two variables exist–further emphasized by the large value granted by the X-square. We can reject the null hypothesis.
Using the infer package and following the 4 steps outlined in the tutorial and during the lecture, conduct and visualize the hypothesis testing for domicil (recoded as urban/rural – see tutorial 4) and gincdif (Government should reduce differences in income levels) for the country of Netherlands. For gincdif, you must recode for the five categories (i.e, cannot keep as numeric).
See variable info here:
https://ess-search.nsd.no/en/variable/1ad9f42e-0c02-4b13-9a67-ee83d82ac576
Make decision over appropriate test given table provided during lecture. After conducting the 4 steps, interpret all the output and discuss your takeaways.
netherlands_data <- ess %>%
filter(cntry == "NL") %>%
mutate(geo = recode(as.character(domicil),
'1' = "Urban",
'2' = "Peri-Urban",
'3' = "Rural",
'4' = "Rural",
'5' = "Rural",
'7' = NA_character_,
'8' = NA_character_,
'9' = NA_character_),
gincdif = case_when(
gincdif == '1' ~ "Agree strongly",
gincdif == '2' ~ "Agree",
gincdif == '3' ~ "Neither agree nor disagree",
gincdif == '4' ~ "Disagree",
gincdif == '5' ~ "Disagree strongly",
gincdif %in% c(7, 8, 9) ~ NA_character_,
TRUE ~ as.character(gincdif)
)
)
test_stat <- netherlands_data %>%
specify(explanatory = geo,
response = gincdif) %>%
hypothesize(null = "independence") %>%
calculate(stat = "chisq")
## Warning: Removed 198 rows containing missing values.
print(test_stat$stat)
## X-squared
## 24.81544
null_distribution <- netherlands_data %>%
specify(explanatory = geo,
response = gincdif) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "chisq")
## Warning: Removed 198 rows containing missing values.
p_val <- null_distribution %>% # replace name here if you assigned something other than null_distribution above
get_pvalue(obs_stat = test_stat, direction = "greater") # would only replace test_stat if assigned another name in Step 1
p_val
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.003
null_distribution %>%
visualize() +
shade_p_value(obs_stat = test_stat, direction = "greater")
null_distribution
## Response: gincdif (factor)
## Explanatory: geo (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
## replicate stat
## <int> <dbl>
## 1 1 4.07
## 2 2 4.58
## 3 3 2.02
## 4 4 2.99
## 5 5 25.3
## 6 6 4.81
## 7 7 4.09
## 8 8 7.54
## 9 9 15.2
## 10 10 6.91
## # ℹ 990 more rows
conf_int <- null_distribution %>%
get_confidence_interval(level = 0.95, type = "percentile")
null_distribution %>%
visualize() +
shade_p_value(obs_stat = test_stat, direction = "greater") +
shade_confidence_interval(endpoints = conf_int)
null_distribution
## Response: gincdif (factor)
## Explanatory: geo (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
## replicate stat
## <int> <dbl>
## 1 1 4.07
## 2 2 4.58
## 3 3 2.02
## 4 4 2.99
## 5 5 25.3
## 6 6 4.81
## 7 7 4.09
## 8 8 7.54
## 9 9 15.2
## 10 10 6.91
## # ℹ 990 more rows
The red line falls outside the confidence interval of the simulated null distribution. We can reject the null hypothesis and assume there to be a relation between the two variables.
First, talk about a research question you would like to consider, which identifies a main relationship between an explanatory variable and an outcome variable. Using the infer package and following the 4 steps outlined in the tutorial and during the lecture, conduct and visualize the hypothesis testing for your variables based on your research question. After conducting the 4 steps, interpret all the output and discuss your takeaways.
I want to research the relationship between how location relates with interest in politics in Finland. I will do this by having the explanatory variable = polintr and explanatory variable = geo.
finland_data <- ess %>%
filter(cntry == "FI") %>%
mutate(geo = recode(as.character(domicil),
'1' = "Urban",
'2' = "Peri-Urban",
'3' = "Rural",
'4' = "Rural",
'5' = "Rural",
'7' = NA_character_,
'8' = NA_character_,
'9' = NA_character_),
polintr = recode(polintr,
`1` = "Very interested",
`2` = "Quite interested",
`3` = "Hardly interested",
`4` = "Not at all interested",
`7` = NA_character_,
`8` = NA_character_,
`9` = NA_character_)
)
test_stat <- finland_data %>%
specify(explanatory = geo,
response = polintr) %>% # t
hypothesize(null = "independence") %>%
calculate(stat = "chisq")
## Warning: Removed 28 rows containing missing values.
print(test_stat$stat)
## X-squared
## 370.8845
null_distribution <- finland_data %>%
specify(explanatory = geo,
response = polintr) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>% # only line we are adding, use reps = 1000 as standard
calculate(stat = "chisq")
## Warning: Removed 28 rows containing missing values.
p_val <- null_distribution %>%
get_pvalue(obs_stat = test_stat, direction = "greater")
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step.
## See `?get_p_value()` for more information.
p_val
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
null_distribution %>%
visualize() +
shade_p_value(obs_stat = test_stat, direction = "greater")
## Warning in min(diff(unique_loc)): no non-missing arguments to min; returning
## Inf
null_distribution
## Response: polintr (factor)
## Explanatory: geo (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
## replicate stat
## <int> <dbl>
## 1 1 3.71
## 2 2 5.23
## 3 3 4.03
## 4 4 15.5
## 5 5 5.39
## 6 6 1.59
## 7 7 2.61
## 8 8 4.21
## 9 9 7.10
## 10 10 10.1
## # ℹ 990 more rows
conf_int <- null_distribution %>%
get_confidence_interval(level = 0.95, type = "percentile")
null_distribution %>%
visualize() +
shade_p_value(obs_stat = test_stat, direction = "two-sided") +
shade_confidence_interval(endpoints = conf_int)
## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.
null_distribution
## Response: polintr (factor)
## Explanatory: geo (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
## replicate stat
## <int> <dbl>
## 1 1 3.71
## 2 2 5.23
## 3 3 4.03
## 4 4 15.5
## 5 5 5.39
## 6 6 1.59
## 7 7 2.61
## 8 8 4.21
## 9 9 7.10
## 10 10 10.1
## # ℹ 990 more rows
There is a strong correlation between the two variables, as represented in the p-value being at 0 and the red line straying far away from the simulated null distribution.