In this tutorial, we will learn how to use the infer package for hypothesis testing as well as do tables and calculations the “long way” for the Chi-Square test statistic. Let’s get started.
# List of packages
packages <- c("tidyverse", "infer", "fst") # add any you need here
# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load the packages
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'infer' was built under R version 4.3.3
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "infer" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "fst" "infer" "lubridate" "forcats" "stringr" "dplyr"
## [7] "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [13] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [19] "base"
ess <- read_fst("All-ESS-Data.fst")
Filtering to country, recoding and cleaning variables of interest
france_data <- ess %>%
# Filter for France
filter(cntry == "FR") %>%
# Recoding clsprty and cleaning trstplt (will use later)
mutate(
clsprty = case_when(
clsprty == 1 ~ "Yes", # Recode 1 to "Yes"
clsprty == 2 ~ "No", # Recode 2 to "No"
clsprty %in% c(7, 8, 9) ~ NA_character_, # Handle other specific cases where you want to set it as NA
TRUE ~ as.character(clsprty) # Keep other values as-is but ensure they are characters
),
trstplt = ifelse(trstplt %in% c(77, 88, 99), NA, trstplt),
# Recoding gender
gndr = case_when(
gndr == 1 ~ "Male",
gndr == 2 ~ "Female",
gndr == 9 ~ NA_character_,
TRUE ~ as.character(gndr)
),
# Recoding education with case_when for consistency
educ.ba = case_when(
essround < 5 & edulvla == 5 ~ "BA or more",
essround >= 5 & edulvlb > 600 ~ "BA or more",
TRUE ~ "No BA"
)
) %>%
# Ensure all NAs are handled uniformly
mutate(
edulvla = ifelse(edulvla %in% c(77, 88), NA_integer_, edulvla),
edulvlb = ifelse(edulvlb %in% c(5555, 7777, 8888), NA_integer_, edulvlb),
educ.ba = as.character(educ.ba) # Force educ.ba to be treated as character
)
Using combination of unique() and is.na to make sure there are no NAs before proceeding. Even after you thought you cleaned, and table() does not return NAs, there can still be NAs that introduce issues. One way to double check is the the unique() function. With unique(), you can identify all the unique values for a given variable, including NAs. Then, we will add a line using is.na to make sure it’s clean.
# using unique to identify
unique(france_data$gndr)
## [1] "Female" "Male"
# does not return NA, but the following does
unique(france_data$clsprty)
## [1] "Yes" "No" NA
# here's how to make sure it's really cleaned
france_data <- france_data %>% filter(!is.na(clsprty))
france_data <- france_data %>% filter(!is.na(trstplt))
# double-checking + quick distribution
unique(france_data$clsprty)
## [1] "Yes" "No"
table(france_data$clsprty)
##
## No Yes
## 9098 9480
In our case for education, recoded as having attained at least a BA vs. not, and the recoded clsprty as the corresponding categorical values of Yes and No.
Note: we will do calculations the longer way to show how they are generated. The following also assumes that we are working with a 2 x 2 contingency table with categorical variables.
## Not necessary, but always good to rename when doing operations so you can backtrack to clean dataset
tab_dat <- france_data
# Create a crosstable and save it into the object mytab
mytab <- table(tab_dat$clsprty, tab_dat$educ.ba)
# Calculate row-wise sums of counts in table
rsums <- rowSums(mytab)
# Calculate column-wise sums of counts in table
csums <- colSums(mytab)
# Get the total count of the table (N)
N <- sum(mytab)
# Generate table of expected proportions
ptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")
## Table of Expected Proportions:
print(ptab)
## [,1] [,2]
## [1,] 0.1099746 0.3797444
## [2,] 0.1145921 0.3956889
# Table of expected frequencies
ftab <- N * ptab
cat("Table of Expected Frequencies:\n")
## Table of Expected Frequencies:
print(ftab)
## [,1] [,2]
## [1,] 2043.108 7054.892
## [2,] 2128.892 7351.108
# Critical Value for Independence:
alpha <- 0.05 # significance level
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")
## Critical Value: 3.841
Interpreting the Critical Value: If the Chi-squared test statistic that you compute (from the chisq.test function) exceeds 3.841, then we reject the null hypothesis at the 0.05 significance level. This would mean that there is a statistically significant association between feeling close to a party and education level in your dataset.
# Pearson's chi-squared statistic:
test_stat <- sum((mytab - ftab)^2 / ftab)
cat("Pearson's X^2:", round(test_stat, 4), "\n")
## Pearson's X^2: 160.4057
Comparison to the critical value: We previously calculated a critical value of 3.841. Our observed X-squared value is much larger than this critical value.
Interpretation: The difference between our observed data and what we’d expect under the assumption of independence is statistically significant. Thus, there is a statistically significant association between feeling close to a party and one’s educational level. This means that the probability of feeling close to a party is not the same across different education levels.
# p-value:
p_val <- pchisq(test_stat, df = 1, lower.tail = F)
cat("p-value:", round(p_val, 4), "\n")
## p-value: 0
# Using chisq.test for validation:
cat("Chi-squared test result:\n")
## Chi-squared test result:
print(chisq.test(mytab, correct = F))
##
## Pearson's Chi-squared test
##
## data: mytab
## X-squared = 160.41, df = 1, p-value < 2.2e-16
p-value < 2.2e-16: This is another representation of the p-value in scientific notation because it’s extremely close to 0.
The chi-squared test tells us if two categorical variables are independent or if they are associated in some manner. In our test:
The extremely small p-value (essentially 0) means that we reject the null hypothesis that the two variables are independent of each other. In other words, there’s strong evidence to suggest that there’s a significant association between the two variables in mytab.
The large chi-squared statistic of 165.44 further reinforces this point. If the two variables were independent, we’d expect a chi-squared statistic closer to 0.
Now let’s follow the four steps outlined in the lecture to use the infer package for hypothesis testing
Important Note: to assign the appropriate test statistic, please refer to the summary table presented during thelecture.
test_stat <- france_data %>%
specify(explanatory = educ.ba, # change variable name for explanatory variable
response = trstplt) %>% # change variable name for outcome of interest
hypothesize(null = "independence") %>%
calculate(stat = "t") # replace in between quotation marks appropriate test statistic
## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "BA or more" - "No BA", or divided in the order "BA or more" / "No BA"
## for ratio-based statistics. To specify this order yourself, supply `order =
## c("BA or more", "No BA")` to the calculate() function.
print(test_stat$stat)
## t
## 13.99345
Interpretation: The t-statistic measures the difference between the observed sample mean and the population mean (or another sample mean, in the case of two-sample tests) in units of standard error. The larger the magnitude of t (regardless of the sign), the greater the evidence against the null hypothesis.
A t-statistic of 14.1 is quite large, indicating a significant difference in trust in politicians between those with a BA and those without a BA. It suggests that the mean trust level in politicians is different for the two groups.
null_distribution <- france_data %>%
specify(explanatory = educ.ba,
response = trstplt) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>% # only line we are adding, use reps = 1000 as standard but change if too computationally demanding
calculate(stat = "t")
## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "BA or more" - "No BA", or divided in the order "BA or more" / "No BA"
## for ratio-based statistics. To specify this order yourself, supply `order =
## c("BA or more", "No BA")` to the calculate() function.
p_val <- null_distribution %>% # replace name here if you assigned something other than null_distribution above
get_pvalue(obs_stat = test_stat, direction = "two-sided") # would only replace test_stat if assigned another name in Step 1
## Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
## based on the number of `reps` chosen in the `generate()` step.
## ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
p_val
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
Really, it’s essentially a p-value of 0, not 0 – but, still, under the assumptions of frequentist hypothesis testing, this means there is strong evidence to reject the null.
null_distribution %>%
visualize() +
shade_p_value(obs_stat = test_stat, direction = "two-sided")
null_distribution
## Response: trstplt (numeric)
## Explanatory: educ.ba (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
## replicate stat
## <int> <dbl>
## 1 1 -1.40
## 2 2 1.05
## 3 3 0.144
## 4 4 0.0210
## 5 5 0.00489
## 6 6 0.192
## 7 7 -0.429
## 8 8 1.10
## 9 9 -0.0602
## 10 10 -0.779
## # ℹ 990 more rows
Visualizes the null distribution alongside the observsed test statistic (the red line), indicating here clear evidence, under assumptions of frequentist hypothesis testing, to reject the null. If the observed statistic would fall within the null distribution, you would see shaded red regions of the null distribution that are as (or more) extreme than our observed statistic.
We can also visualize with confidence intervals.
conf_int <- null_distribution %>%
get_confidence_interval(level = 0.95, type = "percentile")
null_distribution %>%
visualize() +
shade_p_value(obs_stat = test_stat, direction = "two-sided") +
shade_confidence_interval(endpoints = conf_int)
null_distribution
## Response: trstplt (numeric)
## Explanatory: educ.ba (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
## replicate stat
## <int> <dbl>
## 1 1 -1.40
## 2 2 1.05
## 3 3 0.144
## 4 4 0.0210
## 5 5 0.00489
## 6 6 0.192
## 7 7 -0.429
## 8 8 1.10
## 9 9 -0.0602
## 10 10 -0.779
## # ℹ 990 more rows
This one will showcae output and visualizations when we do not have evidence to reject the null.
This time using education as the outcome, gender as explanatory. In this case we use “Z” (or two sample Z-test) since both variables are two categories (again, refer to table in slides)
edu_stat <- france_data %>%
specify(explanatory = gndr, # change variable name for explanatory variable
response = educ.ba, # change variable name for outcome of interest
success = "BA or more") %>% # for this test need to set "success" as one of the categories
hypothesize(null = "independence") %>%
calculate(stat = "z") # replace in between quotation marks appropriate test statistic
## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "Female" - "Male", or divided in the order "Female" / "Male" for
## ratio-based statistics. To specify this order yourself, supply `order =
## c("Female", "Male")` to the calculate() function.
print(edu_stat$stat)
## [1] -1.042447
null_edu <- france_data %>%
specify(explanatory = gndr, # change variable name for explanatory variable
response = educ.ba, # change variable name for outcome of interest
success = "BA or more") %>% # for this test need to set "success" as one of the categories
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>% # only line we are adding, use reps = 1000 as standard
calculate(stat = "z")
## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "Female" - "Male", or divided in the order "Female" / "Male" for
## ratio-based statistics. To specify this order yourself, supply `order =
## c("Female", "Male")` to the calculate() function.
p_val <- null_edu %>% # replace name here if you assigned something other than null_distribution above
get_pvalue(obs_stat = edu_stat, direction = "two-sided") # would only replace test_stat if assigned another name in Step 1
p_val
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.33
null_edu %>%
visualize() +
shade_p_value(obs_stat = edu_stat, direction = "two-sided")
null_edu
## Response: educ.ba (factor)
## Explanatory: gndr (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 1
## stat
## <dbl>
## 1 0.579
## 2 1.64
## 3 1.25
## 4 -1.64
## 5 -0.161
## 6 -0.479
## 7 -0.267
## 8 0.790
## 9 -0.0556
## 10 -0.232
## # ℹ 990 more rows
Note where the red line is set based on the test statistic and how we would not reject the null given the standard cutoff points at the tails.
It becomes even clearer with the confidence intervals.
conf_int <- null_distribution %>%
get_confidence_interval(level = 0.95, type = "percentile")
null_edu %>%
visualize() +
shade_p_value(obs_stat = edu_stat, direction = "two-sided") +
shade_confidence_interval(endpoints = conf_int)
null_edu
## Response: educ.ba (factor)
## Explanatory: gndr (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 1
## stat
## <dbl>
## 1 0.579
## 2 1.64
## 3 1.25
## 4 -1.64
## 5 -0.161
## 6 -0.479
## 7 -0.267
## 8 0.790
## 9 -0.0556
## 10 -0.232
## # ℹ 990 more rows
Instructions: Start a new R markdown for the homework and call it “Yourlastname_Firstname_Homework_2”.
Copy everything below from Task 1 to Task 5. Keep the task prompt and questions, and provide your code and answer underneath.
Remember: you need all the steps for your code to work, including loading your data – otherwise it will not knit.
To generate a new code box, click on the +C sign above. Underneath your code, provide your answer to the task question.
When you are done, click on “Knit” above, then “Knit to Html”. Wait for everything to compile. If you get an error like “Execution halted”, it means there are issues with your code you must fix. When all issues are fixed, it will prompt a new window. Then click on “Publish” in the top right, and then Rpubs (the first option) and follow the instructions to create your Rpubs account and get your Rpubs link for your document (i.e., html link as I provide for the tutorial).
Note: Make sure to provide both your markdown file and R pubs link. If you do not submit both, you will be penalized 2 pts. out of the 5 pts. total.
Based on the recoding of lrscale for “left” and “right” and omitting “moderates” (see Tutorial 4), and educ.ba from this tutorial, do the long form coding for Chi-Square. Using the steps outlined in the tutorial, first generate the tables of expected proportions and frequencies. Determine and interpret the critical value for independence. Finally, determine and interpret both the Pearson’s chi-squared statistic and the p-value. Discuss main takeaways.
france_data <- read.fst("france_data.fst")
france_data <- france_data %>%
mutate(
gndr = case_when(
gndr == 1 ~ "Male",
gndr == 2 ~ "Female",
TRUE ~ NA_character_ # Set anything that is not 1 or 2 to NA
),
lrscale = case_when(
lrscale %in% 0:3 ~ "Left", # Left-wing (0 to 3)
lrscale %in% 7:10 ~ "Right", # Right-wing (7 to 10)
TRUE ~ NA_character_ # Moderate (4, 5, 6) and special codes (77, 88, 99) set to NA
)
)
## Not necessary, but always good to rename when doing operations so you can backtrack to clean dataset
tab_dat <- france_data
# Calculate row-wise sums of counts in table
rsums <- rowSums(mytab)
# Calculate column-wise sums of counts in table
csums <- colSums(mytab)
# Get the total count of the table (N)
N <- sum(mytab)
# Generate table of expected proportions
ptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")
## Table of Expected Proportions:
print(ptab)
## [,1] [,2]
## [1,] 0.1099746 0.3797444
## [2,] 0.1145921 0.3956889
# Table of expected frequencies
ftab <- N * ptab
cat("Table of Expected Frequencies:\n")
## Table of Expected Frequencies:
print(ftab)
## [,1] [,2]
## [1,] 2043.108 7054.892
## [2,] 2128.892 7351.108
# Critical Value for Independence:
alpha <- 0.05 # significance level
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")
## Critical Value: 3.841
``` With a critical value of 3.841 one can come to the conclusion that researchers have collectedd sufficent data to reject the null hypothesis and conclude that there is in fact a sifigcant disrepancy between the sample frequencies and expected frequencies within the population.
Do Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables sbbsntx (recoded to the corresponding 5 categories) as the response and domicil (recoded to urban and rural, setting 2 to urban instead of peri-urban) as the explanatory variable for the country of France.
See variable info here: https://ess.sikt.no/en/datafile/ffc43f48-e15a-4a1c-8813-47eda377c355/92?tab=1&elements=[%22874a86e2-a0f6-40b4-aef7-51c8eed98a7d/1%22])
Provide interpretations of the output (consider the variable info).
france_data <- read.fst("france_data.fst")
table(france_data$domicil)
##
## 1 2 3 4 5 7 8
## 3584 2361 6257 5668 1162 3 3
table(france_data$sbbsntx)
##
## 1 2 3 4 5 7 8
## 757 1538 804 640 301 3 100
france_data <- france_data %>%
mutate(
sbbsntx = recode(as.character(sbbsntx),
'1' = "Agree Strongly",
'2' = "Agree",
'3' = "Neither agree nor disagree",
'4' = "Disagree",
'5' = "Disagree strongly",
'7' = NA_character_,
'8' = NA_character_,
'9' = NA_character_)
) %>%
filter(!is.na(sbbsntx))
france_data <- france_data %>%
mutate(
domicil = recode(as.character(domicil),
'1' = "Urban",
'2' = "Urban",
'3' = "Rural",
'4' = "Rural",
'5' = "Rural",
'7' = NA_character_,
'8' = NA_character_,
'9' = NA_character_)
) %>%
filter(!is.na(domicil))
test_stat <- france_data %>%
specify(explanatory = domicil, # change variable name for explanatory variable
response = sbbsntx) %>% # change variable name for outcome of interest
hypothesize(null = "independence") %>%
calculate(stat = "chisq") # replace in between quotation marks appropriate test statistic
print(test_stat$stat)
## X-squared
## 19.97187
null_distribution <- france_data %>%
specify(explanatory = domicil,
response = sbbsntx) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>% # only line we are adding, use reps = 1000 as standard but change if too computationally demanding
calculate(stat = "chisq")
p_val <- null_edu %>% # replace name here if you assigned something other than null_distribution above
get_pvalue(obs_stat = edu_stat, direction = "two-sided") # would only replace test_stat if assigned another name in Step 1
p_val
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.33
The calculated value that was deduced for chi squares was that x lies at a value of 19.97187. When interpreting a statistic like Chi-squared, researchers can come to the conclusion that the observed differences are statistically significant.
The deduced a P value of 0.292 which is the probability that results are due to change. When the p value is smaller than the predefined alpha value, then researchers can deduce that the likelihood results were reached at random and are due to chance. The value 0.292 is larger than the standard critical value of 0.05 which means that researchers have failed to collect data that can reject the null hypothesis significantly provided the alpha value was set at 0.05.
## Task 3
Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 2, and interpret the output.
`
```r
null_distribution %>%
visualize() +
shade_p_value(obs_stat = test_stat, direction = "two-sided")
## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.
conf_int <- null_distribution %>%
get_confidence_interval(level = 0.95, type = "percentile")
null_distribution %>%
visualize() +
shade_p_value(obs_stat = test_stat, direction = "greater") +
shade_confidence_interval(endpoints = conf_int)
the statistic seems to fall just within the boundaries of the rejection region, being outside of the acceptance region represented by the shaded blue area. The fact that the statistic is outside of these boundries/cutoff points of acceptance means that researchers have suffiencet evidence to reject the null hypothesis as this sample produced a statistic that differs sigifcantly relative to the majority of the data. It is important to consider the fact that this sampling distribution fails to be normative and is unequally dispersed, which could potentially be affecting the position of this sample statistic. When sampling distributions fail to be normal, its importnt for researchers to consider increasing the amount of samples they have included in the distriubution, with N=30 being the general benchmark for the ideal amount of samples within a ditribution
Conduct Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables ccrdprs (leaving it on the 0-10 numeric scale) as the response (or outcome) variable and lrscale (recoded as left and right, omitting “moderates” as we did in Task 1) as the explanatory variable.
``
Variable info here: https://ess.sikt.no/en/datafile/ffc43f48-e15a-4a1c-8813-47eda377c355/92?tab=1&elements=[%2283b08cf2-508e-49a0-9fc8-c3cc281290ec/4%22]
Provide interpretations of the output (consider the variable info).
Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 4, and interpret the output.
null_distribution %>%
visualize() +
shade_p_value(obs_stat = test_stat, direction = "two-sided")
## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.
conf_int <- null_distribution %>%
get_confidence_interval(level = 0.95, type = "percentile")
null_distribution %>%
visualize() +
shade_p_value(obs_stat = test_stat, direction = "greater") +
shade_confidence_interval(endpoints = conf_int)
This visualization demonstrates that the statistic lies outside of the acceptance region, being that researchers have reasonable evidence to reach the conclusion that the results that were found within this investigation are not by chance and researchers can in fact reject the null as there is sufficient evidence that has been proven to bne statistically significant. ## End.