In this tutorial, we will learn how to use the infer package for hypothesis testing as well as do tables and calculations the “long way” for the Chi-Square test statistic. Let’s get started.

Setting up environment

# List of packages
packages <- c("tidyverse", "infer", "fst") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## Warning: package 'infer' was built under R version 4.3.3

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "infer"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "fst"       "infer"     "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"

ess <- read_fst("All-ESS-Data.fst")

Filtering to country, recoding and cleaning variables of interest

france_data <- ess %>%
  # Filter for France
  filter(cntry == "FR") %>%
  
  # Recoding clsprty and cleaning trstplt (will use later)
  mutate(
    clsprty = case_when(
      clsprty == 1 ~ "Yes",  # Recode 1 to "Yes"
      clsprty == 2 ~ "No",   # Recode 2 to "No"
      clsprty %in% c(7, 8, 9) ~ NA_character_,  # Handle other specific cases where you want to set it as NA
      TRUE ~ as.character(clsprty)  # Keep other values as-is but ensure they are characters
    ),
    trstplt = ifelse(trstplt %in% c(77, 88, 99), NA, trstplt),
    
    # Recoding gender
    gndr = case_when(
      gndr == 1 ~ "Male",
      gndr == 2 ~ "Female",
      gndr == 9 ~ NA_character_,
      TRUE ~ as.character(gndr)
    ),

    # Recoding education with case_when for consistency
    educ.ba = case_when(
      essround < 5 & edulvla == 5 ~ "BA or more",
      essround >= 5 & edulvlb > 600 ~ "BA or more",
      TRUE ~ "No BA"
    )
  ) %>%
  # Ensure all NAs are handled uniformly
  mutate(
    edulvla = ifelse(edulvla %in% c(77, 88), NA_integer_, edulvla),
    edulvlb = ifelse(edulvlb %in% c(5555, 7777, 8888), NA_integer_, edulvlb),
    educ.ba = as.character(educ.ba)  # Force educ.ba to be treated as character
  )

New skill

Using combination of unique() and is.na to make sure there are no NAs before proceeding. Even after you thought you cleaned, and table() does not return NAs, there can still be NAs that introduce issues. One way to double check is the the unique() function. With unique(), you can identify all the unique values for a given variable, including NAs. Then, we will add a line using is.na to make sure it’s clean.

# using unique to identify
unique(france_data$gndr)

## [1] "Female" "Male"

# does not return NA, but the following does
unique(france_data$clsprty)

## [1] "Yes" "No"  NA

# here's how to make sure it's really cleaned
france_data <- france_data %>% filter(!is.na(clsprty))
france_data <- france_data %>% filter(!is.na(trstplt))

# double-checking + quick distribution
unique(france_data$clsprty)

## [1] "Yes" "No"

table(france_data$clsprty)

## 
##   No  Yes 
## 9098 9480

Long form coding for Chi-Square

Calculating and interpreting critical value for independence

In our case for education, recoded as having attained at least a BA vs. not, and the recoded clsprty as the corresponding categorical values of Yes and No.

Note: we will do calculations the longer way to show how they are generated. The following also assumes that we are working with a 2 x 2 contingency table with categorical variables.

## Not necessary, but always good to rename when doing operations so you can backtrack to clean dataset
tab_dat <- france_data

# Create a crosstable and save it into the object mytab
mytab <- table(tab_dat$clsprty, tab_dat$educ.ba)

# Calculate row-wise sums of counts in table 
rsums <- rowSums(mytab)

# Calculate column-wise sums of counts in table
csums <- colSums(mytab)

# Get the total count of the table (N)
N <- sum(mytab)

# Generate table of expected proportions
ptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")

## Table of Expected Proportions:

print(ptab)

##           [,1]      [,2]
## [1,] 0.1099746 0.3797444
## [2,] 0.1145921 0.3956889

# Table of expected frequencies
ftab <- N * ptab
cat("Table of Expected Frequencies:\n")

## Table of Expected Frequencies:

print(ftab)

##          [,1]     [,2]
## [1,] 2043.108 7054.892
## [2,] 2128.892 7351.108

# Critical Value for Independence:
alpha <- 0.05 # significance level
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")

## Critical Value: 3.841

Interpreting the Critical Value: If the Chi-squared test statistic that you compute (from the chisq.test function) exceeds 3.841, then we reject the null hypothesis at the 0.05 significance level. This would mean that there is a statistically significant association between feeling close to a party and education level in your dataset.

Calculating and interpreting Pearson’s chi-squared statistic

# Pearson's chi-squared statistic:
test_stat <- sum((mytab - ftab)^2 / ftab)
cat("Pearson's X^2:", round(test_stat, 4), "\n")

## Pearson's X^2: 160.4057

Comparison to the critical value: We previously calculated a critical value of 3.841. Our observed X-squared value is much larger than this critical value.

Interpretation: The difference between our observed data and what we’d expect under the assumption of independence is statistically significant. Thus, there is a statistically significant association between feeling close to a party and one’s educational level. This means that the probability of feeling close to a party is not the same across different education levels.

Calculating the p-value

# p-value:
p_val <- pchisq(test_stat, df = 1, lower.tail = F)
cat("p-value:", round(p_val, 4), "\n")

## p-value: 0

# Using chisq.test for validation:
cat("Chi-squared test result:\n")

## Chi-squared test result:

print(chisq.test(mytab, correct = F))

## 
##  Pearson's Chi-squared test
## 
## data:  mytab
## X-squared = 160.41, df = 1, p-value < 2.2e-16

p-value < 2.2e-16: This is another representation of the p-value in scientific notation because it’s extremely close to 0.

The chi-squared test tells us if two categorical variables are independent or if they are associated in some manner. In our test:

The extremely small p-value (essentially 0) means that we reject the null hypothesis that the two variables are independent of each other. In other words, there’s strong evidence to suggest that there’s a significant association between the two variables in mytab.

The large chi-squared statistic of 165.44 further reinforces this point. If the two variables were independent, we’d expect a chi-squared statistic closer to 0.

Hypothesis testing using infer

Now let’s follow the four steps outlined in the lecture to use the infer package for hypothesis testing

Using Infer – Example 1

Step 1: Calculate the test statistic of your sample

Important Note: to assign the appropriate test statistic, please refer to the summary table presented during thelecture.

test_stat <- france_data %>%
  specify(explanatory = educ.ba, # change variable name for explanatory variable
          response = trstplt) %>% # change variable name for outcome of interest
  hypothesize(null = "independence") %>%
  calculate(stat = "t") # replace in between quotation marks appropriate test statistic

## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "BA or more" - "No BA", or divided in the order "BA or more" / "No BA"
## for ratio-based statistics. To specify this order yourself, supply `order =
## c("BA or more", "No BA")` to the calculate() function.

print(test_stat$stat)

##        t 
## 13.99345

Interpretation: The t-statistic measures the difference between the observed sample mean and the population mean (or another sample mean, in the case of two-sample tests) in units of standard error. The larger the magnitude of t (regardless of the sign), the greater the evidence against the null hypothesis.

A t-statistic of 14.1 is quite large, indicating a significant difference in trust in politicians between those with a BA and those without a BA. It suggests that the mean trust level in politicians is different for the two groups.

Step 2: Simulate the null distribution

null_distribution <- france_data %>%
  specify(explanatory = educ.ba,
          response = trstplt) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>% # only line we are adding, use reps = 1000 as standard but change if too computationally demanding
  calculate(stat = "t")

## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "BA or more" - "No BA", or divided in the order "BA or more" / "No BA"
## for ratio-based statistics. To specify this order yourself, supply `order =
## c("BA or more", "No BA")` to the calculate() function.

Step 3: Calculate the p-value of your sample

p_val <- null_distribution %>% # replace name here if you assigned something other than null_distribution above
  get_pvalue(obs_stat = test_stat, direction = "two-sided") # would only replace test_stat if assigned another name in Step 1

## Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
## based on the number of `reps` chosen in the `generate()` step.
## ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Really, it’s essentially a p-value of 0, not 0 – but, still, under the assumptions of frequentist hypothesis testing, this means there is strong evidence to reject the null.

Step 4: Visualize

null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided")

null_distribution

## Response: trstplt (numeric)
## Explanatory: educ.ba (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate    stat
##        <int>   <dbl>
##  1         1 -0.579 
##  2         2  0.741 
##  3         3 -0.942 
##  4         4 -0.746 
##  5         5  0.916 
##  6         6  0.0294
##  7         7 -1.08  
##  8         8  1.31  
##  9         9  0.288 
## 10        10  1.26  
## # ℹ 990 more rows

Visualizes the null distribution alongside the observsed test statistic (the red line), indicating here clear evidence, under assumptions of frequentist hypothesis testing, to reject the null. If the observed statistic would fall within the null distribution, you would see shaded red regions of the null distribution that are as (or more) extreme than our observed statistic.

We can also visualize with confidence intervals.

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

null_distribution

## Response: trstplt (numeric)
## Explanatory: educ.ba (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate    stat
##        <int>   <dbl>
##  1         1 -0.579 
##  2         2  0.741 
##  3         3 -0.942 
##  4         4 -0.746 
##  5         5  0.916 
##  6         6  0.0294
##  7         7 -1.08  
##  8         8  1.31  
##  9         9  0.288 
## 10        10  1.26  
## # ℹ 990 more rows

Using infer – Example 2

This one will showcae output and visualizations when we do not have evidence to reject the null.

Step 1: Calculate the test statistic of your sample

This time using education as the outcome, gender as explanatory. In this case we use “Z” (or two sample Z-test) since both variables are two categories (again, refer to table in slides)

edu_stat <- france_data %>%
  specify(explanatory = gndr, # change variable name for explanatory variable
          response = educ.ba, # change variable name for outcome of interest
          success = "BA or more") %>% # for this test need to set "success" as one of the categories
  hypothesize(null = "independence") %>%
  calculate(stat = "z") # replace in between quotation marks appropriate test statistic

## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "Female" - "Male", or divided in the order "Female" / "Male" for
## ratio-based statistics. To specify this order yourself, supply `order =
## c("Female", "Male")` to the calculate() function.

print(edu_stat$stat)

## [1] -1.042447

Step 2: Simulate the null distribution

null_edu <- france_data %>%
  specify(explanatory = gndr, # change variable name for explanatory variable
          response = educ.ba, # change variable name for outcome of interest
          success = "BA or more") %>% # for this test need to set "success" as one of the categories
  hypothesize(null = "independence") %>%
    generate(reps = 1000, type = "permute") %>% # only line we are adding, use reps = 1000 as standard
  calculate(stat = "z")

## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "Female" - "Male", or divided in the order "Female" / "Male" for
## ratio-based statistics. To specify this order yourself, supply `order =
## c("Female", "Male")` to the calculate() function.

Step 3: Calculate the p-value of your sample

p_val <- null_edu %>% # replace name here if you assigned something other than null_distribution above
  get_pvalue(obs_stat = edu_stat, direction = "two-sided") # would only replace test_stat if assigned another name in Step 1

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.314

Step 4: Visualize

null_edu %>%
  visualize() +
  shade_p_value(obs_stat = edu_stat, direction = "two-sided")

null_edu

## Response: educ.ba (factor)
## Explanatory: gndr (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 1
##      stat
##     <dbl>
##  1 -2.59 
##  2  1.60 
##  3  0.403
##  4 -0.479
##  5 -1.08 
##  6 -0.620
##  7 -1.11 
##  8 -1.47 
##  9  1.95 
## 10 -1.04 
## # ℹ 990 more rows

Note where the red line is set based on the test statistic and how we would not reject the null given the standard cutoff points at the tails.

It becomes even clearer with the confidence intervals.

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_edu %>%
  visualize() +
  shade_p_value(obs_stat = edu_stat, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

null_edu

## Response: educ.ba (factor)
## Explanatory: gndr (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 1
##      stat
##     <dbl>
##  1 -2.59 
##  2  1.60 
##  3  0.403
##  4 -0.479
##  5 -1.08 
##  6 -0.620
##  7 -1.11 
##  8 -1.47 
##  9  1.95 
## 10 -1.04 
## # ℹ 990 more rows

End of Tutorial

Homework 3 (5%): due by next lecture on March 12th

Instructions: Start a new R markdown for the homework and call it “Yourlastname_Firstname_Homework_2”.

Copy everything below from Task 1 to Task 5. Keep the task prompt and questions, and provide your code and answer underneath.

Remember: you need all the steps for your code to work, including loading your data – otherwise it will not knit.

To generate a new code box, click on the +C sign above. Underneath your code, provide your answer to the task question.

When you are done, click on “Knit” above, then “Knit to Html”. Wait for everything to compile. If you get an error like “Execution halted”, it means there are issues with your code you must fix. When all issues are fixed, it will prompt a new window. Then click on “Publish” in the top right, and then Rpubs (the first option) and follow the instructions to create your Rpubs account and get your Rpubs link for your document (i.e., html link as I provide for the tutorial).

Note: Make sure to provide both your markdown file and R pubs link. If you do not submit both, you will be penalized 2 pts. out of the 5 pts. total.

Task 1

ess <- read_fst("france_data.fst")

france_data <- ess %>%
  mutate(
    ID = case_when(
      lrscale %in% 0:3 ~ "Left",                    # Values 0 to 3 in 'lrscale' are categorized as "Left"
      lrscale %in% 7:10 ~ "Right",                  # Values 7 to 10 in 'lrscale' are categorized as "Right"
      lrscale %in% 4:6 ~ NA_character_,                # Values 4 to 6 in 'lrscale' are categorized as "Moderate"
      lrscale %in% c(77, 88, 99) ~ NA_character_ 
    )
  )
table(france_data$ID)

## 
##  Left Right 
##  4709  4302

france_data <- ess %>%
  mutate(
    educ.ba = case_when(
      essround < 5 & edulvla == 5 ~ "BA or more",
      essround >= 5 & edulvlb > 600 ~ "BA or more",
      TRUE ~ "No BA"
    ),
    edulvla = ifelse(edulvla %in% c(77, 88), NA_integer_, edulvla),
    edulvlb = ifelse(edulvlb %in% c(5555, 7777, 8888), NA_integer_, edulvlb),
    educ.ba = as.character(educ.ba)  # Force educ.ba to be treated as character
  )

educ_table <- table(france_data$educ.ba)

educ_table

## 
## BA or more      No BA 
##       4252      14786

educ_table <- table(france_data$educ.ba)

educ_table

## 
## BA or more      No BA 
##       4252      14786

unique(france_data$educ.ba)

## [1] "No BA"      "BA or more"

france_data <- france_data %>% filter(!is.na(lrscale))
france_data <- france_data %>% filter(!is.na(educ.ba))

unique(france_data$lrscale)

##  [1]  3  2  1  4  5  6 10  8 88  7  9  0 77

table(france_data$lrscale)

## 
##    0    1    2    3    4    5    6    7    8    9   10   77   88 
##  945  499 1269 1996 1838 5274 1594 1749 1458  409  686  381  940

tab_dat <- france_data

# Create a crosstable and save it into the object mytab
mytab <- table(tab_dat$lrscale, tab_dat$educ.ba)

# Calculate row-wise sums of counts in table 
rsums <- rowSums(mytab)

# Calculate column-wise sums of counts in table
csums <- colSums(mytab)

# Get the total count of the table (N)
N <- sum(mytab)

# Generate table of expected proportions
ptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")

## Table of Expected Proportions:

print(ptab)

##              [,1]       [,2]
##  [1,] 0.011086193 0.03855137
##  [2,] 0.005853979 0.02035676
##  [3,] 0.014887173 0.05176899
##  [4,] 0.023415916 0.08142703
##  [5,] 0.021562351 0.07498140
##  [6,] 0.061871513 0.21515338
##  [7,] 0.018699885 0.06502740
##  [8,] 0.020518255 0.07135064
##  [9,] 0.017104411 0.05947926
## [10,] 0.004798151 0.01668520
## [11,] 0.008047755 0.02798544
## [12,] 0.004469671 0.01554294
## [13,] 0.011027535 0.03834740

ftab <- N * ptab
cat("Table of Expected Frequencies:\n")

## Table of Expected Frequencies:

print(ftab)

##            [,1]      [,2]
##  [1,]  211.0589  733.9411
##  [2,]  111.4481  387.5519
##  [3,]  283.4220  985.5780
##  [4,]  445.7922 1550.2078
##  [5,]  410.5040 1427.4960
##  [6,] 1177.9099 4096.0901
##  [7,]  356.0084 1237.9916
##  [8,]  390.6265 1358.3735
##  [9,]  325.6338 1132.3662
## [10,]   91.3472  317.6528
## [11,]  153.2132  532.7868
## [12,]   85.0936  295.9064
## [13,]  209.9422  730.0578

alpha <- 0.05 
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")

## Critical Value: 3.841

test_stat <- sum((mytab - ftab)^2 / ftab)
cat("Pearson's X^2:", round(test_stat, 4), "\n")

## Pearson's X^2: 510.8805

p_val <- pchisq(test_stat, df = 1, lower.tail = F)
cat("p-value:", round(p_val, 4), "\n")

## p-value: 0

cat("Chi-squared test result:\n")

## Chi-squared test result:

print(chisq.test(mytab, correct = F))

## 
##  Pearson's Chi-squared test
## 
## data:  mytab
## X-squared = 510.88, df = 12, p-value < 2.2e-16

Based on these results it demonstrates that the p value of 0 indicates that there is a relationship between left and right. Also, given the critical value it showcases that the chi-square statistic is higher indicating that there is also an assocation between the two. the value of the chi-squae results depicting 510.88 ultimately shows the relation.

Task 2

Do Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables sbbsntx (recoded to the corresponding 5 categories) as the response and domicil (recoded to urban and rural, setting 2 to urban instead of peri-urban) as the explanatory variable for the country of France.

See variable info here: https://ess.sikt.no/en/datafile/ffc43f48-e15a-4a1c-8813-47eda377c355/92?tab=1&elements=[%22874a86e2-a0f6-40b4-aef7-51c8eed98a7d/1%22])

Provide interpretations of the output (consider the variable info).

table_domicil <- table(france_data$domicil)
print(table_domicil)

## 
##    1    2    3    4    5    7    8 
## 3584 2361 6257 5668 1162    3    3

table_sbbsntx <- table(france_data$sbbsntx)
print(table_sbbsntx)

## 
##    1    2    3    4    5    7    8 
##  757 1538  804  640  301    3  100

france_clean <- france_data %>%
  mutate(
    ID = case_when(
      sbbsntx %in% 1 ~ "Agree Strongly",                 
      sbbsntx %in% 2 ~ "Agree",     
      sbbsntx %in% 3 ~ "Neither agree nor disagree", 
      sbbsntx %in% 4 ~ "Disagree", 
      sbbsntx %in% 5 ~ "Disagree Strongly", 
      sbbsntx %in% c(77, 88, 99) ~ NA_character_ 
    )
  )
table(france_clean$ID)

## 
##                      Agree             Agree Strongly 
##                       1538                        757 
##                   Disagree          Disagree Strongly 
##                        640                        301 
## Neither agree nor disagree 
##                        804

france_clean <- france_data %>%
  mutate(
    ID = case_when(
      domicil %in% 1:2 ~ "Urban",                   
      domicil %in% 3:5 ~ "Rural",              
      domicil %in% c(77, 88, 99) ~ NA_character_ 
    )
  )
table(france_data$ID)

## < table of extent 0 >

## < table of extent 0 >

library(infer)
library(dplyr)

france_data <- france_data %>%
  filter(!is.na(domicil), !is.na(sbbsntx))

france_data <- france_data %>%
  mutate(
    domicil = factor(domicil),
    sbbsntx = factor(sbbsntx)
  )

test_stat <- france_data %>%
  specify(explanatory = domicil,
 response = sbbsntx) %>% 
  hypothesize(null = "independence") %>%
  calculate(stat = "chisq")

print(test_stat$stat)

## X-squared 
##  49.95025

null_distribution <- france_data %>%
  specify(explanatory = domicil,
          response = sbbsntx) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "chisq")

p_val <- null_distribution %>% 
  get_pvalue(obs_stat = test_stat, direction = "two-sided") 

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.074

The Chi-square value concludes that both independent variable and dependent variable have correlation to one another. And given the p value of 0.066 it indicates that there is a larger statistical difference between both varibles tested; domicil and sbbsntx.

Task 3

Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 2, and interpret the output.

Task 4

Conduct Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables ccrdprs (leaving it on the 0-10 numeric scale) as the response (or outcome) variable and lrscale (recoded as left and right, omitting “moderates” as we did in Task 1) as the explanatory variable.

Variable info here: https://ess.sikt.no/en/datafile/ffc43f48-e15a-4a1c-8813-47eda377c355/92?tab=1&elements=[%2283b08cf2-508e-49a0-9fc8-c3cc281290ec/4%22]

Provide interpretations of the output (consider the variable info).

null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = edu_stat, direction = "two-sided")

## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

null_edu

## Response: educ.ba (factor)
## Explanatory: gndr (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 1
##      stat
##     <dbl>
##  1 -2.59 
##  2  1.60 
##  3  0.403
##  4 -0.479
##  5 -1.08 
##  6 -0.620
##  7 -1.11 
##  8 -1.47 
##  9  1.95 
## 10 -1.04 
## # ℹ 990 more rows

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

The red areas indicated in the charting indicate the exhaustive amount fo rthe distibution. Based on the charting the statistic is within distribution.

Task 5

Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 4, and interpret the output.

null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided")

## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

## Warning: Chi-Square usually corresponds to right-tailed tests. Proceed with
## caution.

Based on this it demonstrates that it is within the distribution.

Homework 3

Raya Raposo 1007225879

Setting up environment

New skill

Long form coding for Chi-Square

Calculating and interpreting critical value for independence

Calculating and interpreting Pearson’s chi-squared statistic

Calculating the p-value

Hypothesis testing using infer

Using Infer – Example 1

Step 1: Calculate the test statistic of your sample

Step 2: Simulate the null distribution

Step 3: Calculate the p-value of your sample

Step 4: Visualize

Using infer – Example 2

Step 1: Calculate the test statistic of your sample

Step 2: Simulate the null distribution

Step 3: Calculate the p-value of your sample

Step 4: Visualize

End of Tutorial

Homework 3 (5%): due by next lecture on March 12th

Task 1

Task 2

Task 3

Task 4

Task 5

End.