Copy everything below from Task 1 to Task 5. Keep the task prompt and questions, and provide your code and answer underneath.

Remember: you need all the steps for your code to work, including loading your data – otherwise it will not knit.

To generate a new code box, click on the +C sign above. Underneath your code, provide your answer to the task question.

When you are done, click on “Knit” above, then “Knit to Html”. Wait for everything to compile. If you get an error like “Execution halted”, it means there are issues with your code you must fix. When all issues are fixed, it will prompt a new window.

Then click on “Publish” in the top right, and then Rpubs (the first option) and follow the instructions to create your Rpubs account and get your Rpubs link for your document (i.e., html link as I provide for the tutorial).

Note: Make sure to provide both your markdown file and R pubs link. If you do not submit both, you will be penalized 2 pts. out of the 5 pts. total.

Task 1

Prompt

Based on the recoding of lrscale for “left” and “right” and omitting “moderates” (see Tutorial 4), and educ.ba from this tutorial, do the long form coding for Chi-Square.

Using the steps outlined in the tutorial, first generate the tables of expected proportions and frequencies. Determine and interpret the critical value for independence. Finally, determine and interpret both the Pearson’s chi-squared statistic and the p-value. Discuss main takeaways.

packages <- c("tidyverse", "infer", "fst") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "infer"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "fst"       "infer"     "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"

Loading data

ess <- read_fst("/Users/jessie/All-ESS-Data.fst")
table(ess$lrscale)
## 
##      0      1      2      3      4      5      6      7      8      9     10 
##  16299  10633  24208  41241  41150 139328  40165  44145  34501  11695  18027 
##     77     88     99 
##   9428  57354   2381

recoding lrscale and omitting “moderates” recoding educ.ba

tab_dat<-ess %>%
  filter(cntry == "FR")%>%
   mutate(
    ID = case_when(
      lrscale %in% 0:3 ~ "Left",# Values 0 to 3 in 'lrscale' are categorized as "Left"
      lrscale %in% 7:10 ~ "Right",# Values 7 to 10 in 'lrscale' are categorized as "Right"
      # lrscale %in% 4:6 ~ "Moderate",# Values 4 to 6 in 'lrscale' are categorized as "Moderate"
      lrscale %in% c(4,5,6,77, 88, 99) ~ NA_character_# Values 77, 88, 99 in 'lrscale' are set as NA
    ),
     educ.ba = case_when(
      essround < 5 & edulvla == 5 ~ "BA or more",
      essround >= 5 & edulvlb > 600 ~ "BA or more",
      TRUE ~ "No BA"
    )) %>%
  # Ensure all NAs are handled uniformly
  mutate(
    edulvla = ifelse(edulvla %in% c(77, 88), NA_integer_, edulvla),
    edulvlb = ifelse(edulvlb %in% c(5555, 7777, 8888), NA_integer_, edulvlb),
    educ.ba = as.character(educ.ba)  # Force educ.ba to be treated as character
  )

mytab <- table(tab_dat$ID, tab_dat$educ.ba)

The long form coding for Chi-Square

# Calculate row-wise sums of counts in table 
rsums <- rowSums(mytab)

# Calculate column-wise sums of counts in table
csums <- colSums(mytab)

# Get the total count of the table (N)
N <- sum(mytab)

# Generate table of expected proportions
ptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")
## Table of Expected Proportions:
print(ptab)
##           [,1]      [,2]
## [1,] 0.1311243 0.3914592
## [2,] 0.1197912 0.3576253
# Table of expected frequencies
ftab <- N * ptab
cat("Table of Expected Frequencies:\n")
## Table of Expected Frequencies:
print(ftab)
##          [,1]     [,2]
## [1,] 1181.561 3527.439
## [2,] 1079.439 3222.561
# Critical Value for Independence:
alpha <- 0.05 # significance level
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")
## Critical Value: 3.841

Interpretation: If the Chi-squared test statistic that you compute (from the chisq.test function) exceeds 3.841, then we reject the null hypothesis at the 0.05 significance level. This would mean that there is a statistically significant association between feeling close to a party and education level in your dataset.

# Pearson's chi-squared statistic:
test_stat <- sum((mytab - ftab)^2 / ftab)
cat("Pearson's X^2:", round(test_stat, 4), "\n")
## Pearson's X^2: 100.8551

Comparison to the critical value: We previously calculated a critical value of 3.841. Our observed Chi-squared value is much larger than this critical value.

Interpretation: The difference between our observed data and what we’d expect under the assumption of independence is statistically significant. Thus, there is a statistically significant association between feeling close to a party and one’s educational level. This means that the probability of feeling close to a party is not the same across different education levels.

# p-value:
p_val <- pchisq(test_stat, df = 1, lower.tail = F)
cat("p-value:", round(p_val, 4), "\n")
## p-value: 0
# Using chisq.test for validation:
cat("Chi-squared test result:\n")
## Chi-squared test result:
print(chisq.test(mytab, correct = F))
## 
##  Pearson's Chi-squared test
## 
## data:  mytab
## X-squared = 100.86, df = 1, p-value < 2.2e-16

Task 2

Prompt

Do Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables sbbsntx (recoded to the corresponding 5 categories) as the response and domicil (recoded to urban and rural, setting 2 to urban instead of peri-urban) as the explanatory variable for the country of France.

See variable info here: https://ess.sikt.no/en/datafile/ffc43f48-e15a-4a1c-8813-47eda377c355/92?tab=1&elements=[%22874a86e2-a0f6-40b4-aef7-51c8eed98a7d/1%22])

Provide interpretations of the output (consider the variable info).

Code

Step 1: Calculate the test statistic of your sample

sbbsntx detailed variable description 1 Agree strongly 2 Agree 3 Neither agree nor disagree 4 Disagree 5 Disagree strongly 7 Refusal 8 Don’t know
9 No answer

domicil detailed variable description Domicile, respondent’s description 1 A big city 2 Suburbs or outskirts of big city 3 Town or small city 4 Country village 5 Farm or home in countryside 7 Refusal 8 Don’t know 9 No answer

Data pre-process

table(ess$sbbsntx)
## 
##     1     2     3     4     5     7     8     9 
##  6577 27773 26302 25689  5868   122  8707   101
mytab2<-tab_dat%>%
  filter(cntry == "FR")%>%
  mutate(sbbsntx= ifelse(sbbsntx%in% c(7, 8, 9), NA, sbbsntx),
         domicil= ifelse(sbbsntx%in% c(7, 8, 9), NA, domicil))%>%
  mutate( # Recoding education with case_when for consistency
        domicil_ur = case_when(
        domicil < 3 ~ "urban",
        TRUE ~ "rural"))
test_stat <- mytab2%>%
  specify(explanatory = domicil_ur, # change variable name for explanatory variable
          response = sbbsntx) %>% # change variable name for outcome of interest
  hypothesize(null = "independence") %>%
  calculate(stat = "t") # replace in between quotation marks appropriate test statistic
## Warning: Removed 14998 rows containing missing values.
## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "rural" - "urban", or divided in the order "rural" / "urban" for
## ratio-based statistics. To specify this order yourself, supply `order =
## c("rural", "urban")` to the calculate() function.
print(test_stat$stat)
##         t 
## -4.138823

Interpretation: The t-statistic measures the difference between the observed sample mean and the population mean (or another sample mean, in the case of two-sample tests) in units of standard error. The larger the magnitude of t (regardless of the sign), the greater the evidence against the null hypothesis.

A t-statistic of -4.138823 is notably substantial, indicating a significant difference in trust in politicians between those with a BA and those without a BA.

Step 2: Simulate the null distribution

null_distribution <- mytab2 %>%
  specify(explanatory = domicil_ur,
          response = sbbsntx) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>% # only line we are adding, use reps = 1000 as standard but change if too computationally demanding
  calculate(stat = "t")
## Warning: Removed 14998 rows containing missing values.
## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "rural" - "urban", or divided in the order "rural" / "urban" for
## ratio-based statistics. To specify this order yourself, supply `order =
## c("rural", "urban")` to the calculate() function.

Step 3: Calculate the p-value of your sample

p_val <- null_distribution %>% # replace name here if you assigned something other than null_distribution above
  get_pvalue(obs_stat = test_stat, direction = "two-sided") # would only replace test_stat if assigned another name in Step 1
## Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
## based on the number of `reps` chosen in the `generate()` step.
## ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
p_val
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Really, it’s essentially a p-value of 0, not 0 – but, still, under the assumptions of frequentist hypothesis testing, this means there is strong evidence to reject the null.

Task 3

Prompt

Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 2, and interpret the output.

null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided")

null_distribution
## Response: sbbsntx (numeric)
## Explanatory: domicil_ur (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate    stat
##        <int>   <dbl>
##  1         1 -0.509 
##  2         2 -0.149 
##  3         3  0.0283
##  4         4  0.234 
##  5         5  0.353 
##  6         6 -0.329 
##  7         7  1.41  
##  8         8 -1.46  
##  9         9  1.36  
## 10        10  1.11  
## # ℹ 990 more rows
conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

Interpretation:From the figure we can see that the red line is far inside the null hypothesis (on the far left of the figure), so we can reject the null hypothesis. Our analysis of the data reveals a statistically significant difference in trust in politicians between individuals with a BA and those without a BA (t = -4.28, p < 0.05). The observed t-statistic of -4.28 indicates a substantial difference in mean trust levels, with individuals holding a BA expressing significantly lower levels of trust compared to those without a BA. This finding suggests that education level plays a significant role in shaping attitudes towards politicians. The rejection of the null hypothesis underscores the robustness of this result and highlights the importance of considering educational background when studying political attitudes.”

null_distribution
## Response: sbbsntx (numeric)
## Explanatory: domicil_ur (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate    stat
##        <int>   <dbl>
##  1         1 -0.509 
##  2         2 -0.149 
##  3         3  0.0283
##  4         4  0.234 
##  5         5  0.353 
##  6         6 -0.329 
##  7         7  1.41  
##  8         8 -1.46  
##  9         9  1.36  
## 10        10  1.11  
## # ℹ 990 more rows

Task 4

Prompt

Conduct Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables ccrdprs (leaving it on the 0-10 numeric scale) as the response (or outcome) variable and lrscale (recoded as left and right, omitting “moderates” as we did in Task 1) as the explanatory variable.

Variable info here: https://ess.sikt.no/en/datafile/ffc43f48-e15a-4a1c-8813-47eda377c355/92?tab=1&elements=[%2283b08cf2-508e-49a0-9fc8-c3cc281290ec/4%22]

Provide interpretations of the output (consider the variable info).

mytab4<-tab_dat %>%
  filter(!is.na(ID))%>%
  mutate(ccrdprs=ifelse(ccrdprs%in% c(66,77,88,99), NA,ccrdprs))
test_stat <- mytab4%>%
  specify(explanatory = ID, # change variable name for explanatory variable
          response = ccrdprs) %>% # change variable name for outcome of interest
  hypothesize(null = "independence") %>%
  calculate(stat = "t") # replace in between quotation marks appropriate test statistic
## Warning: Removed 7140 rows containing missing values.
## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "Left" - "Right", or divided in the order "Left" / "Right" for
## ratio-based statistics. To specify this order yourself, supply `order =
## c("Left", "Right")` to the calculate() function.
print(test_stat$stat)
##        t 
## 5.482311

Interpretation: The t-statistic of 5.482311 indicates a substantial difference between the mean ccrdprs scores for individuals identifying as left-leaning and right-leaning. With this t-value, which is well above any critical value at conventional significance levels (assuming a two-tailed test at α = 0.05), we confidently reject the null hypothesis. This rejection suggests that there exists a statistically significant difference in mean ccrdprs scores between left-leaning and right-leaning individuals.

null_distribution <- mytab4 %>%
  specify(explanatory = educ.ba,
          response = trstplt) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>% # only line we are adding, use reps = 1000 as standard but change if too computationally demanding
  calculate(stat = "t")
## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "BA or more" - "No BA", or divided in the order "BA or more" / "No BA"
## for ratio-based statistics. To specify this order yourself, supply `order =
## c("BA or more", "No BA")` to the calculate() function.
p_val <- null_distribution %>% # replace name here if you assigned something other than null_distribution above
  get_pvalue(obs_stat = test_stat, direction = "two-sided") # would only replace test_stat if assigned another name in Step 1
## Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
## based on the number of `reps` chosen in the `generate()` step.
## ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
p_val
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Interpretation:The t-test yielded a t-statistic of 5.482311 and a p-value of 0. With such a low p-value, essentially zero, we reject the null hypothesis. This result indicates a significant difference in the mean ccrdprs scores between individuals identifying as left-leaning and right-leaning. The t-test comparing ccrdprs scores between individuals with a BA or more and those without a BA yielded a statistically significant difference. With such a small p-value, we reject the null hypothesis, indicating that there is strong evidence to support the alternative hypothesis that there is a significant difference in mean ccrdprs scores between the two groups.

Task 5

Prompt

Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 4, and interpret the output.

null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided")

null_distribution
## Response: trstplt (numeric)
## Explanatory: educ.ba (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate   stat
##        <int>  <dbl>
##  1         1 -1.03 
##  2         2  0.577
##  3         3  1.60 
##  4         4  0.513
##  5         5  0.862
##  6         6  0.392
##  7         7  0.774
##  8         8  0.146
##  9         9 -0.642
## 10        10 -0.116
## # ℹ 990 more rows
conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

null_distribution
## Response: trstplt (numeric)
## Explanatory: educ.ba (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate   stat
##        <int>  <dbl>
##  1         1 -1.03 
##  2         2  0.577
##  3         3  1.60 
##  4         4  0.513
##  5         5  0.862
##  6         6  0.392
##  7         7  0.774
##  8         8  0.146
##  9         9 -0.642
## 10        10 -0.116
## # ℹ 990 more rows

Interpretation: This provides evidence against the null hypothesis and supports the alternative hypothesis that there is a significant difference in mean ccrdprs scores between the two groups. Overall, the visualization reinforces the statistical significance of the observed difference and highlights the importance of considering political orientation when examining attitudes towards politicians.