SetUp

#Installing & Applying Packages
packages <-c("tidyverse", "modelsummary", "flextable", "fst", "viridis", "knitr", "kableExtra", "rmarkdown", "ggridges", "questionr", "infer")
new_packages <-packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Version 2.0.0 of `modelsummary`, to be released soon, will introduce a
##   breaking change: The default table-drawing package will be `tinytable`
##   instead of `kableExtra`. All currently supported table-drawing packages
##   will continue to be supported for the foreseeable future, including
##   `kableExtra`, `gt`, `huxtable`, `flextable, and `DT`.
##   
##   You can always call the `config_modelsummary()` function to change the
##   default table-drawing package in persistent fashion. To try `tinytable`
##   now:
##   
##   config_modelsummary(factory_default = 'tinytable')
##   
##   To set the default back to `kableExtra`:
##   
##   config_modelsummary(factory_default = 'kableExtra')
## 
## 
## Attaching package: 'flextable'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     compose
## 
## 
## Loading required package: viridisLite
## 
## 
## Attaching package: 'kableExtra'
## 
## 
## The following objects are masked from 'package:flextable':
## 
##     as_image, footnote
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "modelsummary" "lubridate"    "forcats"      "stringr"      "dplyr"       
##  [6] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [11] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [16] "datasets"     "methods"      "base"        
## 
## [[3]]
##  [1] "flextable"    "modelsummary" "lubridate"    "forcats"      "stringr"     
##  [6] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [11] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [16] "utils"        "datasets"     "methods"      "base"        
## 
## [[4]]
##  [1] "fst"          "flextable"    "modelsummary" "lubridate"    "forcats"     
##  [6] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [11] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [16] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[5]]
##  [1] "viridis"      "viridisLite"  "fst"          "flextable"    "modelsummary"
##  [6] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [11] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [16] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [21] "methods"      "base"        
## 
## [[6]]
##  [1] "knitr"        "viridis"      "viridisLite"  "fst"          "flextable"   
##  [6] "modelsummary" "lubridate"    "forcats"      "stringr"      "dplyr"       
## [11] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [16] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [21] "datasets"     "methods"      "base"        
## 
## [[7]]
##  [1] "kableExtra"   "knitr"        "viridis"      "viridisLite"  "fst"         
##  [6] "flextable"    "modelsummary" "lubridate"    "forcats"      "stringr"     
## [11] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [16] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [21] "utils"        "datasets"     "methods"      "base"        
## 
## [[8]]
##  [1] "rmarkdown"    "kableExtra"   "knitr"        "viridis"      "viridisLite" 
##  [6] "fst"          "flextable"    "modelsummary" "lubridate"    "forcats"     
## [11] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [16] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [21] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[9]]
##  [1] "ggridges"     "rmarkdown"    "kableExtra"   "knitr"        "viridis"     
##  [6] "viridisLite"  "fst"          "flextable"    "modelsummary" "lubridate"   
## [11] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [16] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [21] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [26] "base"        
## 
## [[10]]
##  [1] "questionr"    "ggridges"     "rmarkdown"    "kableExtra"   "knitr"       
##  [6] "viridis"      "viridisLite"  "fst"          "flextable"    "modelsummary"
## [11] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [16] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [21] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [26] "methods"      "base"        
## 
## [[11]]
##  [1] "infer"        "questionr"    "ggridges"     "rmarkdown"    "kableExtra"  
##  [6] "knitr"        "viridis"      "viridisLite"  "fst"          "flextable"   
## [11] "modelsummary" "lubridate"    "forcats"      "stringr"      "dplyr"       
## [16] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [21] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [26] "datasets"     "methods"      "base"

#Import dataset
rm(list=ls()); gc()

##           used (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
## Ncells 1247649 66.7    2278070 121.7         NA  2278070 121.7
## Vcells 2139699 16.4    8388608  64.0      16384  3200680  24.5

ess <- read.fst("All-ESS-Data.fst")

Task 1

Based on the recoding of lrscale for “left” and “right” and omitting “moderates” (see Tutorial 4), and educ.ba from this tutorial, do the long form coding for Chi-Square. Using the steps outlined in the tutorial, first generate the tables of expected proportions and frequencies. Determine and interpret the critical value for independence. Finally, determine and interpret both the Pearson’s chi-squared statistic and the p-value. Discuss main takeaways.

Recoding & cleaning education and lrscale

#Recoding & cleaning education and lrscale
france_data <- ess %>%
  filter(cntry == "FR") %>%
  mutate(
    educ.ba = case_when(
      essround < 5 & edulvla == 5 ~ "BA or more",
      essround >= 5 & edulvlb > 600 ~ "BA or more",
      TRUE ~ "No BA"),
    edulvla = ifelse(edulvla %in% c(77, 88), NA_integer_, edulvla),
    edulvlb = ifelse(edulvlb %in% c(5555, 7777, 8888), NA_integer_, edulvlb),
    educ.ba = as.character(educ.ba)) %>%
  mutate(
    lrscale = case_when(
      lrscale %in% 0:3 ~ "Left",
      lrscale %in% 7:10 ~ "Right", 
      lrscale %in% c(4, 5, 6, 77, 88, 99) ~NA_character_ )) %>%
filter(!is.na(lrscale) & !is.na(educ.ba))
#Note: 4,5,6 set to NA because those values correspond to "Moderate"

#Checking variables and lengths before making contingency tables
table(france_data$lrscale)

## 
##  Left Right 
##  4709  4302

table(france_data$educ.ba)

## 
## BA or more      No BA 
##       2261       6750

length(france_data$lrscale)

## [1] 9011

length(france_data$educ.ba)

## [1] 9011

Calculations

#New dataset & contingency table
tab_dat <-france_data
mytab <-table(tab_dat$lrscale, tab_dat$educ.ba)
rsums <-rowSums(mytab)
csums <-colSums(mytab)
N<-sum(mytab)

#Table of expected proportions
ptab <-tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")

## Table of Expected Proportions:

print(ptab)

##           [,1]      [,2]
## [1,] 0.1311243 0.3914592
## [2,] 0.1197912 0.3576253

#Table of expected frequencies
ftab <-N * ptab
cat("Table of Expected Frequencies:\n")

## Table of Expected Frequencies:

print(ftab)

##          [,1]     [,2]
## [1,] 1181.561 3527.439
## [2,] 1079.439 3222.561

#Critical Value of Independence
alpha <-0.05
c_val <-qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")

## Critical Value: 3.841

Interpretation: The critical value of independe was calculated to be 3.841. If the Chi-squared test statistic computed below exceeds 3.841, then we reject the null hypothesis at the 0.05 significance level.

#Pearson's chi-squared statistic
test_stat <-sum((mytab-ftab)^2/ftab)
cat("Pearson's X^2:", round(test_stat,4),"\n")

## Pearson's X^2: 100.8551

#Calculating the p-value
p_val <-pchisq(test_stat, df=1, lower.tail = F)
cat("p-value", round(p_val, 4), "\n")

## p-value 0

cat("Chi-squared test result:\n")

## Chi-squared test result:

print(chisq.test(mytab,correct=F))

## 
##  Pearson's Chi-squared test
## 
## data:  mytab
## X-squared = 100.86, df = 1, p-value < 2.2e-16

Interpretations: Our observed X-squared value is much larger than the critical value of 3.841, therefore the difference in the observed and expected data is statistically significant under the assumption of independence. Thus, there is a statistically significant association between one’s education level and whether they lie to the left or right on a political scale. If the two variables were independent we would expect a chi-square statistic closer to 0, rather than 100.8551. The p-value 2.2e-16 is very close to 0 meaning that we can reject the null hypothesis that the two variables of lrscale and educ.ba are independent of each other.

Takeaway:There is strong evidence to suggest that there’s a significant association between the two variables in mytab. This conclusion can be drawn by the very small p-value that is close to 0 and the very large X-squared value that is much greater than the critical value of 3.841, allowing us to reject the null hypothesis that the variables of lrscale and educ.ba are independent. In other words, an individuals placement, left or right, on a political scale is not independent from whether or not they have higher education.

Task 2

Do Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables sbbsntx (recoded to the corresponding 5 categories) as the response and domicil (recoded to urban and rural, setting 2 to urban instead of peri-urban) as the explanatory variable for the country of France. Provide interpretations of the output (consider the variable info).

Recoding and cleaning sbbsntx & domicil

#Recoding and cleaning sbbsntx & domicil
france_data <- ess %>%
  filter(cntry == "FR") %>%
  mutate(
    sbbsntx = case_when(
      sbbsntx == 1 ~ "Agree strongly", 
      sbbsntx == 2 ~ "Agree", 
      sbbsntx == 3 ~ "Neither agree nor disagree", 
      sbbsntx == 4 ~ "Disagree", 
      sbbsntx == 5 ~ "Disagree strongly", 
      sbbsntx %in% c(7,8,9) ~NA_character_)) %>%
  mutate(
    domicil = case_when(
      domicil == 1 ~ "Urban",
      domicil == 2 ~ "Urban",
      domicil == 3 ~ "Rural",
      domicil == 4 ~ "Rural",
      domicil == 5 ~ "Rural", 
      domicil %in% c(7,8,9) ~NA_character_ )) %>%
  filter(!is.na(domicil) & !is.na(sbbsntx))
#Checking variables
table(france_data$sbbsntx)

## 
##                      Agree             Agree strongly 
##                       1538                        756 
##                   Disagree          Disagree strongly 
##                        640                        301 
## Neither agree nor disagree 
##                        804

table(france_data$domicil)

## 
## Rural Urban 
##  2876  1163

Calculations

#Step 1: Calculating test statistic test_stat
test_stat <- france_data %>%
  specify(explanatory = domicil, 
          response = sbbsntx) %>%
  hypothesize(null="independence")%>%
  calculate(stat="chisq")
##Chisq used because sbbsntx is a categorical variable with 3 or more categories and domicil is a categorical variable with 2 categories. 
print(test_stat$stat)

## X-squared 
##  19.97187

#Step 2: Simulate the null distribution
null_distribution <-france_data %>%
  specify(explanatory = domicil, 
          response = sbbsntx) %>%
  hypothesize(null="independence")%>%
  generate(reps=1000, type="permute")%>%
  calculate(stat="chisq")

#Step 3: Calculate the p-value
p_val <-null_distribution %>%
  get_pvalue(obs_stat=test_stat, direction = "greater")

## Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
## based on the number of `reps` chosen in the `generate()` step.
## ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Task 3

Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 2, and interpret the output.

Step 4: Visualize

#Step 4: Visualize
null_distribution %>%
  visualize() + 
  shade_p_value(obs_stat = test_stat, direction = "greater")

## Warning in (function (mapping = NULL, data = NULL, stat = "identity", position = "identity", : All aesthetics have length 1, but the data has 1000 rows.
## ℹ Did you mean to use `annotate()`?

## Warning in min(diff(unique_loc)): no non-missing arguments to min; returning
## Inf

null_distribution

## Response: sbbsntx (factor)
## Explanatory: domicil (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate   stat
##        <int>  <dbl>
##  1         1  4.86 
##  2         2  4.10 
##  3         3  5.12 
##  4         4 10.4  
##  5         5  4.13 
##  6         6  0.770
##  7         7  6.07 
##  8         8  1.37 
##  9         9  0.718
## 10        10  2.48 
## # ℹ 990 more rows

#Note: "greater" used because computed a chisq test

#Visualize with confidence intervals
conf_int<-null_distribution %>%
  get_confidence_interval(level=0.95, type ="percentile")
null_distribution %>%
  visualize() + 
  shade_p_value(obs_stat = test_stat, direction = "greater") + 
  shade_confidence_interval(endpoints=conf_int)

## Warning in (function (mapping = NULL, data = NULL, stat = "identity", position = "identity", : All aesthetics have length 1, but the data has 1000 rows.
## ℹ Did you mean to use `annotate()`?

## Warning in min(diff(unique_loc)): no non-missing arguments to min; returning
## Inf

#Note: "greater" used because computed a chisq test

Interpretations: The observed test statistic is represented by the red line. Since the observed test statistic does not fall within the shaded region (the null distribution), we can reject the null hypothesis. The shaded blue region represents the range of values within which the true Chi-Square statistic is likely to fall with a certain level of confidence, 95% confidence interval. The observed test statistic does not fall within the confifence interval shading, indicating the observed association between the variables domicil and sbbsntx is unlikely to have occurred by chance alone. In other words, there is a statistically significant association between the variables based on the data.

Task 4

Conduct Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables ccrdprs (leaving it on the 0-10 numeric scale) as the response (or outcome) variable and lrscale (recoded as left and right, omitting “moderates” as we did in Task 1) as the explanatory variable. Provide interpretations of the output (consider the variable info).

Recoding & cleaning ccrdprs and lrscale

#Recoding & cleaning ccrdprs and lrscale
france_data <- ess %>%
  filter(cntry == "FR") %>%
  mutate(
    ccrdprs = ifelse(ccrdprs %in% c(66, 77, 88, 99), NA_integer_, ccrdprs))%>%
  mutate(
    lrscale = case_when(
      lrscale %in% 0:3 ~ "Left",
      lrscale %in% 7:10 ~ "Right", 
      lrscale %in% c(4, 5, 6, 77, 88, 99) ~NA_character_ )) %>%
filter(!is.na(lrscale) & !is.na(ccrdprs))
## Note: 4,5,6 set to NA because those values correspond to "Moderate"
#Checking variables
table(france_data$lrscale)

## 
##  Left Right 
##   889   982

table(france_data$ccrdprs)

## 
##   0   1   2   3   4   5   6   7   8   9  10 
##  33  12  28  38  57 205 173 344 472 174 335

Calculations

#Step 1: Calculating test statistic test_stat
cc_stat <- france_data %>%
  specify(explanatory = lrscale, 
          response = ccrdprs) %>%
  hypothesize(null="independence")%>%
  calculate(stat="t")

## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "Left" - "Right", or divided in the order "Left" / "Right" for
## ratio-based statistics. To specify this order yourself, supply `order =
## c("Left", "Right")` to the calculate() function.

##t-test used because lrscale is a dichotomous categorical explanatory variable and ccrdprs is a numeric response variable.  
print(cc_stat$stat)

##        t 
## 5.482311

#Step 2: Simulate the null distribution
null_cc <-france_data %>%
  specify(explanatory = lrscale, 
          response = ccrdprs) %>%
  hypothesize(null="independence")%>%
  generate(reps=1000, type="permute")%>%
  calculate(stat="t")

## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "Left" - "Right", or divided in the order "Left" / "Right" for
## ratio-based statistics. To specify this order yourself, supply `order =
## c("Left", "Right")` to the calculate() function.

#Step 3: Calculate the p-value
p_val <-null_cc %>%
  get_pvalue(obs_stat=cc_stat, direction = "two-sided")

## Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
## based on the number of `reps` chosen in the `generate()` step.
## ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Interpretations: With an observed t-statistic of approximately 5.48 and a reported p-value of 0, this suggests that the observed association between the lrscale and ccrdprs variables is statistically significant. The observed t statistic of 5.48 indicates the magnitude of difference between the two gorups in lrscale in regards to the numeric variable ccrdprs. The reported p-value of 0 suggests that we would reject the null hypothesis of independence between political orientation (lrscale) and feeling responsible to reduce climate chance (ccrdprs). In other words, the observed association is highly unlikely to have occurred by chance alone, providing strong evidence against the null hypothesis. There is a statistically significant association between where survey respondents political placement on a left-right scale and the extent they feel responsible to reduce climate change.

Task 5

Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 4, and interpret the output.

Step 4: Visualize

#Step 4: Visualize
null_cc %>%
  visualize() + 
  shade_p_value(obs_stat = cc_stat, direction = "two-sided")

## Warning in (function (mapping = NULL, data = NULL, stat = "identity", position = "identity", : All aesthetics have length 1, but the data has 1000 rows.
## ℹ Did you mean to use `annotate()`?

null_cc

## Response: ccrdprs (numeric)
## Explanatory: lrscale (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate   stat
##        <int>  <dbl>
##  1         1 -0.532
##  2         2 -0.852
##  3         3  1.50 
##  4         4 -1.19 
##  5         5 -0.511
##  6         6 -0.321
##  7         7 -1.34 
##  8         8  0.483
##  9         9 -0.428
## 10        10  0.378
## # ℹ 990 more rows

#Visualize with confidence intervals
conf_int<-null_distribution %>%
  get_confidence_interval(level=0.95, type ="percentile")
null_cc %>%
  visualize() + 
  shade_p_value(obs_stat = cc_stat, direction = "two-sided") + 
  shade_confidence_interval(endpoints=conf_int)

## Warning in (function (mapping = NULL, data = NULL, stat = "identity", position = "identity", : All aesthetics have length 1, but the data has 1000 rows.
## ℹ Did you mean to use `annotate()`?

Interpretations: The observed test statistic is represented by the red line. Since the observed test statistic does not fall within the shaded region (the null distribution), we can reject the null hypothesis. The shaded blue region represents the range of values within which the t-test statistic is likely to fall with a certain level of confidence, 95% confidence interval. The observed test statistic does not fall within the confidence interval shading, indicating the observed association between the variables lrscale and ccrdprs is unlikely to have occurred by chance alone. In other words, there is a statistically significant association between the variables based on the data.

Coletta_Mia_Homework_3

Mia Coletta

2024-03-09

SetUp

Task 1

Recoding & cleaning education and lrscale

Calculations

Interpretation: The critical value of independe was calculated to be 3.841. If the Chi-squared test statistic computed below exceeds 3.841, then we reject the null hypothesis at the 0.05 significance level.

Task 2

Recoding and cleaning sbbsntx & domicil

Calculations

Task 3

Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 2, and interpret the output.

Step 4: Visualize

Task 4

Recoding & cleaning ccrdprs and lrscale

Calculations

Task 5

Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 4, and interpret the output.

Step 4: Visualize