SOC202 HW7

Task 1

# List of packages
packages <- c("tidyverse", "infer", "fst") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "infer"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "fst"       "infer"     "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"

ess <- read_fst("All-ESS-Data.fst")

switzerland_data <- ess %>%
  filter(cntry == "CH") %>%
  mutate(
    brncntr = case_when(
      brncntr == 1 ~ "Yes", 
      brncntr == 2 ~ "No",  
      brncntr %in% c(7, 8, 9) ~ NA_character_, 
      TRUE ~ as.character(brncntr) 
    ),
    edulvla = case_when(
      essround < 5 & edulvla == 55 ~ NA_real_,
      TRUE ~ edulvla
    ),
    edulvlb = case_when(
      essround >= 5 & edulvlb == 5555 ~ NA_real_,
      TRUE ~ edulvlb
    ),
    educ_level = case_when(
      essround < 5 & edulvla == 5 ~ "BA",
      essround >= 5 & edulvlb > 600 ~ "BA",
      TRUE ~ "No BA"
    )
  )

unique(switzerland_data$brncntr)

## [1] "No"  "Yes" NA

switzerland_data <- switzerland_data %>% filter(!is.na(brncntr))
switzerland_data <- switzerland_data %>% filter(!is.na(educ_level))

unique(switzerland_data$brncntr)

## [1] "No"  "Yes"

table(switzerland_data$brncntr)

## 
##    No   Yes 
##  3867 13056

tab_dat <- switzerland_data

mytab <- table(tab_dat$brncntr, tab_dat$educ_level)

rsums <- rowSums(mytab)

csums <- colSums(mytab)

N <- sum(mytab)

ptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")

## Table of Expected Proportions:

print(ptab)

##            [,1]      [,2]
## [1,] 0.05206627 0.1764393
## [2,] 0.17578931 0.5957051

ftab <- N * ptab
cat("Table of Expected Frequencies:\n")

## Table of Expected Frequencies:

print(ftab)

##           [,1]      [,2]
## [1,]  881.1175  2985.882
## [2,] 2974.8825 10081.118

alpha <- 0.05 # significance level
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")

## Critical Value: 3.841

If the Chi-squared test statistic that is computed exceeds 3.841, then we reject the null hypothesis at the 0.05 significance level. This would mean that there is a statistically significant association between being born in country and education level in Switzerland.

test_stat <- sum((mytab - ftab)^2 / ftab)
cat("Pearson's X^2:", round(test_stat, 4), "\n")

## Pearson's X^2: 65.1218

Comparison to the critical value: We previously calculated a critical value of 3.841. Our observed X-squared value is much larger than this critical value.

Interpretation: The difference between our observed data and what we’d expect under the assumption of independence is statistically significant. Thus, there is a statistically significant association between being born in country and one’s educational level. This means that the probability of being born in country is not the same across different education levels.

p_val <- pchisq(test_stat, df = 1, lower.tail = F)
cat("p-value:", round(p_val, 4), "\n")

## p-value: 0

cat("Chi-squared test result:\n")

## Chi-squared test result:

print(chisq.test(mytab, correct = F))

## 
##  Pearson's Chi-squared test
## 
## data:  mytab
## X-squared = 65.122, df = 1, p-value = 7.041e-16

The chi-squared test tells us if two categorical variables are independent or if they are associated in some manner. In our test: The extremely small p-value (essentially 0) means that we reject the null hypothesis that the two variables are independent of each other. In other words, there’s strong evidence to suggest that there’s a significant association between the two variables in mytab.

The large chi-squared statistic of 65.1218 further reinforces this point. If the two variables were independent, we’d expect a chi-squared statistic closer to 0.

Main takeaways: According to the critical value, Pearson’s chi-squared statistic, and p-value the data not only rejects the null hypothesis, that a relationship does not exist betweeen being born in country and education level in Switzerland, but supports the hypothesis that a relationship does exist.

Task 2

netherlands_data <- ess %>%
  filter(cntry == "NL") %>%
    mutate(
     domicil = case_when(
      domicil == 1 ~ "Urban", 
      domicil == 2 ~ "Urban",   
      domicil == 3 ~ "Rural",
      domicil == 4 ~ "Rural",
      domicil == 5 ~ "Rural",
      domicil %in% c(7, 8, 9) ~ NA_character_,  
      TRUE ~ as.character(domicil)  
      ),
     gincdif = case_when(
      gincdif == 1 ~ "Agree Strongly", 
      gincdif == 2 ~ "Agree",   
      gincdif == 3 ~ "Neither Agree nor Disagree",
      gincdif == 4 ~ "Disagree",
      gincdif == 5 ~ "Disagree Strongly",
      gincdif %in% c(7, 8, 9) ~ NA_character_,  
      TRUE ~ as.character(gincdif)  
    )
    )

netherlands_data <- netherlands_data %>% filter(!is.na(domicil))
netherlands_data <- netherlands_data %>% filter(!is.na(gincdif))

unique(netherlands_data$domicil)

## [1] "Urban" "Rural"

table(netherlands_data$domicil)

## 
## Rural Urban 
## 12574  5557

unique(netherlands_data$gincdif)

## [1] "Agree"                      "Disagree"                  
## [3] "Neither Agree nor Disagree" "Agree Strongly"            
## [5] "Disagree Strongly"

table(netherlands_data$gincdif)

## 
##                      Agree             Agree Strongly 
##                       7886                       2889 
##                   Disagree          Disagree Strongly 
##                       3676                        543 
## Neither Agree nor Disagree 
##                       3137

tab_dat <- netherlands_data

mytab <- table(tab_dat$domicil, tab_dat$gincdif)

rsums <- rowSums(mytab)

csums <- colSums(mytab)

N <- sum(mytab)

ptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")

## Table of Expected Proportions:

print(ptab)

##           [,1]       [,2]       [,3]        [,4]       [,5]
## [1,] 0.3016385 0.11050387 0.14060651 0.020769678 0.11998984
## [2,] 0.1333072 0.04883649 0.06214016 0.009179028 0.05302875

ftab <- N * ptab
cat("Table of Expected Frequencies:\n")

## Table of Expected Frequencies:

print(ftab)

##          [,1]      [,2]     [,3]    [,4]      [,5]
## [1,] 5469.007 2003.5456 2549.337 376.575 2175.5357
## [2,] 2416.993  885.4544 1126.663 166.425  961.4643

alpha <- 0.05 # significance level
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")

## Critical Value: 3.841

If the Chi-squared test statistic that is computed exceeds 3.841, then we reject the null hypothesis at the 0.05 significance level. This would mean that there is a statistically significant association between domicile and belief that the government should reduce differences in income levels in the Netherlands.

test_stat <- sum((mytab - ftab)^2 / ftab)
cat("Pearson's X^2:", round(test_stat, 4), "\n")

## Pearson's X^2: 8.176

Comparison to the critical value: We previously calculated a critical value of 3.841. Our observed X-squared value is larger than this critical value.

Interpretation: The difference between our observed data and what we’d expect under the assumption of independence is statistically significant. Thus, there is a statistically significant association between domicile and belief that the government should reduce differences in income levels in the Netherlands. This means that the probability of belief that the government should reduce differences in income levels is not the same based on domicile.

p_val <- pchisq(test_stat, df = 1, lower.tail = F)
cat("p-value:", round(p_val, 4), "\n")

## p-value: 0.0042

cat("Chi-squared test result:\n")

## Chi-squared test result:

print(chisq.test(mytab, correct = F))

## 
##  Pearson's Chi-squared test
## 
## data:  mytab
## X-squared = 8.176, df = 4, p-value = 0.08534

The chi-squared test tells us if two categorical variables are independent or if they are associated in some manner. In our test: The small p-value (essentially 0) means that we reject the null hypothesis that the two variables are independent of each other. In other words, there’s evidence to suggest that there’s a significant association between the two variables in mytab.

The chi-squared statistic of 8.176 further reinforces this point. If the two variables were independent, we’d expect a chi-squared statistic closer to 0.

test_stat <- netherlands_data %>%
  specify(explanatory = domicil, 
          response = gincdif) %>% 
  hypothesize(null = "independence") %>%
  calculate(stat = "chisq") 

print(test_stat$stat)

## X-squared 
##  8.176048

A chi-square test statistic of 8.176048 is large, indicating a significant difference between domicile and belief that the government should reduce differences in income levels in the Netherlands. It suggests that the mean level of belief that the government should reduce differences in income levels for the two groups is different.

null_distribution <- netherlands_data %>%
  specify(explanatory = domicil,
          response = gincdif) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "chisq")

p_val <- null_distribution %>% 
  get_pvalue(obs_stat = test_stat, direction = "greater") 

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.098

It’s close to a p-value of 0, not 0 – but, still, under the assumptions of frequentist hypothesis testing, this means there is evidence to reject the null.

null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "greater")

null_distribution

## Response: gincdif (factor)
## Explanatory: domicil (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1  3.25
##  2         2  4.24
##  3         3  3.20
##  4         4  7.91
##  5         5  5.74
##  6         6  2.54
##  7         7  2.72
##  8         8  8.71
##  9         9  2.35
## 10        10  5.23
## # ℹ 990 more rows

Note where the red line is set based on the test statistic and how we would not reject the null given the standard cutoff points at the tails.

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "greater") +
  shade_confidence_interval(endpoints = conf_int)

null_distribution

## Response: gincdif (factor)
## Explanatory: domicil (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1  3.25
##  2         2  4.24
##  3         3  3.20
##  4         4  7.91
##  5         5  5.74
##  6         6  2.54
##  7         7  2.72
##  8         8  8.71
##  9         9  2.35
## 10        10  5.23
## # ℹ 990 more rows

The CI supports the graph above.

Main takeaways: Although the critical value supported a rejection of the null hypothesis, Pearson’s chi-squared statistic, p-value, and the infer package of the data shows that test statistic and how we would not reject the null. Therefore, concluding that a significantly different relationship does not exist between domicile and belief that the government should reduce differences in income levels in the Netherlands.

Task 3

I want to see the relationship between what level of education (explanatory) and how happy people are (outcome) in Finland.

RQ: Does a relationship exist between the level of education and how happy people are in Finland?

finland_data <- ess %>%
  filter(cntry == "FI") %>%
  mutate(
    happy = ifelse(happy %in% c(77, 88, 99), NA, happy),
    edulvla = case_when(
      essround < 5 & edulvla == 55 ~ NA_real_,
      TRUE ~ edulvla
    ),
    edulvlb = case_when(
      essround >= 5 & edulvlb == 5555 ~ NA_real_,
      TRUE ~ edulvlb
    ),
    educ_level = case_when(
      essround < 5 & edulvla == 5 ~ "BA",
      essround >= 5 & edulvlb > 600 ~ "BA",
      TRUE ~ "No BA"
    )
  )

finland_data <- finland_data %>% filter(!is.na(happy))
finland_data <- finland_data %>% filter(!is.na(educ_level))

unique(finland_data$happy)

##  [1]  9  7  8  5 10  4  6  3  2  1  0

table(finland_data$happy)

## 
##    0    1    2    3    4    5    6    7    8    9   10 
##   24   26   80  166  231  628  790 2646 6855 6128 1928

unique(finland_data$educ_level)

## [1] "BA"    "No BA"

table(finland_data$educ_level)

## 
##    BA No BA 
##  5452 14050

tab_dat <- finland_data

mytab <- table(tab_dat$happy, tab_dat$educ_level)

rsums <- rowSums(mytab)

csums <- colSums(mytab)

N <- sum(mytab)

ptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")

## Table of Expected Proportions:

print(ptab)

##               [,1]         [,2]
##  [1,] 0.0003440399 0.0008866031
##  [2,] 0.0003727099 0.0009604867
##  [3,] 0.0011467996 0.0029553438
##  [4,] 0.0023796092 0.0061323383
##  [5,] 0.0033113838 0.0085335552
##  [6,] 0.0090023768 0.0231994487
##  [7,] 0.0113246460 0.0291840198
##  [8,] 0.0379303965 0.0977479954
##  [9,] 0.0982663901 0.2532360199
## [10,] 0.0878448488 0.2263793334
## [11,] 0.0276378702 0.0712237850

ftab <- N * ptab
cat("Table of Expected Frequencies:\n")

## Table of Expected Frequencies:

print(ftab)

##              [,1]       [,2]
##  [1,]    6.709466   17.29053
##  [2,]    7.268588   18.73141
##  [3,]   22.364886   57.63511
##  [4,]   46.407138  119.59286
##  [5,]   64.578607  166.42139
##  [6,]  175.564352  452.43565
##  [7,]  220.853246  569.14675
##  [8,]  739.718593 1906.28141
##  [9,] 1916.391139 4938.60886
## [10,] 1713.150241 4414.84976
## [11,]  538.993744 1389.00626

alpha <- 0.05 # significance level
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")

## Critical Value: 3.841

test_stat <- sum((mytab - ftab)^2 / ftab)
cat("Pearson's X^2:", round(test_stat, 4), "\n")

## Pearson's X^2: 221.3878

Comparison to the critical value: We previously calculated a critical value of 3.841. Our observed X-squared value is much larger than this critical value.

Interpretation: The difference between our observed data and what we’d expect under the assumption of independence is statistically significant. Thus, there is a statistically significant association between feeling close to a party and one’s educational level. This means that the happiness levels is not the same across different education levels.

p_val <- pchisq(test_stat, df = 1, lower.tail = F)
cat("p-value:", round(p_val, 4), "\n")

## p-value: 0

cat("Chi-squared test result:\n")

## Chi-squared test result:

print(chisq.test(mytab, correct = F))

## 
##  Pearson's Chi-squared test
## 
## data:  mytab
## X-squared = 221.39, df = 10, p-value < 2.2e-16

The large chi-squared statistic of 221.39 further reinforces this point. If the two variables were independent, we’d expect a chi-squared statistic closer to 0.

test_stat <- finland_data %>%
  specify(explanatory = educ_level, 
          response = happy) %>% 
  hypothesize(null = "independence") %>%
  calculate(stat = "t")

## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "BA" - "No BA", or divided in the order "BA" / "No BA" for ratio-based
## statistics. To specify this order yourself, supply `order = c("BA", "No BA")`
## to the calculate() function.

print(test_stat$stat)

##        t 
## 12.01526

Interpretation: The t-statistic measures the difference between the observed sample mean and the population mean (or another sample mean, in the case of two-sample tests) in units of standard error. The larger the magnitude of t (regardless of the sign), the greater the evidence against the null hypothesis.

A t-statistic of 12.01526 is quite large, indicating a significant difference in happiness between those with a BA and those without a BA. It suggests that the mean happiness is different for the two groups.

null_distribution <- finland_data %>%
  specify(explanatory = educ_level,
          response = happy) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "t")

## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "BA" - "No BA", or divided in the order "BA" / "No BA" for ratio-based
## statistics. To specify this order yourself, supply `order = c("BA", "No BA")`
## to the calculate() function.

p_val <- null_distribution %>% 
  get_pvalue(obs_stat = test_stat, direction = "two-sided")

## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step.
## See `?get_p_value()` for more information.

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Really, it’s essentially a p-value of 0, not 0 – but, still, under the assumptions of frequentist hypothesis testing, this means there is strong evidence to reject the null.

null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided")

null_distribution

## Response: happy (numeric)
## Explanatory: educ_level (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate   stat
##        <int>  <dbl>
##  1         1  0.319
##  2         2 -1.87 
##  3         3  1.05 
##  4         4  0.834
##  5         5  0.758
##  6         6  1.41 
##  7         7 -0.157
##  8         8  0.694
##  9         9  0.955
## 10        10  0.567
## # ℹ 990 more rows

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

null_distribution

## Response: happy (numeric)
## Explanatory: educ_level (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate   stat
##        <int>  <dbl>
##  1         1  0.319
##  2         2 -1.87 
##  3         3  1.05 
##  4         4  0.834
##  5         5  0.758
##  6         6  1.41 
##  7         7 -0.157
##  8         8  0.694
##  9         9  0.955
## 10        10  0.567
## # ℹ 990 more rows

Main takeaways: According to the critical value, Pearson’s chi-squared statistic, p-value, and the infer package of the data not only rejects the null hypothesis, that a relationship does not exist betweeen happiness and education level in Finland, but supports the hypothesis that a relationship does exist.

SOC202 HW7

Tim Liu

2023-11-13