Homework_7_Lyra

# List of packages
packages <- c("tidyverse", "infer", "fst") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "infer"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "fst"       "infer"     "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"

ess <- read_fst("All-ESS-Data.fst")

Task 1:

switzerland_data <- ess %>%
 
  filter(cntry == "CH") %>%
  
 
  mutate(
    brncntr = case_when(
      brncntr == 1 ~ "Yes",  # Recode 1 to "Yes"
      brncntr == 2 ~ "No",   # Recode 2 to "No"
      brncntr %in% c(7, 8, 9) ~ NA_character_,  # Handle other specific cases where you want to set it as NA
      TRUE ~ as.character(brncntr)  # Keep other values as-is but ensure they are characters
    ),
    
    
    
    edulvla = case_when(
      essround < 5 & edulvla == 55 ~ NA_real_,
      TRUE ~ edulvla
    ),
    
    
    edulvlb = case_when(
      essround >= 5 & edulvlb == 5555 ~ NA_real_,
      TRUE ~ edulvlb
    ),
    
    
    educ_level = case_when(
      essround < 5 & edulvla == 5 ~ "BA",
      essround >= 5 & edulvlb > 600 ~ "BA",
      TRUE ~ "No BA"
    ),
    
  )

unique(switzerland_data$gndr)

## [1] 2 1 9

unique(switzerland_data$brncntr)

## [1] "No"  "Yes" NA

switzerland_data <- switzerland_data %>% filter(!is.na(brncntr))
switzerland_data <- switzerland_data %>% filter(!is.na(trstplt))
switzerland_data <- switzerland_data %>% filter(!is.na(educ_level))


unique(switzerland_data$brncntr)

## [1] "No"  "Yes"

table(switzerland_data$brncntr)

## 
##    No   Yes 
##  3867 13056

tab_dat <- switzerland_data


mytab <- table(tab_dat$brncntr, tab_dat$educ_level)


rsums <- rowSums(mytab)


csums <- colSums(mytab)


N <- sum(mytab)


ptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")

## Table of Expected Proportions:

print(ptab)

##            [,1]      [,2]
## [1,] 0.05206627 0.1764393
## [2,] 0.17578931 0.5957051

ftab <- N * ptab
cat("Table of Expected Frequencies:\n")

## Table of Expected Frequencies:

print(ftab)

##           [,1]      [,2]
## [1,]  881.1175  2985.882
## [2,] 2974.8825 10081.118

alpha <- 0.05 # significance level
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")

## Critical Value: 3.841

Interpreting Critical Value: A critical value is a specific point on the distribution of the test statistic under the null hypothesis that serves as a threshold for rejecting the null hypothesis.The set critical value is 3.841. If the computed Chi-squared test statistic exceeds this critical value of 3.841, we can reject the null hypothesis at a significance level of 0.05, indicating a statistically significant correlation between our variables of education level and born in country within the data set.

test_stat <- sum((mytab - ftab)^2 / ftab)
cat("Pearson's X^2:", round(test_stat, 4), "\n")

## Pearson's X^2: 65.1218

Interpretation of Pearson’s chi-squared statistic: As previously discussed, our critical value is 3.841. And the Pearson’s chi-squared statistic that we got here as 65.1218 is significantly larger than the critical value. The observed data significantly deviates from what would be expected under the assumption of independence, indicating a statistically significant association between being born in a country and one’s educational level. Consequently, the likelihood of the variable of born in country varies across different levels of education.

P-value:

p_val <- pchisq(test_stat, df = 1, lower.tail = F)
cat("p-value:", round(p_val, 4), "\n")

## p-value: 0

cat("Chi-squared test result:\n")

## Chi-squared test result:

print(chisq.test(mytab, correct = F))

## 
##  Pearson's Chi-squared test
## 
## data:  mytab
## X-squared = 65.122, df = 1, p-value = 7.041e-16

Interpretation of p-value: P-value helps us to determine whether or not to reject the null hypothesis. If the p-value is below the designated threshold which is commonly 0.05, we can confidently reject the null hypothesis. In this case, the p-value that we got is p-value = 7.041e-16, it is significantly close to zero. The exceedingly small p-value (most likely zero) indicates that there is substantial evidence to refute the notion that the two variables are independent from each other. Consequently, it strongly suggests a significant association between these variables studied within this data set.

Main takeaways: The main takeaway that I gained by completing this task was to understand the relationships between p-value, critical value, Pearson’s chi-squared test. As well as how they help us to understand and interpret our data set more efficiently and thoroughly. Moreover, understood what does those numbers mean and how to apply them into actual data set analysis.

Task 2:

netherlands_data <- ess %>%
  
  filter(cntry == "NL") %>%
  
  
  mutate(
    gincdif = case_when(
      
      gincdif %in% c(7, 8, 9) ~ NA_character_,  # Handle other specific cases where you want to set it as NA
      TRUE ~ as.character(gincdif)  # Keep other values as-is but ensure they are characters
    ),
   
    
   
    domicil = case_when(
      domicil %in% c(1, 2, 3) ~ "Urban",
      domicil %in% c(4, 5) ~ "Rural",
      domicil %in% c(7, 8, 9) ~ NA_character_,
      TRUE ~ as.character(domicil)
    )
  )

netherlands_data$gincdif <- as.numeric(netherlands_data$gincdif)

unique(netherlands_data$gincdif)

## [1]  2  4  3  1 NA  5

unique(netherlands_data$gincdif)

## [1]  2  4  3  1 NA  5

netherlands_data <- netherlands_data %>% filter(!is.na(gincdif))

netherlands_data <- netherlands_data %>% filter(!is.na(domicil))


unique(netherlands_data$gincdif)

## [1] 2 4 3 1 5

table(netherlands_data$gincdif)

## 
##    1    2    3    4    5 
## 2889 7886 3137 3676  543

table(netherlands_data$domicil)

## 
## Rural Urban 
##  7931 10200

class(netherlands_data$gincdif)

## [1] "numeric"

test_stat <- netherlands_data %>%
  specify(explanatory = domicil, 
          response = gincdif) %>% 
  hypothesize(null = "independence") %>%
  calculate(stat = "t")

## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "Rural" - "Urban", or divided in the order "Rural" / "Urban" for
## ratio-based statistics. To specify this order yourself, supply `order =
## c("Rural", "Urban")` to the calculate() function.

print(test_stat$stat)

##        t 
## 3.013806

Interpreting the t-statistic: The t-statistic demonstrates the discrepancy between the observed sample mean and either the population mean or another sample mean in terms of standard error units and through the presentation of a number. A larger absolute value of t, irrespective of its sign, provides stronger evidence against the null hypothesis. However, in this case, we have a t-statistic of 3.013806, which was not as significant as analyzed. Therefore, we do not have a strong evidence against the null hypothesis from this t-statistic.

null_distribution <- netherlands_data %>%
  specify(explanatory = domicil,
          response = gincdif) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "t")

## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "Rural" - "Urban", or divided in the order "Rural" / "Urban" for
## ratio-based statistics. To specify this order yourself, supply `order =
## c("Rural", "Urban")` to the calculate() function.

p_val <- null_distribution %>% 
  get_pvalue(obs_stat = test_stat, direction = "two-sided") 

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.004

null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided")

null_distribution

## Response: gincdif (numeric)
## Explanatory: domicil (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate   stat
##        <int>  <dbl>
##  1         1 -0.767
##  2         2  1.67 
##  3         3 -1.00 
##  4         4  0.306
##  5         5 -0.810
##  6         6  0.306
##  7         7 -0.419
##  8         8 -0.781
##  9         9  0.696
## 10        10  0.682
## # ℹ 990 more rows

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

null_distribution

## Response: gincdif (numeric)
## Explanatory: domicil (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate   stat
##        <int>  <dbl>
##  1         1 -0.767
##  2         2  1.67 
##  3         3 -1.00 
##  4         4  0.306
##  5         5 -0.810
##  6         6  0.306
##  7         7 -0.419
##  8         8 -0.781
##  9         9  0.696
## 10        10  0.682
## # ℹ 990 more rows

Main takeaway: During the process of completing Homework 7, especially in Task 2, I have spent a lot of time to get started. I have run into an error code indicating “Error in calculate(): ! A t statistic is not well-defined for a multinomial categorical response variable (gincdif) and a multinomial categorical explanatory variable (domicil).” Here is where I found the table in the lecture on slide 48 useful. I have neglected the fact that t-statistic requires one numeric and one categorical variable. My variables at first were both categorical due to my arrangements. With the help of my TA, I was able to solve this problem by changing one of the variables to a numeric variable, with the code of ‘class(netherlands_data$gincdif)’. Through this experience I understood the essence of t-statistic deeper. Moreover, analyzes the significance of generalizing conclusions from the t-statistics that calculated was learned through this task.

Task 3:

Research Question: How does the time spent on electronic devices (netusoft) related to people’s human values [explanatory variable] on important to think new ideas and being creative (ipcrtiv) [outcome variable] in Germany?

germany_data <- ess %>%
 
  filter(cntry == "DE") %>%
  
 
  mutate(
    ipcrtiv = case_when(
      
      ipcrtiv %in% c(7, 8, 9) ~ NA_character_,  
      TRUE ~ as.character(ipcrtiv)  
    ),
   
    
    
    netusoft = case_when(
      netusoft %in% c(1, 2, 3) ~ "Sometimes",
      netusoft %in% c(4, 5) ~ "Often",
      netusoft %in% c(7, 8, 9) ~ NA_character_,
      TRUE ~ as.character(netusoft)
    )
  )

germany_data$ipcrtiv <- as.numeric(germany_data$ipcrtiv)

unique(germany_data$ipcrtiv)

## [1]  4  1  5  2  3  6 NA

unique(germany_data$ipcrtiv)

## [1]  4  1  5  2  3  6 NA

germany_data <- germany_data %>% filter(!is.na(ipcrtiv))

germany_data <- germany_data %>% filter(!is.na(netusoft))


unique(germany_data$ipcrtiv)

## [1] 3 2 1 4 6 5

table(germany_data$ipcrtiv)

## 
##    1    2    3    4    5    6 
## 1143 1926 1290  450  290   46

table(germany_data$netusoft)

## 
##     Often Sometimes 
##      3902      1243

class(germany_data$ipcrtiv)

## [1] "numeric"

test_stat <- germany_data %>%
  specify(explanatory = netusoft, 
          response = ipcrtiv) %>% 
  hypothesize(null = "independence") %>%
  calculate(stat = "t")

## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "Often" - "Sometimes", or divided in the order "Often" / "Sometimes" for
## ratio-based statistics. To specify this order yourself, supply `order =
## c("Often", "Sometimes")` to the calculate() function.

print(test_stat$stat)

##         t 
## -7.577223

null_distribution <- germany_data %>%
  specify(explanatory = netusoft,
          response = ipcrtiv) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "t")

## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "Often" - "Sometimes", or divided in the order "Often" / "Sometimes" for
## ratio-based statistics. To specify this order yourself, supply `order =
## c("Often", "Sometimes")` to the calculate() function.

p_val <- null_distribution %>% 
  get_pvalue(obs_stat = test_stat, direction = "two-sided")

## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step.
## See `?get_p_value()` for more information.

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided")

null_distribution

## Response: ipcrtiv (numeric)
## Explanatory: netusoft (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate    stat
##        <int>   <dbl>
##  1         1  1.99  
##  2         2 -1.11  
##  3         3 -0.858 
##  4         4 -0.0116
##  5         5 -1.98  
##  6         6  0.471 
##  7         7  0.329 
##  8         8 -0.0403
##  9         9 -0.774 
## 10        10  1.59  
## # ℹ 990 more rows

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

null_distribution

## Response: ipcrtiv (numeric)
## Explanatory: netusoft (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate    stat
##        <int>   <dbl>
##  1         1  1.99  
##  2         2 -1.11  
##  3         3 -0.858 
##  4         4 -0.0116
##  5         5 -1.98  
##  6         6  0.471 
##  7         7  0.329 
##  8         8 -0.0403
##  9         9 -0.774 
## 10        10  1.59  
## # ℹ 990 more rows

Main takeaway: Under a comparison with the last task, the t-statistic that calculated using my research question was larger. Even though it had a negative sign on it, it is still significantly larger. Thus it suggested that the null hypothesis could be rejected.

Homework_7_Lyra_Dong

Lyra Dong

2023-11-09