Nagpal_Angela_Homework3

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

packages <- c("tidyverse", "infer", "fst") # add any you need here

new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "infer"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "fst"       "infer"     "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"

rm(list=ls()); gc()

##           used (Mb) gc trigger  (Mb) limit (Mb) max used (Mb)
## Ncells 1044146 55.8    2177524 116.3         NA  1274208 68.1
## Vcells 1809885 13.9    8388608  64.0      16384  2371298 18.1

ess <- read_fst("All-ESS-Data.fst")

Task 1 Based on the recoding of lrscale for “left” and “right” and omitting “moderates” (see Tutorial 4), and educ.ba from this tutorial, do the long form coding for Chi-Square. Using the steps outlined in the tutorial, first generate the tables of expected proportions and frequencies. Determine and interpret the critical value for independence. Finally, determine and interpret both the Pearson’s chi-squared statistic and the p-value. Discuss main takeaways.

france_data <- ess %>%
  # Filter for France
  filter(cntry == "FR") %>%
  
  # Recoding clsprty and cleaning trstplt (will use later)
  mutate(
    clsprty = case_when(
      clsprty == 1 ~ "Yes",  # Recode 1 to "Yes"
      clsprty == 2 ~ "No",   # Recode 2 to "No"
      clsprty %in% c(7, 8, 9) ~ NA_character_,  # Handle other specific cases where you want to set it as NA
      TRUE ~ as.character(clsprty)  # Keep other values as-is but ensure they are characters
    ),
    trstplt = ifelse(trstplt %in% c(77, 88, 99), NA, trstplt),
    
    # Recoding gender
    gndr = case_when(
      gndr == 1 ~ "Male",
      gndr == 2 ~ "Female",
      gndr == 9 ~ NA_character_,
      TRUE ~ as.character(gndr)
    ),

    # Recoding education with case_when for consistency
    educ.ba = case_when(
      essround < 5 & edulvla == 5 ~ "BA or more",
      essround >= 5 & edulvlb > 600 ~ "BA or more",
      TRUE ~ "No BA"
    )
  ) %>%
  # Ensure all NAs are handled uniformly
  mutate(
    edulvla = ifelse(edulvla %in% c(77, 88), NA_integer_, edulvla),
    edulvlb = ifelse(edulvlb %in% c(5555, 7777, 8888), NA_integer_, edulvlb),
    educ.ba = as.character(educ.ba)  # Force educ.ba to be treated as character
  )

# using unique to identify
unique(france_data$gndr)

## [1] "Female" "Male"

# does not return NA, but the following does
unique(france_data$clsprty)

## [1] "Yes" "No"  NA

# here's how to make sure it's really cleaned
france_data <- france_data %>% filter(!is.na(clsprty))
france_data <- france_data %>% filter(!is.na(trstplt))

# double-checking + quick distribution
unique(france_data$clsprty)

## [1] "Yes" "No"

table(france_data$clsprty)

## 
##   No  Yes 
## 9098 9480

tab_dat <- france_data

# Create a crosstable and save it into the object mytab
mytab <- table(tab_dat$clsprty, tab_dat$educ.ba)

# Calculate row-wise sums of counts in table 
rsums <- rowSums(mytab)

# Calculate column-wise sums of counts in table
csums <- colSums(mytab)

# Get the total count of the table (N)
N <- sum(mytab)

# Generate table of expected proportions
ptab <- tcrossprod(rsums/N, csums/N)
cat("Table of Expected Proportions:\n")

## Table of Expected Proportions:

print(ptab)

##           [,1]      [,2]
## [1,] 0.1099746 0.3797444
## [2,] 0.1145921 0.3956889

# Table of expected frequencies
ftab <- N * ptab
cat("Table of Expected Frequencies:\n")

## Table of Expected Frequencies:

print(ftab)

##          [,1]     [,2]
## [1,] 2043.108 7054.892
## [2,] 2128.892 7351.108

# Critical Value for Independence:
alpha <- 0.05 # significance level
c_val <- qchisq(alpha, df = 1, lower.tail = FALSE)
cat("Critical Value:", round(c_val, 3), "\n")

## Critical Value: 3.841

Interpreting the Critical Value: If the Chi-sqaured test statistic we compute from the chisq.test function exceeds our critical value of 3.841, then we reject the null hypothesis at the 0.05 significance level.

# Pearson's chi-squared statistic:
test_stat <- sum((mytab - ftab)^2 / ftab)
cat("Pearson's X^2:", round(test_stat, 4), "\n")

## Pearson's X^2: 160.4057

Our chi-squared result is significantly more than the 3.841 critical threshold that we determined. If we were to interpret it, it would suggest that feeling close to a party and education levels are statistically significantly correlated.referring to the idea that different educational attainment levels have different effects on how close a person feels to a party.

# p-value:
p_val <- pchisq(test_stat, df = 1, lower.tail = F)
cat("p-value:", round(p_val, 4), "\n")

## p-value: 0

# Using chisq.test for validation:
cat("Chi-squared test result:\n")

## Chi-squared test result:

print(chisq.test(mytab, correct = F))

## 
##  Pearson's Chi-squared test
## 
## data:  mytab
## X-squared = 160.41, df = 1, p-value < 2.2e-16

We obtained a p-value that is extremely small and close to 0. For that reason we must reject the null hypothesis that the two variable, feeling close to a party and education levels are independent of each other. As we can see from the large chi-squared value (160.41) and the p-value, the hypothesis that the two variables we are looking at, feeling close to a party and education levels are independent of each other. With the results achieved, it shows that there is a statistically strong association between these two variables.

TASK 2: Do Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables sbbsntx (recoded to the corresponding 5 categories) as the response and domicil (recoded to urban and rural, setting 2 to urban instead of peri-urban) as the explanatory variable for the country of France.

test_stat <- france_data %>%
  specify(explanatory = domicil, # change variable name for explanatory variable
          response = sbbsntx) %>% # change variable name for outcome of interest
  hypothesize(null = "independence") %>%
  calculate(stat = "slope") # replace in between quotation marks appropriate test statistic

## Warning: Removed 14528 rows containing missing values.

## Message: The independence null hypothesis does not inform calculation of the
## observed statistic (a slope) and will be ignored.

print(test_stat$stat)

##    domicil 
## -0.0768133

null_distribution <- france_data %>%
  specify(explanatory = domicil,
          response = sbbsntx) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>% # only line we are adding, use reps = 1000 as standard but change if too computationally demanding
  calculate(stat = "slope")

## Warning: Removed 14528 rows containing missing values.

p_val <- null_distribution %>% # replace name here if you assigned something other than null_distribution above
  get_pvalue(obs_stat = test_stat, direction = "two-sided") # would only replace test_stat if assigned another name in Step 1

## Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
## based on the number of `reps` chosen in the `generate()` step.
## ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

The P-value that we obtained is 0 which means that there is strong evidence to reject the null. Meaning that the two variables, social benefits/ services cost businesses too much tax and the respondent’s description are independent of each other.

Task 3: The P-value that we obtained is 0 which means that there is strong evidence to reject the null. Meaning that the two variables, social benefits/ services cost businesses too much tax and the respondent’s description are independent of each other.

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

null_distribution

## Response: sbbsntx (numeric)
## Explanatory: domicil (numeric)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate       stat
##        <int>      <dbl>
##  1         1  0.0000266
##  2         2  0.00875  
##  3         3 -0.0352   
##  4         4 -0.0150   
##  5         5 -0.00498  
##  6         6 -0.0117   
##  7         7 -0.0378   
##  8         8 -0.00944  
##  9         9  0.00114  
## 10        10 -0.0354   
## # ℹ 990 more rows

Task 4: Conduct Steps 1 to 3 using the infer package (as we did in the tutorial), for the variables ccrdprs (leaving it on the 0-10 numeric scale) as the response (or outcome) variable and lrscale (recoded as left and right, omitting “moderates” as we did in Task 1) as the explanatory variable.

france_data <- ess %>%
  # Filter for France
  filter(cntry == "FR") %>%
  
  
  mutate(
    lrscale = case_when(
      lrscale %in% 0:3 ~ "Left",  
      lrscale %in% 7:10 ~ "Right",
      lrscale %in% c(4,5,6) ~ NA_character_,  
      TRUE ~ NA_character_  
    ),
    ccrdprs = ifelse(ccrdprs %in% c(77, 88, 99), NA, ccrdprs),
  )

test_stat <- france_data %>%
  specify(explanatory = lrscale, 
          response = ccrdprs) %>%
  hypothesize(null = "independence") %>%
  calculate(stat = "F")

## Warning: Removed 17154 rows containing missing values.

print(test_stat$stat)

## [1] 2.627956

null_distribution <- france_data %>%
  specify(explanatory = lrscale,
          response = ccrdprs) %>% 
  hypothesize(null = "independence") %>%
    generate(reps = 1000, type = "permute") %>%
  calculate(stat = "F")

## Warning: Removed 17154 rows containing missing values.

p_val <- null_distribution %>% 
  get_pvalue(obs_stat = test_stat, direction = "greater") 

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.116

The P-value I got is 0.112 which suggests that we should reject the null hypothesis that the two variables are independent of each other.

Task 5: Now do Step 4 (i.e., the null distribution visualization with confidence intervals) for the same variables and country as Task 4, and interpret the output.

conf_int <- null_distribution %>%
  get_confidence_interval(level = 0.95, type = "percentile")


null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = test_stat, direction = "two-sided") +
  shade_confidence_interval(endpoints = conf_int)

## Warning: F usually corresponds to right-tailed tests. Proceed with caution.

null_distribution

## Response: ccrdprs (numeric)
## Explanatory: lrscale (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate     stat
##        <int>    <dbl>
##  1         1 0.000352
##  2         2 0.0212  
##  3         3 1.37    
##  4         4 0.954   
##  5         5 0.0247  
##  6         6 0.121   
##  7         7 1.47    
##  8         8 1.47    
##  9         9 0.102   
## 10        10 3.59    
## # ℹ 990 more rows

Nagpal_Angela_Homework3

2024-03-12

R Markdown

Including Plots

Interpreting the Critical Value: If the Chi-sqaured test statistic we compute from the chisq.test function exceeds our critical value of 3.841, then we reject the null hypothesis at the 0.05 significance level.

The P-value I got is 0.112 which suggests that we should reject the null hypothesis that the two variables are independent of each other.

The output shows that the red line falls outside of the distribution so we would have to reject the null.