Statistical inference with the GSS data

Setup

Load packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.0.4

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.0.4

library(statsr)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss. Delete this note when before you submit your work.

load("gss.Rdata")

Part 1: Data

# Due to COVID-19,the 2020 GSS will be conducting two separate studies 
#that will both employ a mixed-mode web and telephone survey: 
#(1) the first study will be a panel study which will follow up 
#with 2018 and 2016 cross-section GSS respondents and 
#(2) the second will be a nationwide cross section survey of the US adult population.

Part 2: Research question

#Using a 95% confidence interval, compared with 1972 and 2012,
#estimate how proportion of people whose education level is over 11 differ?

Part 3: Exploratory data analysis

NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button (green button with orange arrow) above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following objects are masked from 'package:Matrix':
## 
##     expand, pack, unpack

#independence
#random sample:yes
#10% condition:met


#Hypothesis 
#H0: the proportion of people whose education level is over 11 in 1972 is the same as the proportion in 2012
#HA: the proportion of people whose education level is over 11 in 1972 is different from 2012



#First, let's conclude two table about the education level in 1972 and 2012

educ1972 <- gss %>% 
  filter(year==1972) %>% 
  select(year, educ) %>% 
  drop_na()

educ2012 <- gss %>% 
  filter(year==2012) %>% 
  select(year, educ) %>% 
  drop_na()


# then, let's draw two plot to see the distribution

ggplot(educ1972,aes(x=educ))+
  geom_density()

ggplot(educ2012,aes(x=educ))+
  geom_density()

#sample size/skewed
educ1972 %>% 
  filter(educ>11) %>% 
  count()

##     n
## 1 967

educ2012 %>% 
  filter(educ>11) %>% 
  count()

##      n
## 1 1654

#proportion 1972
967/1608

## [1] 0.6013682

#proportion 2012
1654/1972

## [1] 0.8387424

#so we could make a table here
# year Suc.    n     p-hat
#1972  967    1608   0.6017
#2012  1654   1972   0.8388

#success in 1972: 967  Failure in 1972: 1608-967=641
#success in 2012 :1654 Failure in 2012: 1972-1654=318

Part 4: Inference

#we need to create a new categorical value, to seperate the people whose education level is over 11 or not

educ <- gss%>% 
  mutate(EducLev=ifelse(educ>11, "HighSchoolgrad", "NotHighSchoolgrad")) %>%
  filter(year %in% c(1972,2012)) %>% 
  select(year, educ,EducLev) %>% 
  drop_na()


#so let's use the "inference" function to do the hypothesis test



inference(y=EducLev,x=year,
          data=educ,
          statistic = "proportion",
          type="ht",
          method = "theoretical",
          success = "HighSchoolgrad",
          null=0,
          alternative = "twosided"
  
)

## Warning: Explanatory variable was numerical, it has been converted
##               to categorical. In order to avoid this warning, first convert
##               your explanatory variable to a categorical variable using the
##               as.factor() function

## Response variable: categorical (2 levels, success: HighSchoolgrad)
## Explanatory variable: categorical (2 levels) 
## n_1972 = 1608, p_hat_1972 = 0.6014
## n_2012 = 1972, p_hat_2012 = 0.8387
## H0: p_1972 =  p_2012
## HA: p_1972 != p_2012
## z = -15.9525
## p_value = < 0.0001

#so we could say that, because the p-value is very small, 
#thus, reject the null hypothesis, 
#and the data provide strong evidence for HA, 
#which the proportion of people whose education level is over 11 in 2012, 
#is different from the proportion in 1972