Statistical inference with the GSS data

Inferential Statistics - Coursera Project

Truc Vo

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("gss.Rdata")

Part 1: Data

The observations were sampled via multi-stage sampling. They blocked using quotas based on sex, age, and employment status. The study tried to reduce non-response bias by interviewing after 3 PM on weekdays or on weekends/holidays. The results are generalizable to the US population above 18 years old, as this is an observational study that uses random sampling. However, as it is an observational study and there is no random assignment, we cannot make causal conclusions. The sample biases are likely to be small relative to the precision of the measuring instrument and the decisions that are to be made.

Part 2: Research question

America is a meritocracy, and many people believe that a quality education can improve one’s life. Since 1980, women have surpassed men in conferring of bachelor’s degrees. With college-educated women now making up a higher share of the American workforce, my question explores whether sex and confidence in the educational institutions in America are independent or associated.

In other words, does confidence in educational institutions vary by sex?

Part 3: Exploratory data analysis

This question deals with the variables: “sex” and “coneduc”. We will exclude observations marked “NA” in the GSS Codebook for both of these variables.

#filtering data
gss_na <- gss %>%
  filter(!is.na(sex), !is.na(coneduc)) %>%
  select(sex, coneduc)

#table of counts
tbl = table(gss_na$sex, gss_na$coneduc)
tbl

##         
##          A Great Deal Only Some Hardly Any
##   Male           5087      9346       2415
##   Female         6605     11976       2793

The total number of male responses from our data is 16848. The total number of female responses from our data is 21374. The total number of responses from our data is 38222.

There are more women sampled than men, but we would expect the observed counts to be close in proportion if sex and confidence in education are independent of each other. Based on the observed counts, there seems to be a difference. However, we cannot determine whether this difference is statistically significant without conducting a chi-square independence test.

#visual portrayal of observed counts
ggplot(data = gss_na, aes(x = sex, fill = coneduc)) + geom_bar(position=position_dodge())

This bar plot offers a visual representation of the observed counts. Based on this plot, it appears that the distribution of counts are relatively similar across both sex. To determine whether sex and confidence in education are independent or if there is a statistically significant difference between sexes in confidence, we would need to conduct a chi-square independence test.

Part 4: Inference

Step One: State the hypotheses

Null Hypothesis: Sex and confidence in educational institutions are independent. Confidence in educational institutions do not vary by sex.

Alternative Hypothesis: Sex and confidence in educational institutions are associated. Confidence in educational institutions vary by sex.

Step 2: Check conditions

Independence: We can assume independence within groups because the respondents were randomly selected. Also, 16,488 is less than 10% of the total male population in the US and 21,374 is less than 10% of the total female population in the US.

Sample Size: Each cell has more than 5 cases, thus we satisfy the sample size condition.

Step 3: State the method(s) to be used and why and how

As we are dealing with two categorical variables, we will be using the chi-square independence test. The chi-square independence test, like the chi-square goodness of fit test, evaluates how different observed counts are from the expected counts. As we are doing a chi-square test, there is no associated confidence interval.

We will reject the null hypothesis if our p-value is less than our significance level (alpha-value of .05).

Step 4: Perform inference

chisq.test(tbl)

## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 13.168, df = 2, p-value = 0.001382

The degrees of freedom is found by (number of rows - 1)(number of columns - 1).

Step 5: Interpret results

Our data has a p-value of .001. As the p-value is less than our alpha-value, we reject the null-hypothesis (sex and confidence in educational institutions are independent) in favor of the alternative hypothesis (sex and confidence in educational institutions are associated). The data provides convincing evidence that sex and confidence in educational institutions are associated.