Statistical inference with the GSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(tidyverse)

Load data

load("gss.Rdata")

Part 1: Data

The data we will being using for this project GSS (the General Social Survey) is gathered by NORC (National Opinion Research Center) from the University of Chicago since 1972. The GSS data is dedicated to record contemporary American society including social trends, constants in attitudes, behaviors, and attributes as well as social structures and functioning of society. It covers various aspects of American society such as national spending priorities, marijuana use, crime, intergroup relations, social and economic life, lifestyle, civil liberties, subjective well-being, and confidence in institutions etc.

The data gss has 57061 observations and 114 columns and it was gathered using random sampling, therefore it can be used to generalize the entire population of the country. However, we should be aware that bias also exists in the data since there are missing entries for certain years and certain columns as well as missing years since 1972. Because the survey is observational, therefore casual inferences can not be made from the data since it doesn’t meet some of the requirements (bias).

Part 2: Research question

In the following analysis, I’m about to look into how the proportion of U.S. white males’ views on LGBTQ has changed between the year 2002 and 2012, here we define ‘Not Wrong At All’ a success, ‘Always Wrong’ a failure’ (for clarity we only consider these two “extreme” opinions).

The variables I will be using are race, sex, year, homosex.

summary(gss$race)

## White Black Other 
## 46350  7926  2785

summary(gss$sex)

##   Male Female 
##  25146  31915

summary(gss$homosex)

##     Always Wrong Almst Always Wrg  Sometimes Wrong Not Wrong At All 
##            21601             1581             2243             7282 
##            Other             NA's 
##               82            24272

summary(gss$year)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1972    1983    1993    1992    2002    2012

Part 3: Exploratory data analysis

Filter our data:

W_Male <- gss %>% 
  filter(race == 'White', sex == 'Male', year == c('2002', '2012'), 
         !is.na(homosex), stringr::str_detect(homosex, 'Always Wrong|Not Wrong At All'))

## Warning in year == c("2002", "2012"): longer object length is not a multiple of
## shorter object length

Summarize W_Male:

W_Male %>% 
  group_by(year) %>% 
  summarise(count = n())

## # A tibble: 2 x 2
##    year count
##   <int> <int>
## 1  2002   189
## 2  2012   172

W_Male %>% 
  filter(stringr::str_detect(homosex, 'Always Wrong|Not Wrong At All')) %>% 
  group_by(homosex) %>% 
  summarise(count = n()) %>% 
  mutate(ratio = (count/361)*100)

## # A tibble: 2 x 3
##   homosex          count ratio
##   <fct>            <int> <dbl>
## 1 Always Wrong       207  57.3
## 2 Not Wrong At All   154  42.7

Part 4: Inference

State hypothesis:

Null hypothesis: the proportions of U.S. white males’ opinions on LGBTQ in 2002 is the same as the one in 2012: \[p_{2002} = p_{2012}\]

Alternative hypothesis: the proportions of U.S. white males’ opinions on LGBTQ in 2002 is not the same as the one in 2012: \[p_{2002} \ne p_{2012}\]

Check Conditions:

Independence:

within groups: the statistic data for the year 2002 and 2012 is 189 and 172, which is definitely less than 10% of the population respectively;
between groups: it’s pretty safe to say that the two groups are not paired, hence the independence also holds.

Sample size/skew: according to the data, there are more than 10 successes and 10 failures.

Inference:

inference(y = homosex, x = year, data = W_Male, statistic = 'proportion', type = 'ht',
          method = 'theoretical', success = 'Not Wrong At All', null = 0, alternative = 'twosided')

## Warning: Explanatory variable was numerical, it has been converted
##               to categorical. In order to avoid this warning, first convert
##               your explanatory variable to a categorical variable using the
##               as.factor() function

## Warning: Ignoring null value since it's undefined for chi-square test of
## independence

## Response variable: categorical (2 levels, success: Not Wrong At All)
## Explanatory variable: categorical (5 levels) 
## n_2002 = 189, p_hat_2002 = 0.3915
## n_2012 = 172, p_hat_2012 = 0.4651
## H0: p_2002 =  p_2012
## HA: p_2002 != p_2012
## z = -1.4118
## p_value = 0.158

The p-value is less than \(\alpha\) (0.05), therefore we can reject the null hypothesis. Hence we have convincing evidence to state that the proportions of U.S. white males’ opinions on LGBTQ in 2002 is not the same as the one in 2012.

Chi-Square test of independence

The following is a quick analysis of using Chi-Square test of independence:

State hypothesis:
- Null hypothesis: Time(year) and US white males’ view on LGBTQ are independent
- Alternative hypothesis: Time(year) and US white males’ view on LGBTQ are dependent
Check independence: it’s already stated above.
Inference: Filter data:

gss_W_Male <- gss %>% 
  filter(race == 'White', sex == 'Male', !is.na(homosex), 
         stringr::str_detect(homosex, 'Always Wrong|Not Wrong At All'))

Plot:

gss_W_Male %>% 
  ggplot(aes(x = year, fill = homosex)) +
  geom_bar() +
  labs(title = "U.S. white males' views on LGBTQ 1972 - 2012",
       x = 'Year',
       y = 'Count and Ratio',
       caption = paste0("Data Source: GSS"),
       fill = 'Views')

We can see that even though the proportion of “Always Wrong” has always been high, but it has decreased a lot since 1972 and its share almost reaches the same level as “Not Wrong At All” in 2012,

chisq.test(gss_W_Male$year, gss_W_Male$homosex)$expected

##                gss_W_Male$homosex
## gss_W_Male$year Always Wrong Not Wrong At All
##            1973     367.9182        122.08175
##            1974     370.9217        123.07834
##            1976     363.4131        120.58687
##            1977     374.6759        124.32407
##            1980     379.1811        125.81895
##            1982     358.9080        119.09200
##            1984     330.3756        109.62443
##            1985     388.9421        129.05785
##            1987     349.1469        115.85309
##            1988     220.7509         73.24905
##            1989     256.7919         85.20808
##            1990     219.2492         72.75076
##            1991     244.7783         81.22174
##            1993     252.2868         83.71320
##            1994     470.7852        156.21481
##            1996     439.2493        145.75066
##            1998     392.6964        130.30358
##            2000     403.2084        133.79164
##            2002     261.2970         86.70296
##            2004     212.4916         70.50844
##            2006     391.1947        129.80529
##            2008     307.8500        102.15004
##            2010     288.3278         95.67223
##            2012     272.5598         90.44016

Degree of freedom: \[df = (R-1)*(C-1)\]

(24-1)*(2-1)

## [1] 23

chisq.test(gss_W_Male$year, gss_W_Male$homosex)

## 
##  Pearson's Chi-squared test
## 
## data:  gss_W_Male$year and gss_W_Male$homosex
## X-squared = 636.53, df = 23, p-value < 2.2e-16

We can see that our p value is almost 0, therefore we can reject the null hypothesis. Hence we have convincing evidence to state that U.S. white males’ views and the time(year) are dependent.