Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(reshape)

Load data

load("gss.Rdata")

Part 1: Data

The General Social Survey (GSS) is a survey that has been collecting observational data every year since 1972. This particular file is a subset of that data that has been previously cleaned. Missing values have been already removed and factor variables have been defined.

Some potential sources of bias…
1) The missing values may have had valuable information as a group, and thus we may be introducing some confounding variables.
2) This study was collected online, in person, and on the phone. While this is an extensive effort, there still exist people who simply will not participate in surveys. Their absence from this study may also be a confounding factor.

Part 2: Research question

Considering only the last 5 years, does there appear to be a relationship between one’s opinion of how people “get ahead” and whether they believe the government should do more?

It seems logical that there may be a link between whether someone believes a person can change their own lot in life and whether the government should become more involved in solving our country’s problems. Limiting to the last 5 years in order to capture a relatively current picture.

Part 3: Exploratory data analysis

Filter to the years in question and remove NA’s…

df <- gss %>% filter(year>=2008 & year<=2012 & !is.na(helpnot) & !is.na(getahead))

See if, for the years captured, there appears to be any trend in opinions on how people “get ahead”. A strong change year-over-year is not desirable, as we’re trying to capture the “current” opinion.

ggplot(data=df,aes(x=as.factor(year))) +
  geom_bar() +
  facet_grid(.~getahead) +
  labs(x="Years",y="Counts")

There appears to be a very slight trend upwards for attributing success to hard work, but should be suitable for analysis.

See if, for the years captured, there appears to be any trend in opinions on whether people think the government should do more. A strong change, again, is not desireable.

ggplot(data=df,aes(x=as.factor(year))) +
  geom_bar() +
  facet_grid(.~helpnot) +
  labs(x="Years",y="Counts")

Not much of a change year-over-year, so we will proceed.

Before diving into the inference calculations, let’s take a look at the proportions of the sample accross the two variables. We’ll start with counts.

counts <- cast(df,helpnot~getahead,margins=c("grand_col","grand_row"))
counts

##              helpnot Hard Work Both Equally Luck Or Help (all)
## 1       Govt Do More       203           63           31   297
## 2    Agree With Both       507          165           84   756
## 3 Govt Does Too Much       244           62           31   337
## 4              (all)       954          290          146  1390

Below is the same table above, but displaying the proportions, instead of counts.

actProp <- counts[1:4,2:5]
actProp <- data.frame(sapply(actProp,FUN=function(x){round(x/max(x),3)}))
actProp <- cbind(counts[1],actProp)
colnames(actProp) <- c(colnames(actProp[1:4]),"All")
actProp

##              helpnot Hard.Work Both.Equally Luck.Or.Help   All
## 1       Govt Do More     0.213        0.217        0.212 0.214
## 2    Agree With Both     0.531        0.569        0.575 0.544
## 3 Govt Does Too Much     0.256        0.214        0.212 0.242
## 4              (all)     1.000        1.000        1.000 1.000

Part 4: Inference

Now that we have the actual proportions for the two variables, let’s get the expected proportions if the opinion on Govt Help were not influenced by opinion of hard work leading to success.

Hypothesis…

H0: Opinion on Govt Help is not influenced by opinion of hard work leading to success

HA: Opinion on Govt Help is influenced by opinion of hard work leading to success

Check Conditions

This data was collected as part of a random survey and includes less than 10% of the population. Thus, it is suitable for statistical inference, but cannot determine causality.

Method

We will use a Chi-Squared Independence test to see whether any trends are due to chance, or whether a trend is statistically significant.

First we’ll start by getting expected counts…

expCounts <- counts
for(i in 1:3){
  for(j in 2:4){
    expCounts[i,j] <- round((counts[i,5]*counts[4,j])/counts[4,5],0)
  }
}
#Make up for rounding error
expCounts[3,4] <- expCounts[3,4]+1
expCounts

##              helpnot Hard Work Both Equally Luck Or Help (all)
## 1       Govt Do More       204           62           31   297
## 2    Agree With Both       519          158           79   756
## 3 Govt Does Too Much       231           70           36   337
## 4              (all)       954          290          146  1390

Now we’ll show to expected proportions…

expProp <- expCounts[1:4,2:5]
expProp <- data.frame(sapply(expProp,FUN=function(x){round(x/max(x),3)}))
expProp <- cbind(expCounts[1],expProp)
colnames(expProp) <- c(colnames(expProp[1:4]),"All")
expProp

##              helpnot Hard.Work Both.Equally Luck.Or.Help   All
## 1       Govt Do More     0.214        0.214        0.212 0.214
## 2    Agree With Both     0.544        0.545        0.541 0.544
## 3 Govt Does Too Much     0.242        0.241        0.247 0.242
## 4              (all)     1.000        1.000        1.000 1.000

Let’s move to getting the Chi-Squared statistic…

chiSqStat <- 0
for(i in 1:3){
  for(j in 2:4){
    chiSqStat <- chiSqStat +    
      ((counts[i,j]-expCounts[i,j])^2)/expCounts[i,j]
  }
}
chiSqStat

## [1] 3.265402

The formula for degrees of freedom for the Chi-Squared Distribution is…
(# of Rows - 1) * (# of Columns - 1)

For our data, this comes to (3-1) * (3-1) = 4.

Let’s plot the Chi-Squared Distribution and see where our test statistic lands…

ggplot(data.frame(x = c(0, 20)), aes(x = x)) +
        stat_function(fun = dchisq, args = list(df = 4)) +
        stat_function(fun = dchisq, 
                      args = list(df = 4),
                      xlim = c(chiSqStat,20),
                      geom = "area",
                      alpha=0.2) +
        geom_vline(xintercept = chiSqStat,col="red") +
        labs(x = "Chi-Squared",y="p")

The shaded area is the probability of a Chi-Squared statistic if there were nothing going on.

Calculate the probability under the Chi-Squared Distribution…

pchisq(chiSqStat,df=4,lower.tail=FALSE)

## [1] 0.5144327

Conclusion

At a 5% significance level, we fail to reject the null hypothesis that states one’s opinion on Govt Help is not correlated with one’s opinion of hard work/luck leading to success.

Statistical Inference with GSS data

Zack Burch

2/9/2018