library(ggplot2)
library(dplyr)
library(statsr)
library(reshape)load("gss.Rdata")The General Social Survey (GSS) is a survey that has been collecting observational data every year since 1972. This particular file is a subset of that data that has been previously cleaned. Missing values have been already removed and factor variables have been defined.
Some potential sources of bias…
1) The missing values may have had valuable information as a group, and thus we may be introducing some confounding variables.
2) This study was collected online, in person, and on the phone. While this is an extensive effort, there still exist people who simply will not participate in surveys. Their absence from this study may also be a confounding factor.
Considering only the last 5 years, does there appear to be a relationship between one’s opinion of how people “get ahead” and whether they believe the government should do more?
It seems logical that there may be a link between whether someone believes a person can change their own lot in life and whether the government should become more involved in solving our country’s problems. Limiting to the last 5 years in order to capture a relatively current picture.
Filter to the years in question and remove NA’s…
df <- gss %>% filter(year>=2008 & year<=2012 & !is.na(helpnot) & !is.na(getahead))See if, for the years captured, there appears to be any trend in opinions on how people “get ahead”. A strong change year-over-year is not desirable, as we’re trying to capture the “current” opinion.
ggplot(data=df,aes(x=as.factor(year))) +
geom_bar() +
facet_grid(.~getahead) +
labs(x="Years",y="Counts")There appears to be a very slight trend upwards for attributing success to hard work, but should be suitable for analysis.
See if, for the years captured, there appears to be any trend in opinions on whether people think the government should do more. A strong change, again, is not desireable.
ggplot(data=df,aes(x=as.factor(year))) +
geom_bar() +
facet_grid(.~helpnot) +
labs(x="Years",y="Counts")Not much of a change year-over-year, so we will proceed.
Before diving into the inference calculations, let’s take a look at the proportions of the sample accross the two variables. We’ll start with counts.
counts <- cast(df,helpnot~getahead,margins=c("grand_col","grand_row"))
counts## helpnot Hard Work Both Equally Luck Or Help (all)
## 1 Govt Do More 203 63 31 297
## 2 Agree With Both 507 165 84 756
## 3 Govt Does Too Much 244 62 31 337
## 4 (all) 954 290 146 1390
Below is the same table above, but displaying the proportions, instead of counts.
actProp <- counts[1:4,2:5]
actProp <- data.frame(sapply(actProp,FUN=function(x){round(x/max(x),3)}))
actProp <- cbind(counts[1],actProp)
colnames(actProp) <- c(colnames(actProp[1:4]),"All")
actProp## helpnot Hard.Work Both.Equally Luck.Or.Help All
## 1 Govt Do More 0.213 0.217 0.212 0.214
## 2 Agree With Both 0.531 0.569 0.575 0.544
## 3 Govt Does Too Much 0.256 0.214 0.212 0.242
## 4 (all) 1.000 1.000 1.000 1.000
Now that we have the actual proportions for the two variables, let’s get the expected proportions if the opinion on Govt Help were not influenced by opinion of hard work leading to success.
H0: Opinion on Govt Help is not influenced by opinion of hard work leading to success
HA: Opinion on Govt Help is influenced by opinion of hard work leading to success
This data was collected as part of a random survey and includes less than 10% of the population. Thus, it is suitable for statistical inference, but cannot determine causality.
We will use a Chi-Squared Independence test to see whether any trends are due to chance, or whether a trend is statistically significant.
First we’ll start by getting expected counts…
expCounts <- counts
for(i in 1:3){
for(j in 2:4){
expCounts[i,j] <- round((counts[i,5]*counts[4,j])/counts[4,5],0)
}
}
#Make up for rounding error
expCounts[3,4] <- expCounts[3,4]+1
expCounts## helpnot Hard Work Both Equally Luck Or Help (all)
## 1 Govt Do More 204 62 31 297
## 2 Agree With Both 519 158 79 756
## 3 Govt Does Too Much 231 70 36 337
## 4 (all) 954 290 146 1390
Now we’ll show to expected proportions…
expProp <- expCounts[1:4,2:5]
expProp <- data.frame(sapply(expProp,FUN=function(x){round(x/max(x),3)}))
expProp <- cbind(expCounts[1],expProp)
colnames(expProp) <- c(colnames(expProp[1:4]),"All")
expProp## helpnot Hard.Work Both.Equally Luck.Or.Help All
## 1 Govt Do More 0.214 0.214 0.212 0.214
## 2 Agree With Both 0.544 0.545 0.541 0.544
## 3 Govt Does Too Much 0.242 0.241 0.247 0.242
## 4 (all) 1.000 1.000 1.000 1.000
Let’s move to getting the Chi-Squared statistic…
chiSqStat <- 0
for(i in 1:3){
for(j in 2:4){
chiSqStat <- chiSqStat +
((counts[i,j]-expCounts[i,j])^2)/expCounts[i,j]
}
}
chiSqStat## [1] 3.265402
The formula for degrees of freedom for the Chi-Squared Distribution is…
(# of Rows - 1) * (# of Columns - 1)
For our data, this comes to (3-1) * (3-1) = 4.
Let’s plot the Chi-Squared Distribution and see where our test statistic lands…
ggplot(data.frame(x = c(0, 20)), aes(x = x)) +
stat_function(fun = dchisq, args = list(df = 4)) +
stat_function(fun = dchisq,
args = list(df = 4),
xlim = c(chiSqStat,20),
geom = "area",
alpha=0.2) +
geom_vline(xintercept = chiSqStat,col="red") +
labs(x = "Chi-Squared",y="p")The shaded area is the probability of a Chi-Squared statistic if there were nothing going on.
Calculate the probability under the Chi-Squared Distribution…
pchisq(chiSqStat,df=4,lower.tail=FALSE)## [1] 0.5144327
At a 5% significance level, we fail to reject the null hypothesis that states one’s opinion on Govt Help is not correlated with one’s opinion of hard work/luck leading to success.