Data Analysis and Statistical Inference on GSS data set

Does people’s opinion of how to get ahead affect their education level and family income?

Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.

GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.

In this report, the cases are respondents who finished the survey questions. And the variables that I used to answer the question are:

getahead: Opinion of how people get ahead. This is a categorical variable. degree: Respondents’ highest degree. This is an ordinal categorical variable. coninc: Total family income in constant dollars. This is a continuous numerical variable. This is an observational study, because there is no random assignment. The population of the study is all the residents in the United State. Since this is only an observational study, I don’t think its conclusion can be generalized to the whole population, but might be generalized to some of its subsection.

People’s opinion are often influenced by the environment they grew up or the industry atmosphere they are working in. So these two factors could be potential sources of bias preventing the findings in this project from generalizability. And also there are a lot of missing data in the getahead variable which represnts people’s opinion of how to get ahead. The results could also be influenced by these missing values in the data set.

Last but not least, for the same reason as above, no causal relationship can be established.

load(url("http://bit.ly/dasi_gss_data"))

DATA PROCESSING :

data <- gss[, c("getahead", "degree", "coninc")]
summary(data)

##          getahead                degree          coninc      
##  Hard Work   :23022   Lt High School:11822   Min.   :   383  
##  Both Equally: 7834   High School   :29287   1st Qu.: 18445  
##  Luck Or Help: 4085   Junior College: 3070   Median : 35602  
##  Other       :   36   Bachelor      : 8002   Mean   : 44503  
##  NA's        :22084   Graduate      : 3870   3rd Qu.: 59542  
##                       NA's          : 1010   Max.   :180386  
##                                              NA's   :5829

From the summary statistics, we can see that there are missing values for each of the three variables. This should certainly affect the results of the analysis. So before diving into the analysis and inference, we need to deal with the missing data.

degree variable, the education level, seems to be the easiest one to deal with, given that the missing values are the smallest, only 1010. From the summary statistics, we can see that most of the respondents hold a high-school degree. So we can replace the missing values with High School directly.

coninc variable stands for the family income of the respondents. We expect the the family income would increase when the education level increases. So we will compute the median family income for each degree level and then use the computing results to fill in the NA’s associated with that degree.

getahead variable is a little tricky. It reprents people’s opinion of how to get ahead. There are numerous missing values in the data set (NA’s: 22084), almost as many as the first level (Hard Work: 23022). With a size of missing value like this, they cannot be simply replaced with any single value directly. And it is the explanatory variable here, so we cannot fill in the NA’s based on the other two response variables, because that would imply a certain correlation between them. So I finally decide to just ignore them, and use the known data only. Fortunately, the sample size is big enough.

# deal with missing values
# 1. degree
data[is.na(data$degree), "degree"] <- "High School"
# 2. coninc
cnnc <- subset(data, !is.na(data$coninc))
agg.cnnc <- aggregate(cnnc$coninc, by=list(cnnc$degree), FUN=median)
colnames(agg.cnnc) <- c("degree", "coninc.median")
for(i in agg.cnnc$degree) {
        data[is.na(data$coninc)&data$degree == i, "coninc"] <- agg.cnnc[agg.cnnc$degree == i, "coninc.median"]
}
# 3. get ahead
gtahd <- subset(data, !is.na(data$getahead))
summary(gtahd)

##          getahead                degree          coninc      
##  Hard Work   :23022   Lt High School: 7319   Min.   :   383  
##  Both Equally: 7834   High School   :18600   1st Qu.: 18519  
##  Luck Or Help: 4085   Junior College: 1844   Median : 34859  
##  Other       :   36   Bachelor      : 4813   Mean   : 43233  
##                       Graduate      : 2401   3rd Qu.: 56213  
##                                              Max.   :180386

EXPLORATORY DATA ANALYSIS :

par(mfrow = c(1, 3))
barplot(summary(gtahd$getahead), col = "green", main = "Respondents' Opinion of How to Get Ahead",
        xlab = "Opinioin Options", ylab = "Counts of Respondents", ylim = c(0, 25000))
barplot(summary(gtahd$degree), col = "blue", main = "Respondents' Highest Degree",
        xlab = "Highest Degree", ylab = "Counts of Respondents", ylim = c(0, 20000))
hist(gtahd$coninc, col = "red", main = "Respondents' Family income", xlab = "Family Income in Dollars")

The three plots above are just another way to interpret the summary statistics. Now we can see clearly that:

Most of the respondents believe hard work is the most important way to get ahead. Most of the respondents hold a high school diploma. The family income is right skewed which is understandable, because higher the family income, fewer the people.

par(mfrow = c(1, 3))
# degree vs. getahead
plot(gtahd$degree ~ gtahd$getahead, main = "Highset Degree vs. Opinioin Options",
     xlab = "Opinioin Options", ylab = "Highest Degree")
# family income vs. getahead
plot(gtahd$coninc ~ gtahd$getahead, col = "green", main = "Family Income vs. Opinion Options",
     xlab = "Opinioin Options", ylab = "Family Income in Dollars")
# family income vs. degree
plot(gtahd$coninc ~ gtahd$degree, col = "blue", main = "Family Income vs. Highset Degree",
     xlab = "Highest Degree", ylab = "Family Income in Dollars")

The three plots above show the correlation between each pair of two variables.

The first one is degree vs. getahead. These are two categorical variables with mutiple levels, so a spineplot is generated. If we want to know if people’s opinion affects their degrees, we need to compare if there is a difference between the two proportions, using Chi-square independence test.

The second one is family income coninc vs. getahead. From the boxplots we can see that The median family income of different opinions are quite close to each other, but there are many outliers for each different opinion option, so we will focus on the mean instead of median, using ANOVA to see if people’s opinion have an impact on the average family income.

The third one is family income coninc vs. degree. From the boxplots we can tell that the median of family income increases along with the education level.

INFERENCE :

# degree vs. getahead, ordinal regression
library(MASS, warn.conflicts = FALSE)
model1 <- polr(degree ~ getahead, data = gtahd)
sm1 <- summary(model1)

## 
## Re-fitting to get Hessian

sm1

## Call:
## polr(formula = degree ~ getahead, data = gtahd)
## 
## Coefficients:
##                         Value Std. Error t value
## getaheadBoth Equally  0.26153    0.02482 10.5373
## getaheadLuck Or Help -0.08546    0.03221 -2.6534
## getaheadOther         0.12232    0.32933  0.3714
## 
## Intercepts:
##                            Value    Std. Error t value 
## Lt High School|High School  -1.2852   0.0149   -86.2781
## High School|Junior College   1.1023   0.0143    76.8796
## Junior College|Bachelor      1.3994   0.0152    91.8430
## Bachelor|Graduate            2.6620   0.0225   118.1166
## 
## Residual Deviance: 89064.44 
## AIC: 89078.44

From the summary results we can see that the coefficient of Luck Or Help is negative, and the coefficients of Both Equally and Other are possitive. These information tells us compared with those who believe Hard Work is most important, the odds of holding a higher education degree is lower for the people who believe in pure Luck Or Help from others, but higher for the people who believe they are Both Equally important or have Other opinions.

# Luck Or Help
round(exp(sm1$coefficients[2, 1]), 2)

## [1] 0.92

round(exp(confint(model1, "getaheadLuck Or Help", 0.95)), 2)

## Waiting for profiling to be done...

## 
## Re-fitting to get Hessian

##  2.5 % 97.5 % 
##   0.86   0.98

The odds of holding a higher education degree for the people who believe in pure Luck Or Help from others is 0.92 times of the same odds for those who believe Hard Work is most important, with a 95% confidence interval of [0.86, 0.98].

# Both Equally
round(exp(sm1$coefficients[1, 1]), 2)

## [1] 1.3

round(exp(confint(model1, "getaheadBoth Equally", 0.95)), 2)

## Waiting for profiling to be done...

## 
## Re-fitting to get Hessian

##  2.5 % 97.5 % 
##   1.24   1.36

The odds of holding a higher education degree for the people who believe in Both Equally important is 1.3 times of the same odds for those who believe Hard Work is most important, with a 95% confidence interval of [1.24, 1.36].

# Other
round(exp(sm1$coefficients[3, 1]), 2)

## [1] 1.13

round(exp(confint(model1, "getaheadOther", 0.95)), 2)

## Waiting for profiling to be done...

## 
## Re-fitting to get Hessian

##  2.5 % 97.5 % 
##   0.59   2.15

The odds of holding a higher education degree for the people who have Other opinions is 1.13 times of the same odds for those who believe Hard Work is most important, with a 95% confidence interval of [0.59, 2.15].

Besides the regression analysis, we can also do a Chi-square independence hypothesis test for the proportion between these two categorical variables.

table(gtahd$getahead, gtahd$degree)

##               
##                Lt High School High School Junior College Bachelor Graduate
##   Hard Work              4930       12390           1221     3018     1463
##   Both Equally           1429        4025            425     1243      712
##   Luck Or Help            951        2169            198      544      223
##   Other                     9          16              0        8        3

chisq.test(gtahd$getahead, gtahd$degree)

## Warning in chisq.test(gtahd$getahead, gtahd$degree): Chi-squared
## approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  gtahd$getahead and gtahd$degree
## X-squared = 163.32, df = 12, p-value < 2.2e-16

In Chi-square test, the null hypothesis is that people’s opinion of how to get ahead have no effect on their degrees, the alternative hypothesis is that people’s opinion of how to get ahead do have an effect on their degrees.

From the results above, we can see that the p-value is extremely small, which suggests we should reject the null hypothesis.

family income vs. getahead

model2 <- lm(coninc ~ getahead, data = gtahd)
model2$call

## lm(formula = coninc ~ getahead, data = gtahd)

par(mfrow = c(1, 3))
qqnorm(model2$residuals,col = "green", pch = 20)
qqline(model2$residuals, col = "red", lty = 5)
hist(model2$residuals, col = "green", main = "Histogram of Model 2 Residuals")
boxcox(model2, lambda = seq(0.2, 0.4, 0.1))

Before looking at its summary results and getting any conclusion, we need to check the model’s residual, making sure it is normally distributed. However from the first two plots shown above, the residual is clearly right skewed, which means the response variable needs transformation.

The third plot above shows the Box-Cox method which suggests lambda = 0.3. So a power transformation was performed to the response variable coninc. The transformed value y equals to the original value to the power of 0.3.

y <- gtahd$coninc^0.3
model3 <- lm(y ~ getahead, data = gtahd)
model3$call

## lm(formula = y ~ getahead, data = gtahd)

par(mfrow = c(1, 2))
qqnorm(model3$residuals,col = "blue", pch = 20)
qqline(model3$residuals, col = "red", lty = 5)
hist(model3$residuals, col = "blue", main = "Histogram of Model 3 Residuals")

sm3 <- summary(model3)
sm3

## 
## Call:
## lm(formula = y ~ getahead, data = gtahd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4353  -3.9688   0.0153   3.6159  15.5295 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          23.03596    0.03799 606.334  < 2e-16 ***
## getaheadBoth Equally  0.35536    0.07540   4.713 2.45e-06 ***
## getaheadLuck Or Help -0.82038    0.09787  -8.383  < 2e-16 ***
## getaheadOther         0.55258    0.96151   0.575    0.565    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.765 on 34973 degrees of freedom
## Multiple R-squared:  0.003209,   Adjusted R-squared:  0.003123 
## F-statistic: 37.53 on 3 and 34973 DF,  p-value: < 2.2e-16

anova(model3)

## Analysis of Variance Table
## 
## Response: y
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## getahead      3    3741 1247.00  37.526 < 2.2e-16 ***
## Residuals 34973 1162156   33.23                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In ANOVA, the null hypothesis is that people’s opinion of how to get ahead have no effect on their family income, the alternative hypothesis is that people’s opinion of how to get ahead do have an effect on their family income.

From the results above, we can see that the p-value is extremely small, which suggests we should reject the null hypothesis.

CONCLUSION :

The question raised at the beginning of the report seems quite simple. However the conclusion may be different from what we had in mind.

People’s opinion of how to get ahead do have an effect on both education level and family income. Although majority people believe hard work is the most important for getting a higher degree, those who think pure luck or help from others are equally important as hard work actually have a higher odds of getting a higher degree. However only believing in luck or help will bring the odds down. A similar conclusion can be estabished for family income as well. Although the median of family income for different opinion options are quite close to each other, the people who think pure luck or help from others are equally important as hard work actually make higher family income in average than those who believe in hard work only. But only believing in luck or help will make less family income in average.

Data Analysis and Statistical Inference on GSS data set

Shiivong Kapil Birla

April, 2015