Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("gss.Rdata")

Part 1: Data

Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.

GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.


Part 2: Research question

Is there a relation between a persons education and family income when they were 16?

I want to explore how self reported family income at age 16 is related to the person’s highest degree obtained.


Part 3: Exploratory data analysis

To address the research question we need to find the relation between highest degree obtained and self reported family income at age 16. Searching the codebook for the gss data we find degree listing the highest degree attained in categories from LT High School to graduate. The elf reported family income at age 16 is listed in the incom16 variable ranging from Far Below Average to Far Above Average.

Making a relative frequency table to understand the proportion of degree levels for each income group.

data <- gss %>% select(degree, incom16)
table(data)
##                 incom16
## degree           Far Below Average Below Average Average Above Average
##   Lt High School              1507          2994    4408           583
##   High School                 1531          5229   12062          3145
##   Junior College               140           480    1124           415
##   Bachelor                     220          1108    2777          1635
##   Graduate                     158           610    1250           749
##                 incom16
## degree           Far Above Average Lived In Institution
##   Lt High School               131                    5
##   High School                  313                    3
##   Junior College                35                    0
##   Bachelor                     191                    2
##   Graduate                     111                    0
prop.table(table(data), 2)
##                 incom16
## degree           Far Below Average Below Average    Average Above Average
##   Lt High School        0.42379078    0.28730448 0.20387586    0.08932128
##   High School           0.43053993    0.50177526 0.55788354    0.48184465
##   Junior College        0.03937008    0.04606084 0.05198649    0.06358204
##   Bachelor              0.06186727    0.10632377 0.12843994    0.25049793
##   Graduate              0.04443195    0.05853565 0.05781416    0.11475410
##                 incom16
## degree           Far Above Average Lived In Institution
##   Lt High School        0.16773367           0.50000000
##   High School           0.40076825           0.30000000
##   Junior College        0.04481434           0.00000000
##   Bachelor              0.24455826           0.20000000
##   Graduate              0.14212548           0.00000000

We can see that Far above average has the greatest proportion of graduates while Far Below Average and people lived in Instituion have the lowest no of graduates and high number of Lt High School people. We have very few observations with income level of “Lived in Institution”. We’ll remove these for a better analysis of the rest of the levels.

data <- data %>% filter(incom16 != "Lived In Institution")
data <- droplevels(data)

Plotting the above results for a clearer picture.

temp <- as.data.frame(prop.table(table(data), 2))
ggplot() + geom_bar(aes(y = Freq, x = incom16, fill = degree), data = temp,
                    stat="identity") + coord_flip() +
    labs(x ="Family Income Level at 16", y = "Relative Frequency" , fill = "Highest Degree")

We can see proportion of Lt High School decreasing as we move to higher income levels and proportion of Graduates increasing. There seems to be a positive relation between highest degree and income levels.


Part 4: Inference

We define our hypothesis as :

H0 : Highest Degree and Family Income level at age 16 are independent of each other

HA : Highest Degree and Family Income level at age 16 are not independent of each other and are related in some way

Since we are dealing with two multilevel categorical variables we will use Chi square test for independence to see if there’s a relation between the two variables.

Conditions for Chi Square Test :

  1. Independence : We have used random sampling and total observations are less than 10% of the population. Also each case contributes to only one cell.

  2. Sample Size : Each cell has atleast five values since we already removed Lived in Institution from our analysis.

chisq.test(data$degree, data$incom16)
## 
##  Pearson's Chi-squared test
## 
## data:  data$degree and data$incom16
## X-squared = 2847.1, df = 16, p-value < 2.2e-16

The p-value is very close to zero. Hence we reject our null hypothesis and conclude that there is a relation between highest degree obtained and family income at age 16. However since no random assignment was used, we can’t infer causation from this study. Confidence intervals don’t make sense in this analysis as we are calculating relationships for a 5x5 contingency table.

Our exploratory data analysis showed a positive correlation between income and highest degree obtained and the chi square test shows that the two are independent. Further analysis testing a correlation between the two can be done in addition to this analysis.