Is there a relationship between highest education attained by US residents and their family income?

1 Introduction

This project studies the relationship between highest EDUCATION attained by United States residents and their Family INCOME in constant dollars using General Social Survey (GSS) survey data.

Above study is important for students as well as policy makers. It helps the policy makers to develop and impliment policies to provide access to education. Also, it helps the students to understand the long term benefits from investing in good education and it’s impact on distribution of income to lead a Quality life in future.

Above study mobilizes the discussion around role of education in wealth distribution to bring postive changes in the Society.

2 Data

2.1 Data collection

The study uses General Social Survey (GSS) data for the year 2012. GSS Data has been collected based on cumulative data file for surveys conducted between 1972 - 2012 and that not all respondents answered all questions in all years.

2.2 Cases

There are a total of 57,061 cases and 114 variables in this dataset. Here is the link to the Survey Data :http://bit.ly/dasi_gss_data The codebook below lists all variables, the values they take, and the survey questions associated with them. Link to the detailed Codebook for the Data set https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fgss1.html

2.3 Variables

Variable-1: ‘DEGREE’- Respondents Highest Degree, which is a categorical variable
Variable-2: ‘CONINC’- Total Family Income in constant dollars, which is a continuos numerical variable

2.4 Study

This is an observational study not an experiment since the data came from a survey not from an experiment with test and control groups. Hence it can establish correlation not causation.

2.5 Scope of inference

Generalizability: The Population of interest is entire US Population. Samples are drawn randomly during the survey . Hence, findings from the study can be generalized to the entire US population.

Causality: These data cannot be used to establish causal links between the variables of interest since this is an observational study and not an experiment.

2.6 Potential sources of bias that might prevent generalizability

There may be response bias and convinience bias since the data is collected by using a survey methodology. These biases must be accounted while drawing conclusions with this study.

2.7 Data Citation

Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1

3 Exploratory data analysis

3.1 Data Preparation

Two key variables are to be extracted from the Dataset- Degree & income for the year 2012 1. ‘DEGREE’- Respondents Highest Degree 2. ‘CONINC’- Total Family Income in constant dollars

Loading the Data set using R

load(url("http://bit.ly/dasi_gss_data"))

# Create a Subset for Education and Income for the year 2012
Edu_income_study_data <- subset(gss, select=c(degree, coninc),gss$year == 2012)

# Rename the column names
colnames(Edu_income_study_data) <- c("Highest_Degree","Family_Income")
# Count the observations
nrow(Edu_income_study_data)

## [1] 1974

3.2 Summary Statistics- Univariate Analysis

table(Edu_income_study_data$Highest_Degree)

## 
## Lt High School    High School Junior College       Bachelor       Graduate 
##            280            976            151            354            205

barplot(table(Edu_income_study_data$Highest_Degree),las=2, 
        main="Highest Degree",cex.axis= 1, cex.names=0.75)

# Bar Plot for Education Vs Frequency

hist(Edu_income_study_data$Family_Income, main="Family Income in constant USD"
     , xlab="USD",cex.axis= 1)

# Histogram for highest family income in constant currency

Figure 1: Summary Statistics for Degree & Family Income

3.2.1 Highest Degree

Highest Degree is a Categorical Variable. Categorical variable is summarized by contingency table, Frequency table and a Bar plot

# Contingency table for Highest.Degree
table(Edu_income_study_data$Highest_Degree)

## 
## Lt High School    High School Junior College       Bachelor       Graduate 
##            280            976            151            354            205

# Frequency Table for Highest Degree
prop.table(table(Edu_income_study_data$Highest_Degree))

## 
## Lt High School    High School Junior College       Bachelor       Graduate 
##      0.1424212      0.4964395      0.0768057      0.1800610      0.1042726

We can see that high school as the highest degree has nearly 50% percent of observations.

3.2.2 Highest Family Income

Family income in constant USD is a continuous numerical variable. We summarize it with mean, range and quantiles, and with a histogram

Summary

summary(Edu_income_study_data$Family_Income)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     383   16280   34470   48380   63200  178700     216

We can see that the distribution is right skewed and unimodal, with 50% of observations in the 16,300-63,200 USD (constant dollar) range, and there is a maximum value of 179,000 USD. There are some clear outliers in the upper quantiles of the distribution.There are 216 observations with missing income values. Filtering them out brings the number of observations to 1752. The sample size remains significant for the study

# Filter NAs
Edu_income_study_data = 
  Edu_income_study_data[complete.cases(Edu_income_study_data),]
# Count observations after the filter
nrow(Edu_income_study_data)

## [1] 1752

# Contingency table for Highest Degree after the filter
table(Edu_income_study_data$Highest_Degree)

## 
## Lt High School    High School Junior College       Bachelor       Graduate 
##            230            881            132            324            185

3.2.3 Relationship among family income and highest degree

boxplot(Edu_income_study_data$Family_Income~
          Edu_income_study_data$Highest_Degree,
main="Family Income by Highest Degree", xlab="Highest Degree", ylab="USD", 
cex.axis= 0.75, cex.names=0.75)

# Boxplot for relationship between Education and Highest Family income

Finally, we explore the relationship among family income and highest degree. We can see that exists a positive association, but the wider interquantile range in the college groups and the presence of outliers in the high school and less than high school groups, means that such a relationship is not strong and that family income could be associated with other variables

3.2.4 Addtional Data Exploration to see the correlation between Education

 and other Variables like Workplace and Economic Concerns

4.1 Inference

4.1.1 Defnining Null & Alternate Hypotheisis

Null Hypotheisis: There is no significant difference between Means of Income between Multiple groups based on highest educational level in other words Highest educational level has no correlation to the income level H0 : Mu(LHS) = Mu(HS) = Mu(JC) = Mu(B) = Mu(G)

Alternate Hypothesis: There is a significant diffrence between Means of Income between multiple groups based on highest education level. in other words education level has impact on the income level HA : the average income in constant dollar (??i) varies across some (or all) groups

4.1.2 Testing our Hypothesis using Analysis of Variance (ANOVA) method

The one-way analysis of variance (ANOVA) is used to determine whether there are any significant differences between the means of three or more independent (unrelated) groups.The one-way ANOVA compares the means between the groups we are interested in and determines whether any of those means are significantly different from each other. This is the best approach since we are considering one Categorical Variable i.e EDUCATION and One Numerical Variable i.e INCOME

Specifically, it tests the null hypothesis: H0 : Mu(LHS) = Mu(HS) = Mu(JC) = Mu(B) = Mu(G)

If, however, the one-way ANOVA returns a significant result, we accept the alternative hypothesis (HA), which is that there are at least 2 group means that are significantly different from each other.

4.1.2.1 Key Assumptions made in the test

Following 3 Key assumptions are made: 1. Independence of observations.GSS data consist in a random sample of AMerican Population and the sample is defnitiely less than 10% of the population and so they could be considered independent.

The dependent variable is normally distributed in each group that is being compared in the ANOVA: We compare the normality of the distribution for each group using computation. As we can see some deviation in normality in each group

# Create a plot grid for 5 graphs in a row
par(mfrow = c(1,5))
# Plot normality graphs for each groups based on Educational level
degrees = c("Lt High School","High School","Junior College",
            "Bachelor","Graduate")
for (i in 1:5) {
qqnorm(Edu_income_study_data[Edu_income_study_data$Highest_Degree
                             == degrees[i],]$Family_Income, main=degrees[i])
qqline(Edu_income_study_data[Edu_income_study_data$Highest_Degree
                             == degrees[i],]$Family_Income)
}

Figure 2:Checking for Data Normality

Constant Variance: Third condition is to check if the variablity is constant across multiple groups. The box plots in figure-1 shows that the Total range and the Inter Quartile Range is diffrent across multiple groups with the lowest variability in the Less than high school group and the highest variability in the Graduate group. The conditions on normality and constant variance are not fully respected. We use ANOVA in our hypotheses test, but we report the uncertainty in the results.

4.1.2.2 Computation for ANOVA

# ANOVA for the mean income grouped by degree
anova(lm(Family_Income ~ Highest_Degree, data=Edu_income_study_data ))

## Analysis of Variance Table
## 
## Response: Family_Income
##                  Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Highest_Degree    4 8.2832e+11 2.0708e+11  120.52 < 2.2e-16 ***
## Residuals      1747 3.0017e+12 1.7182e+09                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation of the Results: Above Simulation shows the F Statistic of 120.52 which is the ratio of Variation between the groups to Variation within the groups. p-value of approximately zero This mean that the probability of observing a F value of 121 or higher, if the null hypotheses were true, is very low.

Hence we can reject the Null Hypothesis and we can say that the average income in constant dollar varies across some (or all) groups in a statistically significant way.

we apply a Bonferroni correction to the p-values which are multiplied by the number of comparison. With this correction, the difference of the means has to be bigger to reject the null hypotheses.

# Pairwise t test for the mean income grouped by degree
# With Bonferroni correction
pairwise.t.test(Edu_income_study_data$Family_Income, 
                Edu_income_study_data$Highest_Degree,
                p.adj="bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  Edu_income_study_data$Family_Income and Edu_income_study_data$Highest_Degree 
## 
##                Lt High School High School Junior College Bachelor
## High School    1.4e-06        -           -              -       
## Junior College 3.2e-07        0.2140      -              -       
## Bachelor       < 2e-16        < 2e-16     2.3e-10        -       
## Graduate       < 2e-16        < 2e-16     < 2e-16        0.0011  
## 
## P value adjustment method: bonferroni

Interpretation of the Results: We can see that for nine group pairs the p-value is lower than the significance level of 0.05 and so the null hypotheses are rejected: the difference of the means of these nine groups is statistically significant.The null hypotheses is not rejected for the pair High school-Junior college. The difference of the means of this pairis not statistically significant and it is due to chance.Since we are using ANOVA there is no other methods applicable and hence there’s nothing to compare.

5. Final Plots and Summary

Plot-1:

## 
## Lt High School    High School Junior College       Bachelor       Graduate 
##            230            881            132            324            185

Plot-2

Plot-3

6.Reflection

The Study establishes a positive correlation between Education level and Family income in constant dollar for United states residents

ABove study can be generalized to the entire United states residents since we used GSS Survey data for FY12.We grouped the family income in constants dollars by the highest degree earned by the interviewees(less than high school, high school, junior college, bachelor’s and graduate), and by visually exploring the data by visually exploring the data we noticed a positive correlation among the two variables.

We used ANOVA method & pair comparison to test our hypotheis if there is a significant diffrence in mean incomes of the groups. Our Analysis highlights that there is a significant difference.The only exception being among high school degree and junior college degree.

We observed some Outliers during Data exploration stage and it is an indicative of strong correlation of other variables with income.Some of the conditions for the statistical inference methods used were not fully respected, and so we have to be cautious in interpreting the results.

Further analysis can be done to over come the above shortcomings to see the impact of other variables using sophisticated techniques and it will be interseting to repeat the above analysis for each year and compare results

7.References

7.1 Data reference

General Social Survey Cumulative File, 1972-2012 Coursera Extract. Modified for Data Analysis and Statistical Inferencecourse (Duke University). R dataset co uld be downloaded at http://bit.ly/dasi_gss_data. Original data: Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802- v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1 Persistent URL: http://doi.org/10.3886/ICPSR34802.v1

7.2 Other references

General Social Survey (GSS) FAQ. URL: http://publicdata.norc.org:41000/ gssbeta/faqs.html. Accessed 03/30/2014
Comparing many means with ANOVA. In Diez M David, Barr D Christopher, Çetinkaya-Rundel Mine (2012), OpenIntro Statistics, Second Edition, URL: http://www.openintro.org/stat/textbook.php.