Is there a relationship between highest education attained by US residents and their family income?

1 Introduction

This project studies the relationship between highest EDUCATION attained by United States residents and their Family INCOME in constant dollars using General Social Survey (GSS) survey data.Also, some insights about the relationship between Education level and attitude towards life and job satisfaction

Above study is important for students as well as policy makers. It helps the policy makers to develop and impliment policies to provide access to education. Also, it helps the students to understand the long term benefits from investing in good education and it’s impact on distribution of income to lead a Quality life in future.

2 Data

2.1 Data collection

The study uses General Social Survey (GSS) data for the year 2012. GSS Data has been collected based on cumulative data file for surveys conducted between 1972 - 2012 and that not all respondents answered all questions in all years.

2.2 Cases

There are a total of 57,061 cases and 114 variables in this dataset. Here is the link to the Survey Data :http://bit.ly/dasi_gss_data The codebook below lists all variables, the values they take, and the survey questions associated with them. Link to the detailed Codebook for the Data set https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fgss1.html

2.3 Variables

Following key variables are to be extracted from the Dataset for the year 2012

.degree:Respondents Highest Degree .coinc:Total Family Income in constant dollars .Background: ‘Age’, ‘sex’, ‘race’, ‘marital status’, .Attitude: Passion to ‘getahead’ Financial Satisfaction ‘satfin’ Job Satisfaction ‘satjob’

.joblose: Is likely to lose job .jobfind: Could R find equally good job .satjob: Job or housework satisfaction .jobpromo: chance of advancement .jobmeans: Work important and feel accomplishment .satfin:Financial Staisfaction

2.4 Study

This is an observational study not an experiment since the data came from a survey not from an experiment with test and control groups. Hence it can establish correlation not causation.

2.5 Scope of inference

Generalizability: The Population of interest is entire US Population. Samples are drawn randomly during the survey . Hence, findings from the study can be generalized to the entire US population.

Causality: These data cannot be used to establish causal links between the variables of interest since this is an observational study and not an experiment.

2.6 Potential sources of bias that might prevent generalizability

There may be response bias and convinience bias since the data is collected by using a survey methodology. These biases must be accounted while drawing conclusions with this study.

2.7 Data Citation

Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1

3 Exploratory data analysis

3.1 Data Preparation

Loading the Data set using R

## [1] 1974

3.2 Summary Statistics- Univariate Analysis

## 
## Lt High School    High School Junior College       Bachelor       Graduate 
##            280            976            151            354            205

3.2.1 Highest Degree

Highest Degree is a Categorical Variable. Categorical variable is summarized by contingency table, Frequency table and a Bar plot

## 
## Lt High School    High School Junior College       Bachelor       Graduate 
##            280            976            151            354            205

## 
## Lt High School    High School Junior College       Bachelor       Graduate 
##      0.1424212      0.4964395      0.0768057      0.1800610      0.1042726

We can see that high school as the highest degree has nearly 50% percent of observations.

3.2.2 Highest Family Income

Family income in constant USD is a continuous numerical variable. We summarize it with mean, range and quantiles, and with a histogram

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     383   16280   34470   48380   63200  178700     216

We can see that the distribution is right skewed and unimodal, with 50% of observations in the 16,300-63,200 USD (constant dollar) range, and there is a maximum value of 179,000 USD. There are some clear outliers in the upper quantiles of the distribution.There are 216 observations with missing income values. Filtering them out brings the number of observations to 1752. The sample size remains significant for the study

## [1] 1752

## 
## Lt High School    High School Junior College       Bachelor       Graduate 
##            230            881            132            324            185

3.2.3 Relationship among family income and highest degree

Box plot is used to summarize the relationship between Numerical and Categorical variable

Finally, we explore the relationship among family income and highest degree. We can see that exists a positive association, but the wider interquantile range in the college groups and the presence of outliers in the high school and less than high school groups, means that such a relationship is not strong and that family income could be associated with other variables. We shall explore addtional variables in the section below.

3.2.4 Addtional Data Exploration to see the correlation between Education

 and other Variables like attitude towards life and job satisfaction

Box Plot for Education, Gender and Income

Box plot shows the relationship between Income and Education by Gender We can observe that on an avearge people with higher education tend to have more income

We observe that the median income of Male is higher than median income of female.However, the overall distribution of income looks similar

## df_gss_background$sex: Male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     383   28720   51700   62320   76600  178700 
## ------------------------------------------------------------ 
## df_gss_background$sex: Female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1532   22020   42130   53800   63200  178700

Box Plot for Education and Income

Relationship between Income, Age and Education

It is intersting to observe that there is some pattern in reationship between age and income. However, the strength of relationship varies between diffrent education levels and also by gender.

Box Plot for Attitude to get ahead and Income

We can observe that people who weigh both hard work and and luck tend to have more variationin in income althogh there is no clear relationship observed

Plot for Job satisfaction and Income

Plot for Ability to find a job and Education

4.1 Inference

4.1.1 Defnining Null & Alternate Hypotheisis

Null Hypotheisis: There is no significant difference between Means of Income between Multiple groups based on highest educational level in other words Highest educational level has no correlation to the income level H0 : Mu(LHS) = Mu(HS) = Mu(JC) = Mu(B) = Mu(G)

Alternate Hypothesis: There is a significant diffrence between Means of Income between multiple groups based on highest education level. in other words education level has impact on the income level HA : the average income in constant dollar (??i) varies across some (or all) groups

4.1.2 Testing our Hypothesis using Analysis of Variance (ANOVA) method

The one-way analysis of variance (ANOVA) is used to determine whether there are any significant differences between the means of three or more independent (unrelated) groups.The one-way ANOVA compares the means between the groups we are interested in and determines whether any of those means are significantly different from each other. This is the best approach since we are considering one Categorical Variable i.e EDUCATION and One Numerical Variable i.e INCOME

Specifically, it tests the null hypothesis: H0 : Mu(LHS) = Mu(HS) = Mu(JC) = Mu(B) = Mu(G)

If, however, the one-way ANOVA returns a significant result, we accept the alternative hypothesis (HA), which is that there are at least 2 group means that are significantly different from each other.

4.1.2.1 Key Assumptions made in the test

Following 3 Key assumptions are made: 1. Independence of observations.GSS data consist in a random sample of AMerican Population and the sample is defnitiely less than 10% of the population and so they could be considered independent.

The dependent variable is normally distributed in each group that is being compared in the ANOVA: We compare the normality of the distribution for each group using computation. As we can see some deviation in normality in each group

Figure 2:Checking for Data Normality

Constant Variance: Third condition is to check if the variablity is constant across multiple groups. The box plots in figure-1 shows that the Total range and the Inter Quartile Range is diffrent across multiple groups with the lowest variability in the Less than high school group and the highest variability in the Graduate group. The conditions on normality and constant variance are not fully respected. We use ANOVA in our hypotheses test, but we report the uncertainty in the results.

4.1.2.2 Computation for ANOVA

## Analysis of Variance Table
## 
## Response: Family_Income
##                  Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Highest_Degree    4 8.2832e+11 2.0708e+11  120.52 < 2.2e-16 ***
## Residuals      1747 3.0017e+12 1.7182e+09                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation of the Results: Above Simulation shows the F Statistic of 120.52 which is the ratio of Variation between the groups to Variation within the groups. p-value of approximately zero. This mean that the probability of observing a F value of 121 or higher, if the null hypotheses were true, is very low.

Hence we can reject the Null Hypothesis and we can say that the average income in constant dollar varies across some (or all) groups in a statistically significant way.

we apply a Bonferroni correction to the p-values which are multiplied by the number of comparison. With this correction, the difference of the means has to be bigger to reject the null hypotheses.

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  Edu_income_study_data$Family_Income and Edu_income_study_data$Highest_Degree 
## 
##                Lt High School High School Junior College Bachelor
## High School    1.4e-06        -           -              -       
## Junior College 3.2e-07        0.2140      -              -       
## Bachelor       < 2e-16        < 2e-16     2.3e-10        -       
## Graduate       < 2e-16        < 2e-16     < 2e-16        0.0011  
## 
## P value adjustment method: bonferroni

Interpretation of the Results: We can see that for nine group pairs the p-value is lower than the significance level of 0.05 and so the null hypotheses are rejected: the difference of the means of these nine groups is statistically significant.The null hypotheses is not rejected for the pair High school-Junior college. The difference of the means of this pairis not statistically significant and it is due to chance.Since we are using ANOVA there is no other methods applicable and hence there’s nothing to compare.

5. Final Plots and Summary

Plot-1:

Notice that in lower education level- Lt. High School, High School, and Junior College, median of income is under $50,000 annually, while Bachelor and Graduate incomes median at near $100,000.We do some correlation between education and Income.

Median of income between males and females consistently disparate in education levels lower than Graduate. This is interesting, does it show that at the highest education levels, gender inequality does not show as much? Or is it because of the widely held believe that women are more likely to be family focused, thus sacrificing higher education?

See that income gaps (median vs quantiles) are much larger in higher education levels. It shows that more there in an opportunity to explore the possibility for maximizinf the income levelfor people with higher education levels.

Plot-2:

Notice that in lower education level Lt. High School and High School, and Junior College, The age or work experience will not yield more income. For Junior college and Bachelors there is a slight uptick in the annual income with age, while Graduates income increases with increase in age or experience

While we see a strong correlation between age and income for a male graduate, However, the strengh of the relationship is medium to low for female graduates. This may also be due to female graduates dropping from the work force or taking a less strenous jobs due to juggling between work and personal commitments or making family commitments as a top priority

Also, for graduate The opportunity to maximize the income is higher with increase in age and experience

It is intersting to see the attitidue towards life as most people would think that Hardwork is important to get ahead in life around agegroup of 35 to 45. We donot see any strong relation between Education and attitude or by gender.

Plot-3

Let’s take a look at softer aspects of like Job Satisfaction Vs Education Notice that Graduates & Bachelor Male and Female who are very satisfied & Moderately Satisfied with their job have their income distributed at a wider range and median of the income around 100K for Male and 75K for Female.

Also, more educated group like graduates have a wide ranging of distribution in income compared to other groups. However, the IQR is very narrow for lower level groups may be the flexibility to change jobs of interest is adding to the satisfaction level of Indviduals.

We also observed some Outliers during Data exploration stage and it is an indicative of strong correlation of other variables with income and Satisfaction Some of the conditions for the statistical inference methods used were not fully respected so we have to be cautious in interpreting the results.

6.Reflection

The Study establishes a positive correlation between Education level and Family income in constant dollar for United states residents

ABove study can be generalized to the entire United states residents since we used GSS Survey data for FY12.We grouped the family income in constants dollars by the highest degree earned by the interviewees(less than high school, high school, junior college, bachelor’s and graduate), and by visually exploring the data by visually exploring the data we noticed a positive correlation among the two variables.

We used ANOVA method & pair comparison to test our hypotheis if there is a significant diffrence in mean incomes of the groups. Our Analysis highlights that there is a significant difference.The only exception being among high school degree and junior college degree.

From the above analysis we do see a correlation between education and economic success. Education, specially high quality education represents social mobility and oppotunities which can be interpreted as higher future income, social status or simply satisfaction with new knowledge.

7.References

7.1 Data reference

General Social Survey Cumulative File, 1972-2012 Coursera Extract. Modified for Data Analysis and Statistical Inferencecourse (Duke University). R dataset co uld be downloaded at http://bit.ly/dasi_gss_data. Original data: Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802- v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1 Persistent URL: http://doi.org/10.3886/ICPSR34802.v1

7.2 Other references

General Social Survey (GSS) FAQ. URL: http://publicdata.norc.org:41000/ gssbeta/faqs.html. Accessed 03/30/2014
Comparing many means with ANOVA. In Diez M David, Barr D Christopher, Çetinkaya-Rundel Mine (2012), OpenIntro Statistics, Second Edition, URL: http://www.openintro.org/stat/textbook.php.