This project studies the relationship between highest EDUCATION attained by United States residents and their Family INCOME in constant dollars using General Social Survey (GSS) survey data.Also, some insights about the relationship between Education level and attitude towards life and job satisfaction
Above study is important for students as well as policy makers. It helps the policy makers to develop and impliment policies to provide access to education. Also, it helps the students to understand the long term benefits from investing in good education and it’s impact on distribution of income to lead a Quality life in future.
The study uses General Social Survey (GSS) data for the year 2012. GSS Data has been collected based on cumulative data file for surveys conducted between 1972 - 2012 and that not all respondents answered all questions in all years.
There are a total of 57,061 cases and 114 variables in this dataset. Here is the link to the Survey Data :http://bit.ly/dasi_gss_data The codebook below lists all variables, the values they take, and the survey questions associated with them. Link to the detailed Codebook for the Data set https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fgss1.html
Following key variables are to be extracted from the Dataset for the year 2012
.degree:Respondents Highest Degree .coinc:Total Family Income in constant dollars .Background: ‘Age’, ‘sex’, ‘race’, ‘marital status’, .Attitude: Passion to ‘getahead’ Financial Satisfaction ‘satfin’ Job Satisfaction ‘satjob’
.joblose: Is likely to lose job .jobfind: Could R find equally good job .satjob: Job or housework satisfaction .jobpromo: chance of advancement .jobmeans: Work important and feel accomplishment .satfin:Financial Staisfaction
This is an observational study not an experiment since the data came from a survey not from an experiment with test and control groups. Hence it can establish correlation not causation.
Generalizability: The Population of interest is entire US Population. Samples are drawn randomly during the survey . Hence, findings from the study can be generalized to the entire US population.
Causality: These data cannot be used to establish causal links between the variables of interest since this is an observational study and not an experiment.
There may be response bias and convinience bias since the data is collected by using a survey methodology. These biases must be accounted while drawing conclusions with this study.
Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1
Loading the Data set using R
## [1] 1974
##
## Lt High School High School Junior College Bachelor Graduate
## 280 976 151 354 205
Highest Degree is a Categorical Variable. Categorical variable is summarized by contingency table, Frequency table and a Bar plot
##
## Lt High School High School Junior College Bachelor Graduate
## 280 976 151 354 205
##
## Lt High School High School Junior College Bachelor Graduate
## 0.1424212 0.4964395 0.0768057 0.1800610 0.1042726
We can see that high school as the highest degree has nearly 50% percent of observations.
Family income in constant USD is a continuous numerical variable. We summarize it with mean, range and quantiles, and with a histogram
Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 383 16280 34470 48380 63200 178700 216
We can see that the distribution is right skewed and unimodal, with 50% of observations in the 16,300-63,200 USD (constant dollar) range, and there is a maximum value of 179,000 USD. There are some clear outliers in the upper quantiles of the distribution.There are 216 observations with missing income values. Filtering them out brings the number of observations to 1752. The sample size remains significant for the study
## [1] 1752
##
## Lt High School High School Junior College Bachelor Graduate
## 230 881 132 324 185
Box plot is used to summarize the relationship between Numerical and Categorical variable
Finally, we explore the relationship among family income and highest degree. We can see that exists a positive association, but the wider interquantile range in the college groups and the presence of outliers in the high school and less than high school groups, means that such a relationship is not strong and that family income could be associated with other variables. We shall explore addtional variables in the section below.
and other Variables like attitude towards life and job satisfaction
Box plot shows the relationship between Income and Education by Gender We can observe that on an avearge people with higher education tend to have more income
We observe that the median income of Male is higher than median income of female.However, the overall distribution of income looks similar
## df_gss_background$sex: Male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 383 28720 51700 62320 76600 178700
## ------------------------------------------------------------
## df_gss_background$sex: Female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1532 22020 42130 53800 63200 178700
It is intersting to observe that there is some pattern in reationship between age and income. However, the strength of relationship varies between diffrent education levels and also by gender.
We can observe that people who weigh both hard work and and luck tend to have more variationin in income althogh there is no clear relationship observed
Null Hypotheisis: There is no significant difference between Means of Income between Multiple groups based on highest educational level in other words Highest educational level has no correlation to the income level H0 : Mu(LHS) = Mu(HS) = Mu(JC) = Mu(B) = Mu(G)
Alternate Hypothesis: There is a significant diffrence between Means of Income between multiple groups based on highest education level. in other words education level has impact on the income level HA : the average income in constant dollar (??i) varies across some (or all) groups
The one-way analysis of variance (ANOVA) is used to determine whether there are any significant differences between the means of three or more independent (unrelated) groups.The one-way ANOVA compares the means between the groups we are interested in and determines whether any of those means are significantly different from each other. This is the best approach since we are considering one Categorical Variable i.e EDUCATION and One Numerical Variable i.e INCOME
Specifically, it tests the null hypothesis: H0 : Mu(LHS) = Mu(HS) = Mu(JC) = Mu(B) = Mu(G)
If, however, the one-way ANOVA returns a significant result, we accept the alternative hypothesis (HA), which is that there are at least 2 group means that are significantly different from each other.
Following 3 Key assumptions are made: 1. Independence of observations.GSS data consist in a random sample of AMerican Population and the sample is defnitiely less than 10% of the population and so they could be considered independent.
Figure 2:Checking for Data Normality
## Analysis of Variance Table
##
## Response: Family_Income
## Df Sum Sq Mean Sq F value Pr(>F)
## Highest_Degree 4 8.2832e+11 2.0708e+11 120.52 < 2.2e-16 ***
## Residuals 1747 3.0017e+12 1.7182e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation of the Results: Above Simulation shows the F Statistic of 120.52 which is the ratio of Variation between the groups to Variation within the groups. p-value of approximately zero. This mean that the probability of observing a F value of 121 or higher, if the null hypotheses were true, is very low.
Hence we can reject the Null Hypothesis and we can say that the average income in constant dollar varies across some (or all) groups in a statistically significant way.
we apply a Bonferroni correction to the p-values which are multiplied by the number of comparison. With this correction, the difference of the means has to be bigger to reject the null hypotheses.
##
## Pairwise comparisons using t tests with pooled SD
##
## data: Edu_income_study_data$Family_Income and Edu_income_study_data$Highest_Degree
##
## Lt High School High School Junior College Bachelor
## High School 1.4e-06 - - -
## Junior College 3.2e-07 0.2140 - -
## Bachelor < 2e-16 < 2e-16 2.3e-10 -
## Graduate < 2e-16 < 2e-16 < 2e-16 0.0011
##
## P value adjustment method: bonferroni
Interpretation of the Results: We can see that for nine group pairs the p-value is lower than the significance level of 0.05 and so the null hypotheses are rejected: the difference of the means of these nine groups is statistically significant.The null hypotheses is not rejected for the pair High school-Junior college. The difference of the means of this pairis not statistically significant and it is due to chance.Since we are using ANOVA there is no other methods applicable and hence there’s nothing to compare.
Notice that in lower education level- Lt. High School, High School, and Junior College, median of income is under $50,000 annually, while Bachelor and Graduate incomes median at near $100,000.We do some correlation between education and Income.
Median of income between males and females consistently disparate in education levels lower than Graduate. This is interesting, does it show that at the highest education levels, gender inequality does not show as much? Or is it because of the widely held believe that women are more likely to be family focused, thus sacrificing higher education?
See that income gaps (median vs quantiles) are much larger in higher education levels. It shows that more there in an opportunity to explore the possibility for maximizinf the income levelfor people with higher education levels.
Notice that in lower education level Lt. High School and High School, and Junior College, The age or work experience will not yield more income. For Junior college and Bachelors there is a slight uptick in the annual income with age, while Graduates income increases with increase in age or experience
While we see a strong correlation between age and income for a male graduate, However, the strengh of the relationship is medium to low for female graduates. This may also be due to female graduates dropping from the work force or taking a less strenous jobs due to juggling between work and personal commitments or making family commitments as a top priority
Also, for graduate The opportunity to maximize the income is higher with increase in age and experience
It is intersting to see the attitidue towards life as most people would think that Hardwork is important to get ahead in life around agegroup of 35 to 45. We donot see any strong relation between Education and attitude or by gender.
Let’s take a look at softer aspects of like Job Satisfaction Vs Education Notice that Graduates & Bachelor Male and Female who are very satisfied & Moderately Satisfied with their job have their income distributed at a wider range and median of the income around 100K for Male and 75K for Female.
Also, more educated group like graduates have a wide ranging of distribution in income compared to other groups. However, the IQR is very narrow for lower level groups may be the flexibility to change jobs of interest is adding to the satisfaction level of Indviduals.
We also observed some Outliers during Data exploration stage and it is an indicative of strong correlation of other variables with income and Satisfaction Some of the conditions for the statistical inference methods used were not fully respected so we have to be cautious in interpreting the results.
The Study establishes a positive correlation between Education level and Family income in constant dollar for United states residents
ABove study can be generalized to the entire United states residents since we used GSS Survey data for FY12.We grouped the family income in constants dollars by the highest degree earned by the interviewees(less than high school, high school, junior college, bachelor’s and graduate), and by visually exploring the data by visually exploring the data we noticed a positive correlation among the two variables.
We used ANOVA method & pair comparison to test our hypotheis if there is a significant diffrence in mean incomes of the groups. Our Analysis highlights that there is a significant difference.The only exception being among high school degree and junior college degree.
From the above analysis we do see a correlation between education and economic success. Education, specially high quality education represents social mobility and oppotunities which can be interpreted as higher future income, social status or simply satisfaction with new knowledge.
General Social Survey Cumulative File, 1972-2012 Coursera Extract. Modified for Data Analysis and Statistical Inferencecourse (Duke University). R dataset co uld be downloaded at http://bit.ly/dasi_gss_data. Original data: Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802- v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1 Persistent URL: http://doi.org/10.3886/ICPSR34802.v1