This project studies the relationship between highest EDUCATION attained by United States residents and their Family INCOME in constant dollars using General Social Survey (GSS) survey data.
Above study is important for students as well as policy makers. It helps the policy makers to develop and impliment policies to provide access to education. Also, it helps the students to understand the long term benefits from investing in good education and it’s impact on distribution of income to lead a Quality life in future.
Above study mobilizes the discussion around role of education in wealth distribution to bring postive changes in the Society.
The study uses General Social Survey (GSS) data for the year 2012. GSS Data has been collected based on cumulative data file for surveys conducted between 1972 - 2012 and that not all respondents answered all questions in all years.
There are a total of 57,061 cases and 114 variables in this dataset. Here is the link to the Survey Data :http://bit.ly/dasi_gss_data The codebook below lists all variables, the values they take, and the survey questions associated with them. Link to the detailed Codebook for the Data set https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fgss1.html
Variable-1: ‘DEGREE’- Respondents Highest Degree, which is a categorical variable
Variable-2: ‘CONINC’- Total Family Income in constant dollars, which is a continuos numerical variable
This is an observational study not an experiment since the data came from a survey not from an experiment with test and control groups. Hence it can establish correlation not causation.
Generalizability: The Population of interest is entire US Population. Samples are drawn randomly during the survey . Hence, findings from the study can be generalized to the entire US population.
Causality: These data cannot be used to establish causal links between the variables of interest since this is an observational study and not an experiment.
There may be response bias and convinience bias since the data is collected by using a survey methodology. These biases must be accounted while drawing conclusions with this study.
Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1
Two key variables are to be extracted from the Dataset- Degree & income for the year 2012 1. ‘DEGREE’- Respondents Highest Degree 2. ‘CONINC’- Total Family Income in constant dollars
Loading the Data set using R
load(url("http://bit.ly/dasi_gss_data"))
# Create a Subset for Education and Income for the year 2012
Edu_income_study_data <- subset(gss, select=c(degree, coninc),gss$year == 2012)
# Rename the column names
colnames(Edu_income_study_data) <- c("Highest_Degree","Family_Income")
# Count the observations
nrow(Edu_income_study_data)
## [1] 1974
table(Edu_income_study_data$Highest_Degree)
##
## Lt High School High School Junior College Bachelor Graduate
## 280 976 151 354 205
barplot(table(Edu_income_study_data$Highest_Degree),las=2,
main="Highest Degree",cex.axis= 1, cex.names=0.75)
# Bar Plot for Education Vs Frequency
hist(Edu_income_study_data$Family_Income, main="Family Income in constant USD"
, xlab="USD",cex.axis= 1)
# Histogram for highest family income in constant currency
Highest Degree is a Categorical Variable. Categorical variable is summarized by contingency table, Frequency table and a Bar plot
# Contingency table for Highest.Degree
table(Edu_income_study_data$Highest_Degree)
##
## Lt High School High School Junior College Bachelor Graduate
## 280 976 151 354 205
# Frequency Table for Highest Degree
prop.table(table(Edu_income_study_data$Highest_Degree))
##
## Lt High School High School Junior College Bachelor Graduate
## 0.1424212 0.4964395 0.0768057 0.1800610 0.1042726
We can see that high school as the highest degree has nearly 50% percent of observations.
Family income in constant USD is a continuous numerical variable. We summarize it with mean, range and quantiles, and with a histogram
Summary
summary(Edu_income_study_data$Family_Income)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 383 16280 34470 48380 63200 178700 216
We can see that the distribution is right skewed and unimodal, with 50% of observations in the 16,300-63,200 USD (constant dollar) range, and there is a maximum value of 179,000 USD. There are some clear outliers in the upper quantiles of the distribution.There are 216 observations with missing income values. Filtering them out brings the number of observations to 1752. The sample size remains significant for the study
# Filter NAs
Edu_income_study_data =
Edu_income_study_data[complete.cases(Edu_income_study_data),]
# Count observations after the filter
nrow(Edu_income_study_data)
## [1] 1752
# Contingency table for Highest Degree after the filter
table(Edu_income_study_data$Highest_Degree)
##
## Lt High School High School Junior College Bachelor Graduate
## 230 881 132 324 185
boxplot(Edu_income_study_data$Family_Income~
Edu_income_study_data$Highest_Degree,
main="Family Income by Highest Degree", xlab="Highest Degree", ylab="USD",
cex.axis= 0.75, cex.names=0.75)
# Boxplot for relationship between Education and Highest Family income
Finally, we explore the relationship among family income and highest degree. We can see that exists a positive association, but the wider interquantile range in the college groups and the presence of outliers in the high school and less than high school groups, means that such a relationship is not strong and that family income could be associated with other variables
and other Variables like Workplace and Economic Concerns
Null Hypotheisis: There is no significant difference between Means of Income between Multiple groups based on highest educational level in other words Highest educational level has no correlation to the income level H0 : Mu(LHS) = Mu(HS) = Mu(JC) = Mu(B) = Mu(G)
Alternate Hypothesis: There is a significant diffrence between Means of Income between multiple groups based on highest education level. in other words education level has impact on the income level HA : the average income in constant dollar (??i) varies across some (or all) groups
The one-way analysis of variance (ANOVA) is used to determine whether there are any significant differences between the means of three or more independent (unrelated) groups.The one-way ANOVA compares the means between the groups we are interested in and determines whether any of those means are significantly different from each other. This is the best approach since we are considering one Categorical Variable i.e EDUCATION and One Numerical Variable i.e INCOME
Specifically, it tests the null hypothesis: H0 : Mu(LHS) = Mu(HS) = Mu(JC) = Mu(B) = Mu(G)
If, however, the one-way ANOVA returns a significant result, we accept the alternative hypothesis (HA), which is that there are at least 2 group means that are significantly different from each other.
Following 3 Key assumptions are made: 1. Independence of observations.GSS data consist in a random sample of AMerican Population and the sample is defnitiely less than 10% of the population and so they could be considered independent.
# Create a plot grid for 5 graphs in a row
par(mfrow = c(1,5))
# Plot normality graphs for each groups based on Educational level
degrees = c("Lt High School","High School","Junior College",
"Bachelor","Graduate")
for (i in 1:5) {
qqnorm(Edu_income_study_data[Edu_income_study_data$Highest_Degree
== degrees[i],]$Family_Income, main=degrees[i])
qqline(Edu_income_study_data[Edu_income_study_data$Highest_Degree
== degrees[i],]$Family_Income)
}
# ANOVA for the mean income grouped by degree
anova(lm(Family_Income ~ Highest_Degree, data=Edu_income_study_data ))
## Analysis of Variance Table
##
## Response: Family_Income
## Df Sum Sq Mean Sq F value Pr(>F)
## Highest_Degree 4 8.2832e+11 2.0708e+11 120.52 < 2.2e-16 ***
## Residuals 1747 3.0017e+12 1.7182e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation of the Results: Above Simulation shows the F Statistic of 120.52 which is the ratio of Variation between the groups to Variation within the groups. p-value of approximately zero This mean that the probability of observing a F value of 121 or higher, if the null hypotheses were true, is very low.
Hence we can reject the Null Hypothesis and we can say that the average income in constant dollar varies across some (or all) groups in a statistically significant way.
we apply a Bonferroni correction to the p-values which are multiplied by the number of comparison. With this correction, the difference of the means has to be bigger to reject the null hypotheses.
# Pairwise t test for the mean income grouped by degree
# With Bonferroni correction
pairwise.t.test(Edu_income_study_data$Family_Income,
Edu_income_study_data$Highest_Degree,
p.adj="bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: Edu_income_study_data$Family_Income and Edu_income_study_data$Highest_Degree
##
## Lt High School High School Junior College Bachelor
## High School 1.4e-06 - - -
## Junior College 3.2e-07 0.2140 - -
## Bachelor < 2e-16 < 2e-16 2.3e-10 -
## Graduate < 2e-16 < 2e-16 < 2e-16 0.0011
##
## P value adjustment method: bonferroni
Interpretation of the Results: We can see that for nine group pairs the p-value is lower than the significance level of 0.05 and so the null hypotheses are rejected: the difference of the means of these nine groups is statistically significant.The null hypotheses is not rejected for the pair High school-Junior college. The difference of the means of this pairis not statistically significant and it is due to chance.Since we are using ANOVA there is no other methods applicable and hence there’s nothing to compare.
##
## Lt High School High School Junior College Bachelor Graduate
## 230 881 132 324 185
The Study establishes a positive correlation between Education level and Family income in constant dollar for United states residents
ABove study can be generalized to the entire United states residents since we used GSS Survey data for FY12.We grouped the family income in constants dollars by the highest degree earned by the interviewees(less than high school, high school, junior college, bachelor’s and graduate), and by visually exploring the data by visually exploring the data we noticed a positive correlation among the two variables.
We used ANOVA method & pair comparison to test our hypotheis if there is a significant diffrence in mean incomes of the groups. Our Analysis highlights that there is a significant difference.The only exception being among high school degree and junior college degree.
We observed some Outliers during Data exploration stage and it is an indicative of strong correlation of other variables with income.Some of the conditions for the statistical inference methods used were not fully respected, and so we have to be cautious in interpreting the results.
Further analysis can be done to over come the above shortcomings to see the impact of other variables using sophisticated techniques and it will be interseting to repeat the above analysis for each year and compare results
General Social Survey Cumulative File, 1972-2012 Coursera Extract. Modified for Data Analysis and Statistical Inferencecourse (Duke University). R dataset co uld be downloaded at http://bit.ly/dasi_gss_data. Original data: Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802- v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1 Persistent URL: http://doi.org/10.3886/ICPSR34802.v1