Educational attainment rates for 25- to 29-year-olds increased at all levels between 2000 and 2017. During this time, the percentage who had completed high school increased from 88 to 92 percent, the percentage with an associate’s or higher degree increased from 38 to 46 percent, the percentage with a bachelor’s or higher degree increased from 29 to 36 percent, and the percentage with a master’s or higher degree increased from 5 to 9 percent.
Is Education and/or area of study predictive of future earnings? Is it possible to model the relationship between education and salary?
These statistics are generated by matching university transcript data with a national database of jobs. The PSEO are made possible through data sharing partnerships between universities, university systems, State Departments of Education, State Labor Market Information offices, and the U.S. Census Bureau. PSEO data are currently only available for post-graduate institutions whose transcript data has been made available to Census Bureau through a data-sharing agreement. This is an observational study. Each case represents a student who received a degree or certificate from n Colorado or Texas educational institution.
There is 8367 observations in the given data set.
library(psych)
library(ggplot2)
library(xtable)
# Load data from Longitudinal Employer-Household Dynamics (LEHD) program website at the U.S. Census Bureau.
data_source <- 'https://lehd.ces.census.gov/data/pseo/graduate_earnings_all.csv'
data <- read.csv(data_source)
# There are a lot of columns in these data frames. We will select only relevant columns.
data <- read.csv(data_source)
data <- data[,c('deglevl','deglevl_code', 'ciptitle', 'p25_earnings', 'p50_earnings', 'p75_earnings')]
# Since LEHD reports the 25th, 50th and 75th percentile, we will use the mean of
# these two as a proxy of the median (i.e center of the distribution).
data$Salary <- (data$p25_earnings + data$p50_earnings + data$p75_earnings)/3
#We will rename the column names to be more descriptive
data <- data[,c('deglevl','deglevl_code', 'ciptitle', 'Salary')]
names(data)<-c('Degree','DegreeCode', 'Major','Salary')
data$Salary <- as.numeric(data$Salary)
data <- data[which(data$Salary > 0),]
data <- data[which((data$Degree=='Associates'|
data$Degree=='Baccalaureate'|
data$Degree=='Masters'|
data$Degree=='Doctoral - Research/Scholarship'|
data$Degree=='Doctoral - Professional Practice')) , ]
data$Degree <- factor(data$Degree)
degreeLabelLong <- c("Associates",
"Baccalaureate",
"Masters",
"Doctoral - Research/Scholarship",
"Doctoral - Professional Practice")
degreeLabels <- c( "Assoc",
"Bacc",
"Masters",
"Dr.Research",
"Dr.Practice")
data$Degree <- factor(data$Degree,
levels= degreeLabelLong,
labels=degreeLabels)
data$Degree <- factor(data$Degree)
datax <- data[which(data$Major != 'PETROLEUM ENGINEERING'),]
Below table displays summary salary data. The table clearly shows that there is a clear difference in median/mean and salary range.
Assoc = data[which(data$Degree=='Assoc'),]
Bacc = data[which(data$Degree=='Bacc'),]
Masters = data[which(data$Degree=='Masters'),]
Dr.Research = data[which(data$Degree=='Dr.Research'),]
Dr.Practice = data[which(data$Degree=='Dr.Practice'),]
dataSummary <- rbind(Assoc = summary(Assoc$Salary), Bacc = summary(Bacc$Salary), Masters = summary(Masters$Salary),
Dr.Practice = summary(Dr.Practice$Salary), Dr.Research = summary(Dr.Research$Salary))
kable(dataSummary, escape = FALSE)%>% kable_styling(bootstrap_options = c("striped", "hover",'condensed') )
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
---|---|---|---|---|---|---|
Assoc | 19903.00 | 33297.67 | 42051.33 | 43242.50 | 51817.00 | 107224.0 |
Bacc | 18319.33 | 35767.50 | 45357.67 | 48875.38 | 57182.83 | 329249.0 |
Masters | 24482.00 | 49222.50 | 59357.67 | 64814.41 | 76997.00 | 171280.7 |
Dr.Practice | 50521.33 | 65636.00 | 94736.33 | 106951.90 | 130962.50 | 295536.7 |
Dr.Research | 38687.67 | 63094.08 | 78024.17 | 82360.57 | 95356.17 | 174588.0 |
Box plot also shows the difference in the mean/median and salary range for different degrees. It also shows that there are some outliers especially for Baccalaureate and Doctoral - Professional Practice degrees.
ggplot(data, aes(x = Degree, y = Salary, fill = Degree)) +
geom_boxplot() +scale_fill_brewer(palette="Dark2")+ scale_y_continuous(labels = scales::comma)
Conditions for inference:
The sample size is not more than 10% of the population.
rowsSummary <- data.frame(table(data$Degree))
names(rowsSummary)<-c('Degree','Number of Observations')
kable(rowsSummary, escape = FALSE)%>% kable_styling(bootstrap_options = c("striped", "hover",'condensed') )
Degree | Number of Observations |
---|---|
Assoc | 837 |
Bacc | 5814 |
Masters | 1231 |
Dr.Research | 382 |
Dr.Practice | 103 |
Histograms and QQ plots show that the distribution is not nearly normal. The distribution is skewed. Since our sample size is significantly more than 30, skewness should not be a concern.
Assume data was collected randomly.
Since we are comparing more than three means, we will use ANOVA to analyze the differences among group means in our sample.
Based on the plots, variance among groups are not equal. The sample size is not the same for all the groups.
Based on the histogram, distribution is skewed but our sample size significantly more than 30.
No one's salary is reported under multiple degrees.
\[ Hypothesis \\ H_0 : All~means~are~equal\\ H_A : All~~means~~are~~not~~equal\\ \]
salaryModel <- lm(data$Salary ~ data$Degree, data = data)
anova(salaryModel)
## Analysis of Variance Table
##
## Response: data$Salary
## Df Sum Sq Mean Sq F value Pr(>F)
## data$Degree 4 9.7945e+11 2.4486e+11 578.06 < 2.2e-16 ***
## Residuals 8362 3.5421e+12 4.2359e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on ANOVA results, the p-value is less than our significance level, so we reject the null hypothesis. We could say with 95% confidence that the mean salaries are different for different degrees.
summary(salaryModel)
##
## Call:
## lm(formula = data$Salary ~ data$Degree, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56431 -13520 -3614 9214 280374
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43242.5 711.4 60.785 < 2e-16 ***
## data$DegreeBacc 5632.9 760.9 7.403 1.46e-13 ***
## data$DegreeMasters 21571.9 922.1 23.395 < 2e-16 ***
## data$DegreeDr.Research 39118.1 1270.8 30.782 < 2e-16 ***
## data$DegreeDr.Practice 63709.4 2149.1 29.645 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20580 on 8362 degrees of freedom
## Multiple R-squared: 0.2166, Adjusted R-squared: 0.2162
## F-statistic: 578.1 on 4 and 8362 DF, p-value: < 2.2e-16
\[ \begin{aligned} Salary = 43242.5 + (5632.9 \times Bacc)+ (21571.9 \times Masters) + (39118.1\times Dr.Resarch) + (63709.4 \times Dr.Prractice) \end{aligned} \]
qq normal plot shows the distribution of residuals is right skewed.
Residuals variability is not constant. There are alot of points above 50,000 but not below -50,000.
hist(salaryModel$residuals)
qqnorm(salaryModel$residuals)
qqline(salaryModel$residuals)
plot(salaryModel$residuals)
abline(h = 0, lty = 3)
Adjusted R-squared: 0.2162.
This model explains only 22% variability. This might not be a good model to predict salary. I have created another model that includes the major, that is a better model with an Adjusted R-squared: 0.5616.
salaryMajorModel <- lm(data$Salary ~ data$Degree+data$Major, data = data)
summary(salaryMajorModel)$adj.r.squared
## [1] 0.561587
There is a difference in mean salaries of degrees. Higher the degree, the higher the salary. I wouldn’t have any problems with generalizing these findings to anyone who received a degree from Texas or Colorado. One of the concerns I have is that, some of the degrees offered in these universities are not offered in other parts of the country. For example, Petroleum engineering is not a major offered by universities throughout the US. In this data set, salaries of Petroleum engineer is an influential point. This is a concern I have about generalizing the results to the entire US population. The model with degree alone explains only 22% of the variation.The model with area of study/ major is a far better predictor with an Adjusted R-squared value of .56.