DATA 606 Data Project

Part 1 - Introduction

Educational attainment rates for 25- to 29-year-olds increased at all levels between 2000 and 2017. During this time, the percentage who had completed high school increased from 88 to 92 percent, the percentage with an associate’s or higher degree increased from 38 to 46 percent, the percentage with a bachelor’s or higher degree increased from 29 to 36 percent, and the percentage with a master’s or higher degree increased from 5 to 9 percent.

Research Question

Is Education and/or area of study predictive of future earnings? Is it possible to model the relationship between education and salary?

Part 2 - Data

These statistics are generated by matching university transcript data with a national database of jobs. The PSEO are made possible through data sharing partnerships between universities, university systems, State Departments of Education, State Labor Market Information offices, and the U.S. Census Bureau. PSEO data are currently only available for post-graduate institutions whose transcript data has been made available to Census Bureau through a data-sharing agreement. This is an observational study. Each case represents a student who received a degree or certificate from n Colorado or Texas educational institution.

There is 8367 observations in the given data set.

library(psych)
library(ggplot2)
library(xtable)
# Load data from Longitudinal Employer-Household Dynamics (LEHD) program website at the U.S. Census Bureau.
 
data_source <- 'https://lehd.ces.census.gov/data/pseo/graduate_earnings_all.csv'
data <- read.csv(data_source)

# There are a lot of columns in these data frames. We will select only relevant columns. 
data <- read.csv(data_source)
data <- data[,c('deglevl','deglevl_code', 'ciptitle', 'p25_earnings', 'p50_earnings', 'p75_earnings')]

# Since LEHD reports the 25th, 50th and 75th percentile, we will use the mean of 
# these two as a proxy of the median (i.e center of the distribution).
data$Salary <- (data$p25_earnings + data$p50_earnings + data$p75_earnings)/3

#We will rename the column names to be more descriptive
data <- data[,c('deglevl','deglevl_code', 'ciptitle', 'Salary')]
names(data)<-c('Degree','DegreeCode', 'Major','Salary')

data$Salary <- as.numeric(data$Salary)

data <- data[which(data$Salary > 0),]

data <- data[which((data$Degree=='Associates'|
                  data$Degree=='Baccalaureate'|
                  data$Degree=='Masters'|
                  data$Degree=='Doctoral - Research/Scholarship'|
                  data$Degree=='Doctoral - Professional Practice')) , ]

data$Degree <- factor(data$Degree)

degreeLabelLong <-  c("Associates", 
                      "Baccalaureate", 
                      "Masters",
                      "Doctoral - Research/Scholarship", 
                      "Doctoral - Professional Practice")

degreeLabels <- c( "Assoc", 
                      "Bacc", 
                      "Masters",
                      "Dr.Research", 
                      "Dr.Practice")


data$Degree <- factor(data$Degree, 
                        levels= degreeLabelLong, 
                        labels=degreeLabels)
data$Degree <- factor(data$Degree)
datax <- data[which(data$Major != 'PETROLEUM ENGINEERING'),]

Part 3 - Exploratory data analysis

Below table displays summary salary data. The table clearly shows that there is a clear difference in median/mean and salary range.

Assoc = data[which(data$Degree=='Assoc'),]
Bacc = data[which(data$Degree=='Bacc'),]
Masters = data[which(data$Degree=='Masters'),]
Dr.Research = data[which(data$Degree=='Dr.Research'),]
Dr.Practice = data[which(data$Degree=='Dr.Practice'),]
dataSummary <- rbind(Assoc = summary(Assoc$Salary), Bacc = summary(Bacc$Salary), Masters = summary(Masters$Salary),
      Dr.Practice = summary(Dr.Practice$Salary), Dr.Research = summary(Dr.Research$Salary))
 
kable(dataSummary, escape = FALSE)%>% kable_styling(bootstrap_options = c("striped", "hover",'condensed') )

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
Assoc	19903.00	33297.67	42051.33	43242.50	51817.00	107224.0
Bacc	18319.33	35767.50	45357.67	48875.38	57182.83	329249.0
Masters	24482.00	49222.50	59357.67	64814.41	76997.00	171280.7
Dr.Practice	50521.33	65636.00	94736.33	106951.90	130962.50	295536.7
Dr.Research	38687.67	63094.08	78024.17	82360.57	95356.17	174588.0

Box plot also shows the difference in the mean/median and salary range for different degrees. It also shows that there are some outliers especially for Baccalaureate and Doctoral - Professional Practice degrees.

ggplot(data, aes(x = Degree, y = Salary, fill = Degree)) +
  geom_boxplot() +scale_fill_brewer(palette="Dark2")+ scale_y_continuous(labels = scales::comma)

Part 4 - Inference

Conditions for inference:

Independent

The sample size is not  more than 10% of the population.

rowsSummary <- data.frame(table(data$Degree))
names(rowsSummary)<-c('Degree','Number of Observations')
kable(rowsSummary, escape = FALSE)%>% kable_styling(bootstrap_options = c("striped", "hover",'condensed') )

Degree	Number of Observations
Assoc	837
Bacc	5814
Masters	1231
Dr.Research	382
Dr.Practice	103

Normal

Histograms and QQ plots show that the distribution is not nearly normal. The distribution is skewed. Since our sample size is significantly more than 30, skewness should not be a concern.

Random

Assume data was collected randomly.

Since we are comparing more than three means, we will use ANOVA to analyze the differences among group means in our sample.

Condtions for ANOVA

Equal variances among groups

Based on the plots, variance among groups are not equal. The sample size is not the same for all the groups.

Normal distribution within each group

Based on the histogram, distribution is skewed but our sample size significantly more than 30.

Observations are independent of all other.

No one's salary is reported under multiple degrees.

\[ Hypothesis \\ H_0 : All~means~are~equal\\ H_A : All~~means~~are~~not~~equal\\ \]

salaryModel <- lm(data$Salary ~ data$Degree, data = data)

anova(salaryModel)

## Analysis of Variance Table
## 
## Response: data$Salary
##               Df     Sum Sq    Mean Sq F value    Pr(>F)    
## data$Degree    4 9.7945e+11 2.4486e+11  578.06 < 2.2e-16 ***
## Residuals   8362 3.5421e+12 4.2359e+08                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on ANOVA results, the p-value is less than our significance level, so we reject the null hypothesis. We could say with 95% confidence that the mean salaries are different for different degrees.

Model

summary(salaryModel)

## 
## Call:
## lm(formula = data$Salary ~ data$Degree, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56431 -13520  -3614   9214 280374 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             43242.5      711.4  60.785  < 2e-16 ***
## data$DegreeBacc          5632.9      760.9   7.403 1.46e-13 ***
## data$DegreeMasters      21571.9      922.1  23.395  < 2e-16 ***
## data$DegreeDr.Research  39118.1     1270.8  30.782  < 2e-16 ***
## data$DegreeDr.Practice  63709.4     2149.1  29.645  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20580 on 8362 degrees of freedom
## Multiple R-squared:  0.2166, Adjusted R-squared:  0.2162 
## F-statistic: 578.1 on 4 and 8362 DF,  p-value: < 2.2e-16

\[ \begin{aligned} Salary = 43242.5 + (5632.9 \times Bacc)+ (21571.9 \times Masters) + (39118.1\times Dr.Resarch) + (63709.4 \times Dr.Prractice) \end{aligned} \]

Model Diagnostic plots

Nearly normal distirbution

qq normal plot shows the distribution of residuals is right skewed.

Constant variability

Residuals variability is not constant. There are alot of points above 50,000 but not below -50,000.

hist(salaryModel$residuals)

qqnorm(salaryModel$residuals)
qqline(salaryModel$residuals)

plot(salaryModel$residuals)
abline(h = 0, lty = 3)

Adjusted R-squared:  0.2162. 
This model explains only 22% variability. This might not be a good model to predict  salary. I have created another model that includes the major, that is a better model with an Adjusted R-squared: 0.5616.

salaryMajorModel <- lm(data$Salary ~ data$Degree+data$Major, data = data)

summary(salaryMajorModel)$adj.r.squared

## [1] 0.561587

Part 5 - Conclusion

There is a difference in mean salaries of degrees. Higher the degree, the higher the salary. I wouldn’t have any problems with generalizing these findings to anyone who received a degree from Texas or Colorado. One of the concerns I have is that, some of the degrees offered in these universities are not offered in other parts of the country. For example, Petroleum engineering is not a major offered by universities throughout the US. In this data set, salaries of Petroleum engineer is an influential point. This is a concern I have about generalizing the results to the entire US population. The model with degree alone explains only 22% of the variation.The model with area of study/ major is a far better predictor with an Adjusted R-squared value of .56.

References

https://nces.ed.gov/programs/coe/indicator_caa.asp

https://lehd.ces.census.gov/data/pseo_beta.html