In the United State a college education has come to be viewed as means of economically progressing in life. The Bureau of Labor Statistics finds that full-time workers with just a high school diploma had median weekly earnings of $718 while those with a bachelor’s degree had median weekly earnings of $1,189 (bls.gov, 2020). College is important but it has also become increasingly expensive. For example in 1989, the cost of a 4-year degree (adjusted for inflation) averaged $52,892, while in 2016, the cost of the same degree rose to $104,480 on average (educationdata.org, 2020). Given the increasing income inequality in the United State these rising cost act as a barrier to under-privileged and working-class students moving into the middle class and beyond.
The following research attempts to gain an understanding of the factors that produce increasing college costs. The data for this study come from the College Scorecard and span the years 2017-2018 (collegescorecard.ed.gov, 2020). The variable of interest (the dependent variable) in this analysis is the average annual total cost of attending college.
The units of analysis for this study are 7,112 colleges and universities in the United States. The data for this research can be found here:https://collegescorecard.ed.gov/data/. The independent variables (factors) used to try and explain college costs are:
The dependent variable is cost4_a (the average annual total cost of attendance, including tuition and fees, books and supplies). Various plotting techniques and correlation analysis were used in both exploratory data analysis and variable selection. Additionally, Bayesian averaging was used in variable selection. Variable selection was carried out to product the most parsimonious final model(s) possible. Parsimonious models are not only simpler models to interpret, they reduce the confounding of effects.
The exploratory data analysis of total cost versus average SAT suggests that two populations of colleges exits. The two populations were split from each other and regression analysis was performed on both. All but one of independent variables was found to be statistically significant in both regression models.
The exploratory analysis begins with a correlgram showing robust correlation coefficients (Mosteller & Tukey, 1977) for each pair of variables in the data set.
#Check correlations between variables for columns 1 to 11 of ColSco
set.seed(123) # for reproducibility
ggcorrmat(data = ColSco[, 1:11],
type = "robust", # correlation method
p.adjust.method = "holm", # p-value adjustment method for multiple comparisons
matrix.type = "upper", # type of visualization matrix
colors = c("#B2182B", "white", "#4000FF"),
title = "Correlation Matrix")
Figure 1: Correlogram of all variables
Figure 1 shows that median_hh_inc and sat_avg are moderately positively correlated with costt4_a, while first_gen is moderately negatively correlated with costt4_a. Also note that median_hh_inc, poverty_rate, and unemp_rate are all moderately to highly inter-correlated. This makes sense in that they are all, through different means, measuring the same thing – i.e., economic well-being. This inter-correlation suggest possible multicollinearity issues.
#Exploratory Variable selection using Bayesian Model Averaging
model1 <- bas.lm(costt4_a ~ .,
data = ColSco[, 1:11],
prior = "ZS-null",
modelprior = uniform(), initprobs = "eplogp",
force.heredity = FALSE, pivot = TRUE)
## Warning in bas.lm(costt4_a ~ ., data = ColSco[, 1:11], prior = "ZS-null", :
## dropping 5891 rows due to missing data
image(model1, rotate = F)
Figure 2: Bayesian Variable Selection
The Bayesian model averaging analysis suggests, based on log posterior odds (7.08), that a model that excludes the independent variables age_entry, poverty_rate, and unemp_rate is the best model. The variables poverty_rate, and unemp_rate are probably excluded from the model because of their multicollinearity with median_hh_inc. Simply, if you have median_hh_inc in the model then poverty_rate, and unemp_rate add very little new information.
Base on the correlgram and Bayesian model averaging the variables female, poverty_rate, unemp_rate, age_entry, ugds, and ugds_white will be dropped from further consideration.
#Based on correlations and the variable selection method
#the following varibles will be dropped (deleted) from
#the data set:
ColSco2 = ColSco
ColSco2$female <-NULL
ColSco2$poverty_rate <-NULL
ColSco2$unemp_rate <-NULL
ColSco2$age_entry <-NULL
ColSco2$ugds <-NULL
ColSco2$ugds_white <-NULL
The remaining variables will be more closely examined using a scatter plot matrix.
#Remove all observations with missing values
ColSco3 = drop_na(ColSco2)
#Check correlations and distrubutions for columns 1 to 5 of ColSco
ggpairs(ColSco3[, 1:5])
Figure 3: A scatter plot matrix
In figure 3 the density plot for costt4_a is bi-modal (has two humps) also in the scatter plot of costt4_a and sat_avg, and in the scatter plot of costt4_a and sat_avg there can be seen a divergence (a splitting) of observations from one another. All these visuals suggests there exist two different populations of colleges and universities. To explore this divergence a simple bivariate regression analysis in which the dependent variable is cost4_a and the variable is sat_avg is run and the residuals from this model are used to split the observation (colleges) into those below the regression line (negative residuals) and those above (positive residuals). A random sample of ten observations below and above the regression line provides some insight into the differences between these two diverging groups of observations. The sample reveals that most below observations are public schools while most above observations are private school.Public and private institutions are qualitatively two different types of colleges and hence two different types of populations.
p1 = ggplot(ColSco3, aes(x = sat_avg, y = costt4_a)) +
geom_point()+
geom_smooth(method='lm')
#p1
#Using Regression to split two populations (?) of colleges
model2 = lm(ColSco3$costt4_a ~ ColSco3$sat_avg)
#extract the resideuals
e = residuals(model2)
#add a new dummy variable (1s and 0s) to ColSco3
#called abovebelow
ColSco3$abovebelow = (1 * (e > 0))
#Random sample of the two types of colleges
#Note a seed is not set so results will vary
ColSco3 %>% group_by(abovebelow) %>% select(instnm, costt4_a, sat_avg) %>%
slice_sample(n = 10) %>%
rename(college = instnm, cost = costt4_a, sat = sat_avg) %>%
knitr::kable()
## Adding missing grouping variables: `abovebelow`
| abovebelow | college | cost | sat |
|---|---|---|---|
| 0 | Saginaw Valley State University | 20061 | 1101 |
| 0 | Kutztown University of Pennsylvania | 25274 | 1056 |
| 0 | Fayetteville State University | 15463 | 955 |
| 0 | University of South Carolina-Upstate | 22583 | 1010 |
| 0 | Fort Lewis College | 25064 | 1123 |
| 0 | University of Minnesota-Duluth | 23913 | 1185 |
| 0 | University of Illinois at Springfield | 23966 | 1159 |
| 0 | Marshall University | 18315 | 1099 |
| 0 | University of Wisconsin-Whitewater | 17480 | 1105 |
| 0 | Pennsylvania State University-Penn State Mont Alto | 27352 | 1072 |
| 1 | Walsh University | 40677 | 1140 |
| 1 | Indiana Wesleyan University-Marion | 36871 | 1138 |
| 1 | Arizona Christian University | 37675 | 956 |
| 1 | Heidelberg University | 41566 | 1088 |
| 1 | Stanford University | 66696 | 1489 |
| 1 | Eastern University | 44446 | 1106 |
| 1 | Washington Adventist University | 32240 | 957 |
| 1 | Aurora University | 33781 | 1062 |
| 1 | Nebraska Christian College of Hope International University | 28800 | 989 |
| 1 | St. Mary’s University | 40087 | 1139 |
#Create a factor above or below into a factor and
#label positive values above and negative below
ColSco3$PublicPrivate = as.factor(ifelse(ColSco3$abovebelow == 0, "Public", "Private"))
#Scatter plot of entire and then two different
#costt4_a and sat_avg relationships
p2 = ggplot(ColSco3, aes(x = sat_avg, y = costt4_a)) +
geom_point() +
geom_smooth(method='lm') +
facet_wrap(vars(PublicPrivate))
p1/ p2
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Figure 4: A scatter plot of the entire population of colleges and it two sub-populations
Figure 4 clearly shows 1) the divergence the two populations of colleges and universities and 2) the two different trajectories they take.
These two populations of colleges and universities are analyzed using regression analysis. To aid in interpretation all the independent variables are center on their means.
#Regression analysis
#Prevent presenting regression results in scientific notation
options(scipen = 10, digits = 4)
#Public school model
Public = ColSco3[ColSco3$abovebelow == 0, ]
summary(Pub <- lm(costt4_a ~ scale(adm_rate, scale = FALSE) + scale(sat_avg,scale = FALSE) +
scale(first_gen, scale = FALSE) + scale(median_hh_inc, scale = FALSE),
data = Public))
##
## Call:
## lm(formula = costt4_a ~ scale(adm_rate, scale = FALSE) + scale(sat_avg,
## scale = FALSE) + scale(first_gen, scale = FALSE) + scale(median_hh_inc,
## scale = FALSE), data = Public)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11302 -2888 -638 2359 14645
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 22779.8925 185.8750 122.55
## scale(adm_rate, scale = FALSE) -2369.3496 1091.5440 -2.17
## scale(sat_avg, scale = FALSE) 11.5773 2.3313 4.97
## scale(first_gen, scale = FALSE) -11554.1619 2561.5926 -4.51
## scale(median_hh_inc, scale = FALSE) 0.1131 0.0186 6.08
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## scale(adm_rate, scale = FALSE) 0.03 *
## scale(sat_avg, scale = FALSE) 0.0000009001 ***
## scale(first_gen, scale = FALSE) 0.0000078269 ***
## scale(median_hh_inc, scale = FALSE) 0.0000000021 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4500 on 581 degrees of freedom
## Multiple R-squared: 0.275, Adjusted R-squared: 0.27
## F-statistic: 55.2 on 4 and 581 DF, p-value: <2e-16
#confidence intervals
round(confint(Pub), 3)
## 2.5 % 97.5 %
## (Intercept) 22414.824 23144.96
## scale(adm_rate, scale = FALSE) -4513.202 -225.50
## scale(sat_avg, scale = FALSE) 6.998 16.16
## scale(first_gen, scale = FALSE) -16585.272 -6523.05
## scale(median_hh_inc, scale = FALSE) 0.077 0.15
#Private school model
Private = ColSco3[ColSco3$abovebelow == 1, ]
summary(Pri <- lm(costt4_a ~ scale(adm_rate, scale = FALSE) + scale(sat_avg,scale = FALSE) +
scale(first_gen, scale = FALSE) + scale(median_hh_inc, scale = FALSE),
data = Private))
##
## Call:
## lm(formula = costt4_a ~ scale(adm_rate, scale = FALSE) + scale(sat_avg,
## scale = FALSE) + scale(first_gen, scale = FALSE) + scale(median_hh_inc,
## scale = FALSE), data = Private)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12638 -3746 54 3538 20394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46956.2591 202.4264 231.97 <2e-16 ***
## scale(adm_rate, scale = FALSE) -2126.8036 1133.0148 -1.88 0.061 .
## scale(sat_avg, scale = FALSE) 45.2844 2.4570 18.43 <2e-16 ***
## scale(first_gen, scale = FALSE) -6794.3086 3054.5789 -2.22 0.026 *
## scale(median_hh_inc, scale = FALSE) 0.3320 0.0253 13.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5180 on 651 degrees of freedom
## Multiple R-squared: 0.773, Adjusted R-squared: 0.772
## F-statistic: 555 on 4 and 651 DF, p-value: <2e-16
#confidence intervals
round(confint(Pri), 3)
## 2.5 % 97.5 %
## (Intercept) 46558.772 47353.747
## scale(adm_rate, scale = FALSE) -4351.608 98.001
## scale(sat_avg, scale = FALSE) 40.460 50.109
## scale(first_gen, scale = FALSE) -12792.325 -796.293
## scale(median_hh_inc, scale = FALSE) 0.282 0.382
In the two regression models all the independent variables are statistically significant except adm_rate in the Private school model.
The public school model accounts for 27.5 percent of the variation in cosst4_a. Controlling for the other independent variables in the model: a 1% increase in admission rate (adm_rate) results in a $4,513.20 to $225.49 decrease in college cost; a one unit increase in average student body SAT scores (sat_age) results in a $7.00 to $16.20 increase in college cost; a 1% increase in first generations students (first_gen) results in a $16,585.30 to $6,523.10 decrease in college cost; and a $1 increase in student’s median household incomes results in $0.10 to $0.25 increase in college cost.
The private school model accounts for 77.3 percent of the variation in cosst4_a. Controlling for the other independent variables in the model: admission rate (adm_rate) has no statistical effect on college cost; a one unit increase in average student body SAT scores (sat_age) results in a $40.46 to $50.10 increase in college cost; a 1% increase in first generations students (first_gen) results in a $12,792.32 to $796.29 decrease in college cost; and a $1 increase in student’s median household incomes results in $0.28 to $0.38 increase in college cost.
The fact that adm_rate and first_gen have negative signs while sat_avg and median_hh_inc have positive signs in these models suggests that the more exclusive a college (the hard it is to get into) the higher its total cost of attendance. In economic terms exclusivity is a measure scarcity and scare items demand higher prices in the market place. Also the two different models suggest the two different missions of these two types of colleges and universities. Public schools’ primary missions are to educate the children of the citizens of their states. These schools to varying degrees appeal to a mass consumer market. Private schools sell themselves as premium products for a select market. Of course these are generalizations but the difference in regression coefficients for sat_avg ($11.57 vs $45.28) and median_hh_inc ($0.11 vs $0.33) suggest that cost of attending these private school is more strongly associated with students’ SAT scores and the incomes of the homes they come from than it is for public schools.
Colleges and universities are important portals for moving students into the middle class and beyond, but access to these portals comes at a cost. This begs the question: “What determines the cost of colleges and universities?” The analysis performed here suggests colleges and universities’ admission rates, average SAT scores of their student bodies, average household income of their students (families), and what percentage of their student bodies are first generation students are the primary factors associated with the cost of colleges. Taken as a whole all these factors suggest that college costs are a function of college exclusivity. Simply put, the harder an institution of higher learning is to get into the greater in general will be its cost.
While this association between exclusivity and cost is seen for all colleges and universities, this relationship is particularly strong for private schools as opposed to public schools. This is probably a refelection of the differing missions of the two institutions. Private schools are literally businesses who sell special types of educational experiences – e.g, specialized education (Embry-Riddle, Gallaudet), religious infused education (Liberty University, Notre Dame), small liberal arts education (Pomona, Emory), elite education (Yale, University of Chicago), Black experience education (Howard, Morehouse). Despite neo-liberal calls for public colleges and universities to be run more like businesses, the primary mission of these institutions is to provide access to higher learning to the citizens of the states in which they reside; the citizens whose taxes, in part, fund these institutions. Access is the very opposite of exclusivity and as a result the associations between cost and admission rates, SAT scores, first generation students and household incomes are weaker in public schools than in private schools.
Item 1, R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Item 2, Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
Item 3, Clyde M (2020). BAS: Bayesian Variable Selection and Model Averaging using Bayesian Adaptive Sampling. R package version 1.5.5.
Item 4, Patil I (2018). “ggstatsplot: ‘ggplot2’ Based Plots with Statistical Details.” CRAN. doi: 10.5281/zenodo.2074621, https://CRAN.R-project.org/package=ggstatsplot.
Item 5, Thomas Lin Pedersen (2020). patchwork: The Composer of Plots. R package version 1.0.1. https://CRAN.R-project.org/package=patchwork
Item 6, Barret Schloerke, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg and Jason Crowley (2020). GGally: Extension to ‘ggplot2’. R package version 2.0.0. https://CRAN.R-project.org/package=GGally
Item 7, Mosteller, F., & Tukey, J. W. (1977). Data analysis and regression: a second course in statistics.