Factors Influencing the Cost of College

Introduction
Methods
Results
- Exploring the data
- Analyzing and interpreting the data
Discussion
References

Introduction

In the United State a college education has come to be viewed as means of economically progressing in life. The Bureau of Labor Statistics finds that full-time workers with just a high school diploma had median weekly earnings of $718 while those with a bachelor’s degree had median weekly earnings of $1,189 (bls.gov, 2020). College is important but it has also become increasingly expensive. For example in 1989, the cost of a 4-year degree (adjusted for inflation) averaged $52,892, while in 2016, the cost of the same degree rose to $104,480 on average (educationdata.org, 2020). Given the increasing income inequality in the United State these rising cost act as a barrier to under-privileged and working-class students moving into the middle class and beyond.

The following research attempts to gain an understanding of the factors that produce increasing college costs. The data for this study come from the College Scorecard and span the years 2017-2018 (collegescorecard.ed.gov, 2020). The variable of interest (the dependent variable) in this analysis is the average annual total cost of attending college.

Methods

The units of analysis for this study are 7,112 colleges and universities in the United States. The data for this research can be found here:https://collegescorecard.ed.gov/data/. The independent variables (factors) used to try and explain college costs are:

adm_rate - Admission rate
sat_avg - Average SAT equivalent score of students admitted
ugds - Enrollment of undergraduate certificate/degree-seeking students (size)
ugds_white - share of enrollment of undergraduate degree-seeking students who are white
avgfacsal - Average faculty salary per month
age_entry - Average age of entry
female - Share of female students
first_gen - Share of first-generation students
median_hh_inc - Median household income
poverty_rate - Poverty rate, via Census data
unemp_rate - Unemployment rate, via Census data

The dependent variable is cost4_a (the average annual total cost of attendance, including tuition and fees, books and supplies). Various plotting techniques and correlation analysis were used in both exploratory data analysis and variable selection. Additionally, Bayesian averaging was used in variable selection. Variable selection was carried out to product the most parsimonious final model(s) possible. Parsimonious models are not only simpler models to interpret, they reduce the confounding of effects.

The exploratory data analysis of total cost versus average SAT suggests that two populations of colleges exits. The two populations were split from each other and regression analysis was performed on both. All but one of independent variables was found to be statistically significant in both regression models.

Results

Exploring the data

The exploratory analysis begins with a correlgram showing robust correlation coefficients (Mosteller & Tukey, 1977) for each pair of variables in the data set.

#Check correlations between variables for columns 1 to 11 of ColSco
set.seed(123)  # for reproducibility
ggcorrmat(data = ColSco[, 1:11],
  type = "robust",                    # correlation method
  p.adjust.method = "holm",           # p-value adjustment method for multiple comparisons
  matrix.type = "upper",              # type of visualization matrix
  colors = c("#B2182B", "white", "#4000FF"),
  title = "Correlation Matrix")

Figure 1: Correlogram of all variables

Figure 1 shows that median_hh_inc and sat_avg are moderately positively correlated with costt4_a, while first_gen is moderately negatively correlated with costt4_a. Also note that median_hh_inc, poverty_rate, and unemp_rate are all moderately to highly inter-correlated. This makes sense in that they are all, through different means, measuring the same thing – i.e., economic well-being. This inter-correlation suggest possible multicollinearity issues.

#Exploratory Variable selection using Bayesian Model Averaging 
model1 <- bas.lm(costt4_a ~ .,
                   data = ColSco[, 1:11],
                   prior = "ZS-null",
                   modelprior = uniform(), initprobs = "eplogp",
                   force.heredity = FALSE, pivot = TRUE)

## Warning in bas.lm(costt4_a ~ ., data = ColSco[, 1:11], prior = "ZS-null", :
## dropping 5891 rows due to missing data

image(model1,  rotate = F)

Figure 2: Bayesian Variable Selection

The Bayesian model averaging analysis suggests, based on log posterior odds (7.08), that a model that excludes the independent variables age_entry, poverty_rate, and unemp_rate is the best model. The variables poverty_rate, and unemp_rate are probably excluded from the model because of their multicollinearity with median_hh_inc. Simply, if you have median_hh_inc in the model then poverty_rate, and unemp_rate add very little new information.

Base on the correlgram and Bayesian model averaging the variables female, poverty_rate, unemp_rate, age_entry, ugds, and ugds_white will be dropped from further consideration.

#Based on correlations and the variable selection method
#the following varibles will be dropped (deleted) from 
#the data set:
ColSco2 = ColSco
ColSco2$female <-NULL
ColSco2$poverty_rate <-NULL
ColSco2$unemp_rate <-NULL
ColSco2$age_entry <-NULL
ColSco2$ugds <-NULL
ColSco2$ugds_white <-NULL

The remaining variables will be more closely examined using a scatter plot matrix.

#Remove all observations with missing values
ColSco3 = drop_na(ColSco2)

#Check correlations and distrubutions for columns 1 to 5 of ColSco
ggpairs(ColSco3[, 1:5])

Figure 3: A scatter plot matrix

In figure 3 the density plot for costt4_a is bi-modal (has two humps) also in the scatter plot of costt4_a and sat_avg, and in the scatter plot of costt4_a and sat_avg there can be seen a divergence (a splitting) of observations from one another. All these visuals suggests there exist two different populations of colleges and universities. To explore this divergence a simple bivariate regression analysis in which the dependent variable is cost4_a and the variable is sat_avg is run and the residuals from this model are used to split the observation (colleges) into those below the regression line (negative residuals) and those above (positive residuals). A random sample of ten observations below and above the regression line provides some insight into the differences between these two diverging groups of observations. The sample reveals that most below observations are public schools while most above observations are private school.Public and private institutions are qualitatively two different types of colleges and hence two different types of populations.

p1 = ggplot(ColSco3, aes(x = sat_avg, y = costt4_a)) + 
    geom_point()+
    geom_smooth(method='lm')

#p1

#Using Regression to split two populations (?) of colleges
model2 = lm(ColSco3$costt4_a ~ ColSco3$sat_avg)

#extract the resideuals
e = residuals(model2)


#add a new dummy variable (1s and 0s) to ColSco3 
#called abovebelow
ColSco3$abovebelow = (1 * (e > 0))

#Random sample of the two types of colleges
#Note a seed is not set so results will vary
ColSco3 %>% group_by(abovebelow) %>% select(instnm, costt4_a, sat_avg) %>% 
  slice_sample(n = 10) %>% 
  rename(college = instnm, cost = costt4_a, sat = sat_avg) %>%
  knitr::kable()

## Adding missing grouping variables: `abovebelow`

abovebelow	college	cost	sat
0	Saginaw Valley State University	20061	1101
0	Kutztown University of Pennsylvania	25274	1056
0	Fayetteville State University	15463	955
0	University of South Carolina-Upstate	22583	1010
0	Fort Lewis College	25064	1123
0	University of Minnesota-Duluth	23913	1185
0	University of Illinois at Springfield	23966	1159
0	Marshall University	18315	1099
0	University of Wisconsin-Whitewater	17480	1105
0	Pennsylvania State University-Penn State Mont Alto	27352	1072
1	Walsh University	40677	1140
1	Indiana Wesleyan University-Marion	36871	1138
1	Arizona Christian University	37675	956
1	Heidelberg University	41566	1088
1	Stanford University	66696	1489
1	Eastern University	44446	1106
1	Washington Adventist University	32240	957
1	Aurora University	33781	1062
1	Nebraska Christian College of Hope International University	28800	989
1	St. Mary’s University	40087	1139

#Create a factor above or below  into a factor and 
#label positive values above and negative below
ColSco3$PublicPrivate =  as.factor(ifelse(ColSco3$abovebelow == 0, "Public", "Private"))

#Scatter plot of entire and then two different 
#costt4_a and sat_avg relationships
p2 = ggplot(ColSco3, aes(x = sat_avg, y = costt4_a)) + 
      geom_point() +
      geom_smooth(method='lm') +
      facet_wrap(vars(PublicPrivate))

p1/ p2

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

Figure 4: A scatter plot of the entire population of colleges and it two sub-populations

Figure 4 clearly shows 1) the divergence the two populations of colleges and universities and 2) the two different trajectories they take.

Analyzing and interpreting the data

These two populations of colleges and universities are analyzed using regression analysis. To aid in interpretation all the independent variables are center on their means.

#Regression analysis
#Prevent presenting regression results in scientific notation
options(scipen = 10, digits = 4)

#Public school model
Public = ColSco3[ColSco3$abovebelow == 0, ]
summary(Pub <- lm(costt4_a ~ scale(adm_rate, scale = FALSE) + scale(sat_avg,scale = FALSE) + 
            scale(first_gen, scale = FALSE) + scale(median_hh_inc, scale = FALSE),  
            data = Public))

## 
## Call:
## lm(formula = costt4_a ~ scale(adm_rate, scale = FALSE) + scale(sat_avg, 
##     scale = FALSE) + scale(first_gen, scale = FALSE) + scale(median_hh_inc, 
##     scale = FALSE), data = Public)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11302  -2888   -638   2359  14645 
## 
## Coefficients:
##                                        Estimate  Std. Error t value
## (Intercept)                          22779.8925    185.8750  122.55
## scale(adm_rate, scale = FALSE)       -2369.3496   1091.5440   -2.17
## scale(sat_avg, scale = FALSE)           11.5773      2.3313    4.97
## scale(first_gen, scale = FALSE)     -11554.1619   2561.5926   -4.51
## scale(median_hh_inc, scale = FALSE)      0.1131      0.0186    6.08
##                                         Pr(>|t|)    
## (Intercept)                              < 2e-16 ***
## scale(adm_rate, scale = FALSE)              0.03 *  
## scale(sat_avg, scale = FALSE)       0.0000009001 ***
## scale(first_gen, scale = FALSE)     0.0000078269 ***
## scale(median_hh_inc, scale = FALSE) 0.0000000021 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4500 on 581 degrees of freedom
## Multiple R-squared:  0.275,  Adjusted R-squared:  0.27 
## F-statistic: 55.2 on 4 and 581 DF,  p-value: <2e-16

#confidence intervals
round(confint(Pub), 3)

##                                          2.5 %   97.5 %
## (Intercept)                          22414.824 23144.96
## scale(adm_rate, scale = FALSE)       -4513.202  -225.50
## scale(sat_avg, scale = FALSE)            6.998    16.16
## scale(first_gen, scale = FALSE)     -16585.272 -6523.05
## scale(median_hh_inc, scale = FALSE)      0.077     0.15

#Private school model
Private = ColSco3[ColSco3$abovebelow == 1, ]
summary(Pri <- lm(costt4_a ~ scale(adm_rate, scale = FALSE) + scale(sat_avg,scale = FALSE) + 
              scale(first_gen, scale = FALSE) + scale(median_hh_inc, scale = FALSE),  
            data = Private))

## 
## Call:
## lm(formula = costt4_a ~ scale(adm_rate, scale = FALSE) + scale(sat_avg, 
##     scale = FALSE) + scale(first_gen, scale = FALSE) + scale(median_hh_inc, 
##     scale = FALSE), data = Private)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12638  -3746     54   3538  20394 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         46956.2591   202.4264  231.97   <2e-16 ***
## scale(adm_rate, scale = FALSE)      -2126.8036  1133.0148   -1.88    0.061 .  
## scale(sat_avg, scale = FALSE)          45.2844     2.4570   18.43   <2e-16 ***
## scale(first_gen, scale = FALSE)     -6794.3086  3054.5789   -2.22    0.026 *  
## scale(median_hh_inc, scale = FALSE)     0.3320     0.0253   13.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5180 on 651 degrees of freedom
## Multiple R-squared:  0.773,  Adjusted R-squared:  0.772 
## F-statistic:  555 on 4 and 651 DF,  p-value: <2e-16

#confidence intervals
round(confint(Pri), 3)

##                                          2.5 %    97.5 %
## (Intercept)                          46558.772 47353.747
## scale(adm_rate, scale = FALSE)       -4351.608    98.001
## scale(sat_avg, scale = FALSE)           40.460    50.109
## scale(first_gen, scale = FALSE)     -12792.325  -796.293
## scale(median_hh_inc, scale = FALSE)      0.282     0.382

In the two regression models all the independent variables are statistically significant except adm_rate in the Private school model.

The public school model accounts for 27.5 percent of the variation in cosst4_a. Controlling for the other independent variables in the model: a 1% increase in admission rate (adm_rate) results in a $4,513.20 to $225.49 decrease in college cost; a one unit increase in average student body SAT scores (sat_age) results in a $7.00 to $16.20 increase in college cost; a 1% increase in first generations students (first_gen) results in a $16,585.30 to $6,523.10 decrease in college cost; and a $1 increase in student’s median household incomes results in $0.10 to $0.25 increase in college cost.

The private school model accounts for 77.3 percent of the variation in cosst4_a. Controlling for the other independent variables in the model: admission rate (adm_rate) has no statistical effect on college cost; a one unit increase in average student body SAT scores (sat_age) results in a $40.46 to $50.10 increase in college cost; a 1% increase in first generations students (first_gen) results in a $12,792.32 to $796.29 decrease in college cost; and a $1 increase in student’s median household incomes results in $0.28 to $0.38 increase in college cost.

The fact that adm_rate and first_gen have negative signs while sat_avg and median_hh_inc have positive signs in these models suggests that the more exclusive a college (the hard it is to get into) the higher its total cost of attendance. In economic terms exclusivity is a measure scarcity and scare items demand higher prices in the market place. Also the two different models suggest the two different missions of these two types of colleges and universities. Public schools’ primary missions are to educate the children of the citizens of their states. These schools to varying degrees appeal to a mass consumer market. Private schools sell themselves as premium products for a select market. Of course these are generalizations but the difference in regression coefficients for sat_avg ($11.57 vs $45.28) and median_hh_inc ($0.11 vs $0.33) suggest that cost of attending these private school is more strongly associated with students’ SAT scores and the incomes of the homes they come from than it is for public schools.

Discussion

Colleges and universities are important portals for moving students into the middle class and beyond, but access to these portals comes at a cost. This begs the question: “What determines the cost of colleges and universities?” The analysis performed here suggests colleges and universities’ admission rates, average SAT scores of their student bodies, average household income of their students (families), and what percentage of their student bodies are first generation students are the primary factors associated with the cost of colleges. Taken as a whole all these factors suggest that college costs are a function of college exclusivity. Simply put, the harder an institution of higher learning is to get into the greater in general will be its cost.

While this association between exclusivity and cost is seen for all colleges and universities, this relationship is particularly strong for private schools as opposed to public schools. This is probably a refelection of the differing missions of the two institutions. Private schools are literally businesses who sell special types of educational experiences – e.g, specialized education (Embry-Riddle, Gallaudet), religious infused education (Liberty University, Notre Dame), small liberal arts education (Pomona, Emory), elite education (Yale, University of Chicago), Black experience education (Howard, Morehouse). Despite neo-liberal calls for public colleges and universities to be run more like businesses, the primary mission of these institutions is to provide access to higher learning to the citizens of the states in which they reside; the citizens whose taxes, in part, fund these institutions. Access is the very opposite of exclusivity and as a result the associations between cost and admission rates, SAT scores, first generation students and household incomes are weaker in public schools than in private schools.

References

Item 1, R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Item 2, Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
Item 3, Clyde M (2020). BAS: Bayesian Variable Selection and Model Averaging using Bayesian Adaptive Sampling. R package version 1.5.5.
Item 4, Patil I (2018). “ggstatsplot: ‘ggplot2’ Based Plots with Statistical Details.” CRAN. doi: 10.5281/zenodo.2074621, https://CRAN.R-project.org/package=ggstatsplot.
Item 5, Thomas Lin Pedersen (2020). patchwork: The Composer of Plots. R package version 1.0.1. https://CRAN.R-project.org/package=patchwork
Item 6, Barret Schloerke, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg and Jason Crowley (2020). GGally: Extension to ‘ggplot2’. R package version 2.0.0. https://CRAN.R-project.org/package=GGally
Item 7, Mosteller, F., & Tukey, J. W. (1977). Data analysis and regression: a second course in statistics.