Introduction

In the United State a college education has come to be viewed as means of economically progressing in life. The Bureau of Labor Statistics finds that full-time workers with just a high school diploma had median weekly earnings of $718 while those with a bachelor’s degree had median weekly earnings of $1,189 (bls.gov, 2020). College is important but it has also become increasingly expensive. For example in 1989, the cost of a 4-year degree (adjusted for inflation) averaged $52,892, while in 2016, the cost of the same degree rose to $104,480 on average (educationdata.org, 2020). Given the increasing income inequality in the United State these rising cost act as a barrier to under-privileged and working-class students moving into the middle class and beyond.

The following research attempts to gain an understanding of the factors that produce increasing college costs. The data for this study come from the College Scorecard and span the years 2017-2018 (collegescorecard.ed.gov, 2020). The variable of interest (the dependent variable) in this analysis is the average annual total cost of attending college.

Methods

The units of analysis for this study are 7,112 colleges and universities in the United States. The data for this research can be found here:https://collegescorecard.ed.gov/data/. The independent variables (factors) used to try and explain college costs are:

The dependent variable is cost4_a (the average annual total cost of attendance, including tuition and fees, books and supplies). Various plotting techniques and correlation analysis were used in both exploratory data analysis and variable selection. Additionally, Bayesian averaging was used in variable selection. Variable selection was carried out to product the most parsimonious final model(s) possible. Parsimonious models are not only simpler models to interpret, they reduce the confounding of effects.

The exploratory data analysis of total cost versus average SAT suggests that two populations of colleges exits. The two populations were split from each other and regression analysis was performed on both. All but one of independent variables was found to be statistically significant in both regression models.

Results

Exploring the data

The exploratory analysis begins with a correlgram showing robust correlation coefficients (Mosteller & Tukey, 1977) for each pair of variables in the data set.

#Check correlations between variables for columns 1 to 11 of ColSco
set.seed(123)  # for reproducibility
ggcorrmat(data = ColSco[, 1:11],
  type = "robust",                    # correlation method
  p.adjust.method = "holm",           # p-value adjustment method for multiple comparisons
  matrix.type = "upper",              # type of visualization matrix
  colors = c("#B2182B", "white", "#4000FF"),
  title = "Correlation Matrix")
Figure 1: Correlogram of all variables

Figure 1: Correlogram of all variables

Figure 1 shows that median_hh_inc and sat_avg are moderately positively correlated with costt4_a, while first_gen is moderately negatively correlated with costt4_a. Also note that median_hh_inc, poverty_rate, and unemp_rate are all moderately to highly inter-correlated. This makes sense in that they are all, through different means, measuring the same thing – i.e., economic well-being. This inter-correlation suggest possible multicollinearity issues.

#Exploratory Variable selection using Bayesian Model Averaging 
model1 <- bas.lm(costt4_a ~ .,
                   data = ColSco[, 1:11],
                   prior = "ZS-null",
                   modelprior = uniform(), initprobs = "eplogp",
                   force.heredity = FALSE, pivot = TRUE)
## Warning in bas.lm(costt4_a ~ ., data = ColSco[, 1:11], prior = "ZS-null", :
## dropping 5891 rows due to missing data
image(model1,  rotate = F)
Figure 2: Bayesian Variable Selection

Figure 2: Bayesian Variable Selection

The Bayesian model averaging analysis suggests, based on log posterior odds (7.08), that a model that excludes the independent variables age_entry, poverty_rate, and unemp_rate is the best model. The variables poverty_rate, and unemp_rate are probably excluded from the model because of their multicollinearity with median_hh_inc. Simply, if you have median_hh_inc in the model then poverty_rate, and unemp_rate add very little new information.

Base on the correlgram and Bayesian model averaging the variables female, poverty_rate, unemp_rate, age_entry, ugds, and ugds_white will be dropped from further consideration.

#Based on correlations and the variable selection method
#the following varibles will be dropped (deleted) from 
#the data set:
ColSco2 = ColSco
ColSco2$female <-NULL
ColSco2$poverty_rate <-NULL
ColSco2$unemp_rate <-NULL
ColSco2$age_entry <-NULL
ColSco2$ugds <-NULL
ColSco2$ugds_white <-NULL

The remaining variables will be more closely examined using a scatter plot matrix.

#Remove all observations with missing values
ColSco3 = drop_na(ColSco2)

#Check correlations and distrubutions for columns 1 to 5 of ColSco
ggpairs(ColSco3[, 1:5])
Figure 3: A scatter plot matrix

Figure 3: A scatter plot matrix

In figure 3 the density plot for costt4_a is bi-modal (has two humps) also in the scatter plot of costt4_a and sat_avg, and in the scatter plot of costt4_a and sat_avg there can be seen a divergence (a splitting) of observations from one another. All these visuals suggests there exist two different populations of colleges and universities. To explore this divergence a simple bivariate regression analysis in which the dependent variable is cost4_a and the variable is sat_avg is run and the residuals from this model are used to split the observation (colleges) into those below the regression line (negative residuals) and those above (positive residuals). A random sample of ten observations below and above the regression line provides some insight into the differences between these two diverging groups of observations. The sample reveals that most below observations are public schools while most above observations are private school.Public and private institutions are qualitatively two different types of colleges and hence two different types of populations.

p1 = ggplot(ColSco3, aes(x = sat_avg, y = costt4_a)) + 
    geom_point()+
    geom_smooth(method='lm')

#p1

#Using Regression to split two populations (?) of colleges
model2 = lm(ColSco3$costt4_a ~ ColSco3$sat_avg)

#extract the resideuals
e = residuals(model2)


#add a new dummy variable (1s and 0s) to ColSco3 
#called abovebelow
ColSco3$abovebelow = (1 * (e > 0))

#Random sample of the two types of colleges
#Note a seed is not set so results will vary
ColSco3 %>% group_by(abovebelow) %>% select(instnm, costt4_a, sat_avg) %>% 
  slice_sample(n = 10) %>% 
  rename(college = instnm, cost = costt4_a, sat = sat_avg) %>%
  knitr::kable()
## Adding missing grouping variables: `abovebelow`
abovebelow college cost sat
0 Saginaw Valley State University 20061 1101
0 Kutztown University of Pennsylvania 25274 1056
0 Fayetteville State University 15463 955
0 University of South Carolina-Upstate 22583 1010
0 Fort Lewis College 25064 1123
0 University of Minnesota-Duluth 23913 1185
0 University of Illinois at Springfield 23966 1159
0 Marshall University 18315 1099
0 University of Wisconsin-Whitewater 17480 1105
0 Pennsylvania State University-Penn State Mont Alto 27352 1072
1 Walsh University 40677 1140
1 Indiana Wesleyan University-Marion 36871 1138
1 Arizona Christian University 37675 956
1 Heidelberg University 41566 1088
1 Stanford University 66696 1489
1 Eastern University 44446 1106
1 Washington Adventist University 32240 957
1 Aurora University 33781 1062
1 Nebraska Christian College of Hope International University 28800 989
1 St. Mary’s University 40087 1139
#Create a factor above or below  into a factor and 
#label positive values above and negative below
ColSco3$PublicPrivate =  as.factor(ifelse(ColSco3$abovebelow == 0, "Public", "Private"))

#Scatter plot of entire and then two different 
#costt4_a and sat_avg relationships
p2 = ggplot(ColSco3, aes(x = sat_avg, y = costt4_a)) + 
      geom_point() +
      geom_smooth(method='lm') +
      facet_wrap(vars(PublicPrivate))

p1/ p2
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Figure 4: A scatter plot of the entire population of colleges and it two sub-populations

Figure 4: A scatter plot of the entire population of colleges and it two sub-populations

Figure 4 clearly shows 1) the divergence the two populations of colleges and universities and 2) the two different trajectories they take.

Analyzing and interpreting the data

These two populations of colleges and universities are analyzed using regression analysis. To aid in interpretation all the independent variables are center on their means.

#Regression analysis
#Prevent presenting regression results in scientific notation
options(scipen = 10, digits = 4)

#Public school model
Public = ColSco3[ColSco3$abovebelow == 0, ]
summary(Pub <- lm(costt4_a ~ scale(adm_rate, scale = FALSE) + scale(sat_avg,scale = FALSE) + 
            scale(first_gen, scale = FALSE) + scale(median_hh_inc, scale = FALSE),  
            data = Public))
## 
## Call:
## lm(formula = costt4_a ~ scale(adm_rate, scale = FALSE) + scale(sat_avg, 
##     scale = FALSE) + scale(first_gen, scale = FALSE) + scale(median_hh_inc, 
##     scale = FALSE), data = Public)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11302  -2888   -638   2359  14645 
## 
## Coefficients:
##                                        Estimate  Std. Error t value
## (Intercept)                          22779.8925    185.8750  122.55
## scale(adm_rate, scale = FALSE)       -2369.3496   1091.5440   -2.17
## scale(sat_avg, scale = FALSE)           11.5773      2.3313    4.97
## scale(first_gen, scale = FALSE)     -11554.1619   2561.5926   -4.51
## scale(median_hh_inc, scale = FALSE)      0.1131      0.0186    6.08
##                                         Pr(>|t|)    
## (Intercept)                              < 2e-16 ***
## scale(adm_rate, scale = FALSE)              0.03 *  
## scale(sat_avg, scale = FALSE)       0.0000009001 ***
## scale(first_gen, scale = FALSE)     0.0000078269 ***
## scale(median_hh_inc, scale = FALSE) 0.0000000021 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4500 on 581 degrees of freedom
## Multiple R-squared:  0.275,  Adjusted R-squared:  0.27 
## F-statistic: 55.2 on 4 and 581 DF,  p-value: <2e-16
#confidence intervals
round(confint(Pub), 3)
##                                          2.5 %   97.5 %
## (Intercept)                          22414.824 23144.96
## scale(adm_rate, scale = FALSE)       -4513.202  -225.50
## scale(sat_avg, scale = FALSE)            6.998    16.16
## scale(first_gen, scale = FALSE)     -16585.272 -6523.05
## scale(median_hh_inc, scale = FALSE)      0.077     0.15
#Private school model
Private = ColSco3[ColSco3$abovebelow == 1, ]
summary(Pri <- lm(costt4_a ~ scale(adm_rate, scale = FALSE) + scale(sat_avg,scale = FALSE) + 
              scale(first_gen, scale = FALSE) + scale(median_hh_inc, scale = FALSE),  
            data = Private))
## 
## Call:
## lm(formula = costt4_a ~ scale(adm_rate, scale = FALSE) + scale(sat_avg, 
##     scale = FALSE) + scale(first_gen, scale = FALSE) + scale(median_hh_inc, 
##     scale = FALSE), data = Private)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12638  -3746     54   3538  20394 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         46956.2591   202.4264  231.97   <2e-16 ***
## scale(adm_rate, scale = FALSE)      -2126.8036  1133.0148   -1.88    0.061 .  
## scale(sat_avg, scale = FALSE)          45.2844     2.4570   18.43   <2e-16 ***
## scale(first_gen, scale = FALSE)     -6794.3086  3054.5789   -2.22    0.026 *  
## scale(median_hh_inc, scale = FALSE)     0.3320     0.0253   13.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5180 on 651 degrees of freedom
## Multiple R-squared:  0.773,  Adjusted R-squared:  0.772 
## F-statistic:  555 on 4 and 651 DF,  p-value: <2e-16
#confidence intervals
round(confint(Pri), 3)
##                                          2.5 %    97.5 %
## (Intercept)                          46558.772 47353.747
## scale(adm_rate, scale = FALSE)       -4351.608    98.001
## scale(sat_avg, scale = FALSE)           40.460    50.109
## scale(first_gen, scale = FALSE)     -12792.325  -796.293
## scale(median_hh_inc, scale = FALSE)      0.282     0.382

In the two regression models all the independent variables are statistically significant except adm_rate in the Private school model.

The public school model accounts for 27.5 percent of the variation in cosst4_a. Controlling for the other independent variables in the model: a 1% increase in admission rate (adm_rate) results in a $4,513.20 to $225.49 decrease in college cost; a one unit increase in average student body SAT scores (sat_age) results in a $7.00 to $16.20 increase in college cost; a 1% increase in first generations students (first_gen) results in a $16,585.30 to $6,523.10 decrease in college cost; and a $1 increase in student’s median household incomes results in $0.10 to $0.25 increase in college cost.

The private school model accounts for 77.3 percent of the variation in cosst4_a. Controlling for the other independent variables in the model: admission rate (adm_rate) has no statistical effect on college cost; a one unit increase in average student body SAT scores (sat_age) results in a $40.46 to $50.10 increase in college cost; a 1% increase in first generations students (first_gen) results in a $12,792.32 to $796.29 decrease in college cost; and a $1 increase in student’s median household incomes results in $0.28 to $0.38 increase in college cost.

The fact that adm_rate and first_gen have negative signs while sat_avg and median_hh_inc have positive signs in these models suggests that the more exclusive a college (the hard it is to get into) the higher its total cost of attendance. In economic terms exclusivity is a measure scarcity and scare items demand higher prices in the market place. Also the two different models suggest the two different missions of these two types of colleges and universities. Public schools’ primary missions are to educate the children of the citizens of their states. These schools to varying degrees appeal to a mass consumer market. Private schools sell themselves as premium products for a select market. Of course these are generalizations but the difference in regression coefficients for sat_avg ($11.57 vs $45.28) and median_hh_inc ($0.11 vs $0.33) suggest that cost of attending these private school is more strongly associated with students’ SAT scores and the incomes of the homes they come from than it is for public schools.

Discussion

Colleges and universities are important portals for moving students into the middle class and beyond, but access to these portals comes at a cost. This begs the question: “What determines the cost of colleges and universities?” The analysis performed here suggests colleges and universities’ admission rates, average SAT scores of their student bodies, average household income of their students (families), and what percentage of their student bodies are first generation students are the primary factors associated with the cost of colleges. Taken as a whole all these factors suggest that college costs are a function of college exclusivity. Simply put, the harder an institution of higher learning is to get into the greater in general will be its cost.

While this association between exclusivity and cost is seen for all colleges and universities, this relationship is particularly strong for private schools as opposed to public schools. This is probably a refelection of the differing missions of the two institutions. Private schools are literally businesses who sell special types of educational experiences – e.g, specialized education (Embry-Riddle, Gallaudet), religious infused education (Liberty University, Notre Dame), small liberal arts education (Pomona, Emory), elite education (Yale, University of Chicago), Black experience education (Howard, Morehouse). Despite neo-liberal calls for public colleges and universities to be run more like businesses, the primary mission of these institutions is to provide access to higher learning to the citizens of the states in which they reside; the citizens whose taxes, in part, fund these institutions. Access is the very opposite of exclusivity and as a result the associations between cost and admission rates, SAT scores, first generation students and household incomes are weaker in public schools than in private schools.

References