Abstract
Graduation rates are of great interest to colleges, universities, ranking bodies, and students. The factors that contribute to graduation rates are hard to determine. This presentation presents a simple multiple linear regression model that adequately predicts graduation rate (Test RMSE: 13.04; r-squared: 0.4251) from a diversity of other data from the 1995 US News and World Report rankings data.Note: This was my final project for Predictive Analytics (LIS 4805) at the University of South Florida School of Information.
College and university graduation rates are of great interest to institutions, ranking bodies, and students alike. Clearly, colleges and universities wish to maximize graduation rates and students wish to attend schools with high graduation rates. In this project, I will be using a dataset of information about colleges and universities to develop a model for predicting graduation rates.
# Reading and munging data
data <- read.csv("College.csv")
data$X <- NULL #removing the universities' names
data$Private <- as.factor(ifelse(data$Private == "Yes", 1, 0))
str(data)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : int 1660 2186 1428 417 193 587 353 1899 1038 582 ...
## $ Accept : int 1232 1924 1097 349 146 479 340 1720 839 498 ...
## $ Enroll : int 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : int 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : int 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: int 2885 2683 1036 510 249 678 416 1594 973 799 ...
## $ P.Undergrad: int 537 1227 99 63 869 41 230 32 306 78 ...
## $ Outstate : int 7440 12280 11250 12960 7560 13500 13290 13868 15595 10468 ...
## $ Room.Board : int 3300 6450 3750 5450 4120 3335 5720 4826 4400 3380 ...
## $ Books : int 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : int 2200 1500 1165 875 1500 675 1500 850 500 1800 ...
## $ PhD : int 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : int 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: int 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : int 7041 10527 8735 19016 10922 9727 8861 11487 11644 8991 ...
## $ Grad.Rate : int 60 56 54 59 15 55 63 73 80 52 ...
From here, we can learn more about the data, which consists of 777 observations of 18 variables from the 1995 issue of US News and World Report:
| Variable | Meaning |
|---|---|
| Private | Whether the school is public (0) or private (1) |
| Apps | The number of applications the school received |
| Accept | The number of applications the school accepted |
| Enroll | The number of applications that enrolled |
| Top10perc | The percent of new students in the top 10% of their high school class |
| Top25perc | The percent of new students in the top 25% of their high school class |
| F.Undergrad | The number of full-time undergraduates |
| P.Undergrad | The number of part-time undergraduates |
| Outstate | The out-of-state tuition, in dollars |
| Room.Board | The cost of room and board, in dollars |
| Books | Estimated book spending, in dollars |
| Personal | Estimated personal spending, dollars |
| PhD | Percent of faculty with PhDs |
| Terminal | The percent of faculty with terminal degrees |
| S.F. Ratio | The number of students for every member of faculty |
| perc.alumni | The percent of alumni who donate back to the university |
| Expend | Per student instructional expenditures, in dollars |
| Grad.Rate | Graduation rate |
I will be using multiple linear regression to resolve this prediction. Before doing so, I will need to divide the data into a training set and a test set so that I can effectively evaluate our model.
sz <- sample(1:nrow(data), size=round(0.8*nrow(data)))
training_set <- data[sz,]
testing_set <- data[-sz,]
#80% of the original data is used as the training set, while the rest 20% is used as test set
set.seed(1337)
From here, we can generate a base model. We can use domain knowledge to generate the base model, and I have chosen to explain graduation rate by Top25perc, F.Undergrad, and P.Undergrad. The logic being that the demographics of a university is what ultimately decides the graduation rate.
#a base model, graduation rate explained by top 25%, and full- and part-time undergrad
model1 <- lm(Grad.Rate ~ Top25perc+F.Undergrad+P.Undergrad, data = training_set)
summary(model1)
##
## Call:
## lm(formula = Grad.Rate ~ Top25perc + F.Undergrad + P.Undergrad,
## data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.289 -9.835 0.074 9.211 59.174
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.8007814 1.7648006 24.819 < 2e-16 ***
## Top25perc 0.4373761 0.0303079 14.431 < 2e-16 ***
## F.Undergrad -0.0002571 0.0001439 -1.787 0.0744 .
## P.Undergrad -0.0019477 0.0004475 -4.352 1.58e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.4 on 618 degrees of freedom
## Multiple R-squared: 0.3055, Adjusted R-squared: 0.3021
## F-statistic: 90.61 on 3 and 618 DF, p-value: < 2.2e-16
The base model had an adjusted r-squared value of 0.2861, which works for our purposes. This means that approximately 28% of our dependent variable (graduation rate) can be explained by the independent variables provided (Top25perc, F.Undergrad, and P.Undergrad).
I will be using backward selection to determine what variables will be going into our final model. Backward selection works by starting with using every variable as an independent variable and removing variables to maximize p.
# full and null models
m_full <- lm(Grad.Rate ~ ., data = training_set)
m_null <- lm(Grad.Rate ~ 1, data = training_set)
# backward selection
step(m_full, trace = F, scope = list(lower=formula(m_null), upper=formula(m_full)),
direction = 'backward')
step(m_full, direction="backward")
Due to their length, the results of the backward selection are not displayed, but it resulted in a model consisting of 11 variables: Private, Apps, Top25perc, P.Undergrad, Outstate, Room.Board, Personal, PhD, Terminal, perc.alumni, Expend.
Let’s see how a model with those variables performs!
m_fit <- lm(Grad.Rate ~ Private+Apps+Top25perc+P.Undergrad+Outstate+Room.Board+Personal+PhD+Terminal+perc.alumni+Expend, data = training_set)
summary(m_fit)
##
## Call:
## lm(formula = Grad.Rate ~ Private + Apps + Top25perc + P.Undergrad +
## Outstate + Room.Board + Personal + PhD + Terminal + perc.alumni +
## Expend, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.459 -7.174 -0.626 7.254 53.760
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.5160003 3.7019519 9.324 < 2e-16 ***
## Private1 2.1053891 1.8232808 1.155 0.248655
## Apps 0.0007983 0.0001691 4.721 2.92e-06 ***
## Top25perc 0.1770614 0.0365739 4.841 1.64e-06 ***
## P.Undergrad -0.0014467 0.0003855 -3.753 0.000192 ***
## Outstate 0.0013208 0.0002536 5.208 2.61e-07 ***
## Room.Board 0.0016408 0.0006479 2.532 0.011582 *
## Personal -0.0013638 0.0008799 -1.550 0.121687
## PhD 0.0938764 0.0614783 1.527 0.127283
## Terminal -0.1212500 0.0677360 -1.790 0.073944 .
## perc.alumni 0.2659408 0.0538758 4.936 1.03e-06 ***
## Expend -0.0003860 0.0001345 -2.871 0.004236 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.65 on 610 degrees of freedom
## Multiple R-squared: 0.4704, Adjusted R-squared: 0.4609
## F-statistic: 49.26 on 11 and 610 DF, p-value: < 2.2e-16
We found an adjusted r-squared value of 0.4333, which is a great improvement over our base model (0.2861)!
We ought to further evaluate our model, but our data is very limited. Cross-validation is a way for us to evaluate our model on limited data. We can split our data into k groups and measure the efficacy of training the model on all groups but one. We will need to compare these analyses between both the base model and the fitted model, with both the training and test sets.
#Cross Validation - Base Model
cv_base <- caret::train(Grad.Rate ~ Top25perc+F.Undergrad+P.Undergrad, training_set,
method="lm",
trControl = trainControl(
method="cv", number = 10,
verboseIter = TRUE
)
)
cv_base # RMSE 14.56
#Test set performance
cv_baset <- train(Grad.Rate ~ Top25perc+F.Undergrad+P.Undergrad, testing_set,
method="lm",
trControl = trainControl(
method="cv", number = 10,
verboseIter = TRUE
)
)
cv_baset #14.67
#Cross Validation - Fitted Model
cv <-train(Grad.Rate ~ Private+Apps+Top25perc+P.Undergrad+Outstate+Room.Board+Personal+PhD+Terminal+perc.alumni+Expend, training_set,
method="lm",
trControl = trainControl(
method="cv", number = 10,
verboseIter = TRUE
)
)
cv #RMSE: 12.86
#Test set performance
cv1 <-train(Grad.Rate ~ Private+Apps+Top25perc+P.Undergrad+Outstate+Room.Board+Personal+PhD+Terminal+perc.alumni+Expend, testing_set,
method="lm",
trControl = trainControl(
method="cv", number = 10,
verboseIter = TRUE
)
)
cv1
#RMSE: 13.04
| Model | Train RMSE | Test RMSE |
|---|---|---|
| Base | 14.56 | 14.67 |
| Fitted | 12.86 | 13.04 |
Our fitted model has a slightly smaller root mean squared error (RMSE) than the base model and the RMSEs for the training set and the test set are similar, validating the fitted model.
The fitted model has an adjusted r-squared value of 0.4251, meaning that the 42.51% of the variation in graduation rates can be explained by the model. This is about twice that of the base model (r^2 == 0.248).
While this model is not perfect and could benefit from more sophisticated methods, it would provide benefit and insight for organizations seeking to learn more about the factors that determine university graduation rates.