Note: This was my final project for Predictive Analytics (LIS 4805) at the University of South Florida School of Information.

College and university graduation rates are of great interest to institutions, ranking bodies, and students alike. Clearly, colleges and universities wish to maximize graduation rates and students wish to attend schools with high graduation rates. In this project, I will be using a dataset of information about colleges and universities to develop a model for predicting graduation rates.

# Reading and munging data
data <- read.csv("College.csv")
data$X <- NULL #removing the universities' names
data$Private <- as.factor(ifelse(data$Private == "Yes", 1, 0))
str(data)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : int  1660 2186 1428 417 193 587 353 1899 1038 582 ...
##  $ Accept     : int  1232 1924 1097 349 146 479 340 1720 839 498 ...
##  $ Enroll     : int  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : int  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : int  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: int  2885 2683 1036 510 249 678 416 1594 973 799 ...
##  $ P.Undergrad: int  537 1227 99 63 869 41 230 32 306 78 ...
##  $ Outstate   : int  7440 12280 11250 12960 7560 13500 13290 13868 15595 10468 ...
##  $ Room.Board : int  3300 6450 3750 5450 4120 3335 5720 4826 4400 3380 ...
##  $ Books      : int  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : int  2200 1500 1165 875 1500 675 1500 850 500 1800 ...
##  $ PhD        : int  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : int  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: int  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : int  7041 10527 8735 19016 10922 9727 8861 11487 11644 8991 ...
##  $ Grad.Rate  : int  60 56 54 59 15 55 63 73 80 52 ...

From here, we can learn more about the data, which consists of 777 observations of 18 variables from the 1995 issue of US News and World Report:

Variable Meaning
Private Whether the school is public (0) or private (1)
Apps The number of applications the school received
Accept The number of applications the school accepted
Enroll The number of applications that enrolled
Top10perc The percent of new students in the top 10% of their high school class
Top25perc The percent of new students in the top 25% of their high school class
F.Undergrad The number of full-time undergraduates
P.Undergrad The number of part-time undergraduates
Outstate The out-of-state tuition, in dollars
Room.Board The cost of room and board, in dollars
Books Estimated book spending, in dollars
Personal Estimated personal spending, dollars
PhD Percent of faculty with PhDs
Terminal The percent of faculty with terminal degrees
S.F. Ratio The number of students for every member of faculty
perc.alumni The percent of alumni who donate back to the university
Expend Per student instructional expenditures, in dollars
Grad.Rate Graduation rate

I will be using multiple linear regression to resolve this prediction. Before doing so, I will need to divide the data into a training set and a test set so that I can effectively evaluate our model.

sz <- sample(1:nrow(data), size=round(0.8*nrow(data)))
training_set <- data[sz,]
testing_set <- data[-sz,]
#80% of the original data is used as the training set, while the rest 20% is used as test set

set.seed(1337)

From here, we can generate a base model. We can use domain knowledge to generate the base model, and I have chosen to explain graduation rate by Top25perc, F.Undergrad, and P.Undergrad. The logic being that the demographics of a university is what ultimately decides the graduation rate.

#a base model, graduation rate explained by top 25%, and full- and part-time undergrad
model1 <- lm(Grad.Rate ~ Top25perc+F.Undergrad+P.Undergrad, data = training_set)
summary(model1)
## 
## Call:
## lm(formula = Grad.Rate ~ Top25perc + F.Undergrad + P.Undergrad, 
##     data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.289  -9.835   0.074   9.211  59.174 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 43.8007814  1.7648006  24.819  < 2e-16 ***
## Top25perc    0.4373761  0.0303079  14.431  < 2e-16 ***
## F.Undergrad -0.0002571  0.0001439  -1.787   0.0744 .  
## P.Undergrad -0.0019477  0.0004475  -4.352 1.58e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.4 on 618 degrees of freedom
## Multiple R-squared:  0.3055, Adjusted R-squared:  0.3021 
## F-statistic: 90.61 on 3 and 618 DF,  p-value: < 2.2e-16

The base model had an adjusted r-squared value of 0.2861, which works for our purposes. This means that approximately 28% of our dependent variable (graduation rate) can be explained by the independent variables provided (Top25perc, F.Undergrad, and P.Undergrad).

I will be using backward selection to determine what variables will be going into our final model. Backward selection works by starting with using every variable as an independent variable and removing variables to maximize p.

# full and null models
m_full <- lm(Grad.Rate ~ ., data = training_set)  
m_null <- lm(Grad.Rate ~ 1, data = training_set)

# backward selection
step(m_full, trace = F, scope = list(lower=formula(m_null), upper=formula(m_full)),
     direction = 'backward')
step(m_full, direction="backward")

Due to their length, the results of the backward selection are not displayed, but it resulted in a model consisting of 11 variables: Private, Apps, Top25perc, P.Undergrad, Outstate, Room.Board, Personal, PhD, Terminal, perc.alumni, Expend.

Let’s see how a model with those variables performs!

m_fit <- lm(Grad.Rate ~ Private+Apps+Top25perc+P.Undergrad+Outstate+Room.Board+Personal+PhD+Terminal+perc.alumni+Expend, data = training_set)
summary(m_fit)
## 
## Call:
## lm(formula = Grad.Rate ~ Private + Apps + Top25perc + P.Undergrad + 
##     Outstate + Room.Board + Personal + PhD + Terminal + perc.alumni + 
##     Expend, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.459  -7.174  -0.626   7.254  53.760 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.5160003  3.7019519   9.324  < 2e-16 ***
## Private1     2.1053891  1.8232808   1.155 0.248655    
## Apps         0.0007983  0.0001691   4.721 2.92e-06 ***
## Top25perc    0.1770614  0.0365739   4.841 1.64e-06 ***
## P.Undergrad -0.0014467  0.0003855  -3.753 0.000192 ***
## Outstate     0.0013208  0.0002536   5.208 2.61e-07 ***
## Room.Board   0.0016408  0.0006479   2.532 0.011582 *  
## Personal    -0.0013638  0.0008799  -1.550 0.121687    
## PhD          0.0938764  0.0614783   1.527 0.127283    
## Terminal    -0.1212500  0.0677360  -1.790 0.073944 .  
## perc.alumni  0.2659408  0.0538758   4.936 1.03e-06 ***
## Expend      -0.0003860  0.0001345  -2.871 0.004236 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.65 on 610 degrees of freedom
## Multiple R-squared:  0.4704, Adjusted R-squared:  0.4609 
## F-statistic: 49.26 on 11 and 610 DF,  p-value: < 2.2e-16

We found an adjusted r-squared value of 0.4333, which is a great improvement over our base model (0.2861)!

We ought to further evaluate our model, but our data is very limited. Cross-validation is a way for us to evaluate our model on limited data. We can split our data into k groups and measure the efficacy of training the model on all groups but one. We will need to compare these analyses between both the base model and the fitted model, with both the training and test sets.

#Cross Validation - Base Model
cv_base <- caret::train(Grad.Rate ~ Top25perc+F.Undergrad+P.Undergrad, training_set,
           method="lm",
           trControl = trainControl(
             method="cv", number = 10,
             verboseIter = TRUE
           )
)
cv_base # RMSE 14.56

#Test set performance
cv_baset <- train(Grad.Rate ~ Top25perc+F.Undergrad+P.Undergrad, testing_set,
                  method="lm",
                  trControl = trainControl(
                    method="cv", number = 10,
                    verboseIter = TRUE
                  )
)
cv_baset #14.67

#Cross Validation - Fitted Model
cv <-train(Grad.Rate ~ Private+Apps+Top25perc+P.Undergrad+Outstate+Room.Board+Personal+PhD+Terminal+perc.alumni+Expend, training_set,
              method="lm",
              trControl = trainControl(
                method="cv", number = 10,
                verboseIter = TRUE
              )
)

cv #RMSE: 12.86

#Test set performance
cv1 <-train(Grad.Rate ~ Private+Apps+Top25perc+P.Undergrad+Outstate+Room.Board+Personal+PhD+Terminal+perc.alumni+Expend, testing_set,
           method="lm",
           trControl = trainControl(
             method="cv", number = 10,
             verboseIter = TRUE
           )
)
cv1
#RMSE: 13.04
Model Train RMSE Test RMSE
Base 14.56 14.67
Fitted 12.86 13.04

Our fitted model has a slightly smaller root mean squared error (RMSE) than the base model and the RMSEs for the training set and the test set are similar, validating the fitted model.

The fitted model has an adjusted r-squared value of 0.4251, meaning that the 42.51% of the variation in graduation rates can be explained by the model. This is about twice that of the base model (r^2 == 0.248).

While this model is not perfect and could benefit from more sophisticated methods, it would provide benefit and insight for organizations seeking to learn more about the factors that determine university graduation rates.