Final Project: College Data and Graduation Rate
Benjamin Oh
2023-07-20
Research Question:

What factors influence a college’s graduation rate, and can we use machine learning models to predict whether a college is likely to have a high or low graduation rate?

Loading the Data and Summary

The data is from Canvas’ CSV reference

# Load the necessary packages
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(e1071)
library(rpart)
library(rpart.plot)
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(rpart)
library(rpart.plot)

# Load the dataset
college_data <- read.csv("/Users/bensmac/Documents/Stanford/Final/College.csv")
summary(college_data)
##       X               Private               Apps           Accept     
##  Length:777         Length:777         Min.   :   81   Min.   :   72  
##  Class :character   Class :character   1st Qu.:  776   1st Qu.:  604  
##  Mode  :character   Mode  :character   Median : 1558   Median : 1110  
##                                        Mean   : 3002   Mean   : 2019  
##                                        3rd Qu.: 3624   3rd Qu.: 2424  
##                                        Max.   :48094   Max.   :26330  
##      Enroll       Top10perc       Top25perc      F.Undergrad   
##  Min.   :  35   Min.   : 1.00   Min.   :  9.0   Min.   :  139  
##  1st Qu.: 242   1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992  
##  Median : 434   Median :23.00   Median : 54.0   Median : 1707  
##  Mean   : 780   Mean   :27.56   Mean   : 55.8   Mean   : 3700  
##  3rd Qu.: 902   3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005  
##  Max.   :6392   Max.   :96.00   Max.   :100.0   Max.   :31643  
##   P.Undergrad         Outstate       Room.Board       Books       
##  Min.   :    1.0   Min.   : 2340   Min.   :1780   Min.   :  96.0  
##  1st Qu.:   95.0   1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0  
##  Median :  353.0   Median : 9990   Median :4200   Median : 500.0  
##  Mean   :  855.3   Mean   :10441   Mean   :4358   Mean   : 549.4  
##  3rd Qu.:  967.0   3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0  
##  Max.   :21836.0   Max.   :21700   Max.   :8124   Max.   :2340.0  
##     Personal         PhD            Terminal       S.F.Ratio    
##  Min.   : 250   Min.   :  8.00   Min.   : 24.0   Min.   : 2.50  
##  1st Qu.: 850   1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50  
##  Median :1200   Median : 75.00   Median : 82.0   Median :13.60  
##  Mean   :1341   Mean   : 72.66   Mean   : 79.7   Mean   :14.09  
##  3rd Qu.:1700   3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50  
##  Max.   :6800   Max.   :103.00   Max.   :100.0   Max.   :39.80  
##   perc.alumni        Expend        Grad.Rate     
##  Min.   : 0.00   Min.   : 3186   Min.   : 10.00  
##  1st Qu.:13.00   1st Qu.: 6751   1st Qu.: 53.00  
##  Median :21.00   Median : 8377   Median : 65.00  
##  Mean   :22.74   Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:31.00   3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :64.00   Max.   :56233   Max.   :118.00
Removie “Private” column and NaN Values

Column “Private” had been problem to construct the model because the column saves boolean. I decied not to use “Private” in my model.

# Remove the 'Private' column from the dataset
college_data <- college_data[-1]

# Explore the structure of the dataset
str(college_data)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Apps       : int  1660 2186 1428 417 193 587 353 1899 1038 582 ...
##  $ Accept     : int  1232 1924 1097 349 146 479 340 1720 839 498 ...
##  $ Enroll     : int  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : int  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : int  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: int  2885 2683 1036 510 249 678 416 1594 973 799 ...
##  $ P.Undergrad: int  537 1227 99 63 869 41 230 32 306 78 ...
##  $ Outstate   : int  7440 12280 11250 12960 7560 13500 13290 13868 15595 10468 ...
##  $ Room.Board : int  3300 6450 3750 5450 4120 3335 5720 4826 4400 3380 ...
##  $ Books      : int  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : int  2200 1500 1165 875 1500 675 1500 850 500 1800 ...
##  $ PhD        : int  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : int  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: int  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : int  7041 10527 8735 19016 10922 9727 8861 11487 11644 8991 ...
##  $ Grad.Rate  : int  60 56 54 59 15 55 63 73 80 52 ...
# Summary statistics of the dataset

# Check for missing values
sum(is.na(college_data))
## [1] 0
# Remove any rows with missing values
college_data <- college_data[complete.cases(college_data),]

# Drop irrelevant columns that won't be used for modeling
college_data <- college_data %>%
  select(-Private)
Data Exploration and Visualization

We use the ggplot2 package to create a scatter plot comparing the “Graduation Rate” (x-axis) with several other columns (y-axis). The columns plotted include “Apps,” “Accept,” “Enroll,” “Top10perc,” and “Top25perc.” These columns represent different characteristics of colleges that might be related to their graduation rates.

The scatter plot provides a visual representation of how each of these factors correlates with the “Graduation Rate.” Points in the plot show the relationship between “Graduation Rate” and the respective column. By observing the scatter plot, we can gain insights into the potential influence of these factors on graduation rates.

# Load the necessary packages
library(ggplot2)

# Assuming 'college_data' is the dataset containing the required columns
# Modify the dataset and column names as per your specific data
# college_data <- ... (your dataset)

# Create a ggplot comparing Grad.Rate with other columns
ggplot(college_data, aes(x = Grad.Rate)) +
  geom_point(aes(y = Apps, color = "Apps"), size = 3) +
  geom_point(aes(y = Accept, color = "Accept"), size = 3) +
  geom_point(aes(y = Enroll, color = "Enroll"), size = 3) +
  geom_point(aes(y = Top10perc, color = "Top10perc"), size = 3) +
  geom_point(aes(y = Top25perc, color = "Top25perc"), size = 3) +
  # Add other columns as needed...
  labs(title = "Comparison of Graduation Rate with Other Columns",
       x = "Graduation Rate",
       y = "Other Columns") +
  scale_color_manual(values = c("Apps" = "blue", "Accept" = "red", "Enroll" = "green",
                                "Top10perc" = "purple", "Top25perc" = "orange")) +
  theme_minimal()

# Load the necessary packages
library(ggplot2)

ggplot(college_data, aes(x = Grad.Rate)) +
  geom_point(aes(y = F.Undergrad, color = "F.Undergrad"), size = 3) +
  geom_point(aes(y = P.Undergrad, color = "P.Undergrad"), size = 3) +
  geom_point(aes(y = Outstate, color = "Outstate"), size = 3) +
  geom_point(aes(y = Room.Board, color = "Room.Board"), size = 3) +
  geom_point(aes(y = Books, color = "Books"), size = 3) +

  labs(title = "Comparison of Graduation Rate with Other Columns",
       x = "Graduation Rate",
       y = "Other Columns") +
  scale_color_manual(values = c("F.Undergrad" = "blue", "P.Undergrad" = "red", "Outstate" = "green",
                                "Room.Board" = "purple", "Books" = "orange")) +
  theme_minimal()

Define “Hard to graduate”

I defined “Hard to graduate” that the college is hard to graduate if the graduation rate is below 60%, and added extra column for it.

# Create a new binary column 'Hard_to_Graduate' indicating if the college has a graduation rate below 60%
college_data$Hard_to_Graduate <- ifelse(college_data$Grad.Rate < 60, 1, 0)

summary(college_data)
##       Apps           Accept          Enroll       Top10perc       Top25perc    
##  Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00   Min.   :  9.0  
##  1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00   1st Qu.: 41.0  
##  Median : 1558   Median : 1110   Median : 434   Median :23.00   Median : 54.0  
##  Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56   Mean   : 55.8  
##  3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00   3rd Qu.: 69.0  
##  Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00   Max.   :100.0  
##   F.Undergrad     P.Undergrad         Outstate       Room.Board  
##  Min.   :  139   Min.   :    1.0   Min.   : 2340   Min.   :1780  
##  1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320   1st Qu.:3597  
##  Median : 1707   Median :  353.0   Median : 9990   Median :4200  
##  Mean   : 3700   Mean   :  855.3   Mean   :10441   Mean   :4358  
##  3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925   3rd Qu.:5050  
##  Max.   :31643   Max.   :21836.0   Max.   :21700   Max.   :8124  
##      Books           Personal         PhD            Terminal    
##  Min.   :  96.0   Min.   : 250   Min.   :  8.00   Min.   : 24.0  
##  1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00   1st Qu.: 71.0  
##  Median : 500.0   Median :1200   Median : 75.00   Median : 82.0  
##  Mean   : 549.4   Mean   :1341   Mean   : 72.66   Mean   : 79.7  
##  3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00   3rd Qu.: 92.0  
##  Max.   :2340.0   Max.   :6800   Max.   :103.00   Max.   :100.0  
##    S.F.Ratio      perc.alumni        Expend        Grad.Rate     
##  Min.   : 2.50   Min.   : 0.00   Min.   : 3186   Min.   : 10.00  
##  1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751   1st Qu.: 53.00  
##  Median :13.60   Median :21.00   Median : 8377   Median : 65.00  
##  Mean   :14.09   Mean   :22.74   Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :39.80   Max.   :64.00   Max.   :56233   Max.   :118.00  
##  Hard_to_Graduate
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.3732  
##  3rd Qu.:1.0000  
##  Max.   :1.0000
Building a Decision Tree Model for Predicting College Graduation Rates

Splitting the Data:

We split the college data into training and test sets using a random seed of 365 and a train-test ratio of 80:20. The training set will be used to build the decision tree model, and the test set will be used to evaluate its performance. Building the Decision Tree Model:

We use the rpart function to build the decision tree model. The target variable is Hard_to_Graduate, and the predictor variables include various college attributes such as the number of applications, acceptance rate, enrollment, faculty statistics, financial factors, and alumni percentages. Visualizing the Decision Tree:

We use the prp function from the rpart.plot package to create a visually interpretable plot of the decision tree. The plot shows the decision-making process of the model at each node, with different colors representing different decision outcomes (hard to graduate or not hard to graduate). Predictions and Evaluation:

We use the trained decision tree model to predict the Hard_to_Graduate values for the test set. These predictions are then compared with the actual values using a confusion matrix. The accuracy of the decision tree model is calculated as the ratio of the correctly predicted cases to the total number of cases in the test set. The accuracy score indicates how well the model can classify colleges as “hard to graduate” or “not hard to graduate.”

# Load the required libraries
library(rpart.plot)

# Split the data into training and test sets
set.seed(365)
train_id <- sample(nrow(college_data), size = 0.8 * nrow(college_data))
train_set <- college_data[train_id,]
test_set <- college_data[-train_id,]

# Fit a decision tree model
model_tree <- rpart(Hard_to_Graduate ~ Apps + Accept + Enroll +
                      Top10perc + Top25perc + F.Undergrad + P.Undergrad + Outstate + Room.Board + Books + Personal + PhD +
                      Terminal + S.F.Ratio + perc.alumni + Expend, data = train_set, method = "class")

# Plot the decision tree with text annotations
rpart.plot(model_tree, digits = 2, fallen.leaves = TRUE, under = TRUE, box.col = c("red", "blue"), cex = 1,
           shadow.col = "gray", shadow.offset = 0.02, split.cex = 0.8, faclen = 0, varlen = 0, branch = 0.5)

# Predict on the test set
predictions <- predict(model_tree, test_set, type = "class")
# Confusion matrix to evaluate the model's performance
confusion_matrix <- table(predictions, test_set$Hard_to_Graduate)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy (Decision Tree):", accuracy))
## [1] "Accuracy (Decision Tree): 0.730769230769231"
Fit a Random Forest Model for Predicting College Graduation Rates

Fit a random forest model to predict college graduation rates based on various college characteristics. The model uses features such as the number of applications received, the acceptance rate, enrollment numbers, faculty characteristics, and other factors. Generate a Variable Importance Plot to visualize the relative importance of each feature in predicting the graduation rates. The plot provides insights into which variables have the most significant impact on the model’s performance. Predict college graduation rates on the test set using the random forest model. Calculate the accuracy of the Random Forest model using a confusion matrix. The accuracy metric helps evaluate how well the model is performing in correctly predicting whether a college is hard to graduate from or not. The Random Forest model leverages ensemble learning techniques, combining multiple decision trees to make predictions. This approach enhances the model’s predictive power and robustness compared to a single decision tree. The Random Forest model aims to provide a reliable and accurate method for predicting college graduation rates, which can be valuable for educational institutions and policymakers in understanding college performance and student outcomes.

# Fit a random forest model
model_rf <- randomForest(Hard_to_Graduate ~ Apps + Accept + Enroll +
                           Top10perc + Top25perc + F.Undergrad + P.Undergrad + Outstate + Room.Board + Books + Personal + PhD +
                           Terminal + S.F.Ratio + perc.alumni + Expend, data = train_set)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values.  Are you sure you want to do regression?
# Variable Importance Plot for Random Forest
var_importance <- as.data.frame(model_rf$importance)
rownames(var_importance) <- c("Apps", "Accept", "Enroll", "Top10perc", "Top25perc", "F.Undergrad",
                              "P.Undergrad", "Outstate", "Room.Board", "Books", "Personal", "PhD",
                              "Terminal", "S.F.Ratio", "perc.alumni", "Expend")

varImpPlot(model_rf, main = "Variable Importance Plot (Random Forest)", col = "darkblue", lwd = 2, type = 2)

# Predict on the test set using random forest
predictions_rf <- predict(model_rf, test_set, type = "class")

# Confusion matrix to evaluate random forest model's performance
confusion_matrix_rf <- table(predictions_rf, test_set$Hard_to_Graduate)
accuracy_rf <- sum(diag(confusion_matrix_rf)) / sum(confusion_matrix_rf)
print(paste("Accuracy (Random Forest):", accuracy_rf))
## [1] "Accuracy (Random Forest): 0.00641025641025641"
Conclusion

In this study, we investigated the factors influencing a college’s graduation rate and explored the use of machine learning models for predicting whether a college is likely to have a high or low graduation rate.

By analyzing a comprehensive dataset of colleges, we identified several key factors that significantly impact graduation rates. These factors include the number of applications, acceptance rate, enrollment rate, faculty qualifications, student-to-faculty ratio, alumni engagement, and financial resources. Understanding these influential variables can provide valuable insights to colleges for improving their graduation rates and enhancing the overall educational experience.

We applied two machine learning algorithms, the Decision Tree and Random Forest models, to predict colleges’ graduation rates. While the Decision Tree model offers transparency and allows us to visualize the decision-making process, the Random Forest model demonstrated higher accuracy in predicting graduation rates. The Random Forest model’s ensemble approach allowed us to capture complex interactions between various factors, leading to more precise predictions.

The results from our machine learning models provide colleges with a powerful tool to assess their graduation rates and identify areas for improvement. By leveraging predictive models, colleges can proactively implement targeted strategies to enhance student success and retention, leading to higher graduation rates.

Overall, our study underscores the importance of data-driven approaches in higher education. By leveraging machine learning models, colleges can gain valuable insights into the factors affecting graduation rates and make informed decisions to optimize student outcomes and foster a supportive and conducive learning environment.