What factors influence a college’s graduation rate, and can we use machine learning models to predict whether a college is likely to have a high or low graduation rate?
The data is from Canvas’ CSV reference
# Load the necessary packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(e1071)
library(rpart)
library(rpart.plot)
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(rpart)
library(rpart.plot)
# Load the dataset
college_data <- read.csv("/Users/bensmac/Documents/Stanford/Final/College.csv")
summary(college_data)
## X Private Apps Accept
## Length:777 Length:777 Min. : 81 Min. : 72
## Class :character Class :character 1st Qu.: 776 1st Qu.: 604
## Mode :character Mode :character Median : 1558 Median : 1110
## Mean : 3002 Mean : 2019
## 3rd Qu.: 3624 3rd Qu.: 2424
## Max. :48094 Max. :26330
## Enroll Top10perc Top25perc F.Undergrad
## Min. : 35 Min. : 1.00 Min. : 9.0 Min. : 139
## 1st Qu.: 242 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992
## Median : 434 Median :23.00 Median : 54.0 Median : 1707
## Mean : 780 Mean :27.56 Mean : 55.8 Mean : 3700
## 3rd Qu.: 902 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005
## Max. :6392 Max. :96.00 Max. :100.0 Max. :31643
## P.Undergrad Outstate Room.Board Books
## Min. : 1.0 Min. : 2340 Min. :1780 Min. : 96.0
## 1st Qu.: 95.0 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0
## Median : 353.0 Median : 9990 Median :4200 Median : 500.0
## Mean : 855.3 Mean :10441 Mean :4358 Mean : 549.4
## 3rd Qu.: 967.0 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0
## Max. :21836.0 Max. :21700 Max. :8124 Max. :2340.0
## Personal PhD Terminal S.F.Ratio
## Min. : 250 Min. : 8.00 Min. : 24.0 Min. : 2.50
## 1st Qu.: 850 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50
## Median :1200 Median : 75.00 Median : 82.0 Median :13.60
## Mean :1341 Mean : 72.66 Mean : 79.7 Mean :14.09
## 3rd Qu.:1700 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50
## Max. :6800 Max. :103.00 Max. :100.0 Max. :39.80
## perc.alumni Expend Grad.Rate
## Min. : 0.00 Min. : 3186 Min. : 10.00
## 1st Qu.:13.00 1st Qu.: 6751 1st Qu.: 53.00
## Median :21.00 Median : 8377 Median : 65.00
## Mean :22.74 Mean : 9660 Mean : 65.46
## 3rd Qu.:31.00 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :64.00 Max. :56233 Max. :118.00
Column “Private” had been problem to construct the model because the column saves boolean. I decied not to use “Private” in my model.
# Remove the 'Private' column from the dataset
college_data <- college_data[-1]
# Explore the structure of the dataset
str(college_data)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Apps : int 1660 2186 1428 417 193 587 353 1899 1038 582 ...
## $ Accept : int 1232 1924 1097 349 146 479 340 1720 839 498 ...
## $ Enroll : int 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : int 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : int 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: int 2885 2683 1036 510 249 678 416 1594 973 799 ...
## $ P.Undergrad: int 537 1227 99 63 869 41 230 32 306 78 ...
## $ Outstate : int 7440 12280 11250 12960 7560 13500 13290 13868 15595 10468 ...
## $ Room.Board : int 3300 6450 3750 5450 4120 3335 5720 4826 4400 3380 ...
## $ Books : int 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : int 2200 1500 1165 875 1500 675 1500 850 500 1800 ...
## $ PhD : int 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : int 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: int 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : int 7041 10527 8735 19016 10922 9727 8861 11487 11644 8991 ...
## $ Grad.Rate : int 60 56 54 59 15 55 63 73 80 52 ...
# Summary statistics of the dataset
# Check for missing values
sum(is.na(college_data))
## [1] 0
# Remove any rows with missing values
college_data <- college_data[complete.cases(college_data),]
# Drop irrelevant columns that won't be used for modeling
college_data <- college_data %>%
select(-Private)
We use the ggplot2 package to create a scatter plot comparing the “Graduation Rate” (x-axis) with several other columns (y-axis). The columns plotted include “Apps,” “Accept,” “Enroll,” “Top10perc,” and “Top25perc.” These columns represent different characteristics of colleges that might be related to their graduation rates.
The scatter plot provides a visual representation of how each of these factors correlates with the “Graduation Rate.” Points in the plot show the relationship between “Graduation Rate” and the respective column. By observing the scatter plot, we can gain insights into the potential influence of these factors on graduation rates.
# Load the necessary packages
library(ggplot2)
# Assuming 'college_data' is the dataset containing the required columns
# Modify the dataset and column names as per your specific data
# college_data <- ... (your dataset)
# Create a ggplot comparing Grad.Rate with other columns
ggplot(college_data, aes(x = Grad.Rate)) +
geom_point(aes(y = Apps, color = "Apps"), size = 3) +
geom_point(aes(y = Accept, color = "Accept"), size = 3) +
geom_point(aes(y = Enroll, color = "Enroll"), size = 3) +
geom_point(aes(y = Top10perc, color = "Top10perc"), size = 3) +
geom_point(aes(y = Top25perc, color = "Top25perc"), size = 3) +
# Add other columns as needed...
labs(title = "Comparison of Graduation Rate with Other Columns",
x = "Graduation Rate",
y = "Other Columns") +
scale_color_manual(values = c("Apps" = "blue", "Accept" = "red", "Enroll" = "green",
"Top10perc" = "purple", "Top25perc" = "orange")) +
theme_minimal()
# Load the necessary packages
library(ggplot2)
ggplot(college_data, aes(x = Grad.Rate)) +
geom_point(aes(y = F.Undergrad, color = "F.Undergrad"), size = 3) +
geom_point(aes(y = P.Undergrad, color = "P.Undergrad"), size = 3) +
geom_point(aes(y = Outstate, color = "Outstate"), size = 3) +
geom_point(aes(y = Room.Board, color = "Room.Board"), size = 3) +
geom_point(aes(y = Books, color = "Books"), size = 3) +
labs(title = "Comparison of Graduation Rate with Other Columns",
x = "Graduation Rate",
y = "Other Columns") +
scale_color_manual(values = c("F.Undergrad" = "blue", "P.Undergrad" = "red", "Outstate" = "green",
"Room.Board" = "purple", "Books" = "orange")) +
theme_minimal()
I defined “Hard to graduate” that the college is hard to graduate if the graduation rate is below 60%, and added extra column for it.
# Create a new binary column 'Hard_to_Graduate' indicating if the college has a graduation rate below 60%
college_data$Hard_to_Graduate <- ifelse(college_data$Grad.Rate < 60, 1, 0)
summary(college_data)
## Apps Accept Enroll Top10perc Top25perc
## Min. : 81 Min. : 72 Min. : 35 Min. : 1.00 Min. : 9.0
## 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00 1st Qu.: 41.0
## Median : 1558 Median : 1110 Median : 434 Median :23.00 Median : 54.0
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56 Mean : 55.8
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00 3rd Qu.: 69.0
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00 Max. :100.0
## F.Undergrad P.Undergrad Outstate Room.Board
## Min. : 139 Min. : 1.0 Min. : 2340 Min. :1780
## 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320 1st Qu.:3597
## Median : 1707 Median : 353.0 Median : 9990 Median :4200
## Mean : 3700 Mean : 855.3 Mean :10441 Mean :4358
## 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925 3rd Qu.:5050
## Max. :31643 Max. :21836.0 Max. :21700 Max. :8124
## Books Personal PhD Terminal
## Min. : 96.0 Min. : 250 Min. : 8.00 Min. : 24.0
## 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00 1st Qu.: 71.0
## Median : 500.0 Median :1200 Median : 75.00 Median : 82.0
## Mean : 549.4 Mean :1341 Mean : 72.66 Mean : 79.7
## 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00 3rd Qu.: 92.0
## Max. :2340.0 Max. :6800 Max. :103.00 Max. :100.0
## S.F.Ratio perc.alumni Expend Grad.Rate
## Min. : 2.50 Min. : 0.00 Min. : 3186 Min. : 10.00
## 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751 1st Qu.: 53.00
## Median :13.60 Median :21.00 Median : 8377 Median : 65.00
## Mean :14.09 Mean :22.74 Mean : 9660 Mean : 65.46
## 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :39.80 Max. :64.00 Max. :56233 Max. :118.00
## Hard_to_Graduate
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3732
## 3rd Qu.:1.0000
## Max. :1.0000
Splitting the Data:
We split the college data into training and test sets using a random seed of 365 and a train-test ratio of 80:20. The training set will be used to build the decision tree model, and the test set will be used to evaluate its performance. Building the Decision Tree Model:
We use the rpart function to build the decision tree model. The target variable is Hard_to_Graduate, and the predictor variables include various college attributes such as the number of applications, acceptance rate, enrollment, faculty statistics, financial factors, and alumni percentages. Visualizing the Decision Tree:
We use the prp function from the rpart.plot package to create a visually interpretable plot of the decision tree. The plot shows the decision-making process of the model at each node, with different colors representing different decision outcomes (hard to graduate or not hard to graduate). Predictions and Evaluation:
We use the trained decision tree model to predict the Hard_to_Graduate values for the test set. These predictions are then compared with the actual values using a confusion matrix. The accuracy of the decision tree model is calculated as the ratio of the correctly predicted cases to the total number of cases in the test set. The accuracy score indicates how well the model can classify colleges as “hard to graduate” or “not hard to graduate.”
# Load the required libraries
library(rpart.plot)
# Split the data into training and test sets
set.seed(365)
train_id <- sample(nrow(college_data), size = 0.8 * nrow(college_data))
train_set <- college_data[train_id,]
test_set <- college_data[-train_id,]
# Fit a decision tree model
model_tree <- rpart(Hard_to_Graduate ~ Apps + Accept + Enroll +
Top10perc + Top25perc + F.Undergrad + P.Undergrad + Outstate + Room.Board + Books + Personal + PhD +
Terminal + S.F.Ratio + perc.alumni + Expend, data = train_set, method = "class")
# Plot the decision tree with text annotations
rpart.plot(model_tree, digits = 2, fallen.leaves = TRUE, under = TRUE, box.col = c("red", "blue"), cex = 1,
shadow.col = "gray", shadow.offset = 0.02, split.cex = 0.8, faclen = 0, varlen = 0, branch = 0.5)
# Predict on the test set
predictions <- predict(model_tree, test_set, type = "class")
# Confusion matrix to evaluate the model's performance
confusion_matrix <- table(predictions, test_set$Hard_to_Graduate)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy (Decision Tree):", accuracy))
## [1] "Accuracy (Decision Tree): 0.730769230769231"
Fit a random forest model to predict college graduation rates based on various college characteristics. The model uses features such as the number of applications received, the acceptance rate, enrollment numbers, faculty characteristics, and other factors. Generate a Variable Importance Plot to visualize the relative importance of each feature in predicting the graduation rates. The plot provides insights into which variables have the most significant impact on the model’s performance. Predict college graduation rates on the test set using the random forest model. Calculate the accuracy of the Random Forest model using a confusion matrix. The accuracy metric helps evaluate how well the model is performing in correctly predicting whether a college is hard to graduate from or not. The Random Forest model leverages ensemble learning techniques, combining multiple decision trees to make predictions. This approach enhances the model’s predictive power and robustness compared to a single decision tree. The Random Forest model aims to provide a reliable and accurate method for predicting college graduation rates, which can be valuable for educational institutions and policymakers in understanding college performance and student outcomes.
# Fit a random forest model
model_rf <- randomForest(Hard_to_Graduate ~ Apps + Accept + Enroll +
Top10perc + Top25perc + F.Undergrad + P.Undergrad + Outstate + Room.Board + Books + Personal + PhD +
Terminal + S.F.Ratio + perc.alumni + Expend, data = train_set)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
# Variable Importance Plot for Random Forest
var_importance <- as.data.frame(model_rf$importance)
rownames(var_importance) <- c("Apps", "Accept", "Enroll", "Top10perc", "Top25perc", "F.Undergrad",
"P.Undergrad", "Outstate", "Room.Board", "Books", "Personal", "PhD",
"Terminal", "S.F.Ratio", "perc.alumni", "Expend")
varImpPlot(model_rf, main = "Variable Importance Plot (Random Forest)", col = "darkblue", lwd = 2, type = 2)
# Predict on the test set using random forest
predictions_rf <- predict(model_rf, test_set, type = "class")
# Confusion matrix to evaluate random forest model's performance
confusion_matrix_rf <- table(predictions_rf, test_set$Hard_to_Graduate)
accuracy_rf <- sum(diag(confusion_matrix_rf)) / sum(confusion_matrix_rf)
print(paste("Accuracy (Random Forest):", accuracy_rf))
## [1] "Accuracy (Random Forest): 0.00641025641025641"
In this study, we investigated the factors influencing a college’s graduation rate and explored the use of machine learning models for predicting whether a college is likely to have a high or low graduation rate.
By analyzing a comprehensive dataset of colleges, we identified several key factors that significantly impact graduation rates. These factors include the number of applications, acceptance rate, enrollment rate, faculty qualifications, student-to-faculty ratio, alumni engagement, and financial resources. Understanding these influential variables can provide valuable insights to colleges for improving their graduation rates and enhancing the overall educational experience.
We applied two machine learning algorithms, the Decision Tree and Random Forest models, to predict colleges’ graduation rates. While the Decision Tree model offers transparency and allows us to visualize the decision-making process, the Random Forest model demonstrated higher accuracy in predicting graduation rates. The Random Forest model’s ensemble approach allowed us to capture complex interactions between various factors, leading to more precise predictions.
The results from our machine learning models provide colleges with a powerful tool to assess their graduation rates and identify areas for improvement. By leveraging predictive models, colleges can proactively implement targeted strategies to enhance student success and retention, leading to higher graduation rates.
Overall, our study underscores the importance of data-driven approaches in higher education. By leveraging machine learning models, colleges can gain valuable insights into the factors affecting graduation rates and make informed decisions to optimize student outcomes and foster a supportive and conducive learning environment.