Analysis of Graduate Admissions using Decision Trees

As we all know Graduate Admissions is a big deal and we go through a lot of exams such as GRE , TOEFL and processes such as building an SOP , getting LORs to ensure we make it to our dream college. I went through the same thing and that is why this dataset interested me.

The dataset is taken from Kaggle. It has about 7 parameters that are considered important during the application for Masters Programs. The parameters included are : 1. GRE Scores ( out of 340 ) 2. TOEFL Scores ( out of 120 ) 3. University Rating ( out of 5 ) 4. Statement of Purpose (and)out of 5) 5. Letter of Recommendation Strength ( out of 5 ) 5. Undergraduate GPA ( out of 10 ) 6. Research Experience ( either 0 or 1 ) 7. Chance of Admit ( ranging from 0 to 1 )

Libraries used

library(tidyverse)
library(dplyr)
library(leaps)
library(corrgram)
library(rpart)
library(rpart.plot)
library(rattle)

Importing the data and splitting the data

admission_data <- read.csv("Admission_Predict.csv")
head(admission_data)
##   Serial.No. GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
## 1          1       337         118                 4 4.5 4.5 9.65        1
## 2          2       324         107                 4 4.0 4.5 8.87        1
## 3          3       316         104                 3 3.0 3.5 8.00        1
## 4          4       322         110                 3 3.5 2.5 8.67        1
## 5          5       314         103                 2 2.0 3.0 8.21        0
## 6          6       330         115                 5 4.5 3.0 9.34        1
##   Chance.of.Admit
## 1            0.92
## 2            0.76
## 3            0.72
## 4            0.80
## 5            0.65
## 6            0.90
sample_index <- sample(nrow(admission_data), nrow(admission_data)*0.80)
admission_train <- admission_data[sample_index,] 
admission_test <- admission_data[-sample_index,]
admission_data$Research <- as.factor(admission_data$Research)

Exploratory Data Analysis

Plots

#GRE Score vs chance of admit and University Ranking 
ggplot(admission_data, aes(x=admission_data$GRE.Score, y=admission_data$Chance.of.Admit))+
  geom_point()+
  facet_grid(admission_data$University.Rating~.)+
  ylab("Chance of Admit")+
  xlab("GRE Score")

#TOEFL Score 

ggplot(admission_data, aes(x=admission_data$TOEFL.Score, y=admission_data$Chance.of.Admit))+
  geom_point()+
  facet_grid(admission_data$University.Rating~.)+
  ylab("Chance of Admit")+
  xlab("TOEFL Score")

#GPA 

ggplot(admission_data, aes(x=admission_data$CGPA, y=admission_data$Chance.of.Admit))+
  geom_point()+
  facet_grid(admission_data$University.Rating~.)+
  ylab("Chance of Admit")+
  xlab("CGPA")

# SOP

ggplot(admission_data, aes(x=admission_data$SOP, y=admission_data$Chance.of.Admit))+
  geom_point()+
  facet_grid(admission_data$University.Rating~.)+
  ylab("Chance of Admit")+
  xlab("SOP")

#LOR 

ggplot(admission_data, aes(x=admission_data$LOR, y=admission_data$Chance.of.Admit))+
  geom_point()+
  facet_grid(admission_data$University.Rating~.)+
  ylab("Chance of Admit")+
  xlab("LOR")

Correlation Check

corrgram(admission_data, lower.panel = panel.fill, upper.panel = panel.cor)

Insights

  1. There seems to be a high correlation between GRE Score and TOEFL Score
  2. SOP and LOR score seem to have a hgih correlation with the university ranking.
  3. Higher the CGPA, higher are the chances of getting an admit
  4. Higher the GRE and TOEFL score, again, higher are the chances of getting an admit

Decision Tree -1

cp= 0.01

reg_tree <- rpart(Chance.of.Admit ~  GRE.Score + TOEFL.Score + University.Rating + SOP + LOR+ CGPA + Research, data = admission_train )
summary(reg_tree)
## Call:
## rpart(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     SOP + LOR + CGPA + Research, data = admission_train)
##   n= 400 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.53001196      0 1.0000000 1.0036629 0.06419074
## 2 0.15375933      1 0.4699880 0.5097235 0.03983339
## 3 0.04238622      2 0.3162287 0.3642947 0.03156608
## 4 0.02551447      3 0.2738425 0.3162808 0.02759593
## 5 0.02106897      4 0.2483280 0.3036896 0.02802083
## 6 0.01392306      5 0.2272590 0.2870013 0.02696670
## 7 0.01000000      6 0.2133360 0.2647919 0.02510073
## 
## Variable importance
##              CGPA         GRE.Score       TOEFL.Score University.Rating 
##                32                17                17                13 
##               SOP               LOR 
##                11                 9 
## 
## Node number 1: 400 observations,    complexity param=0.530012
##   mean=0.721025, MSE=0.0196347 
##   left son=2 (267 obs) right son=3 (133 obs)
##   Primary splits:
##       CGPA              < 8.845 to the left,  improve=0.5300120, (0 missing)
##       GRE.Score         < 319.5 to the left,  improve=0.4953457, (0 missing)
##       TOEFL.Score       < 108.5 to the left,  improve=0.4511530, (0 missing)
##       University.Rating < 3.5   to the left,  improve=0.3481329, (0 missing)
##       SOP               < 3.75  to the left,  improve=0.3176795, (0 missing)
##   Surrogate splits:
##       TOEFL.Score       < 110.5 to the left,  agree=0.872, adj=0.617, (0 split)
##       GRE.Score         < 320.5 to the left,  agree=0.862, adj=0.586, (0 split)
##       University.Rating < 3.5   to the left,  agree=0.830, adj=0.489, (0 split)
##       SOP               < 3.75  to the left,  agree=0.802, adj=0.406, (0 split)
##       LOR               < 4.25  to the left,  agree=0.785, adj=0.353, (0 split)
## 
## Node number 2: 267 observations,    complexity param=0.1537593
##   mean=0.6490262, MSE=0.01132939 
##   left son=4 (84 obs) right son=5 (183 obs)
##   Primary splits:
##       CGPA        < 8.035 to the left,  improve=0.3992160, (0 missing)
##       GRE.Score   < 304.5 to the left,  improve=0.2994343, (0 missing)
##       TOEFL.Score < 101.5 to the left,  improve=0.2552973, (0 missing)
##       LOR         < 2.25  to the left,  improve=0.1444468, (0 missing)
##       SOP         < 2.25  to the left,  improve=0.1404090, (0 missing)
##   Surrogate splits:
##       GRE.Score         < 304.5 to the left,  agree=0.794, adj=0.345, (0 split)
##       SOP               < 2.25  to the left,  agree=0.768, adj=0.262, (0 split)
##       TOEFL.Score       < 99.5  to the left,  agree=0.764, adj=0.250, (0 split)
##       University.Rating < 1.5   to the left,  agree=0.757, adj=0.226, (0 split)
##       LOR               < 2.25  to the left,  agree=0.753, adj=0.214, (0 split)
## 
## Node number 3: 133 observations,    complexity param=0.04238622
##   mean=0.8655639, MSE=0.005009644 
##   left son=6 (76 obs) right son=7 (57 obs)
##   Primary splits:
##       CGPA              < 9.225 to the left,  improve=0.4996322, (0 missing)
##       GRE.Score         < 329.5 to the left,  improve=0.4177667, (0 missing)
##       TOEFL.Score       < 112.5 to the left,  improve=0.3161520, (0 missing)
##       SOP               < 3.75  to the left,  improve=0.2839685, (0 missing)
##       University.Rating < 4.5   to the left,  improve=0.2572263, (0 missing)
##   Surrogate splits:
##       GRE.Score         < 329.5 to the left,  agree=0.835, adj=0.614, (0 split)
##       TOEFL.Score       < 112.5 to the left,  agree=0.759, adj=0.439, (0 split)
##       University.Rating < 4.5   to the left,  agree=0.722, adj=0.351, (0 split)
##       SOP               < 4.25  to the left,  agree=0.632, adj=0.140, (0 split)
##       LOR               < 4.75  to the left,  agree=0.617, adj=0.105, (0 split)
## 
## Node number 4: 84 observations,    complexity param=0.02551447
##   mean=0.5497619, MSE=0.009945181 
##   left son=8 (32 obs) right son=9 (52 obs)
##   Primary splits:
##       CGPA        < 7.665 to the left,  improve=0.23987150, (0 missing)
##       TOEFL.Score < 101.5 to the left,  improve=0.23648020, (0 missing)
##       GRE.Score   < 306.5 to the left,  improve=0.22499170, (0 missing)
##       LOR         < 2.75  to the left,  improve=0.13981850, (0 missing)
##       SOP         < 2.25  to the left,  improve=0.08819698, (0 missing)
##   Surrogate splits:
##       TOEFL.Score       < 99.5  to the left,  agree=0.726, adj=0.281, (0 split)
##       GRE.Score         < 295.5 to the left,  agree=0.702, adj=0.219, (0 split)
##       SOP               < 2.25  to the left,  agree=0.679, adj=0.156, (0 split)
##       University.Rating < 1.5   to the left,  agree=0.643, adj=0.063, (0 split)
##       LOR               < 1.75  to the left,  agree=0.631, adj=0.031, (0 split)
## 
## Node number 5: 183 observations,    complexity param=0.02106897
##   mean=0.6945902, MSE=0.005365816 
##   left son=10 (126 obs) right son=11 (57 obs)
##   Primary splits:
##       CGPA              < 8.63  to the left,  improve=0.16851580, (0 missing)
##       GRE.Score         < 318.5 to the left,  improve=0.12221760, (0 missing)
##       TOEFL.Score       < 108.5 to the left,  improve=0.10294450, (0 missing)
##       Research          < 0.5   to the left,  improve=0.09845903, (0 missing)
##       University.Rating < 2.5   to the left,  improve=0.05379296, (0 missing)
##   Surrogate splits:
##       TOEFL.Score       < 108.5 to the left,  agree=0.743, adj=0.175, (0 split)
##       GRE.Score         < 318.5 to the left,  agree=0.738, adj=0.158, (0 split)
##       University.Rating < 4.5   to the left,  agree=0.716, adj=0.088, (0 split)
##       SOP               < 4.75  to the left,  agree=0.699, adj=0.035, (0 split)
## 
## Node number 6: 76 observations
##   mean=0.8222368, MSE=0.003822628 
## 
## Node number 7: 57 observations
##   mean=0.9233333, MSE=0.0007520468 
## 
## Node number 8: 32 observations,    complexity param=0.01392306
##   mean=0.4875, MSE=0.008325 
##   left son=16 (24 obs) right son=17 (8 obs)
##   Primary splits:
##       GRE.Score   < 304.5 to the left,  improve=0.41047300, (0 missing)
##       TOEFL.Score < 101.5 to the left,  improve=0.26244000, (0 missing)
##       LOR         < 2.75  to the left,  improve=0.10358550, (0 missing)
##       CGPA        < 7.62  to the left,  improve=0.06390500, (0 missing)
##       SOP         < 1.75  to the right, improve=0.01458926, (0 missing)
##   Surrogate splits:
##       TOEFL.Score < 100.5 to the left,  agree=0.906, adj=0.625, (0 split)
## 
## Node number 9: 52 observations
##   mean=0.5880769, MSE=0.007088609 
## 
## Node number 10: 126 observations
##   mean=0.6743651, MSE=0.004753168 
## 
## Node number 11: 57 observations
##   mean=0.7392982, MSE=0.003817051 
## 
## Node number 16: 24 observations
##   mean=0.45375, MSE=0.004206771 
## 
## Node number 17: 8 observations
##   mean=0.58875, MSE=0.007010937
plotcp(reg_tree)  

From the plot above it can be seen that more than 4-5 splits are not required.

Prediction and out-of-Sample MSE

predict_tree <- predict(reg_tree, admission_test[,-1])
mse_tree <- mean((predict_tree- admission_test$Chance.of.Admit)^2)
mse_tree
## [1] 0.004438454

The MSE for turns out to be pretty low.

#List of most important variables 
reg_tree$variable.importance
##              CGPA         GRE.Score       TOEFL.Score University.Rating 
##          6.069015          3.241888          3.168088          2.451372 
##               SOP               LOR 
##          2.090216          1.771089

CGPA , GRE Score and TOEFL turn out to be the 3 most important variable in deciding the chance of getting an admit.

#Plot the tree 

fancyRpartPlot(reg_tree)

Decision Tree -2

cp = 0.021

printcp(reg_tree)
## 
## Regression tree:
## rpart(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     SOP + LOR + CGPA + Research, data = admission_train)
## 
## Variables actually used in tree construction:
## [1] CGPA      GRE.Score
## 
## Root node error: 7.8539/400 = 0.019635
## 
## n= 400 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.530012      0   1.00000 1.00366 0.064191
## 2 0.153759      1   0.46999 0.50972 0.039833
## 3 0.042386      2   0.31623 0.36429 0.031566
## 4 0.025514      3   0.27384 0.31628 0.027596
## 5 0.021069      4   0.24833 0.30369 0.028021
## 6 0.013923      5   0.22726 0.28700 0.026967
## 7 0.010000      6   0.21334 0.26479 0.025101

Based on values above , cp = 0.021, although this won’t make much of an impact on final MSE, as you will see.

reg_tree_prune <- prune(reg_tree, cp= 0.021 )
fancyRpartPlot(reg_tree_prune)

Predictions and Ou-of-Sample MSE

predict_tree_2 <- predict(reg_tree_prune, admission_test[,-1])
mse_tree_2 <- mean((predict_tree_2- admission_test$Chance.of.Admit)^2)
mse_tree_2
## [1] 0.00436336