As we all know Graduate Admissions is a big deal and we go through a lot of exams such as GRE , TOEFL and processes such as building an SOP , getting LORs to ensure we make it to our dream college. I went through the same thing and that is why this dataset interested me.
The dataset is taken from Kaggle. It has about 7 parameters that are considered important during the application for Masters Programs. The parameters included are : 1. GRE Scores ( out of 340 ) 2. TOEFL Scores ( out of 120 ) 3. University Rating ( out of 5 ) 4. Statement of Purpose (and)out of 5) 5. Letter of Recommendation Strength ( out of 5 ) 5. Undergraduate GPA ( out of 10 ) 6. Research Experience ( either 0 or 1 ) 7. Chance of Admit ( ranging from 0 to 1 )
library(tidyverse)
library(dplyr)
library(leaps)
library(corrgram)
library(rpart)
library(rpart.plot)
library(rattle)
admission_data <- read.csv("Admission_Predict.csv")
head(admission_data)
## Serial.No. GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
## 1 1 337 118 4 4.5 4.5 9.65 1
## 2 2 324 107 4 4.0 4.5 8.87 1
## 3 3 316 104 3 3.0 3.5 8.00 1
## 4 4 322 110 3 3.5 2.5 8.67 1
## 5 5 314 103 2 2.0 3.0 8.21 0
## 6 6 330 115 5 4.5 3.0 9.34 1
## Chance.of.Admit
## 1 0.92
## 2 0.76
## 3 0.72
## 4 0.80
## 5 0.65
## 6 0.90
sample_index <- sample(nrow(admission_data), nrow(admission_data)*0.80)
admission_train <- admission_data[sample_index,]
admission_test <- admission_data[-sample_index,]
admission_data$Research <- as.factor(admission_data$Research)
#GRE Score vs chance of admit and University Ranking
ggplot(admission_data, aes(x=admission_data$GRE.Score, y=admission_data$Chance.of.Admit))+
geom_point()+
facet_grid(admission_data$University.Rating~.)+
ylab("Chance of Admit")+
xlab("GRE Score")
#TOEFL Score
ggplot(admission_data, aes(x=admission_data$TOEFL.Score, y=admission_data$Chance.of.Admit))+
geom_point()+
facet_grid(admission_data$University.Rating~.)+
ylab("Chance of Admit")+
xlab("TOEFL Score")
#GPA
ggplot(admission_data, aes(x=admission_data$CGPA, y=admission_data$Chance.of.Admit))+
geom_point()+
facet_grid(admission_data$University.Rating~.)+
ylab("Chance of Admit")+
xlab("CGPA")
# SOP
ggplot(admission_data, aes(x=admission_data$SOP, y=admission_data$Chance.of.Admit))+
geom_point()+
facet_grid(admission_data$University.Rating~.)+
ylab("Chance of Admit")+
xlab("SOP")
#LOR
ggplot(admission_data, aes(x=admission_data$LOR, y=admission_data$Chance.of.Admit))+
geom_point()+
facet_grid(admission_data$University.Rating~.)+
ylab("Chance of Admit")+
xlab("LOR")
corrgram(admission_data, lower.panel = panel.fill, upper.panel = panel.cor)
reg_tree <- rpart(Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + SOP + LOR+ CGPA + Research, data = admission_train )
summary(reg_tree)
## Call:
## rpart(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## SOP + LOR + CGPA + Research, data = admission_train)
## n= 400
##
## CP nsplit rel error xerror xstd
## 1 0.53001196 0 1.0000000 1.0036629 0.06419074
## 2 0.15375933 1 0.4699880 0.5097235 0.03983339
## 3 0.04238622 2 0.3162287 0.3642947 0.03156608
## 4 0.02551447 3 0.2738425 0.3162808 0.02759593
## 5 0.02106897 4 0.2483280 0.3036896 0.02802083
## 6 0.01392306 5 0.2272590 0.2870013 0.02696670
## 7 0.01000000 6 0.2133360 0.2647919 0.02510073
##
## Variable importance
## CGPA GRE.Score TOEFL.Score University.Rating
## 32 17 17 13
## SOP LOR
## 11 9
##
## Node number 1: 400 observations, complexity param=0.530012
## mean=0.721025, MSE=0.0196347
## left son=2 (267 obs) right son=3 (133 obs)
## Primary splits:
## CGPA < 8.845 to the left, improve=0.5300120, (0 missing)
## GRE.Score < 319.5 to the left, improve=0.4953457, (0 missing)
## TOEFL.Score < 108.5 to the left, improve=0.4511530, (0 missing)
## University.Rating < 3.5 to the left, improve=0.3481329, (0 missing)
## SOP < 3.75 to the left, improve=0.3176795, (0 missing)
## Surrogate splits:
## TOEFL.Score < 110.5 to the left, agree=0.872, adj=0.617, (0 split)
## GRE.Score < 320.5 to the left, agree=0.862, adj=0.586, (0 split)
## University.Rating < 3.5 to the left, agree=0.830, adj=0.489, (0 split)
## SOP < 3.75 to the left, agree=0.802, adj=0.406, (0 split)
## LOR < 4.25 to the left, agree=0.785, adj=0.353, (0 split)
##
## Node number 2: 267 observations, complexity param=0.1537593
## mean=0.6490262, MSE=0.01132939
## left son=4 (84 obs) right son=5 (183 obs)
## Primary splits:
## CGPA < 8.035 to the left, improve=0.3992160, (0 missing)
## GRE.Score < 304.5 to the left, improve=0.2994343, (0 missing)
## TOEFL.Score < 101.5 to the left, improve=0.2552973, (0 missing)
## LOR < 2.25 to the left, improve=0.1444468, (0 missing)
## SOP < 2.25 to the left, improve=0.1404090, (0 missing)
## Surrogate splits:
## GRE.Score < 304.5 to the left, agree=0.794, adj=0.345, (0 split)
## SOP < 2.25 to the left, agree=0.768, adj=0.262, (0 split)
## TOEFL.Score < 99.5 to the left, agree=0.764, adj=0.250, (0 split)
## University.Rating < 1.5 to the left, agree=0.757, adj=0.226, (0 split)
## LOR < 2.25 to the left, agree=0.753, adj=0.214, (0 split)
##
## Node number 3: 133 observations, complexity param=0.04238622
## mean=0.8655639, MSE=0.005009644
## left son=6 (76 obs) right son=7 (57 obs)
## Primary splits:
## CGPA < 9.225 to the left, improve=0.4996322, (0 missing)
## GRE.Score < 329.5 to the left, improve=0.4177667, (0 missing)
## TOEFL.Score < 112.5 to the left, improve=0.3161520, (0 missing)
## SOP < 3.75 to the left, improve=0.2839685, (0 missing)
## University.Rating < 4.5 to the left, improve=0.2572263, (0 missing)
## Surrogate splits:
## GRE.Score < 329.5 to the left, agree=0.835, adj=0.614, (0 split)
## TOEFL.Score < 112.5 to the left, agree=0.759, adj=0.439, (0 split)
## University.Rating < 4.5 to the left, agree=0.722, adj=0.351, (0 split)
## SOP < 4.25 to the left, agree=0.632, adj=0.140, (0 split)
## LOR < 4.75 to the left, agree=0.617, adj=0.105, (0 split)
##
## Node number 4: 84 observations, complexity param=0.02551447
## mean=0.5497619, MSE=0.009945181
## left son=8 (32 obs) right son=9 (52 obs)
## Primary splits:
## CGPA < 7.665 to the left, improve=0.23987150, (0 missing)
## TOEFL.Score < 101.5 to the left, improve=0.23648020, (0 missing)
## GRE.Score < 306.5 to the left, improve=0.22499170, (0 missing)
## LOR < 2.75 to the left, improve=0.13981850, (0 missing)
## SOP < 2.25 to the left, improve=0.08819698, (0 missing)
## Surrogate splits:
## TOEFL.Score < 99.5 to the left, agree=0.726, adj=0.281, (0 split)
## GRE.Score < 295.5 to the left, agree=0.702, adj=0.219, (0 split)
## SOP < 2.25 to the left, agree=0.679, adj=0.156, (0 split)
## University.Rating < 1.5 to the left, agree=0.643, adj=0.063, (0 split)
## LOR < 1.75 to the left, agree=0.631, adj=0.031, (0 split)
##
## Node number 5: 183 observations, complexity param=0.02106897
## mean=0.6945902, MSE=0.005365816
## left son=10 (126 obs) right son=11 (57 obs)
## Primary splits:
## CGPA < 8.63 to the left, improve=0.16851580, (0 missing)
## GRE.Score < 318.5 to the left, improve=0.12221760, (0 missing)
## TOEFL.Score < 108.5 to the left, improve=0.10294450, (0 missing)
## Research < 0.5 to the left, improve=0.09845903, (0 missing)
## University.Rating < 2.5 to the left, improve=0.05379296, (0 missing)
## Surrogate splits:
## TOEFL.Score < 108.5 to the left, agree=0.743, adj=0.175, (0 split)
## GRE.Score < 318.5 to the left, agree=0.738, adj=0.158, (0 split)
## University.Rating < 4.5 to the left, agree=0.716, adj=0.088, (0 split)
## SOP < 4.75 to the left, agree=0.699, adj=0.035, (0 split)
##
## Node number 6: 76 observations
## mean=0.8222368, MSE=0.003822628
##
## Node number 7: 57 observations
## mean=0.9233333, MSE=0.0007520468
##
## Node number 8: 32 observations, complexity param=0.01392306
## mean=0.4875, MSE=0.008325
## left son=16 (24 obs) right son=17 (8 obs)
## Primary splits:
## GRE.Score < 304.5 to the left, improve=0.41047300, (0 missing)
## TOEFL.Score < 101.5 to the left, improve=0.26244000, (0 missing)
## LOR < 2.75 to the left, improve=0.10358550, (0 missing)
## CGPA < 7.62 to the left, improve=0.06390500, (0 missing)
## SOP < 1.75 to the right, improve=0.01458926, (0 missing)
## Surrogate splits:
## TOEFL.Score < 100.5 to the left, agree=0.906, adj=0.625, (0 split)
##
## Node number 9: 52 observations
## mean=0.5880769, MSE=0.007088609
##
## Node number 10: 126 observations
## mean=0.6743651, MSE=0.004753168
##
## Node number 11: 57 observations
## mean=0.7392982, MSE=0.003817051
##
## Node number 16: 24 observations
## mean=0.45375, MSE=0.004206771
##
## Node number 17: 8 observations
## mean=0.58875, MSE=0.007010937
plotcp(reg_tree)
From the plot above it can be seen that more than 4-5 splits are not required.
predict_tree <- predict(reg_tree, admission_test[,-1])
mse_tree <- mean((predict_tree- admission_test$Chance.of.Admit)^2)
mse_tree
## [1] 0.004438454
The MSE for turns out to be pretty low.
#List of most important variables
reg_tree$variable.importance
## CGPA GRE.Score TOEFL.Score University.Rating
## 6.069015 3.241888 3.168088 2.451372
## SOP LOR
## 2.090216 1.771089
CGPA , GRE Score and TOEFL turn out to be the 3 most important variable in deciding the chance of getting an admit.
#Plot the tree
fancyRpartPlot(reg_tree)
printcp(reg_tree)
##
## Regression tree:
## rpart(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## SOP + LOR + CGPA + Research, data = admission_train)
##
## Variables actually used in tree construction:
## [1] CGPA GRE.Score
##
## Root node error: 7.8539/400 = 0.019635
##
## n= 400
##
## CP nsplit rel error xerror xstd
## 1 0.530012 0 1.00000 1.00366 0.064191
## 2 0.153759 1 0.46999 0.50972 0.039833
## 3 0.042386 2 0.31623 0.36429 0.031566
## 4 0.025514 3 0.27384 0.31628 0.027596
## 5 0.021069 4 0.24833 0.30369 0.028021
## 6 0.013923 5 0.22726 0.28700 0.026967
## 7 0.010000 6 0.21334 0.26479 0.025101
Based on values above , cp = 0.021, although this won’t make much of an impact on final MSE, as you will see.
reg_tree_prune <- prune(reg_tree, cp= 0.021 )
fancyRpartPlot(reg_tree_prune)
predict_tree_2 <- predict(reg_tree_prune, admission_test[,-1])
mse_tree_2 <- mean((predict_tree_2- admission_test$Chance.of.Admit)^2)
mse_tree_2
## [1] 0.00436336