Graduate Admission Analytics

Introduction

Graduate School application can be a very tedious process. Most candidates prepare their credentials with a very little knowledge about the details of the process. Even the most qualified and confident applicants worry about getting into graduate school. Unfortunately, graduate school admissions statistics tend to be more difficult to find than undergraduate acceptance rates. For this project, our main focus will be predicting the probability of a student getting admitted to graduate schools based on the following factors.

Data

This dataset contains information or criteria for determining Postgraduate Admissions from an Indian perspective. This data was created by Mohan S. Acharya and can be found on Kaggle. The dataset is inspired by the UCLA Graduate Dataset with the aim of helping students get shortlisted in universities with based on their application materials. The predicted output gives them a fair idea about their chances for a particular university. The dataset contains several parameters which are considered important during the application for Masters Programs.

Insights on Data

These are the variables that affect the data: - GRE Scores ( out of 340 ) - TOEFL Scores ( out of 120 ) - University Rating ( out of 5 ) - Statement of Purpose and Letter of Recommendation Strength ( out of 5 ) - Undergraduate GPA ( out of 10 ) - Research Experience ( either 0 or 1 ) - Chance of Admit ( ranging from 0 to 1 )

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

fpath <- "Admission_Predict_Ver1.1.csv"
adm <- read.csv(fpath, sep=",", na.strings="") 
adm <- subset(adm, select = -Serial.No. )
names(adm) <- c("gre", "toefl","uni_rating",
                 "sop","lor","gpa","research","admit")
adm

summary(adm)

##       gre            toefl         uni_rating         sop       
##  Min.   :290.0   Min.   : 92.0   Min.   :1.000   Min.   :1.000  
##  1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000   1st Qu.:2.500  
##  Median :317.0   Median :107.0   Median :3.000   Median :3.500  
##  Mean   :316.5   Mean   :107.2   Mean   :3.114   Mean   :3.374  
##  3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :340.0   Max.   :120.0   Max.   :5.000   Max.   :5.000  
##       lor             gpa           research        admit       
##  Min.   :1.000   Min.   :6.800   Min.   :0.00   Min.   :0.3400  
##  1st Qu.:3.000   1st Qu.:8.127   1st Qu.:0.00   1st Qu.:0.6300  
##  Median :3.500   Median :8.560   Median :1.00   Median :0.7200  
##  Mean   :3.484   Mean   :8.576   Mean   :0.56   Mean   :0.7217  
##  3rd Qu.:4.000   3rd Qu.:9.040   3rd Qu.:1.00   3rd Qu.:0.8200  
##  Max.   :5.000   Max.   :9.920   Max.   :1.00   Max.   :0.9700

attach(adm)

Questions

Currently, it is totally understandable that the most critical factors to get into grad school, particularly for PhD programs depends on some or all the parameters mentioned earlier. Out of curiosity we plan to investigate the following questions:

What percentage of applicants are local and international students? Who gets in first?
Is research experience an important factor when thinking about getting into grad school? This question might seem obvious because professors love to see research experience, especially in their field of interest. However, a fresh graduate might not necessarily have a publication but might have written a very catching Statement of Purpose (SOP) demonstrating his willingness to delve deep into the area.
Does the time period of application influence chances of admission or not ?
Does the size of the pool of applicants in a particular admission cycle reflect the overall acceptance rates ?

Visualization

Admission Chance Distribution

ggplot(adm,aes(admit)) + 
  geom_histogram(aes(fill=..count..),bins=40)

Chance of admission by school rating

ggplot(adm,aes(factor(uni_rating),
    admit)) + geom_boxplot(aes(fill=uni_rating))

GRE vs Research

ggplot(adm, aes(gre, color=factor(research)))+
  geom_density(alpha=0.5)+ggtitle("GRE vs Research Distribution")

3D Graph for chance of admission vs CGPA, TOEFL and GRE

#install.packages("plotly")
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

plot_ly(adm, x=~gre, y=~toefl, z=~gpa, color = ~admit, 
        type="scatter3d", mode="markers") %>% 
  layout(scene = list(xaxis = list(title = "GRE Score"),
                      yaxis = list(title = "TOEFL Score"),
                      zaxis = list(title = "CGPA")))

From the 3D graph, we could see that chance of admission is higher when the CGPA, TOEFL and GRE score is higher.

Correlation between variables

library(corrplot)

## corrplot 0.92 loaded

corrplot(cor(adm), method = "number")

The more extreme the correlation coefficient (the closer to -1 or 1), the stronger the relationship. The positive correlation implies that the two variables under consideration vary in the same direction, i.e., if a variable increases the other one increases and if one decreases the other one decreases as well.

Modeling

Data Preparation - Splitting Data

library(caTools)
set.seed(1)
sample=sample.split(adm$admit,SplitRatio = 0.80)
train_data=subset(adm,sample==TRUE)
test_data=subset(adm,sample==FALSE)

Random Forest

#install.packages("randomForest")
library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

rf_model <- randomForest(admit ~., data = train_data, importance=TRUE)
rf_model

## 
## Call:
##  randomForest(formula = admit ~ ., data = train_data, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.00396254
##                     % Var explained: 80.42

importance(rf_model)

##             %IncMSE IncNodePurity
## gre        22.38588     1.8175344
## toefl      19.49463     1.2343398
## uni_rating 12.79388     0.5147697
## sop        18.18586     0.6988552
## lor        15.43865     0.4103867
## gpa        34.26080     2.8197241
## research   18.24598     0.2316092

Multiple Linear Regression

Let’s try to do linear regression modeling using admit as the target variable.

#install.packages("modelr")
library(modelr)
model1 <- lm(admit ~., data=train_data)
summary(model1)

## 
## Call:
## lm(formula = admit ~ ., data = train_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.260232 -0.024739  0.009725  0.034469  0.159733 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -1.2300630  0.1173089 -10.486 < 0.0000000000000002 ***
## gre          0.0015020  0.0005698   2.636             0.008720 ** 
## toefl        0.0024615  0.0010010   2.459             0.014361 *  
## uni_rating   0.0066631  0.0044006   1.514             0.130798    
## sop         -0.0003637  0.0051481  -0.071             0.943717    
## lor          0.0184058  0.0046711   3.940            0.0000964 ***
## gpa          0.1296950  0.0114186  11.358 < 0.0000000000000002 ***
## research     0.0253166  0.0076079   3.328             0.000959 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06097 on 391 degrees of freedom
## Multiple R-squared:   0.82,  Adjusted R-squared:  0.8167 
## F-statistic: 254.4 on 7 and 391 DF,  p-value: < 0.00000000000000022

mae(model1, data = train_data)

## [1] 0.04346215

Analysis and Discussion

Interpretations

The p-value of SOP and GPA are more than 0.05, making them
insignificant variables. This means they have no influence on the chance of getting admitted.