Graduate School application can be a very tedious process. Most candidates prepare their credentials with a very little knowledge about the details of the process. Even the most qualified and confident applicants worry about getting into graduate school. Unfortunately, graduate school admissions statistics tend to be more difficult to find than undergraduate acceptance rates. For this project, our main focus will be predicting the probability of a student getting admitted to graduate schools based on the following factors.
This dataset contains information or criteria for determining Postgraduate Admissions from an Indian perspective. This data was created by Mohan S. Acharya and can be found on Kaggle. The dataset is inspired by the UCLA Graduate Dataset with the aim of helping students get shortlisted in universities with based on their application materials. The predicted output gives them a fair idea about their chances for a particular university. The dataset contains several parameters which are considered important during the application for Masters Programs.
These are the variables that affect the data: - GRE Scores ( out of 340 ) - TOEFL Scores ( out of 120 ) - University Rating ( out of 5 ) - Statement of Purpose and Letter of Recommendation Strength ( out of 5 ) - Undergraduate GPA ( out of 10 ) - Research Experience ( either 0 or 1 ) - Chance of Admit ( ranging from 0 to 1 )
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
fpath <- "Admission_Predict_Ver1.1.csv"
adm <- read.csv(fpath, sep=",", na.strings="")
adm <- subset(adm, select = -Serial.No. )
names(adm) <- c("gre", "toefl","uni_rating",
"sop","lor","gpa","research","admit")
admsummary(adm)## gre toefl uni_rating sop
## Min. :290.0 Min. : 92.0 Min. :1.000 Min. :1.000
## 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000 1st Qu.:2.500
## Median :317.0 Median :107.0 Median :3.000 Median :3.500
## Mean :316.5 Mean :107.2 Mean :3.114 Mean :3.374
## 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :340.0 Max. :120.0 Max. :5.000 Max. :5.000
## lor gpa research admit
## Min. :1.000 Min. :6.800 Min. :0.00 Min. :0.3400
## 1st Qu.:3.000 1st Qu.:8.127 1st Qu.:0.00 1st Qu.:0.6300
## Median :3.500 Median :8.560 Median :1.00 Median :0.7200
## Mean :3.484 Mean :8.576 Mean :0.56 Mean :0.7217
## 3rd Qu.:4.000 3rd Qu.:9.040 3rd Qu.:1.00 3rd Qu.:0.8200
## Max. :5.000 Max. :9.920 Max. :1.00 Max. :0.9700
attach(adm)Currently, it is totally understandable that the most critical factors to get into grad school, particularly for PhD programs depends on some or all the parameters mentioned earlier. Out of curiosity we plan to investigate the following questions:
What percentage of applicants are local and international students? Who gets in first?
Is research experience an important factor when thinking about getting into grad school? This question might seem obvious because professors love to see research experience, especially in their field of interest. However, a fresh graduate might not necessarily have a publication but might have written a very catching Statement of Purpose (SOP) demonstrating his willingness to delve deep into the area.
Does the time period of application influence chances of admission or not ?
Does the size of the pool of applicants in a particular admission cycle reflect the overall acceptance rates ?
ggplot(adm,aes(admit)) +
geom_histogram(aes(fill=..count..),bins=40)ggplot(adm,aes(factor(uni_rating),
admit)) + geom_boxplot(aes(fill=uni_rating))ggplot(adm, aes(gre, color=factor(research)))+
geom_density(alpha=0.5)+ggtitle("GRE vs Research Distribution")#install.packages("plotly")
library(plotly)##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(adm, x=~gre, y=~toefl, z=~gpa, color = ~admit,
type="scatter3d", mode="markers") %>%
layout(scene = list(xaxis = list(title = "GRE Score"),
yaxis = list(title = "TOEFL Score"),
zaxis = list(title = "CGPA")))From the 3D graph, we could see that chance of admission is higher when the CGPA, TOEFL and GRE score is higher.
library(corrplot)## corrplot 0.92 loaded
corrplot(cor(adm), method = "number")The more extreme the correlation coefficient (the closer to -1 or 1), the stronger the relationship. The positive correlation implies that the two variables under consideration vary in the same direction, i.e., if a variable increases the other one increases and if one decreases the other one decreases as well.
library(caTools)
set.seed(1)
sample=sample.split(adm$admit,SplitRatio = 0.80)
train_data=subset(adm,sample==TRUE)
test_data=subset(adm,sample==FALSE)#install.packages("randomForest")
library(randomForest)## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
rf_model <- randomForest(admit ~., data = train_data, importance=TRUE)
rf_model##
## Call:
## randomForest(formula = admit ~ ., data = train_data, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.00396254
## % Var explained: 80.42
importance(rf_model)## %IncMSE IncNodePurity
## gre 22.38588 1.8175344
## toefl 19.49463 1.2343398
## uni_rating 12.79388 0.5147697
## sop 18.18586 0.6988552
## lor 15.43865 0.4103867
## gpa 34.26080 2.8197241
## research 18.24598 0.2316092
Let’s try to do linear regression modeling using admit as the target variable.
#install.packages("modelr")
library(modelr)
model1 <- lm(admit ~., data=train_data)
summary(model1)##
## Call:
## lm(formula = admit ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.260232 -0.024739 0.009725 0.034469 0.159733
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2300630 0.1173089 -10.486 < 0.0000000000000002 ***
## gre 0.0015020 0.0005698 2.636 0.008720 **
## toefl 0.0024615 0.0010010 2.459 0.014361 *
## uni_rating 0.0066631 0.0044006 1.514 0.130798
## sop -0.0003637 0.0051481 -0.071 0.943717
## lor 0.0184058 0.0046711 3.940 0.0000964 ***
## gpa 0.1296950 0.0114186 11.358 < 0.0000000000000002 ***
## research 0.0253166 0.0076079 3.328 0.000959 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06097 on 391 degrees of freedom
## Multiple R-squared: 0.82, Adjusted R-squared: 0.8167
## F-statistic: 254.4 on 7 and 391 DF, p-value: < 0.00000000000000022
mae(model1, data = train_data)## [1] 0.04346215