When applying to graduate schools, students are eager to know how likely are they to get accepted. This project aims to create a model that predicts chances of admission into a graduate program based on a student’s undergraduate academic performance and qualifications. The dataset I chose for my final project is called the Graduate Admissions 2 dataset. It is available on Kaggle (https://www.kaggle.com/mohansacharya/graduate-admissions) and is inspired from the UCLA graduate admissions dataset. It contains several parameters such as GRE scores, undergraduate GPA, and research experience, which are considered important during the application process.
As I am looking into applying to graduate school myself, I chose this dataset to explore how chance of admission is impacted by some of these other parameters. In order to answer this question, I will start with some exploratory data analysis to visualize any interesting patterns and uncover how different variables are related to each other before constructing a regression model.
The dataset was mostly clean to begin with, though some pre-processing was required. I created some new variables based on the existing ones and removed the ones I didn’t need for my analysis. I also had to change some numeric variables to factors and assign levels to them for readability before plotting these variables. Each of these steps is described in more detail with comments below as I perform each task. Though the dataset primarily consists of quantitative variables, they can also be converted and used as factors depending on the variable. For example, the research experience column contains either 0/1 which can be converted to True/False or Yes/No, as desired. A list of variables contained in the dataset is as follows:
GRE Scores (ranging from 290 to 340) TOEFL Scores (ranging from 92 to 120) (Undergraduate) University Rating (ranging from 1 to 5) Statement of Purpose and Letter of Recommendation Strength (ranging from 1 to 5) Undergraduate GPA (ranging from 6.80 to 9.92) Research Experience (either 0 or 1) Chance of Admit (ranging from 0 to 1)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(IRdisplay)
library(ggthemes)
setwd("~/Desktop/DATA110")
df <- read.csv("archive/Admission_Predict_Ver1.1.csv")
summary(df)
## Serial.No. GRE.Score TOEFL.Score University.Rating
## Min. : 1.0 Min. :290.0 Min. : 92.0 Min. :1.000
## 1st Qu.:125.8 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000
## Median :250.5 Median :317.0 Median :107.0 Median :3.000
## Mean :250.5 Mean :316.5 Mean :107.2 Mean :3.114
## 3rd Qu.:375.2 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000
## Max. :500.0 Max. :340.0 Max. :120.0 Max. :5.000
## SOP LOR CGPA Research
## Min. :1.000 Min. :1.000 Min. :6.800 Min. :0.00
## 1st Qu.:2.500 1st Qu.:3.000 1st Qu.:8.127 1st Qu.:0.00
## Median :3.500 Median :3.500 Median :8.560 Median :1.00
## Mean :3.374 Mean :3.484 Mean :8.576 Mean :0.56
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:9.040 3rd Qu.:1.00
## Max. :5.000 Max. :5.000 Max. :9.920 Max. :1.00
## Chance.of.Admit
## Min. :0.3400
## 1st Qu.:0.6300
## Median :0.7200
## Mean :0.7217
## 3rd Qu.:0.8200
## Max. :0.9700
str(df)
## 'data.frame': 500 obs. of 9 variables:
## $ Serial.No. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
Let’s start with eliminating any spaces and creating variable names that are easier to work with. The dataset contains a cumulative GPA column with values out of 10 so we will create a new variable called GPA based on a 4.0 scale. Let’s then create a second column to represent probability of admission as a percentage. Lastly, for readability and ease of understanding we will change both research and university ranking to factors and assign levels to them.
names(df) <- c("SerialNo", "GRE", "TOEFL", "UniversityRanking", #changing column names
"StatementOfPurpose", "LetterOfRecommendation",
"cGPA", "Research", "ChanceOfAdmittance")
df <- df%>%
#column for gpa based on 4.0 scale
mutate(GPA = (cGPA/10)*4) %>%
#column for chance of admission as a %
mutate(ChanceOfAdmission = ChanceOfAdmittance *100)
df$Research[df$Research == 0] = "No"
df$Research[df$Research == 1] = "Yes"
df$Research <- factor(df$Research, levels=c("Yes", "No")) #changing Research to a factor
#saving copy of dataframe for analysis later
num_df <- df
df$UniversityRanking[df$UniversityRanking == 5] = "Highest"
df$UniversityRanking[df$UniversityRanking == 4] = "Higher"
df$UniversityRanking[df$UniversityRanking == 3] = "Average"
df$UniversityRanking[df$UniversityRanking == 2] = "Lower"
df$UniversityRanking[df$UniversityRanking == 1] = "Lowest"
#changing University Ranking to a factor
df$UniversityRanking <- factor(df$UniversityRanking,
levels=c("Lowest", "Lower", "Average", "Higher", "Highest"))
head(df)
## SerialNo GRE TOEFL UniversityRanking StatementOfPurpose
## 1 1 337 118 Higher 4.5
## 2 2 324 107 Higher 4.0
## 3 3 316 104 Average 3.0
## 4 4 322 110 Average 3.5
## 5 5 314 103 Lower 2.0
## 6 6 330 115 Highest 4.5
## LetterOfRecommendation cGPA Research ChanceOfAdmittance GPA
## 1 4.5 9.65 Yes 0.92 3.860
## 2 4.5 8.87 Yes 0.76 3.548
## 3 3.5 8.00 Yes 0.72 3.200
## 4 2.5 8.67 Yes 0.80 3.468
## 5 3.0 8.21 No 0.65 3.284
## 6 3.0 9.34 Yes 0.90 3.736
## ChanceOfAdmission
## 1 92
## 2 76
## 3 72
## 4 80
## 5 65
## 6 90
Let’s start with visualizing what is the relationship between Chance of Admission and other quantitative variables in the dataset.
library(gridExtra)
#scatterplot to show Chance of Admission by GRE
scatter1 = df%>%
ggplot(mapping = aes(x = GRE, y = ChanceOfAdmission)) +
geom_point( color = '#e2ae6c') +
geom_smooth(method = lm) + #creating regression line
ggtitle('Chance of Admission by GRE Score') + #providing plot title
theme_bw() + #changing default theme
xlab('GRE Score') + #assigning x-axis label
ylab('Chance (%)') + #assigning y-axis label
theme(text=element_text(size=10, family="Times New Roman")) #changing default font
#scatterplot to show Chance of Admission by TOEFL
scatter2 = df %>%
ggplot(mapping = aes(x = TOEFL, y = ChanceOfAdmission)) +
geom_point( color = '#a37c82') +
geom_smooth(method = lm) +
ggtitle('Chance of Admission by TOEFL Score') +
theme_bw() +
xlab('TOEFL Score') +
ylab('Chance (%)') +
theme(text=element_text(size=10, family="Times New Roman"))
#scatterplot to show Chance of Admission by GPA
scatter3 = df %>%
ggplot(mapping = aes(x = GPA, y = ChanceOfAdmission)) +
geom_point( color = '#6e304b') +
geom_smooth(method = lm) +
ggtitle('Chance of Admission by GPA on a 4.0 Scale') +
theme_bw() +
xlab('GPA') +
ylab('Chance (%)') +
theme(text=element_text(size=10, family="Times New Roman"))
#scatterplot to show Chance of Admission by SerialNo
scatter4 = df %>%
ggplot(mapping = aes(x = SerialNo, y = ChanceOfAdmission)) +
geom_point( color = '#22161c') +
geom_smooth(method = lm) +
ggtitle('Chance of Admission by Serial Number') +
labs(caption = "www.kaggle.com/mohansacharya/graduate-admissions") +
theme_bw() +
xlab('Serial No') +
ylab('Chance (%)') +
theme(text=element_text(size=10, family="Times New Roman"))
grid.arrange(scatter1, scatter2, scatter3, scatter4) #creating a scatterplot grid
Notice that the scatterplots with linear regression appear to show a correlation. It looks like there is a positive correlation between all but one of the x variables and the y variable or chance of admission in this case. The strongest correlation can be found between GPA, GRE and TOEFL scores. We will omit Serial No because it appears to have no effect on the chances of admission.
df <- df[-c(1)] #dropping SerialNo
head(df)
## GRE TOEFL UniversityRanking StatementOfPurpose LetterOfRecommendation cGPA
## 1 337 118 Higher 4.5 4.5 9.65
## 2 324 107 Higher 4.0 4.5 8.87
## 3 316 104 Average 3.0 3.5 8.00
## 4 322 110 Average 3.5 2.5 8.67
## 5 314 103 Lower 2.0 3.0 8.21
## 6 330 115 Highest 4.5 3.0 9.34
## Research ChanceOfAdmittance GPA ChanceOfAdmission
## 1 Yes 0.92 3.860 92
## 2 Yes 0.76 3.548 76
## 3 Yes 0.72 3.200 72
## 4 Yes 0.80 3.468 80
## 5 No 0.65 3.284 65
## 6 Yes 0.90 3.736 90
Similarly, let’s visualize what is the relationship between Chance of Admission and the factor variables in the dataset. We will treat Statement of Purpose and Letter of Recommendation as factors for this operation.
fctr_df <- df
#changing statement of purpose to a factor
fctr_df$StatementOfPurpose = as.factor(fctr_df$StatementOfPurpose)
#changing letter of recommendation to a factor
fctr_df$LetterOfRecommendation = as.factor(fctr_df$LetterOfRecommendation)
#boxplot to show chance of admission by university ranking
boxplot5 = fctr_df%>%
ggplot(mapping = aes(x = UniversityRanking, y = ChanceOfAdmittance)) +
geom_boxplot( color = '#e2ae6c') + #custom color
ggtitle('Chance of Admission \nby Undergrad University Ranking') + #providing plot title
theme_bw() + #changing default theme
xlab('University Ranking') + #assigning x-axis label
ylab('Chance (%)') + #assigning y-axis label
theme(text=element_text(size=10, family="Times New Roman")) #changing default font
#boxplot to show chance of admission by statement of purpose
boxplot6 = fctr_df %>%
ggplot(mapping = aes(x = StatementOfPurpose, y = ChanceOfAdmittance)) +
geom_boxplot( color = '#a37c82') +
ggtitle('Chance of Admission \nby Statement of Purpose') +
theme_bw() +
xlab('Statement of Purpose (strength)') +
ylab('Chance (%)') +
theme(text=element_text(size=10, family="Times New Roman"))
#boxplot to show chance of admission by letter of recommendation
boxplot7 = fctr_df %>%
ggplot(mapping = aes(x = LetterOfRecommendation, y = ChanceOfAdmittance)) +
geom_boxplot( color = '#6e304b') +
ggtitle('Chance of Admission \nby Letter of Recommendation') +
theme_bw() +
xlab('Letter of Recomendation (strength)') +
ylab('Chance (%)') +
theme(text=element_text(size=10, family="Times New Roman"))
#boxplot to show chance of admission by research
boxplot8 = fctr_df %>%
ggplot(mapping = aes(x = Research, y = ChanceOfAdmittance)) +
geom_boxplot( color = '#22161c') +
ggtitle('Chance of Admission \nby Research Experience') +
labs(caption = "www.kaggle.com/mohansacharya/graduate-admissions") +
theme_bw() +
xlab('Research Experience') +
ylab('Chance (%)') +
theme(text=element_text(size=10, family="Times New Roman"))
grid.arrange(boxplot5, boxplot6, boxplot7, boxplot8) #creating a boxplot grid
Once again, at a glance there appears to be a relationship between Chance of Admission and each of the factor variables. We will keep this in mind and try to learn more about them as well as GRE scores and GPA from the previous plot to develop a better understanding of each of these variables. As TOEFL isn’t universally applicable to all students we won’t be focusing on it as much.
I am starting with a donut plot to see what percent of students have undergraduate research experience.
donut9 <- df %>%
group_by(Research) %>% #grouping by research
summarize(counts = n(), percentage = n()/nrow(df)) %>% #calculating count and %
ggplot(mapping = aes(x=2, y=percentage, fill=Research)) +
geom_col(color = "#f2f1ef") + #creating pie chart
coord_polar("y", start=1) +
geom_text(aes(label = paste0(round(percentage*100), "%")), #formatting how text will appear
position = position_stack(vjust = 0.1), color = "#f2f1ef") +
theme(panel.background = element_blank(), #customizing both axes
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
plot.title = element_text(hjust = 0.5, size = 18)) +
ggtitle("Student Participation in Undergraduate Research") + #providing plot title
labs(caption = "www.kaggle.com/mohansacharya/graduate-admissions") +
scale_fill_manual(values = c("Yes" = "#6e304b", "No" = "#e2ae6c")) + #providing custom colors
xlim(0.5, 2.5) + #specifying size of the plot
theme(text=element_text(size=12, family="Times New Roman")) #assigning custom font
donut9
We see that a greater number of students (56%) become involved in undergraduate research than those who do not (44%).
Next, we will plot how University Ranking plays a role in student research experience.
barplot10<- df %>%
ggplot(mapping = aes(x = UniversityRanking, fill = Research)) +
geom_bar(position = "dodge", color = "#f2f1ef") + #creating bar graph
labs(title = "Distribution of Students by University Ranking", #assigning titles
subtitle = "Grouped by Undergraduate Research Experience",
caption = "www.kaggle.com/mohansacharya/graduate-admissions") +
theme_bw() + #changing default theme
ylab("Number of Students") + #assigning label to y-axis
xlab("University Ranking (undergrad) ") + #assigning label to x-axis
theme(legend.title = element_blank()) + #removing title legend
scale_fill_manual(name = "Research", #providing labels and colors
labels= c("Research Experience", "No Research Experience"),
values = c("Yes" = "#6e304b", "No" = "#e2ae6c")) +
theme(text=element_text(size=12, family="Times New Roman")) #providing custom font
barplot10
We see that a majority of students at below average ranking universities don’t participate in undergraduate research. At average ranking universities the percentage of students involved in research is still more than those who don’t but the difference isn’t as much. In the case of above average ranking universities, this distribution is switched where more students participate in research than those who don’t.
Next, let’s explore how this research experience impacts a student’s Chance of Admission to graduate schools.
histogram11 <- df %>%
ggplot(aes(ChanceOfAdmission, fill = Research)) +
geom_histogram(bins = 50, boundary = 0, color = "#f2f1ef") + #creating a histogram
labs(title = "Chance of Admission to Graduate School", #providing titles
subtitle = "Grouped by Undergraduate Research Experience",
caption = "www.kaggle.com/mohansacharya/graduate-admissions") +
xlab("Chance (%)") + #providing label for x-axis
ylab("Frequency of Students") + #providing label for y-axis
facet_grid(Research ~ .) + #faceting based on research
theme_bw() + #changing default theme
scale_fill_manual(name = "Research", #providing labels and colors
labels= c("Research Experience", "No Research Experience"),
values = c("Yes" = "#6e304b", "No" = "#e2ae6c")) +
theme(text=element_text(size=12, family="Times New Roman"), #providing custom font
strip.background = element_blank(),
strip.text = element_blank())
ggplotly(histogram11) #generating plot in plotly
It appears from looking at the plot that on average students who participate in research have a much higher chance of acceptance. Very few students who don’t get participate have an 80% or higher chance while a large number of those who do have an 80% or above chance of admission. Zooming in on the plotly chart gives exact number of students in each bucket and helps see this trend more clearly. It could be due to the fact that only students with a certain minimum GPA are permitted to get involved in research and those with higher GPA are also more likely to get admitted into graduate schools so it’s important to keep in mind that correlation does not necessarily imply causation.
Let’s explore how research experience is related to GRE scores and GPA.
densityplot12 <- df %>%
ggplot(mapping = aes(GRE,fill=Research))+ #creating density plot
geom_density(size=1, alpha = 0.7)+
ggtitle("GRE scores by Research Distribution") + #providing plot title
theme_bw() + #changing default theme
scale_fill_manual(name = "Research", #providing labels and colors
labels= c("Research Experience", "No Research Experience"),
values = c("Yes" = "#6e304b", "No" = "#e2ae6c")) +
theme(legend.title = element_blank(), legend.position = "none") #customizing legend
densityplot13 <- df %>%
ggplot(mapping = aes(GPA,fill=factor(Research)))+
geom_density(size=1, alpha = 0.7)+
ggtitle("GPA by Research Experience") +
labs(caption = "www.kaggle.com/mohansacharya/graduate-admissions") +
theme_bw() +
scale_fill_manual(name = "Research",
labels= c("Research Experience", "No Research Experience"),
values = c("Yes" = "#6e304b", "No" = "#e2ae6c")) +
theme(legend.title = element_blank(), legend.position = "bottom")
grid.arrange(densityplot12, densityplot13)
We see from the density plots above that students with research experience are also more likely to have a higher GPA and a higher GRE score. Again, it could be due to the fact that only students with a certain minimum GPA are permitted to get involved in research and those with higher GPA also tend to do well on their GRE exams, so it doesn’t necessarily mean that doing research helps improve GPA or GRE scores. In other words, this may be another case of correlation does not imply causation.
Let’s also explore the distribution of GRE scores and GPA individually.
par(mfrow = c(2,2)) #formatting plot display
boxplot14 <- boxplot(df$GRE,col="#6e304b", #creating boxplot for GRE
horizontal=TRUE,xlab="GRE",main="Boxplot for GRE")
boxplot15 <- boxplot(df$GPA,col="#e2ae6c", #creating boxplot for GPA
horizontal=TRUE,xlab="GRE",main="Boxplot for GPA")
Notice from the above plots that the median GRE score is around 318 and you would have to score at least a 325 approximately to be in the top 25%. Similarly, the median GPA is around 3.4 and you would have to have a GPA above 3.6 to be in the top 25%.
Next, we will look into University Ranking and how each ranking impacts a student’s Chances Of Admission.
boxplot16 <- df %>%
ggplot(mapping = aes(x = UniversityRanking, #creating boxplot
y = ChanceOfAdmission,
fill = UniversityRanking)) +
geom_boxplot(color = "#2e294e", #custom attributes
show.legend = FALSE,
size = 0.6,
outlier.size = 1) +
labs(title = "Chance of Admission to Graduate School", #providing plot titles
subtitle = "by Undergraduate Institution Ranking",
caption = "www.kaggle.com/mohansacharya/graduate-admissions") +
xlab("University Ranking (undergrad) ") +
ylab("Chance (%) ") + #providing label for y-axis
theme_bw() + #changing default theme
theme(strip.background = element_blank(), #customizing labels
strip.text.x = element_blank(),
legend.position = "top") +
scale_fill_manual(name = "University Ranking", #providing labels and colors
values=c("#22161c", "#6e304b", "#a37c82", "#e2ae6c", "#eae2b7")) +
theme(text=element_text(size=12, family="Times New Roman")) + #providing custom font
coord_flip() #flipping coordinates
boxplot16
Notice from the boxplots above that chances of admission tends to increase with university ranking. This difference is quite significant as the median chance of admission from the lowest ranking university is just below 60% whereas the median chance of admission from the highest ranking university is over 90%. Inevitably, there are exceptions to this trend, but on average the higher ranked the university the greater the chance that a student will get accepted into graduate school. Once again, it is important to keep in mind that correlation does not imply causation and simply going to a high ranking university doesn’t guarantee admission.
The last plot we will look at shows average GRE scores, average GPA and the average Chance of Admission grouped by University Ranking.
bargraph17 <- df %>%
group_by(UniversityRanking) %>% #grouping by university ranking
summarize(avgGRE = mean(GRE)) %>% #average GRE score column
ggplot(mapping = aes(x = UniversityRanking, #creating barplot
y = avgGRE,
fill = UniversityRanking)) +
geom_col(width = 0.5) + #changing width of each bar
theme_bw() + #changing default theme
labs(caption = "by undergrad University Ranking ") +
ylab('Average GRE') + #assigning y-axis label
theme(text=element_text(size=10, family="Times New Roman")) + #providing custom font
theme(legend.title = element_blank(), #formatting legend and labels
legend.position = "none",
axis.title.y = element_blank()) +
scale_fill_manual(values = #providing custom colors
c("#22161c", "#6e304b", "#a37c82", "#e2ae6c", "#eae2b7")) +
coord_flip() #flipping coordinates
bargraph18 <- df %>%
group_by(UniversityRanking) %>% #grouping by university ranking
summarize(avgGPA = mean(GPA)) %>% #average GPA column
ggplot(mapping = aes(x = UniversityRanking, #creating barplot
y = avgGPA,
fill = UniversityRanking)) +
geom_col(width = 0.5) +
theme_bw() +
labs(caption = "by undergrad University Ranking ") +
ylab('Average GPA') +
theme(text=element_text(size=10, family="Times New Roman")) +
theme(legend.title = element_blank(),
legend.position = "none",
axis.title.y = element_blank()) +
scale_fill_manual(values =
c("#22161c", "#6e304b", "#a37c82", "#e2ae6c", "#eae2b7")) +
coord_flip()
bargraph19 <- df %>%
group_by(UniversityRanking) %>% #grouping by university ranking
summarize(avgChance = mean(ChanceOfAdmission)) %>% #average chance of admission column
ggplot(mapping = aes(x = UniversityRanking, #creating barplot
y = avgChance,
fill = UniversityRanking)) +
geom_col(width = 0.5) +
theme_bw() +
labs(caption = "by undergrad University Ranking ") +
ylab('Average Chance (%)') +
theme(text=element_text(size=10, family="Times New Roman")) +
theme(legend.title = element_blank(),
legend.position = "none",
axis.title.y = element_blank()) +
scale_fill_manual(values =
c("#22161c", "#6e304b", "#a37c82", "#e2ae6c", "#eae2b7")) +
coord_flip()
grid.arrange(bargraph17, bargraph18, bargraph19, ncol = 3) #creating barplot grid
We can see that there is a very small difference between average GRE scores at a lower ranking university and a higher ranking university however the average does seem to increase progressively. Similarly, this is also the case for average GPA and average chance of admission going from lower to higher ranking universities but the trend is much more pronounced in case of average chance of admission.
In the next section, we will create visualizations in highcharter that involve some of the other variables that we haven’t looked at before.
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(hablar)
##
## Attaching package: 'hablar'
## The following object is masked from 'package:dplyr':
##
## na_if
library(RColorBrewer)
In this section we will introduce three new variables, TOEFL, statement of purpose and letter of recommendation that we didn’t plot before. As TOEFL isn’t applicable for all students we didn’t explore this variable further in previous plots but we will create one visualization so it is not completely excluded from our visualizations.
In highcharter, let’s plot TOEFL and GRE scores grouped by Statement of Purpose and see if we can gather any new information.
highcharter20 <- highchart() %>% #creating highcharter plot
hc_add_series(data = fctr_df,
type = "scatter",
hcaes(x = GRE,
y = TOEFL,
group = StatementOfPurpose)) %>%
hc_chart(style = list(fontFamily = "Times New Roman", #assigning custom font
fontWeight = "bold")) %>%
hc_xAxis(title = list(text="GRE")) %>% #assigning x-axis label
hc_yAxis(title = list(text="TOEFEL")) %>% #assigning y-axis label
hc_colors(c("#03071e", "#01497c", "#718355" , "#774936", "#b07d62" , #assigning custom colors
"#ffc300", "#ee9b00", "#ca6702" , "#ae2012", "#d90429")) %>%
hc_title( #providing plot title
text = "GRE and TOEFEL Scores by Statement of Purpose (Strength)",
margin = 30,
align = "center",
style = list(color = "black", useHTML = TRUE)) %>%
hc_tooltip(shared = TRUE,
borderColor = "black",
pointFormat = "GRE: {point.GRE} <br> TOEFL: {point.TOEFL}")
highcharter20
It appears that lower TOEFL scores correspond to lower GRE scores while higher TOEFL scores correspond to higher GRE scores and follows an upward trajectory based on the statement of purpose. Although there are some exceptions, the results generally follow the trend that those scoring low on TOEFL also tend to have a low GRE score and vice versa.
Next, we will see how strength of Letter of Recommendation plays into the admission process and plot Minimum GPA and Highest Chance of Admission for each level in the variable.
cols_df <- df %>%
select(GPA, ChanceOfAdmission, LetterOfRecommendation) %>% #selecting three columns
group_by(LetterOfRecommendation) %>% #grouping by letter of rec
summarize(minGPA = min(GPA), maxChanceOfAdmission = max(ChanceOfAdmission)) #creating two new columns
highcharter21 <- highchart() %>% #creating highcharter plot
hc_yAxis_multiples( #providing labels for both y-axes
list(title = list(text = "Minimum GPA")),
list(title = list(text = "Highest Chance of Admission"), opposite = TRUE)
) %>%
hc_xAxis(title = list(text="Letter of Recommendation (Strength)")) %>% #providing label for x-axis
hc_add_series(data = cols_df$minGPA, #customizing first y-axis
name = "Minimum GPA on a 4.0 Scale",
type = "column",
yAxis = 0) %>%
hc_add_series(data = cols_df$maxChanceOfAdmission, #customizing second y-axis
name = "Highest Chance of Admission",
type = "line",
yAxis = 1) %>%
hc_xAxis(categories = cols_df$LetterOfRecommendation, #customizing x-axis
tickInterval = 1) %>%
hc_title( #providing plot title
text = "Minimum GPA and Highest Chance of Admission
by Letters of Recommendation",
margin = 30, #formatting plot title
align = "center",
style = list(color = "black", useHTML = TRUE)
) %>%
hc_colors(c("#e2ae6c","#6e304b")) %>% #providing custom colors
hc_chart(style = list(fontFamily = "Times New Roman", #providing custom font
fontWeight = "bold"))
highcharter21
It appears that there is no obvious trend between minimum GPA and different levels of the x-axis so it is not the case that higher GPA guarantees a stronger letter of recommendation or vice versa. The chances of admission do show a different trend, however. There is a slight decrease going from 1.5 to 2 on the x-axis but the chances of admission generally seem to increase progressively from levels 1 through 3.5 before plateauing. This tells us that chance of admission may be improved based on a strong letter of recommendation.
Now that we have spent some time exploring our data, let’s construct a regression model.
In this section, we will construct a multiple linear regression model to predict chance of admission. There are four assumptions associated with a linear regression model which can be tested using diagnostic plots. The assumptions are as follows:
Linearity: The relationship between X and the mean of Y is linear. Homoscedasticity: The variance of residual is the same for any value of X. Independence: Observations are independent of each other. Normality: For any fixed value of X, Y is normally distributed.
library(corrplot)
## corrplot 0.84 loaded
#selecting columns to use in correlation plot
corr_df <- df %>%
select(GRE, StatementOfPurpose, LetterOfRecommendation,
GPA, Research, ChanceOfAdmission )
df_Numeric_Variable <- select_if(corr_df, is.numeric) #selecting numeric variables
#matrix is reflective along the principal diagonal so using lower
corr <- cor(df_Numeric_Variable)
#providing custom labels and plotting the matrix
corrplot(corr,method = "number",
type = "lower", tl.cex=0.9, cl.cex = 0.6, tl.col="black")
The correlation coefficient is a value between -1 and 1, inclusive and tells how strong or weak the correlation is. Values closer to +/- 1 represent a strong correlation where the sign is determined by the linear slope, values close to +/- 0.5 are weak correlation, and values close to zero have no correlation.
We can see the correlation between different elements from the above heat map. It appears that GPA plays the most important role in admission with a correlation value of 0.88 followed by GRE score and chance of admission which are are highly correlated as well at 0.81.
With multiple regression, there are several strategies for comparing variable inputs into a model. We will use backward elimination and start with all possible predictor variables with our response variable. Let’s use GRE score, GPA, University Ranking, Statement of Purpose, Letter of Recommendation, and Research and perform a model fit with all predictors. TOEFL score is not included because it is not likely to be relevant for all applicants. We are now ready to create our model.
fit1 <- lm(ChanceOfAdmission ~ GRE + GPA + Research + UniversityRanking + #creating model using lm function
StatementOfPurpose + LetterOfRecommendation, data = num_df)
summary(fit1) #printing summary statistics
##
## Call:
## lm(formula = ChanceOfAdmission ~ GRE + GPA + Research + UniversityRanking +
## StatementOfPurpose + LetterOfRecommendation, data = num_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.0056 -2.4926 0.8537 3.4852 16.3426
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -126.93551 10.79066 -11.763 < 2e-16 ***
## GRE 0.25938 0.04502 5.761 1.47e-08 ***
## GPA 31.77303 2.34946 13.524 < 2e-16 ***
## ResearchNo -2.33025 0.66591 -3.499 0.000509 ***
## UniversityRanking 0.70032 0.38221 1.832 0.067514 .
## StatementOfPurpose 0.29858 0.45833 0.651 0.515060
## LetterOfRecommendation 1.68170 0.41760 4.027 6.54e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.054 on 493 degrees of freedom
## Multiple R-squared: 0.8182, Adjusted R-squared: 0.816
## F-statistic: 369.9 on 6 and 493 DF, p-value: < 2.2e-16
par(mfrow = c(2,2)) #formatting plot display
plot(fit1) #printing diagnostic plots
The p-value on the right of GRE, GPA, Research and Letter of Recommendation has 3 asterisks which suggest they are meaningful variables to explain the linear change in chances of admission but we also need to look at the adjusted R-squared value. It states that about 82% of the variation in the observations may be explained by the model which is pretty good. In other words, only 18% of the variation in the data is likely not explained by this model.
We usually pay great attention to regression results, such as p-values, R-squared or adjusted R-squared that tell us how well a model represents given data but that’s not the whole picture. We should also look at diagnostic plots to not only check if the linear regression assumptions are met but to improve our model in an exploratory way.
In this case, in residuals vs fitted plot we find equally spread residuals around a horizontal line without distinct patterns which is a good indication of absence of non-linear relationships. Normal Q-Q plot shows if residuals are normally distributed. Though the residuals don’t exactly follow a straight line they also don’t deviate severely. Let’s look at the next plot while keeping in mind that #10, #66 and #93 might be a potential problem. The Scale-Location plot checks the assumption of equal variance (homoscedasticity). It’s a good sign if the red line is horizontal with the points spread about randomly. Here we have a horizontal line with points roughly spread about the red line for x values less than 70 but not so much for values greater than 70. Finally, in the last plot we watch out for outlying values at the upper right corner or at the lower right corner as those spots are the places where cases can be influential against a regression line. The plot identified the influential observation as #92. Let’s see how it changes our model if I exclude the 92nd case from the analysis.
outlier_df <- num_df[-c(92), ] #removing outlier
fit2 <- lm(ChanceOfAdmission ~ GRE + GPA + Research + UniversityRanking + #creating model using lm function
StatementOfPurpose +LetterOfRecommendation, data = outlier_df)
summary(fit2) #printing summary statistics
##
## Call:
## lm(formula = ChanceOfAdmission ~ GRE + GPA + Research + UniversityRanking +
## StatementOfPurpose + LetterOfRecommendation, data = outlier_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.0732 -2.4749 0.8039 3.4624 16.1944
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -124.4560 10.7440 -11.584 < 2e-16 ***
## GRE 0.2561 0.0447 5.730 1.75e-08 ***
## GPA 31.1691 2.3412 13.314 < 2e-16 ***
## ResearchNo -2.2852 0.6611 -3.457 0.000594 ***
## UniversityRanking 0.6858 0.3794 1.808 0.071274 .
## StatementOfPurpose 0.5135 0.4609 1.114 0.265741
## LetterOfRecommendation 1.6711 0.4145 4.032 6.42e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.009 on 492 degrees of freedom
## Multiple R-squared: 0.8192, Adjusted R-squared: 0.817
## F-statistic: 371.5 on 6 and 492 DF, p-value: < 2.2e-16
par(mfrow = c(2,2)) #formatting plot display
plot(fit2) #printing diagnostic plots
We don’t notice any significant improvement in diagnostic plots except now we get a new value #96 that the Residual vs Leverage plot identified as an influential observation. Instead of removing this value and testing our model again like we did before, we will return to our summary statistics.
Based on the summary information the only variables that don’t appear to be as significant as the others are statement of purpose and university ranking. We drop statement of purpose since it has the highest p-value and re-run the model.
fit3 <- lm(ChanceOfAdmission ~ GRE + GPA + Research + UniversityRanking + #creating model using lm function
LetterOfRecommendation, data = num_df)
summary(fit3) #printing summary statistics
##
## Call:
## lm(formula = ChanceOfAdmission ~ GRE + GPA + Research + UniversityRanking +
## LetterOfRecommendation, data = num_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.9197 -2.5216 0.8216 3.5563 16.4160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -127.77743 10.70675 -11.934 < 2e-16 ***
## GRE 0.25978 0.04499 5.774 1.37e-08 ***
## GPA 32.10837 2.29104 14.015 < 2e-16 ***
## ResearchNo -2.33705 0.66544 -3.512 0.000485 ***
## UniversityRanking 0.79488 0.35337 2.249 0.024926 *
## LetterOfRecommendation 1.76304 0.39827 4.427 1.18e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.05 on 494 degrees of freedom
## Multiple R-squared: 0.8181, Adjusted R-squared: 0.8162
## F-statistic: 444.3 on 5 and 494 DF, p-value: < 2.2e-16
par(mfrow = c(2,2)) #formatting plot display
plot(fit3) #printing diagnostic plots
The diagnostic plots look approximately the same as before with an improvement in the Residual vs Leverage plot where we no longer observe values that might be influential against our regression model. The R-squared and adjusted R-squared vales stayed at 82% respectively.
The p-value on the right for each of the predictors has 3 asterisks except for university ranking that has a p-value of 0.02 which suggests that it may not be as meaningful as the others. We remove university ranking and run the model one last time.
fit4 <- lm(ChanceOfAdmission ~ GRE + GPA + Research + #creating model using lm function
LetterOfRecommendation, data = num_df)
summary(fit4) #printing summary statistics
##
## Call:
## lm(formula = ChanceOfAdmission ~ GRE + GPA + Research + LetterOfRecommendation,
## data = num_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.9225 -2.4510 0.6664 3.5002 16.6388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -134.89444 10.27044 -13.134 < 2e-16 ***
## GRE 0.27137 0.04488 6.047 2.91e-09 ***
## GPA 33.58340 2.20418 15.236 < 2e-16 ***
## ResearchNo -2.42989 0.66687 -3.644 0.000297 ***
## LetterOfRecommendation 2.02221 0.38280 5.283 1.91e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.075 on 495 degrees of freedom
## Multiple R-squared: 0.8162, Adjusted R-squared: 0.8147
## F-statistic: 549.6 on 4 and 495 DF, p-value: < 2.2e-16
par(mfrow = c(2,2)) #formatting plot display
plot(fit4) #printing diagnostic plots
Our diagostic plots still look about the same while our adjusted R-squared dropped slightly to 81% but it may be that we were over-fitting before and getting a higher value. The p-value on the right of GRE, GPA, Research and Letter of Recommendation have 3 asterisks which suggest that they are all meaningful variables. So, we select the simplest (parsimonious) model with GRE, GPA, Research, and Letter Of Recommendation on Chance of Admission. We found earlier that chance of admission is highly correlated with GPA and GRE, so it makes sense that these two variables fits in the model.
The requirements for graduate school are generally the same for a lot of the programs with some programs focusing more on one component than another and which component that might be varies from school to school. The key question that I wanted to research was to what extent do these five factors academics, statement, letters, research experience, and university ranking predict your chances of getting admission into grad school.
According to an article published on usnews.com, grad schools want to see that you have a proven track record of success in your field as the purpose of graduate school is to develop expertise in a specific academic subject (Kowarski, 2019). GPA and standardized tests are common admissions factors that graduate schools use to determine an applicant’s potential for academic success. The most common standardized test for admissions is the GRE and how important standardized test scores are to a grad school however generally depends on the field of study (Muniz, 2017). I also learned that your potential grad schools want to know if you are a good fit for their program so beyond your academic qualifications, your letters of recommendation might be the most important factor of your application (Muniz, 2017). Both these sources also listed passion for your field as an obvious component because graduate programs are usually hyper-focused on a particular academic discipline and one easy way to express your passion is through your personal statement (Kowarski, 2019). University ranking was not considered as important and my linear regression model corroborated this statistic. So, it appears that the specific qualities that grad schools look for in potential students are factors like good academic standing, relevant work experience, a strong personal statement and strong letters of recommendation as they can all highlight a deep and prolonged interest in your area of study.
Something which surprised me was that I have known statement of purpose or personal statement to be a key component of your graduate school application which was also confirmed in my background research. The linear regression model I created however deemed the variable insignificant when predicting chance of admission which surprised me. I figured these results might be due to the fact that I was treating the strength of statement of purpose as a numeric variable and not a factor so I changed it to a factor and reran my model which still gave me the same results. It may be that there isn’t a linear relationship that exists between the predictors and the outcome or that the data were systematically biased when collecting data.
Another thing that surprised me was when comparing levels within a factor variable I learned that in this dataset a larger number of undergraduate students are involved in undergraduate research than those who are not. I didn’t know this to be the case and this is something I wish I could have looked more into. would have liked to look into more. We see that a majority of students at above average ranking universities participate in research while about the same number of them at below average ranking universities do not and that is probably canceling out the effet with a slightly greater overall number of students doing undergraduate research.
In my secnd project I wrote that I wished to use highcharter to create visualizations and attempt multiple linear regression with diagnostic plots, both of which I was able to achieve in this project so there isn’t anything in particular that I wish I could have included or done differently. Having developed a better understanding of linear regression through this project however, I would like to attempt logistic regression as well as explore other machine learning models for projects in the future.
Kowarski, Ilana. “Is Graduate School Right For You?” U.S. News & World Report, U.S. News & World Report, 15 Feb. 2019, www.usnews.com/education/best-graduate-schools/articles/2019-02-15/what-graduate-school-is-and-who-should-consider-attending.
Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019
Muniz, Hannah. “Grad School Requirements: What You Need for Admission.” Online GRE Prep Blog by PrepScholar, 20 Mar. 2017, www.prepscholar.com/gre/blog/grad-school-requirements/.