INTRODUCTION

Abstract

When applying to graduate schools, students are eager to know how likely are they to get accepted. This project aims to create a model that predicts chances of admission into a graduate program based on a student’s undergraduate academic performance and qualifications. The dataset I chose for my final project is called the Graduate Admissions 2 dataset. It is available on Kaggle (https://www.kaggle.com/mohansacharya/graduate-admissions) and is inspired from the UCLA graduate admissions dataset. It contains several parameters such as GRE scores, undergraduate GPA, and research experience, which are considered important during the application process.

As I am looking into applying to graduate school myself, I chose this dataset to explore how chance of admission is impacted by some of these other parameters. In order to answer this question, I will start with some exploratory data analysis to visualize any interesting patterns and uncover how different variables are related to each other before constructing a regression model.

More about the dataset

The dataset was mostly clean to begin with, though some pre-processing was required. I created some new variables based on the existing ones and removed the ones I didn’t need for my analysis. I also had to change some numeric variables to factors and assign levels to them for readability before plotting these variables. Each of these steps is described in more detail with comments below as I perform each task. Though the dataset primarily consists of quantitative variables, they can also be converted and used as factors depending on the variable. For example, the research experience column contains either 0/1 which can be converted to True/False or Yes/No, as desired. A list of variables contained in the dataset is as follows:

GRE Scores (ranging from 290 to 340) TOEFL Scores (ranging from 92 to 120) (Undergraduate) University Rating (ranging from 1 to 5) Statement of Purpose and Letter of Recommendation Strength (ranging from 1 to 5) Undergraduate GPA (ranging from 6.80 to 9.92) Research Experience (either 0 or 1) Chance of Admit (ranging from 0 to 1)

Loading necessary libraries

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.3
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(IRdisplay)
library(ggthemes)

Setting current working directory

setwd("~/Desktop/DATA110")

Reading in the dataset

df <- read.csv("archive/Admission_Predict_Ver1.1.csv")

Printing summary statistics

summary(df)
##    Serial.No.      GRE.Score      TOEFL.Score    University.Rating
##  Min.   :  1.0   Min.   :290.0   Min.   : 92.0   Min.   :1.000    
##  1st Qu.:125.8   1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000    
##  Median :250.5   Median :317.0   Median :107.0   Median :3.000    
##  Mean   :250.5   Mean   :316.5   Mean   :107.2   Mean   :3.114    
##  3rd Qu.:375.2   3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000    
##  Max.   :500.0   Max.   :340.0   Max.   :120.0   Max.   :5.000    
##       SOP             LOR             CGPA          Research   
##  Min.   :1.000   Min.   :1.000   Min.   :6.800   Min.   :0.00  
##  1st Qu.:2.500   1st Qu.:3.000   1st Qu.:8.127   1st Qu.:0.00  
##  Median :3.500   Median :3.500   Median :8.560   Median :1.00  
##  Mean   :3.374   Mean   :3.484   Mean   :8.576   Mean   :0.56  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:9.040   3rd Qu.:1.00  
##  Max.   :5.000   Max.   :5.000   Max.   :9.920   Max.   :1.00  
##  Chance.of.Admit 
##  Min.   :0.3400  
##  1st Qu.:0.6300  
##  Median :0.7200  
##  Mean   :0.7217  
##  3rd Qu.:0.8200  
##  Max.   :0.9700

Printing structure of the dataset

str(df)
## 'data.frame':    500 obs. of  9 variables:
##  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

DATA CLEANING

Let’s start with eliminating any spaces and creating variable names that are easier to work with. The dataset contains a cumulative GPA column with values out of 10 so we will create a new variable called GPA based on a 4.0 scale. Let’s then create a second column to represent probability of admission as a percentage. Lastly, for readability and ease of understanding we will change both research and university ranking to factors and assign levels to them.

Formatting and preprocessing

names(df) <- c("SerialNo", "GRE", "TOEFL", "UniversityRanking",                 #changing column names 
               "StatementOfPurpose", "LetterOfRecommendation", 
               "cGPA", "Research", "ChanceOfAdmittance")                        

df <- df%>%   
  #column for gpa based on 4.0 scale
  mutate(GPA = (cGPA/10)*4) %>%                                                 
  #column for chance of admission as a %
  mutate(ChanceOfAdmission = ChanceOfAdmittance *100)                           

df$Research[df$Research == 0] = "No"
df$Research[df$Research == 1] = "Yes"

df$Research <- factor(df$Research, levels=c("Yes", "No"))                       #changing Research to a factor

#saving copy of dataframe for analysis later
num_df <- df                                                                    

df$UniversityRanking[df$UniversityRanking == 5] = "Highest"
df$UniversityRanking[df$UniversityRanking == 4] = "Higher"
df$UniversityRanking[df$UniversityRanking == 3] = "Average"
df$UniversityRanking[df$UniversityRanking == 2] = "Lower"
df$UniversityRanking[df$UniversityRanking == 1] = "Lowest"

#changing University Ranking to a factor
df$UniversityRanking <- factor(df$UniversityRanking, 
  levels=c("Lowest", "Lower", "Average", "Higher", "Highest"))                  

Printing the dataset

head(df)
##   SerialNo GRE TOEFL UniversityRanking StatementOfPurpose
## 1        1 337   118            Higher                4.5
## 2        2 324   107            Higher                4.0
## 3        3 316   104           Average                3.0
## 4        4 322   110           Average                3.5
## 5        5 314   103             Lower                2.0
## 6        6 330   115           Highest                4.5
##   LetterOfRecommendation cGPA Research ChanceOfAdmittance   GPA
## 1                    4.5 9.65      Yes               0.92 3.860
## 2                    4.5 8.87      Yes               0.76 3.548
## 3                    3.5 8.00      Yes               0.72 3.200
## 4                    2.5 8.67      Yes               0.80 3.468
## 5                    3.0 8.21       No               0.65 3.284
## 6                    3.0 9.34      Yes               0.90 3.736
##   ChanceOfAdmission
## 1                92
## 2                76
## 3                72
## 4                80
## 5                65
## 6                90

EXPLORATORY DATA ANALYSIS

Let’s start with visualizing what is the relationship between Chance of Admission and other quantitative variables in the dataset.

Plots with quantitative variables

library(gridExtra)

#scatterplot to show Chance of Admission by GRE
scatter1 = df%>%                                                                   
  ggplot(mapping = aes(x = GRE, y = ChanceOfAdmission)) +                       
  geom_point( color = '#e2ae6c') +                         
  geom_smooth(method = lm) +                                                    #creating regression line
  ggtitle('Chance of Admission by GRE Score') +                                 #providing plot title
  theme_bw() +                                                                  #changing default theme
  xlab('GRE Score') +                                                           #assigning x-axis label
  ylab('Chance (%)') +                                                          #assigning y-axis label
  theme(text=element_text(size=10,  family="Times New Roman"))                  #changing default font

#scatterplot to show Chance of Admission by TOEFL
scatter2 = df %>%
  ggplot(mapping = aes(x = TOEFL, y = ChanceOfAdmission)) +                     
  geom_point( color = '#a37c82') +
  geom_smooth(method = lm) +
  ggtitle('Chance of Admission by TOEFL Score') +
  theme_bw() +
  xlab('TOEFL Score') +
  ylab('Chance (%)') +
  theme(text=element_text(size=10,  family="Times New Roman")) 

#scatterplot to show Chance of Admission by GPA
scatter3 = df %>%
  ggplot(mapping = aes(x = GPA, y = ChanceOfAdmission)) +                       
  geom_point( color = '#6e304b') +
  geom_smooth(method = lm) +
  ggtitle('Chance of Admission by GPA on a 4.0 Scale') +
  theme_bw() +
  xlab('GPA') +
  ylab('Chance (%)') +
  theme(text=element_text(size=10,  family="Times New Roman")) 

#scatterplot to show Chance of Admission by SerialNo
scatter4 = df %>%
  ggplot(mapping = aes(x = SerialNo, y = ChanceOfAdmission)) +                 
  geom_point( color = '#22161c') +
  geom_smooth(method = lm) +
  ggtitle('Chance of Admission by Serial Number') +
  labs(caption = "www.kaggle.com/mohansacharya/graduate-admissions") +
  theme_bw() +
  xlab('Serial No') +
  ylab('Chance (%)') +
  theme(text=element_text(size=10,  family="Times New Roman")) 

grid.arrange(scatter1, scatter2, scatter3, scatter4)                            #creating a scatterplot grid

Notice that the scatterplots with linear regression appear to show a correlation. It looks like there is a positive correlation between all but one of the x variables and the y variable or chance of admission in this case. The strongest correlation can be found between GPA, GRE and TOEFL scores. We will omit Serial No because it appears to have no effect on the chances of admission.

df <- df[-c(1)]                                                                 #dropping SerialNo
head(df)
##   GRE TOEFL UniversityRanking StatementOfPurpose LetterOfRecommendation cGPA
## 1 337   118            Higher                4.5                    4.5 9.65
## 2 324   107            Higher                4.0                    4.5 8.87
## 3 316   104           Average                3.0                    3.5 8.00
## 4 322   110           Average                3.5                    2.5 8.67
## 5 314   103             Lower                2.0                    3.0 8.21
## 6 330   115           Highest                4.5                    3.0 9.34
##   Research ChanceOfAdmittance   GPA ChanceOfAdmission
## 1      Yes               0.92 3.860                92
## 2      Yes               0.76 3.548                76
## 3      Yes               0.72 3.200                72
## 4      Yes               0.80 3.468                80
## 5       No               0.65 3.284                65
## 6      Yes               0.90 3.736                90

Similarly, let’s visualize what is the relationship between Chance of Admission and the factor variables in the dataset. We will treat Statement of Purpose and Letter of Recommendation as factors for this operation.

Plots with factor variables

fctr_df <- df
#changing statement of purpose to a factor
fctr_df$StatementOfPurpose = as.factor(fctr_df$StatementOfPurpose)  
#changing letter of recommendation to a factor
fctr_df$LetterOfRecommendation = as.factor(fctr_df$LetterOfRecommendation)      

#boxplot to show chance of admission by university ranking
boxplot5 = fctr_df%>%
  ggplot(mapping = aes(x = UniversityRanking, y = ChanceOfAdmittance)) +        
  geom_boxplot( color = '#e2ae6c') +                                            #custom color
  ggtitle('Chance of Admission \nby Undergrad University Ranking') +            #providing plot title
  theme_bw() +                                                                  #changing default theme
  xlab('University Ranking') +                                                  #assigning x-axis label
  ylab('Chance (%)') +                                                          #assigning y-axis label
  theme(text=element_text(size=10,  family="Times New Roman"))                  #changing default font

#boxplot to show chance of admission by statement of purpose
boxplot6 = fctr_df %>%
  ggplot(mapping = aes(x = StatementOfPurpose, y = ChanceOfAdmittance)) +       
  geom_boxplot( color = '#a37c82') +
  ggtitle('Chance of Admission \nby Statement of Purpose') +
  theme_bw() +
  xlab('Statement of Purpose (strength)') +
  ylab('Chance (%)') +
  theme(text=element_text(size=10,  family="Times New Roman"))

#boxplot to show chance of admission by letter of recommendation
boxplot7 = fctr_df %>%
  ggplot(mapping = aes(x = LetterOfRecommendation, y = ChanceOfAdmittance)) +   
  geom_boxplot( color = '#6e304b') +
  ggtitle('Chance of Admission \nby Letter of Recommendation') +
  theme_bw() +
  xlab('Letter of Recomendation (strength)') +
  ylab('Chance (%)') +
  theme(text=element_text(size=10,  family="Times New Roman"))

#boxplot to show chance of admission by research
boxplot8 = fctr_df %>%
  ggplot(mapping = aes(x = Research, y = ChanceOfAdmittance)) +                 
  geom_boxplot( color = '#22161c') +
  ggtitle('Chance of Admission \nby Research Experience') +
  labs(caption = "www.kaggle.com/mohansacharya/graduate-admissions") +
  theme_bw() +
  xlab('Research Experience') +
  ylab('Chance (%)') +
  theme(text=element_text(size=10,  family="Times New Roman"))

grid.arrange(boxplot5, boxplot6, boxplot7, boxplot8)                            #creating a boxplot grid

Once again, at a glance there appears to be a relationship between Chance of Admission and each of the factor variables. We will keep this in mind and try to learn more about them as well as GRE scores and GPA from the previous plot to develop a better understanding of each of these variables. As TOEFL isn’t universally applicable to all students we won’t be focusing on it as much.

I am starting with a donut plot to see what percent of students have undergraduate research experience.

Donut plot to compare factors within a variable

donut9 <- df %>%
  group_by(Research) %>%                                                        #grouping by research
  summarize(counts = n(), percentage = n()/nrow(df)) %>%                        #calculating count and % 
  ggplot(mapping = aes(x=2, y=percentage, fill=Research)) +                   
  geom_col(color = "#f2f1ef") +                                                 #creating pie chart
  coord_polar("y", start=1) +
  geom_text(aes(label = paste0(round(percentage*100), "%")),                    #formatting how text will appear
        position = position_stack(vjust = 0.1), color = "#f2f1ef") +
  theme(panel.background = element_blank(),                                     #customizing both axes
        axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        plot.title = element_text(hjust = 0.5, size = 18)) +
  ggtitle("Student Participation in Undergraduate Research") +                  #providing plot title
  labs(caption = "www.kaggle.com/mohansacharya/graduate-admissions") +
  scale_fill_manual(values = c("Yes" = "#6e304b", "No" = "#e2ae6c")) +          #providing custom colors
  xlim(0.5, 2.5) +                                                              #specifying size of the plot 
  theme(text=element_text(size=12,  family="Times New Roman"))                  #assigning custom font

donut9

We see that a greater number of students (56%) become involved in undergraduate research than those who do not (44%).

Next, we will plot how University Ranking plays a role in student research experience.

Plotting Research Experience by undergrad University Ranking

barplot10<- df %>%
  ggplot(mapping = aes(x = UniversityRanking, fill = Research)) +
  geom_bar(position = "dodge", color = "#f2f1ef") +                             #creating bar graph 
  labs(title = "Distribution of Students by University Ranking",                #assigning titles
       subtitle = "Grouped by Undergraduate Research Experience",
       caption = "www.kaggle.com/mohansacharya/graduate-admissions") +
  theme_bw() +                                                                  #changing default theme
  ylab("Number of Students") +                                                  #assigning label to y-axis
  xlab("University Ranking (undergrad) ") +                                     #assigning label to x-axis
  theme(legend.title = element_blank()) +                                       #removing title legend
  scale_fill_manual(name = "Research",                                          #providing labels and colors
              labels= c("Research Experience", "No Research Experience"), 
              values = c("Yes" = "#6e304b", "No" = "#e2ae6c")) +
   theme(text=element_text(size=12,  family="Times New Roman"))                 #providing custom font
  
barplot10

We see that a majority of students at below average ranking universities don’t participate in undergraduate research. At average ranking universities the percentage of students involved in research is still more than those who don’t but the difference isn’t as much. In the case of above average ranking universities, this distribution is switched where more students participate in research than those who don’t.

Next, let’s explore how this research experience impacts a student’s Chance of Admission to graduate schools.

Plotting Chance of Admission by Research Experience

histogram11 <- df %>%
  ggplot(aes(ChanceOfAdmission, fill = Research)) +     
  geom_histogram(bins = 50, boundary = 0, color = "#f2f1ef") +                  #creating a histogram
  labs(title = "Chance of Admission to Graduate School",                        #providing titles
       subtitle = "Grouped by Undergraduate Research Experience",
       caption = "www.kaggle.com/mohansacharya/graduate-admissions") +
  xlab("Chance (%)") +                                                          #providing label for x-axis
  ylab("Frequency of Students") +                                               #providing label for y-axis
  facet_grid(Research ~ .) +                                                    #faceting based on research
  theme_bw() +                                                                  #changing default theme
  scale_fill_manual(name = "Research",                                          #providing labels and colors
                    labels= c("Research Experience", "No Research Experience"), 
                    values = c("Yes" = "#6e304b", "No" = "#e2ae6c")) +          
  theme(text=element_text(size=12,  family="Times New Roman"),                  #providing custom font
        strip.background = element_blank(), 
        strip.text = element_blank())
  
ggplotly(histogram11)                                                           #generating plot in plotly

It appears from looking at the plot that on average students who participate in research have a much higher chance of acceptance. Very few students who don’t get participate have an 80% or higher chance while a large number of those who do have an 80% or above chance of admission. Zooming in on the plotly chart gives exact number of students in each bucket and helps see this trend more clearly. It could be due to the fact that only students with a certain minimum GPA are permitted to get involved in research and those with higher GPA are also more likely to get admitted into graduate schools so it’s important to keep in mind that correlation does not necessarily imply causation.

Let’s explore how research experience is related to GRE scores and GPA.

Plotting GRE scores and GPA by Research Experience

densityplot12 <- df %>%
  ggplot(mapping = aes(GRE,fill=Research))+                                     #creating density plot
  geom_density(size=1, alpha = 0.7)+
  ggtitle("GRE scores by Research Distribution") +                              #providing plot title
  theme_bw() +                                                                  #changing default theme
  scale_fill_manual(name = "Research",                                          #providing labels and colors
                    labels= c("Research Experience", "No Research Experience"), 
                    values = c("Yes" = "#6e304b", "No" = "#e2ae6c")) +
  theme(legend.title = element_blank(), legend.position = "none")               #customizing legend

densityplot13 <- df %>%
  ggplot(mapping = aes(GPA,fill=factor(Research)))+                               
  geom_density(size=1, alpha = 0.7)+                                            
  ggtitle("GPA by Research Experience") +   
  labs(caption = "www.kaggle.com/mohansacharya/graduate-admissions") +
  theme_bw() +                                                                  
  scale_fill_manual(name = "Research",                                          
                    labels= c("Research Experience", "No Research Experience"), 
                    values = c("Yes" = "#6e304b", "No" = "#e2ae6c")) +
  theme(legend.title = element_blank(), legend.position = "bottom")             

grid.arrange(densityplot12, densityplot13)

We see from the density plots above that students with research experience are also more likely to have a higher GPA and a higher GRE score. Again, it could be due to the fact that only students with a certain minimum GPA are permitted to get involved in research and those with higher GPA also tend to do well on their GRE exams, so it doesn’t necessarily mean that doing research helps improve GPA or GRE scores. In other words, this may be another case of correlation does not imply causation.

Let’s also explore the distribution of GRE scores and GPA individually.

Plotting distrubtion of GRE scores and GPA

par(mfrow = c(2,2))                                                             #formatting plot display
boxplot14 <- boxplot(df$GRE,col="#6e304b",                                      #creating boxplot for GRE
        horizontal=TRUE,xlab="GRE",main="Boxplot for GRE")                  


boxplot15 <- boxplot(df$GPA,col="#e2ae6c",                                      #creating boxplot for GPA
        horizontal=TRUE,xlab="GRE",main="Boxplot for GPA")

Notice from the above plots that the median GRE score is around 318 and you would have to score at least a 325 approximately to be in the top 25%. Similarly, the median GPA is around 3.4 and you would have to have a GPA above 3.6 to be in the top 25%.

Next, we will look into University Ranking and how each ranking impacts a student’s Chances Of Admission.

Plotting Chance of Admission by undergrad University Ranking

boxplot16 <- df %>%                                              
  ggplot(mapping = aes(x = UniversityRanking,                                   #creating boxplot
                       y = ChanceOfAdmission, 
                       fill = UniversityRanking)) + 
  geom_boxplot(color = "#2e294e",                                               #custom attributes
               show.legend = FALSE, 
               size = 0.6, 
               outlier.size = 1) +
  labs(title = "Chance of Admission to Graduate School",                        #providing plot titles
       subtitle = "by Undergraduate Institution Ranking",      
       caption = "www.kaggle.com/mohansacharya/graduate-admissions") +
  xlab("University Ranking (undergrad) ") +   
  ylab("Chance (%) ") +                                                         #providing label for y-axis
  theme_bw() +                                                                  #changing default theme
  theme(strip.background = element_blank(),                                     #customizing labels
        strip.text.x = element_blank(),
        legend.position = "top") +
  scale_fill_manual(name = "University Ranking",                                #providing labels and colors
        values=c("#22161c", "#6e304b", "#a37c82", "#e2ae6c", "#eae2b7")) +
  theme(text=element_text(size=12,  family="Times New Roman")) +                #providing custom font
  coord_flip()                                                                  #flipping coordinates

boxplot16

Notice from the boxplots above that chances of admission tends to increase with university ranking. This difference is quite significant as the median chance of admission from the lowest ranking university is just below 60% whereas the median chance of admission from the highest ranking university is over 90%. Inevitably, there are exceptions to this trend, but on average the higher ranked the university the greater the chance that a student will get accepted into graduate school. Once again, it is important to keep in mind that correlation does not imply causation and simply going to a high ranking university doesn’t guarantee admission.

The last plot we will look at shows average GRE scores, average GPA and the average Chance of Admission grouped by University Ranking.

Plotting average of different variables by undergrad University Ranking

 bargraph17 <- df %>%
  group_by(UniversityRanking) %>%                                               #grouping by university ranking
  summarize(avgGRE = mean(GRE)) %>%                                             #average GRE score column
  ggplot(mapping = aes(x = UniversityRanking,                                   #creating barplot
                       y =  avgGRE, 
                       fill = UniversityRanking)) +
  geom_col(width = 0.5) +                                                       #changing width of each bar
  theme_bw() +                                                                  #changing default theme
  labs(caption = "by undergrad University Ranking    ") +
  ylab('Average GRE') +                                                         #assigning y-axis label
  theme(text=element_text(size=10,  family="Times New Roman")) +                #providing custom font
  theme(legend.title = element_blank(),                                         #formatting legend and labels
        legend.position = "none",  
        axis.title.y = element_blank()) + 
  scale_fill_manual(values =                                                    #providing custom colors
        c("#22161c", "#6e304b", "#a37c82", "#e2ae6c", "#eae2b7")) +
  coord_flip()                                                                  #flipping coordinates

bargraph18 <- df %>%
  group_by(UniversityRanking) %>%                                               #grouping by university ranking
  summarize(avgGPA = mean(GPA)) %>%                                             #average GPA column
  ggplot(mapping = aes(x = UniversityRanking,                                   #creating barplot
                       y =  avgGPA, 
                       fill = UniversityRanking)) +
  geom_col(width = 0.5) +
  theme_bw() +  
  labs(caption = "by undergrad University Ranking    ") +
  ylab('Average GPA') +
  theme(text=element_text(size=10,  family="Times New Roman")) +                
  theme(legend.title = element_blank(), 
        legend.position = "none", 
        axis.title.y = element_blank()) +
  scale_fill_manual(values = 
        c("#22161c", "#6e304b", "#a37c82", "#e2ae6c", "#eae2b7")) +
  coord_flip() 

bargraph19 <- df %>%
  group_by(UniversityRanking) %>%                                               #grouping by university ranking
  summarize(avgChance = mean(ChanceOfAdmission)) %>%                            #average chance of admission column
  ggplot(mapping = aes(x = UniversityRanking,                                   #creating barplot
                       y =  avgChance, 
                       fill = UniversityRanking)) +
  geom_col(width = 0.5) +                                                       
  theme_bw() +        
  labs(caption = "by undergrad University Ranking       ") +
  ylab('Average Chance (%)') +                                         
  theme(text=element_text(size=10,  family="Times New Roman")) +                
  theme(legend.title = element_blank(),                                         
        legend.position = "none", 
        axis.title.y = element_blank()) +
  scale_fill_manual(values =                                                    
        c("#22161c", "#6e304b", "#a37c82", "#e2ae6c", "#eae2b7")) +
  coord_flip()                                                                  

grid.arrange(bargraph17, bargraph18, bargraph19, ncol = 3)                      #creating barplot grid

We can see that there is a very small difference between average GRE scores at a lower ranking university and a higher ranking university however the average does seem to increase progressively. Similarly, this is also the case for average GPA and average chance of admission going from lower to higher ranking universities but the trend is much more pronounced in case of average chance of admission.

In the next section, we will create visualizations in highcharter that involve some of the other variables that we haven’t looked at before.

Loading necessary libraries

library(highcharter)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(hablar)
## 
## Attaching package: 'hablar'
## The following object is masked from 'package:dplyr':
## 
##     na_if
library(RColorBrewer)

In this section we will introduce three new variables, TOEFL, statement of purpose and letter of recommendation that we didn’t plot before. As TOEFL isn’t applicable for all students we didn’t explore this variable further in previous plots but we will create one visualization so it is not completely excluded from our visualizations.

In highcharter, let’s plot TOEFL and GRE scores grouped by Statement of Purpose and see if we can gather any new information.

Plotting TOEFL and GRE scores by Statement of Purpose

highcharter20 <- highchart() %>%                                                #creating highcharter plot
   hc_add_series(data = fctr_df,
                 type = "scatter",
                   hcaes(x = GRE, 
                   y = TOEFL, 
                   group = StatementOfPurpose)) %>%
  hc_chart(style = list(fontFamily = "Times New Roman",                         #assigning custom font
                        fontWeight = "bold")) %>%
  hc_xAxis(title = list(text="GRE")) %>%                                        #assigning x-axis label
  hc_yAxis(title = list(text="TOEFEL")) %>%                                     #assigning y-axis label
  hc_colors(c("#03071e", "#01497c", "#718355" , "#774936", "#b07d62" ,          #assigning custom colors
              "#ffc300", "#ee9b00", "#ca6702" , "#ae2012", "#d90429")) %>%
  hc_title(                                                                     #providing plot title
    text = "GRE and TOEFEL Scores by Statement of Purpose (Strength)",
    margin = 30,
    align = "center",
    style = list(color = "black", useHTML = TRUE)) %>%
  hc_tooltip(shared = TRUE,
             borderColor = "black",
             pointFormat = "GRE: {point.GRE} <br> TOEFL: {point.TOEFL}")

highcharter20

It appears that lower TOEFL scores correspond to lower GRE scores while higher TOEFL scores correspond to higher GRE scores and follows an upward trajectory based on the statement of purpose. Although there are some exceptions, the results generally follow the trend that those scoring low on TOEFL also tend to have a low GRE score and vice versa.

Next, we will see how strength of Letter of Recommendation plays into the admission process and plot Minimum GPA and Highest Chance of Admission for each level in the variable.

Plotting GPA and Chance of Admission by Letter of Recommendation

cols_df <- df %>%
  select(GPA, ChanceOfAdmission, LetterOfRecommendation) %>%                    #selecting three columns
  group_by(LetterOfRecommendation) %>%                                          #grouping by letter of rec
  summarize(minGPA = min(GPA), maxChanceOfAdmission = max(ChanceOfAdmission))   #creating two new columns 

highcharter21 <- highchart() %>%                                                #creating highcharter plot
  hc_yAxis_multiples(                                                           #providing labels for both y-axes
    list(title = list(text = "Minimum GPA")),
    list(title = list(text = "Highest Chance of Admission"), opposite = TRUE)
    ) %>%
  hc_xAxis(title = list(text="Letter of Recommendation (Strength)")) %>%        #providing label for x-axis
  hc_add_series(data = cols_df$minGPA,                                          #customizing first y-axis
                name = "Minimum GPA on a 4.0 Scale",
                type = "column",
                yAxis = 0) %>%
  hc_add_series(data = cols_df$maxChanceOfAdmission,                            #customizing second y-axis
                name = "Highest Chance of Admission",
                type = "line",
                yAxis = 1) %>%
  hc_xAxis(categories = cols_df$LetterOfRecommendation,                         #customizing x-axis
           tickInterval = 1) %>%
  hc_title(                                                                     #providing plot title
    text = "Minimum GPA and Highest Chance of Admission 
    by Letters of Recommendation",
    margin = 30,                                                                #formatting plot title
    align = "center",
    style = list(color = "black", useHTML = TRUE)
    ) %>% 
    hc_colors(c("#e2ae6c","#6e304b")) %>%                                       #providing custom colors
    hc_chart(style = list(fontFamily = "Times New Roman",                       #providing custom font
                        fontWeight = "bold")) 

highcharter21

It appears that there is no obvious trend between minimum GPA and different levels of the x-axis so it is not the case that higher GPA guarantees a stronger letter of recommendation or vice versa. The chances of admission do show a different trend, however. There is a slight decrease going from 1.5 to 2 on the x-axis but the chances of admission generally seem to increase progressively from levels 1 through 3.5 before plateauing. This tells us that chance of admission may be improved based on a strong letter of recommendation.

Now that we have spent some time exploring our data, let’s construct a regression model.

MULTIPLE LINEAR REGRESSION

In this section, we will construct a multiple linear regression model to predict chance of admission. There are four assumptions associated with a linear regression model which can be tested using diagnostic plots. The assumptions are as follows:

Linearity: The relationship between X and the mean of Y is linear. Homoscedasticity: The variance of residual is the same for any value of X. Independence: Observations are independent of each other. Normality: For any fixed value of X, Y is normally distributed.

library(corrplot)
## corrplot 0.84 loaded
#selecting columns to use in correlation plot
corr_df <- df %>%                                                               
  select(GRE, StatementOfPurpose, LetterOfRecommendation,                      
         GPA, Research, ChanceOfAdmission )
df_Numeric_Variable <- select_if(corr_df, is.numeric)                           #selecting numeric variables
#matrix is reflective along the principal diagonal so using lower
corr <- cor(df_Numeric_Variable) 
#providing custom labels and plotting the matrix
corrplot(corr,method = "number",                                                
         type = "lower", tl.cex=0.9, cl.cex = 0.6, tl.col="black")

The correlation coefficient is a value between -1 and 1, inclusive and tells how strong or weak the correlation is. Values closer to +/- 1 represent a strong correlation where the sign is determined by the linear slope, values close to +/- 0.5 are weak correlation, and values close to zero have no correlation.

We can see the correlation between different elements from the above heat map. It appears that GPA plays the most important role in admission with a correlation value of 0.88 followed by GRE score and chance of admission which are are highly correlated as well at 0.81.

With multiple regression, there are several strategies for comparing variable inputs into a model. We will use backward elimination and start with all possible predictor variables with our response variable. Let’s use GRE score, GPA, University Ranking, Statement of Purpose, Letter of Recommendation, and Research and perform a model fit with all predictors. TOEFL score is not included because it is not likely to be relevant for all applicants. We are now ready to create our model.

First Model

fit1 <- lm(ChanceOfAdmission ~ GRE + GPA + Research + UniversityRanking +       #creating model using lm function
             StatementOfPurpose + LetterOfRecommendation, data = num_df)
summary(fit1)                                                                   #printing summary statistics
## 
## Call:
## lm(formula = ChanceOfAdmission ~ GRE + GPA + Research + UniversityRanking + 
##     StatementOfPurpose + LetterOfRecommendation, data = num_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.0056  -2.4926   0.8537   3.4852  16.3426 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -126.93551   10.79066 -11.763  < 2e-16 ***
## GRE                       0.25938    0.04502   5.761 1.47e-08 ***
## GPA                      31.77303    2.34946  13.524  < 2e-16 ***
## ResearchNo               -2.33025    0.66591  -3.499 0.000509 ***
## UniversityRanking         0.70032    0.38221   1.832 0.067514 .  
## StatementOfPurpose        0.29858    0.45833   0.651 0.515060    
## LetterOfRecommendation    1.68170    0.41760   4.027 6.54e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.054 on 493 degrees of freedom
## Multiple R-squared:  0.8182, Adjusted R-squared:  0.816 
## F-statistic: 369.9 on 6 and 493 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))                                                             #formatting plot display
plot(fit1)                                                                      #printing diagnostic plots

The p-value on the right of GRE, GPA, Research and Letter of Recommendation has 3 asterisks which suggest they are meaningful variables to explain the linear change in chances of admission but we also need to look at the adjusted R-squared value. It states that about 82% of the variation in the observations may be explained by the model which is pretty good. In other words, only 18% of the variation in the data is likely not explained by this model.

We usually pay great attention to regression results, such as p-values, R-squared or adjusted R-squared that tell us how well a model represents given data but that’s not the whole picture. We should also look at diagnostic plots to not only check if the linear regression assumptions are met but to improve our model in an exploratory way.

In this case, in residuals vs fitted plot we find equally spread residuals around a horizontal line without distinct patterns which is a good indication of absence of non-linear relationships. Normal Q-Q plot shows if residuals are normally distributed. Though the residuals don’t exactly follow a straight line they also don’t deviate severely. Let’s look at the next plot while keeping in mind that #10, #66 and #93 might be a potential problem. The Scale-Location plot checks the assumption of equal variance (homoscedasticity). It’s a good sign if the red line is horizontal with the points spread about randomly. Here we have a horizontal line with points roughly spread about the red line for x values less than 70 but not so much for values greater than 70. Finally, in the last plot we watch out for outlying values at the upper right corner or at the lower right corner as those spots are the places where cases can be influential against a regression line. The plot identified the influential observation as #92. Let’s see how it changes our model if I exclude the 92nd case from the analysis.

Second Model

outlier_df <- num_df[-c(92), ]                                                  #removing outlier
fit2 <- lm(ChanceOfAdmission ~ GRE + GPA + Research + UniversityRanking +       #creating model using lm function
             StatementOfPurpose +LetterOfRecommendation, data = outlier_df)
summary(fit2)                                                                   #printing summary statistics
## 
## Call:
## lm(formula = ChanceOfAdmission ~ GRE + GPA + Research + UniversityRanking + 
##     StatementOfPurpose + LetterOfRecommendation, data = outlier_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.0732  -2.4749   0.8039   3.4624  16.1944 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -124.4560    10.7440 -11.584  < 2e-16 ***
## GRE                       0.2561     0.0447   5.730 1.75e-08 ***
## GPA                      31.1691     2.3412  13.314  < 2e-16 ***
## ResearchNo               -2.2852     0.6611  -3.457 0.000594 ***
## UniversityRanking         0.6858     0.3794   1.808 0.071274 .  
## StatementOfPurpose        0.5135     0.4609   1.114 0.265741    
## LetterOfRecommendation    1.6711     0.4145   4.032 6.42e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.009 on 492 degrees of freedom
## Multiple R-squared:  0.8192, Adjusted R-squared:  0.817 
## F-statistic: 371.5 on 6 and 492 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))                                                             #formatting plot display
plot(fit2)                                                                      #printing diagnostic plots

We don’t notice any significant improvement in diagnostic plots except now we get a new value #96 that the Residual vs Leverage plot identified as an influential observation. Instead of removing this value and testing our model again like we did before, we will return to our summary statistics.

Based on the summary information the only variables that don’t appear to be as significant as the others are statement of purpose and university ranking. We drop statement of purpose since it has the highest p-value and re-run the model.

Third Model

fit3 <- lm(ChanceOfAdmission ~ GRE + GPA + Research + UniversityRanking +       #creating model using lm function
             LetterOfRecommendation, data = num_df)
summary(fit3)                                                                   #printing summary statistics
## 
## Call:
## lm(formula = ChanceOfAdmission ~ GRE + GPA + Research + UniversityRanking + 
##     LetterOfRecommendation, data = num_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.9197  -2.5216   0.8216   3.5563  16.4160 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -127.77743   10.70675 -11.934  < 2e-16 ***
## GRE                       0.25978    0.04499   5.774 1.37e-08 ***
## GPA                      32.10837    2.29104  14.015  < 2e-16 ***
## ResearchNo               -2.33705    0.66544  -3.512 0.000485 ***
## UniversityRanking         0.79488    0.35337   2.249 0.024926 *  
## LetterOfRecommendation    1.76304    0.39827   4.427 1.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.05 on 494 degrees of freedom
## Multiple R-squared:  0.8181, Adjusted R-squared:  0.8162 
## F-statistic: 444.3 on 5 and 494 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))                                                             #formatting plot display
plot(fit3)                                                                      #printing diagnostic plots

The diagnostic plots look approximately the same as before with an improvement in the Residual vs Leverage plot where we no longer observe values that might be influential against our regression model. The R-squared and adjusted R-squared vales stayed at 82% respectively.

The p-value on the right for each of the predictors has 3 asterisks except for university ranking that has a p-value of 0.02 which suggests that it may not be as meaningful as the others. We remove university ranking and run the model one last time.

Final Model

fit4 <- lm(ChanceOfAdmission ~ GRE + GPA + Research +                           #creating model using lm function
             LetterOfRecommendation, data = num_df)
summary(fit4)                                                                   #printing summary statistics
## 
## Call:
## lm(formula = ChanceOfAdmission ~ GRE + GPA + Research + LetterOfRecommendation, 
##     data = num_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.9225  -2.4510   0.6664   3.5002  16.6388 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -134.89444   10.27044 -13.134  < 2e-16 ***
## GRE                       0.27137    0.04488   6.047 2.91e-09 ***
## GPA                      33.58340    2.20418  15.236  < 2e-16 ***
## ResearchNo               -2.42989    0.66687  -3.644 0.000297 ***
## LetterOfRecommendation    2.02221    0.38280   5.283 1.91e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.075 on 495 degrees of freedom
## Multiple R-squared:  0.8162, Adjusted R-squared:  0.8147 
## F-statistic: 549.6 on 4 and 495 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))                                                             #formatting plot display
plot(fit4)                                                                      #printing diagnostic plots

Our diagostic plots still look about the same while our adjusted R-squared dropped slightly to 81% but it may be that we were over-fitting before and getting a higher value. The p-value on the right of GRE, GPA, Research and Letter of Recommendation have 3 asterisks which suggest that they are all meaningful variables. So, we select the simplest (parsimonious) model with GRE, GPA, Research, and Letter Of Recommendation on Chance of Admission. We found earlier that chance of admission is highly correlated with GPA and GRE, so it makes sense that these two variables fits in the model.

CONCLUSION

Background Research

The requirements for graduate school are generally the same for a lot of the programs with some programs focusing more on one component than another and which component that might be varies from school to school. The key question that I wanted to research was to what extent do these five factors academics, statement, letters, research experience, and university ranking predict your chances of getting admission into grad school.

According to an article published on usnews.com, grad schools want to see that you have a proven track record of success in your field as the purpose of graduate school is to develop expertise in a specific academic subject (Kowarski, 2019). GPA and standardized tests are common admissions factors that graduate schools use to determine an applicant’s potential for academic success. The most common standardized test for admissions is the GRE and how important standardized test scores are to a grad school however generally depends on the field of study (Muniz, 2017). I also learned that your potential grad schools want to know if you are a good fit for their program so beyond your academic qualifications, your letters of recommendation might be the most important factor of your application (Muniz, 2017). Both these sources also listed passion for your field as an obvious component because graduate programs are usually hyper-focused on a particular academic discipline and one easy way to express your passion is through your personal statement (Kowarski, 2019). University ranking was not considered as important and my linear regression model corroborated this statistic. So, it appears that the specific qualities that grad schools look for in potential students are factors like good academic standing, relevant work experience, a strong personal statement and strong letters of recommendation as they can all highlight a deep and prolonged interest in your area of study.

PATTERNS AND SURPRISES

Something which surprised me was that I have known statement of purpose or personal statement to be a key component of your graduate school application which was also confirmed in my background research. The linear regression model I created however deemed the variable insignificant when predicting chance of admission which surprised me. I figured these results might be due to the fact that I was treating the strength of statement of purpose as a numeric variable and not a factor so I changed it to a factor and reran my model which still gave me the same results. It may be that there isn’t a linear relationship that exists between the predictors and the outcome or that the data were systematically biased when collecting data.

Another thing that surprised me was when comparing levels within a factor variable I learned that in this dataset a larger number of undergraduate students are involved in undergraduate research than those who are not. I didn’t know this to be the case and this is something I wish I could have looked more into. would have liked to look into more. We see that a majority of students at above average ranking universities participate in research while about the same number of them at below average ranking universities do not and that is probably canceling out the effet with a slightly greater overall number of students doing undergraduate research.

In my secnd project I wrote that I wished to use highcharter to create visualizations and attempt multiple linear regression with diagnostic plots, both of which I was able to achieve in this project so there isn’t anything in particular that I wish I could have included or done differently. Having developed a better understanding of linear regression through this project however, I would like to attempt logistic regression as well as explore other machine learning models for projects in the future.

Sources:

Kowarski, Ilana. “Is Graduate School Right For You?” U.S. News & World Report, U.S. News & World Report, 15 Feb. 2019, www.usnews.com/education/best-graduate-schools/articles/2019-02-15/what-graduate-school-is-and-who-should-consider-attending.

Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019

Muniz, Hannah. “Grad School Requirements: What You Need for Admission.” Online GRE Prep Blog by PrepScholar, 20 Mar. 2017, www.prepscholar.com/gre/blog/grad-school-requirements/.