Graduate School Admission

1. Introduction
2. Data Exploration
3. What’s the decisive factor in Graduate Admission?
4. What’s a good applicant?
5.Conclusion

1. Introduction

1.1 Background

In the process of applying to graduate school, various indicators of the applicant will have an impact on the probability of admission, and will also determine the relative position of the applicant among the candidates. Based on graduate admission data from Kaggle, this paper analyzes the importance ranking of admission indicators, and divides the degree of excellence of applicants. This paper uses the dimensionality reduction method to conclude that GPA, GRE scores and Toefl scores have the greatest impact on the admission probability; through the hierarchical clustering method, it is confirmed that applicants can indeed be divided into two categories. Also, a discriminant criterion with a lower error rate is obtained with the first principal component and the second principal component.

1.2 Dataset summary

The dataset contains several parameters which are considered important during the application for UCLA Masters Programs. Here is the link of the dataset: https://www.kaggle.com/mohansacharya/graduate-admissions

The parameters included are :

GRE Scores ( out of 340 )
TOEFL Scores ( out of 120 )
University Rating ( out of 5 )
Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
Undergraduate GPA ( out of 10 )
Research Experience ( either 0 or 1 )
Chance of Admit ( ranging from 0 to 1 )

2. Data Exploration

First, we take a look at the distribution of different variables and test their normality.

# Import data file
data<-read.csv('Admission_Predict.csv',header=TRUE)
data<-data[,-1]
names(data)<-c('GRE','TOF','UNI','SOP','LOR','GPA','RES','COA')
summary(data)

##       GRE             TOF             UNI             SOP       
##  Min.   :290.0   Min.   : 92.0   Min.   :1.000   Min.   :1.000  
##  1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000   1st Qu.:2.500  
##  Median :317.0   Median :107.0   Median :3.000   Median :3.500  
##  Mean   :316.5   Mean   :107.2   Mean   :3.114   Mean   :3.374  
##  3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :340.0   Max.   :120.0   Max.   :5.000   Max.   :5.000  
##       LOR             GPA             RES            COA        
##  Min.   :1.000   Min.   :6.800   Min.   :0.00   Min.   :0.3400  
##  1st Qu.:3.000   1st Qu.:8.127   1st Qu.:0.00   1st Qu.:0.6300  
##  Median :3.500   Median :8.560   Median :1.00   Median :0.7200  
##  Mean   :3.484   Mean   :8.576   Mean   :0.56   Mean   :0.7217  
##  3rd Qu.:4.000   3rd Qu.:9.040   3rd Qu.:1.00   3rd Qu.:0.8200  
##  Max.   :5.000   Max.   :9.920   Max.   :1.00   Max.   :0.9700

#Normality Test
shapiro.test(scale(data[,'COA']))

## 
##  Shapiro-Wilk normality test
## 
## data:  scale(data[, "COA"])
## W = 0.98025, p-value = 2.654e-06

shapiro.test(scale(data[,'TOF']))

## 
##  Shapiro-Wilk normality test
## 
## data:  scale(data[, "TOF"])
## W = 0.98583, p-value = 8.719e-05

shapiro.test(scale(data[,'GRE']))

## 
##  Shapiro-Wilk normality test
## 
## data:  scale(data[, "GRE"])
## W = 0.98574, p-value = 8.202e-05

shapiro.test(scale(data[,'GPA']))

## 
##  Shapiro-Wilk normality test
## 
## data:  scale(data[, "GPA"])
## W = 0.99221, p-value = 0.01028

It can be seen that the p-values of the four indicators are all less than 0.05. Therefore, in the subsequent analysis, none of the items were considered to satisfy the normality assumption.

#COA-GRE
p1<-ggplot(data,mapping=aes(x=COA,y=GRE))+geom_point()+geom_point(pch=17, color="blue", size=2) +geom_smooth(method="lm", color="red", linetype=2)+labs(title="GRE Score")
p1

## `geom_smooth()` using formula 'y ~ x'

#COA-UNI
datax<-data
datax$UNI <- factor(datax$UNI, levels=c(1,2,3,4,5),labels=c("1","2","3","4","5"))
ggplot(datax, aes(x=UNI, y=COA),color=variable) + geom_boxplot() + theme(axis.text.x=element_text(angle=0,hjust=0.5, vjust=0.5)) +theme(legend.position="none")

#SOP-COA
datax$SOP <- factor(datax$SOP, levels=c(1,1.5,2,2.5,3,3.5,4,4.5,5),labels=c("1","1.5","2","2.5","3","3.5","4","4.5","5"))
ggplot(datax, aes(x=SOP, y=COA),color=variable) + geom_boxplot() + theme(axis.text.x=element_text(angle=0,hjust=0.5, vjust=0.5)) +theme(legend.position="none")

#COA-LOR
datax$LOR <- factor(datax$LOR, levels=c(1,1.5,2,2.5,3,3.5,4,4.5,5),labels=c("1","1.5","2","2.5","3","3.5","4","4.5","5"))
ggplot(datax, aes(x=LOR, y=COA),color=variable) + geom_boxplot() + theme(axis.text.x=element_text(angle=0,hjust=0.5, vjust=0.5)) +theme(legend.position="none")

#COA-RES
datax$RES <- factor(datax$RES, levels=c(0,1),labels=c("0","1"))
ggplot(datax, aes(x=RES, y=COA),color=variable) + geom_boxplot() + theme(axis.text.x=element_text(angle=0,hjust=0.5, vjust=0.5)) +theme(legend.position="none")

In general, we can say that all indicators are positively related with COA.

#Heatmap between variables
data<-read.csv("Admission_Predict.csv",header=TRUE)
data<-data[,-1]
names(data)<-c('GRE','TOF','UNI','SOP','LOR','GPA','RES','COA')
mydata<-data[,-8]
cormat <- round(cor(mydata),2)
get_upper_tri <- function(cormat){
  cormat[lower.tri(cormat)]<- NA
  return(cormat)
}
upper_tri <- get_upper_tri(cormat)
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Heatmap
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal()+ 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   size = 12, hjust = 1))+
  coord_fixed()

# Reorder the correlation matrix
cormat <- reorder_cormat(cormat)
upper_tri <- get_upper_tri(cormat)
# Melt the correlation matrix
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Create a ggheatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal()+ # minimal theme
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   size = 12, hjust = 1))+
  coord_fixed()
ggheatmap + 
  geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank(),
    axis.ticks = element_blank(),
    legend.justification = c(1, 0),
    legend.position = c(0.6, 0.7),
    legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwidth = 7, barheight = 1, title.position = "top", title.hjust = 0.5))

It can be seen that the three standardized indicators of GRE, GPA and TOF are closely related, indicating that these three indicators are very different from other indicators, and more attention should be paid in the subsequent analysis. RES, which represents research experience, is not closely related to the other six variables, which means that this indicator is relatively independent.

3. What’s the decisive factor in Graduate Admission?

This part attempts to answer the first question: what are the important application indicators? Principal component analysis can use fewer dimensions to describe the relationship between variables and give the importance of indicators; factor analysis can find hidden in each Factors behind the indicator. Using regression analysis, the impact of each indicator on COA can be quantified.

3.1 Principal Component Analysis

data_1 <- data[,-8]
# pca with 7 variables
test.pr<-princomp(data_1,cor=TRUE)
summary(test.pr,loadings=TRUE)

## Importance of components:
##                           Comp.1    Comp.2     Comp.3    Comp.4     Comp.5
## Standard deviation     2.1740179 0.8612498 0.74941834 0.6167376 0.51349449
## Proportion of Variance 0.6751934 0.1059645 0.08023255 0.0543379 0.03766808
## Cumulative Proportion  0.6751934 0.7811579 0.86139044 0.9157283 0.95339642
##                            Comp.6     Comp.7
## Standard deviation     0.42223111 0.38463740
## Proportion of Variance 0.02546844 0.02113513
## Cumulative Proportion  0.97886487 1.00000000
## 
## Loadings:
##     Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## GRE  0.404  0.275  0.363  0.145         0.251  0.738
## TOF  0.401  0.111  0.461  0.127        -0.727 -0.263
## UNI  0.383 -0.250        -0.642  0.607              
## SOP  0.385 -0.343 -0.173 -0.327 -0.763         0.123
## LOR  0.347 -0.426 -0.465  0.649  0.211              
## GPA  0.421         0.241  0.137         0.627 -0.593
## RES  0.289  0.742 -0.586                      -0.105

screeplot(test.pr,type="lines",main="PCA_Variance")

It can be seen that using 7 items for principal component analysis, the proportion of explanation is not high. One possible reason is that principal component analysis is suitable for data with strong correlation between variables. According to the correlation coefficient matrix, we remove two items, RES and LOR, which are less correlated with other data, and use the remaining five items for analysis. The result obtained is as follows:

# PCA without LOR & RES
data_2=data_1[,-7]
data_2=data_2[,-5]
test.pr<-princomp(data_2,cor=TRUE)
summary(test.pr,loadings=TRUE)

## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     1.9662492 0.7216676 0.52272571 0.42879985 0.39490322
## Proportion of Variance 0.7732272 0.1041608 0.05464843 0.03677386 0.03118971
## Cumulative Proportion  0.7732272 0.8773880 0.93203643 0.96881029 1.00000000
## 
## Loadings:
##     Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## GRE  0.453  0.464         0.279  0.705
## TOF  0.457  0.384        -0.765 -0.242
## UNI  0.428 -0.528  0.732              
## SOP  0.425 -0.572 -0.672         0.187
## GPA  0.471  0.177         0.575 -0.639

screeplot(test.pr,type="lines",main="PCA_Variance")

It can be seen that the proportion of variance explained significantly increases after using 5 items for analysis. In the first principal component, the load distribution of 5 items is uniform; in the second principal component, the loads of UNI and SOP are too large. With these two principal components, dimensionality reduction is done on the data.

3.2 Factor Analysis

In PCA, this paper obtained the first principal component and the second principal component, but did not get a meaningful explanation. In this part, factor analysis will be used in this paper to provide better interpretability. In factor analysis, using the MLE method, assuming that the number of factors is 2, all 7 items are analyzed, and the obtained result is as follows:

#Factor Analysis
fit <- factanal(data_1,factors=3,rotation="none")
fit

## 
## Call:
## factanal(x = data_1, factors = 3, rotation = "none")
## 
## Uniquenesses:
##   GRE   TOF   UNI   SOP   LOR   GPA   RES 
## 0.125 0.199 0.320 0.224 0.427 0.162 0.005 
## 
## Loadings:
##     Factor1 Factor2 Factor3
## GRE  0.670   0.601  -0.255 
## TOF  0.721   0.506  -0.157 
## UNI  0.632   0.461   0.262 
## SOP  0.660   0.443   0.379 
## LOR  0.553   0.402   0.325 
## GPA  0.738   0.541         
## RES          0.996         
## 
##                Factor1 Factor2 Factor3
## SS loadings      2.658   2.473   0.408
## Proportion Var   0.380   0.353   0.058
## Cumulative Var   0.380   0.733   0.791
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 5.14 on 3 degrees of freedom.
## The p-value is 0.162

fit2 <- factanal(data_1,factors=2,rotation="none")
fit2

## 
## Call:
## factanal(x = data_1, factors = 2, rotation = "none")
## 
## Uniquenesses:
##   GRE   TOF   UNI   SOP   LOR   GPA   RES 
## 0.098 0.220 0.319 0.228 0.427 0.169 0.668 
## 
## Loadings:
##     Factor1 Factor2
## GRE  0.917  -0.249 
## TOF  0.878         
## UNI  0.772   0.293 
## SOP  0.778   0.408 
## LOR  0.670   0.351 
## GPA  0.911         
## RES  0.570         
## 
##                Factor1 Factor2
## SS loadings      4.416   0.455
## Proportion Var   0.631   0.065
## Cumulative Var   0.631   0.696
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 18.31 on 8 degrees of freedom.
## The p-value is 0.019

fit3 <- factanal(data_1,factors=2,rotation="varimax")
fit3

## 
## Call:
## factanal(x = data_1, factors = 2, rotation = "varimax")
## 
## Uniquenesses:
##   GRE   TOF   UNI   SOP   LOR   GPA   RES 
## 0.098 0.220 0.319 0.228 0.427 0.169 0.668 
## 
## Loadings:
##     Factor1 Factor2
## GRE 0.884   0.349  
## TOF 0.761   0.449  
## UNI 0.443   0.696  
## SOP 0.379   0.792  
## LOR 0.328   0.682  
## GPA 0.715   0.566  
## RES 0.510   0.269  
## 
##                Factor1 Factor2
## SS loadings      2.578   2.294
## Proportion Var   0.368   0.328
## Cumulative Var   0.368   0.696
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 18.31 on 8 degrees of freedom.
## The p-value is 0.019

For factor 1, the three items of GRE/TOF/GPA are higher than other items, and factor 1 can be interpreted as a “learning factor”, that is, it is closely related to the applicant’s learning ability, while UNI/SOP/LOR/RES are related to the application The learning ability of the learner is limited, and the load on factor 1 is naturally small. This trend is more pronounced when we use the Varimax method for factor rotation.

Therefore, we have come to a conclusion: the three standardized indicators of GRE/TOF/GPA can be interpreted as ‘learning’ factors, which have an impact on admission results; while the remaining indicators are related to research ability, external environment, academic contacts, etc. The factor of ability is not fully reflected. It is easy to find examples from real life: students with excellent academic performance may not have smooth sailing in scientific research.

4. What’s a good applicant?

This section aims to answer the second question: what makes a good applicant? The dataset includes an item that reflects the probability of admission: COA. An intuitive idea is that applicants with a high probability of admission are excellent, but the overall average COA has reached 0.72. What is a “high” probability?

4.1 Clustering

We analyzed 7 items using hierarchical clustering, determined that the optimal number of clusters was 2, and matched the scatterplots of the two indicators with the COA. When clustering, the distance used is Euclidean distance, and the method used is the average connection method. The result obtained is as follows:

# Clustering
data_scale <- scale(data_1)
dist.r = dist(data_scale, method="euclidean")
hc.r<-hclust(dist.r,method="average")
#plot(hc.r, hang = -1,labels=NULL) 
nc <- NbClust(data_scale, distance="euclidean", min.nc=2, max.nc=15, method="average")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 9 proposed 2 as the best number of clusters 
## * 2 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 2 proposed 7 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 3 proposed 9 as the best number of clusters 
## * 1 proposed 11 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

c=cutree(hc.r,k=2)
A<-data.frame(NUM=c(1:500),COA=data$COA,TREE=c)
TRE<- factor(A$TREE,levels=c(1,2),labels = c("1","2"))
ggplot(data=A, aes(x=NUM, y=COA, color=TRE))+geom_point(size=2)

data_tre<- cbind(data[,1:8],group=TRE)
m<-mean(data_tre[which(data_tre$group==1),])
summary(data_tre[which(data_tre$group==1),])

##       GRE             TOF             UNI             SOP       
##  Min.   :301.0   Min.   : 98.0   Min.   :2.000   Min.   :2.000  
##  1st Qu.:320.0   1st Qu.:108.0   1st Qu.:3.000   1st Qu.:3.500  
##  Median :324.0   Median :111.0   Median :4.000   Median :4.000  
##  Mean   :324.6   Mean   :111.3   Mean   :3.867   Mean   :4.024  
##  3rd Qu.:330.0   3rd Qu.:115.0   3rd Qu.:5.000   3rd Qu.:4.500  
##  Max.   :340.0   Max.   :120.0   Max.   :5.000   Max.   :5.000  
##       LOR             GPA             RES              COA         group  
##  Min.   :1.500   Min.   :7.890   Min.   :0.0000   Min.   :0.4600   1:249  
##  1st Qu.:3.500   1st Qu.:8.720   1st Qu.:1.0000   1st Qu.:0.7600   2:  0  
##  Median :4.000   Median :9.040   Median :1.0000   Median :0.8200          
##  Mean   :4.008   Mean   :9.001   Mean   :0.9237   Mean   :0.8228          
##  3rd Qu.:4.500   3rd Qu.:9.280   3rd Qu.:1.0000   3rd Qu.:0.9000          
##  Max.   :5.000   Max.   :9.920   Max.   :1.0000   Max.   :0.9700

As can be roughly obtained from the figure, it can be seen that applicants are divided into two categories: excellent applicants (category 1, A total of 162 people, accounting for 32.4%) and ordinary applicants (category 2, a total of 338 people, accounting for 67.6%). The COA of excellent applicants is basically above 0.8. Therefore, we can take 0.8 as a cut-off point for COA. Only when the admission probability of an applicant reaches 0.8 can the applicant be considered to have a high probability of success in the application.

4.2 Linear discriminant analysis (LDA)

Obviously, it is not rigorous to judge the degree of excellence of an applicant only by whether the COA is greater than 0.8. Using the classification results obtained by the previous step of clustering, this paper uses the LDA method to obtain a criterion for distinguishing outstanding applicants from ordinary applicants. In the use of LDA, we reduce the dimensionality of the first principal component and the second principal component in the fourth part of the principal component analysis, and use the applicant’s principal component score to distinguish. The classification results are shown in the following figure:

# LDA
data_s <- data.frame(data_scale)
PC1=0.453*data_s$GRE+0.457*data_s$TOF+0.428*data_s$UNI+0.425*data_s$SOP+0.471*data_s$GPA
PC2=0.464*data_s$GRE+0.384*data_s$TOF-0.528*data_s$UNI-0.572*data_s$SOP+0.177*data_s$GPA
PC=data.frame(dim1=PC1,dim2=PC2,Clu=c)

L = lda(TRE~PC1+PC2,data=PC)
p<-ggplot(data=PC,aes(x=PC1,y=PC2,color=TRE)) + geom_point() + theme(legend.position='bottom')
p+geom_abline(intercept=0,slope=2.986028)

LPredict<-predict(L)
newGroup <- LPredict$class   
tab <- table(TRE, newGroup)  
erro <- 1-sum(diag(prop.table(tab)))

The discriminant function for this discrimination is: y = -0.8228PC1 + 0.2756PC2. The ABER of this classification method is 6.2%, which is a good classification effect for the classification through cluster analysis.

5.Conclusion

5.1 Important Indicators

Among the 7 application indicators, undergraduate GPA, GRE scores, and TOEFL scores have the greatest impact on the admission probability of applicants, and are the key indicators for admission.

5.2 Good Applicants

By cluster analysis, applicants can be divided into two categories: excellent applicants and ordinary applicants; according to the dimensionality reduction results of PCA, new data can be discriminated, and the discriminant criteria are as follows: if y < − 0.8228 ∗ PC1 + 0.2756 ∗ PC2, then it is classified into excellent applicants.