In the process of applying to graduate school, various indicators of the applicant will have an impact on the probability of admission, and will also determine the relative position of the applicant among the candidates. Based on graduate admission data from Kaggle, this paper analyzes the importance ranking of admission indicators, and divides the degree of excellence of applicants. This paper uses the dimensionality reduction method to conclude that GPA, GRE scores and Toefl scores have the greatest impact on the admission probability; through the hierarchical clustering method, it is confirmed that applicants can indeed be divided into two categories. Also, a discriminant criterion with a lower error rate is obtained with the first principal component and the second principal component.
The dataset contains several parameters which are considered important during the application for UCLA Masters Programs. Here is the link of the dataset: https://www.kaggle.com/mohansacharya/graduate-admissions
The parameters included are :
First, we take a look at the distribution of different variables and test their normality.
# Import data file
data<-read.csv('Admission_Predict.csv',header=TRUE)
data<-data[,-1]
names(data)<-c('GRE','TOF','UNI','SOP','LOR','GPA','RES','COA')
summary(data)## GRE TOF UNI SOP
## Min. :290.0 Min. : 92.0 Min. :1.000 Min. :1.000
## 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000 1st Qu.:2.500
## Median :317.0 Median :107.0 Median :3.000 Median :3.500
## Mean :316.5 Mean :107.2 Mean :3.114 Mean :3.374
## 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :340.0 Max. :120.0 Max. :5.000 Max. :5.000
## LOR GPA RES COA
## Min. :1.000 Min. :6.800 Min. :0.00 Min. :0.3400
## 1st Qu.:3.000 1st Qu.:8.127 1st Qu.:0.00 1st Qu.:0.6300
## Median :3.500 Median :8.560 Median :1.00 Median :0.7200
## Mean :3.484 Mean :8.576 Mean :0.56 Mean :0.7217
## 3rd Qu.:4.000 3rd Qu.:9.040 3rd Qu.:1.00 3rd Qu.:0.8200
## Max. :5.000 Max. :9.920 Max. :1.00 Max. :0.9700
#Normality Test
shapiro.test(scale(data[,'COA']))##
## Shapiro-Wilk normality test
##
## data: scale(data[, "COA"])
## W = 0.98025, p-value = 2.654e-06
shapiro.test(scale(data[,'TOF']))##
## Shapiro-Wilk normality test
##
## data: scale(data[, "TOF"])
## W = 0.98583, p-value = 8.719e-05
shapiro.test(scale(data[,'GRE']))##
## Shapiro-Wilk normality test
##
## data: scale(data[, "GRE"])
## W = 0.98574, p-value = 8.202e-05
shapiro.test(scale(data[,'GPA']))##
## Shapiro-Wilk normality test
##
## data: scale(data[, "GPA"])
## W = 0.99221, p-value = 0.01028
It can be seen that the p-values of the four indicators are all less than 0.05. Therefore, in the subsequent analysis, none of the items were considered to satisfy the normality assumption.
#COA-GRE
p1<-ggplot(data,mapping=aes(x=COA,y=GRE))+geom_point()+geom_point(pch=17, color="blue", size=2) +geom_smooth(method="lm", color="red", linetype=2)+labs(title="GRE Score")
p1## `geom_smooth()` using formula 'y ~ x'
#COA-UNI
datax<-data
datax$UNI <- factor(datax$UNI, levels=c(1,2,3,4,5),labels=c("1","2","3","4","5"))
ggplot(datax, aes(x=UNI, y=COA),color=variable) + geom_boxplot() + theme(axis.text.x=element_text(angle=0,hjust=0.5, vjust=0.5)) +theme(legend.position="none")#SOP-COA
datax$SOP <- factor(datax$SOP, levels=c(1,1.5,2,2.5,3,3.5,4,4.5,5),labels=c("1","1.5","2","2.5","3","3.5","4","4.5","5"))
ggplot(datax, aes(x=SOP, y=COA),color=variable) + geom_boxplot() + theme(axis.text.x=element_text(angle=0,hjust=0.5, vjust=0.5)) +theme(legend.position="none")#COA-LOR
datax$LOR <- factor(datax$LOR, levels=c(1,1.5,2,2.5,3,3.5,4,4.5,5),labels=c("1","1.5","2","2.5","3","3.5","4","4.5","5"))
ggplot(datax, aes(x=LOR, y=COA),color=variable) + geom_boxplot() + theme(axis.text.x=element_text(angle=0,hjust=0.5, vjust=0.5)) +theme(legend.position="none")#COA-RES
datax$RES <- factor(datax$RES, levels=c(0,1),labels=c("0","1"))
ggplot(datax, aes(x=RES, y=COA),color=variable) + geom_boxplot() + theme(axis.text.x=element_text(angle=0,hjust=0.5, vjust=0.5)) +theme(legend.position="none")In general, we can say that all indicators are positively related with COA.
#Heatmap between variables
data<-read.csv("Admission_Predict.csv",header=TRUE)
data<-data[,-1]
names(data)<-c('GRE','TOF','UNI','SOP','LOR','GPA','RES','COA')
mydata<-data[,-8]
cormat <- round(cor(mydata),2)
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
upper_tri <- get_upper_tri(cormat)
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Heatmap
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()# Reorder the correlation matrix
cormat <- reorder_cormat(cormat)
upper_tri <- get_upper_tri(cormat)
# Melt the correlation matrix
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Create a ggheatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+ # minimal theme
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwidth = 7, barheight = 1, title.position = "top", title.hjust = 0.5)) It can be seen that the three standardized indicators of GRE, GPA and TOF are closely related, indicating that these three indicators are very different from other indicators, and more attention should be paid in the subsequent analysis. RES, which represents research experience, is not closely related to the other six variables, which means that this indicator is relatively independent.
This part attempts to answer the first question: what are the important application indicators? Principal component analysis can use fewer dimensions to describe the relationship between variables and give the importance of indicators; factor analysis can find hidden in each Factors behind the indicator. Using regression analysis, the impact of each indicator on COA can be quantified.
data_1 <- data[,-8]
# pca with 7 variables
test.pr<-princomp(data_1,cor=TRUE)
summary(test.pr,loadings=TRUE)## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.1740179 0.8612498 0.74941834 0.6167376 0.51349449
## Proportion of Variance 0.6751934 0.1059645 0.08023255 0.0543379 0.03766808
## Cumulative Proportion 0.6751934 0.7811579 0.86139044 0.9157283 0.95339642
## Comp.6 Comp.7
## Standard deviation 0.42223111 0.38463740
## Proportion of Variance 0.02546844 0.02113513
## Cumulative Proportion 0.97886487 1.00000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## GRE 0.404 0.275 0.363 0.145 0.251 0.738
## TOF 0.401 0.111 0.461 0.127 -0.727 -0.263
## UNI 0.383 -0.250 -0.642 0.607
## SOP 0.385 -0.343 -0.173 -0.327 -0.763 0.123
## LOR 0.347 -0.426 -0.465 0.649 0.211
## GPA 0.421 0.241 0.137 0.627 -0.593
## RES 0.289 0.742 -0.586 -0.105
screeplot(test.pr,type="lines",main="PCA_Variance")It can be seen that using 7 items for principal component analysis, the proportion of explanation is not high. One possible reason is that principal component analysis is suitable for data with strong correlation between variables. According to the correlation coefficient matrix, we remove two items, RES and LOR, which are less correlated with other data, and use the remaining five items for analysis. The result obtained is as follows:
# PCA without LOR & RES
data_2=data_1[,-7]
data_2=data_2[,-5]
test.pr<-princomp(data_2,cor=TRUE)
summary(test.pr,loadings=TRUE)## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.9662492 0.7216676 0.52272571 0.42879985 0.39490322
## Proportion of Variance 0.7732272 0.1041608 0.05464843 0.03677386 0.03118971
## Cumulative Proportion 0.7732272 0.8773880 0.93203643 0.96881029 1.00000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## GRE 0.453 0.464 0.279 0.705
## TOF 0.457 0.384 -0.765 -0.242
## UNI 0.428 -0.528 0.732
## SOP 0.425 -0.572 -0.672 0.187
## GPA 0.471 0.177 0.575 -0.639
screeplot(test.pr,type="lines",main="PCA_Variance")It can be seen that the proportion of variance explained significantly increases after using 5 items for analysis. In the first principal component, the load distribution of 5 items is uniform; in the second principal component, the loads of UNI and SOP are too large. With these two principal components, dimensionality reduction is done on the data.
In PCA, this paper obtained the first principal component and the second principal component, but did not get a meaningful explanation. In this part, factor analysis will be used in this paper to provide better interpretability. In factor analysis, using the MLE method, assuming that the number of factors is 2, all 7 items are analyzed, and the obtained result is as follows:
#Factor Analysis
fit <- factanal(data_1,factors=3,rotation="none")
fit##
## Call:
## factanal(x = data_1, factors = 3, rotation = "none")
##
## Uniquenesses:
## GRE TOF UNI SOP LOR GPA RES
## 0.125 0.199 0.320 0.224 0.427 0.162 0.005
##
## Loadings:
## Factor1 Factor2 Factor3
## GRE 0.670 0.601 -0.255
## TOF 0.721 0.506 -0.157
## UNI 0.632 0.461 0.262
## SOP 0.660 0.443 0.379
## LOR 0.553 0.402 0.325
## GPA 0.738 0.541
## RES 0.996
##
## Factor1 Factor2 Factor3
## SS loadings 2.658 2.473 0.408
## Proportion Var 0.380 0.353 0.058
## Cumulative Var 0.380 0.733 0.791
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 5.14 on 3 degrees of freedom.
## The p-value is 0.162
fit2 <- factanal(data_1,factors=2,rotation="none")
fit2##
## Call:
## factanal(x = data_1, factors = 2, rotation = "none")
##
## Uniquenesses:
## GRE TOF UNI SOP LOR GPA RES
## 0.098 0.220 0.319 0.228 0.427 0.169 0.668
##
## Loadings:
## Factor1 Factor2
## GRE 0.917 -0.249
## TOF 0.878
## UNI 0.772 0.293
## SOP 0.778 0.408
## LOR 0.670 0.351
## GPA 0.911
## RES 0.570
##
## Factor1 Factor2
## SS loadings 4.416 0.455
## Proportion Var 0.631 0.065
## Cumulative Var 0.631 0.696
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 18.31 on 8 degrees of freedom.
## The p-value is 0.019
fit3 <- factanal(data_1,factors=2,rotation="varimax")
fit3##
## Call:
## factanal(x = data_1, factors = 2, rotation = "varimax")
##
## Uniquenesses:
## GRE TOF UNI SOP LOR GPA RES
## 0.098 0.220 0.319 0.228 0.427 0.169 0.668
##
## Loadings:
## Factor1 Factor2
## GRE 0.884 0.349
## TOF 0.761 0.449
## UNI 0.443 0.696
## SOP 0.379 0.792
## LOR 0.328 0.682
## GPA 0.715 0.566
## RES 0.510 0.269
##
## Factor1 Factor2
## SS loadings 2.578 2.294
## Proportion Var 0.368 0.328
## Cumulative Var 0.368 0.696
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 18.31 on 8 degrees of freedom.
## The p-value is 0.019
For factor 1, the three items of GRE/TOF/GPA are higher than other items, and factor 1 can be interpreted as a “learning factor”, that is, it is closely related to the applicant’s learning ability, while UNI/SOP/LOR/RES are related to the application The learning ability of the learner is limited, and the load on factor 1 is naturally small. This trend is more pronounced when we use the Varimax method for factor rotation.
Therefore, we have come to a conclusion: the three standardized indicators of GRE/TOF/GPA can be interpreted as ‘learning’ factors, which have an impact on admission results; while the remaining indicators are related to research ability, external environment, academic contacts, etc. The factor of ability is not fully reflected. It is easy to find examples from real life: students with excellent academic performance may not have smooth sailing in scientific research.
This section aims to answer the second question: what makes a good applicant? The dataset includes an item that reflects the probability of admission: COA. An intuitive idea is that applicants with a high probability of admission are excellent, but the overall average COA has reached 0.72. What is a “high” probability?
We analyzed 7 items using hierarchical clustering, determined that the optimal number of clusters was 2, and matched the scatterplots of the two indicators with the COA. When clustering, the distance used is Euclidean distance, and the method used is the average connection method. The result obtained is as follows:
# Clustering
data_scale <- scale(data_1)
dist.r = dist(data_scale, method="euclidean")
hc.r<-hclust(dist.r,method="average")
#plot(hc.r, hang = -1,labels=NULL)
nc <- NbClust(data_scale, distance="euclidean", min.nc=2, max.nc=15, method="average")## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 9 proposed 2 as the best number of clusters
## * 2 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 2 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 3 proposed 9 as the best number of clusters
## * 1 proposed 11 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
c=cutree(hc.r,k=2)
A<-data.frame(NUM=c(1:500),COA=data$COA,TREE=c)
TRE<- factor(A$TREE,levels=c(1,2),labels = c("1","2"))
ggplot(data=A, aes(x=NUM, y=COA, color=TRE))+geom_point(size=2)data_tre<- cbind(data[,1:8],group=TRE)
m<-mean(data_tre[which(data_tre$group==1),])
summary(data_tre[which(data_tre$group==1),])## GRE TOF UNI SOP
## Min. :301.0 Min. : 98.0 Min. :2.000 Min. :2.000
## 1st Qu.:320.0 1st Qu.:108.0 1st Qu.:3.000 1st Qu.:3.500
## Median :324.0 Median :111.0 Median :4.000 Median :4.000
## Mean :324.6 Mean :111.3 Mean :3.867 Mean :4.024
## 3rd Qu.:330.0 3rd Qu.:115.0 3rd Qu.:5.000 3rd Qu.:4.500
## Max. :340.0 Max. :120.0 Max. :5.000 Max. :5.000
## LOR GPA RES COA group
## Min. :1.500 Min. :7.890 Min. :0.0000 Min. :0.4600 1:249
## 1st Qu.:3.500 1st Qu.:8.720 1st Qu.:1.0000 1st Qu.:0.7600 2: 0
## Median :4.000 Median :9.040 Median :1.0000 Median :0.8200
## Mean :4.008 Mean :9.001 Mean :0.9237 Mean :0.8228
## 3rd Qu.:4.500 3rd Qu.:9.280 3rd Qu.:1.0000 3rd Qu.:0.9000
## Max. :5.000 Max. :9.920 Max. :1.0000 Max. :0.9700
As can be roughly obtained from the figure, it can be seen that applicants are divided into two categories: excellent applicants (category 1, A total of 162 people, accounting for 32.4%) and ordinary applicants (category 2, a total of 338 people, accounting for 67.6%). The COA of excellent applicants is basically above 0.8. Therefore, we can take 0.8 as a cut-off point for COA. Only when the admission probability of an applicant reaches 0.8 can the applicant be considered to have a high probability of success in the application.
Obviously, it is not rigorous to judge the degree of excellence of an applicant only by whether the COA is greater than 0.8. Using the classification results obtained by the previous step of clustering, this paper uses the LDA method to obtain a criterion for distinguishing outstanding applicants from ordinary applicants. In the use of LDA, we reduce the dimensionality of the first principal component and the second principal component in the fourth part of the principal component analysis, and use the applicant’s principal component score to distinguish. The classification results are shown in the following figure:
# LDA
data_s <- data.frame(data_scale)
PC1=0.453*data_s$GRE+0.457*data_s$TOF+0.428*data_s$UNI+0.425*data_s$SOP+0.471*data_s$GPA
PC2=0.464*data_s$GRE+0.384*data_s$TOF-0.528*data_s$UNI-0.572*data_s$SOP+0.177*data_s$GPA
PC=data.frame(dim1=PC1,dim2=PC2,Clu=c)
L = lda(TRE~PC1+PC2,data=PC)
p<-ggplot(data=PC,aes(x=PC1,y=PC2,color=TRE)) + geom_point() + theme(legend.position='bottom')
p+geom_abline(intercept=0,slope=2.986028)LPredict<-predict(L)
newGroup <- LPredict$class
tab <- table(TRE, newGroup)
erro <- 1-sum(diag(prop.table(tab))) The discriminant function for this discrimination is: y = -0.8228PC1 + 0.2756PC2. The ABER of this classification method is 6.2%, which is a good classification effect for the classification through cluster analysis.
Among the 7 application indicators, undergraduate GPA, GRE scores, and TOEFL scores have the greatest impact on the admission probability of applicants, and are the key indicators for admission.
By cluster analysis, applicants can be divided into two categories: excellent applicants and ordinary applicants; according to the dimensionality reduction results of PCA, new data can be discriminated, and the discriminant criteria are as follows: if y < − 0.8228 ∗ PC1 + 0.2756 ∗ PC2, then it is classified into excellent applicants.