Hi, welcome to my Learning by Building for Regression Model. In this project, I will try to analyze about prediction of Graduate Admissions to applying for Master Program. I found the dataset from kaggle.com

1 Load Library

Here are the library I will use in this LBB

library(tidyverse)
library(GGally)
library(ggthemes)
library(lmtest)
library(car)
library(corrplot)
library(MLmetrics)

options(scipen = 9999)

2 Load Dataset

Let’s just load the data for this poject

admission <- read.csv("data input/Admission_Predict_Ver1.1.csv")
admission

The data consist of 9 columns, which are:

  • Serial.No. : No of student

  • GRE.Score : Graduate Record Examination score

  • TOEFL.Score : Standardized test used to measure the English-language ability of non-native speakers wishing to enroll in English-speaking universities

  • SOP - Statement of Purpose

  • LOR - Letter of Recommendation

  • CGPA - Undergraduate GPA

  • Research - If the Undergraduate Degree graduate had ever done a research or not

  • Chance.of.Admit - The variable to be predicted.The score given to

3 Explanatory Data Analysis

Now, let’s check if our dataset already set in appropriate type

str(admission)
## 'data.frame':    500 obs. of  9 variables:
##  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

The data already had appropriate dataset. Now, lets’s check if there were any missing values in the data

colSums(is.na(admission))
##        Serial.No.         GRE.Score       TOEFL.Score University.Rating 
##                 0                 0                 0                 0 
##               SOP               LOR              CGPA          Research 
##                 0                 0                 0                 0 
##   Chance.of.Admit 
##                 0

Our data were cleaned from missing values

Now, let’s just take out the Serial.No. column since it just gave us information about row number

admission <- admission %>% 
  select(-Serial.No.)
admission

Let’s check the summary for the dataset we have

summary(admission)
##    GRE.Score      TOEFL.Score    University.Rating      SOP       
##  Min.   :290.0   Min.   : 92.0   Min.   :1.000     Min.   :1.000  
##  1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000     1st Qu.:2.500  
##  Median :317.0   Median :107.0   Median :3.000     Median :3.500  
##  Mean   :316.5   Mean   :107.2   Mean   :3.114     Mean   :3.374  
##  3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000     3rd Qu.:4.000  
##  Max.   :340.0   Max.   :120.0   Max.   :5.000     Max.   :5.000  
##       LOR             CGPA          Research    Chance.of.Admit 
##  Min.   :1.000   Min.   :6.800   Min.   :0.00   Min.   :0.3400  
##  1st Qu.:3.000   1st Qu.:8.127   1st Qu.:0.00   1st Qu.:0.6300  
##  Median :3.500   Median :8.560   Median :1.00   Median :0.7200  
##  Mean   :3.484   Mean   :8.576   Mean   :0.56   Mean   :0.7217  
##  3rd Qu.:4.000   3rd Qu.:9.040   3rd Qu.:1.00   3rd Qu.:0.8200  
##  Max.   :5.000   Max.   :9.920   Max.   :1.00   Max.   :0.9700

As we can see, we don’t have any outliers for our dataset. And the range of Chance.of.Admit was between 30% to 97%

Now, let’s check the correlation between our tables. In the chunk below, I will try to show up from the most correlated to the least correlated

numericVars <- which(sapply(admission, is.numeric))

all_numVar <- admission[, numericVars]
cor_numVar <- cor(all_numVar, use = "pairwise.complete.obs")

# Sort on decreasing correlations with Admission Probability
cor_sorted <- as.matrix(sort(cor_numVar[, "Chance.of.Admit"], decreasing = TRUE))

# Selecting high correlations
Cor_High <- names(which(apply(cor_sorted, 1, function(x) abs(x) > 0.25)))
cor_numVar <- cor_numVar[Cor_High, Cor_High]

corrplot.mixed(cor_numVar, tl.col = "black", tl.pos = "lt")

As we can see, the variable mostly had strong correlation to the Chance.of.Admit, but the the most powerful one was CGPA

Now let’s check the correlation between Chance.of.Admit and CGPA

ggplot(admission, aes(x = CGPA, y = Chance.of.Admit)) + geom_point(col = "orchid") +
  labs(x = "CGPA", y = "Chance of Admit") +
  labs(title = "College GPA vs Chance of Admit") +
  geom_smooth(method = "lm", se = FALSE, col = "red") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    panel.grid.minor.x = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.major.y = element_blank()
  )

From the graph, we can see that the better your GPA is, the better chance you get to be admitted.

4 Linear Regression

We got the difference for the prediction, now, let’s try create the model of Chance.of.Admit compared to CGPA

model_lm1 <- lm(Chance.of.Admit ~ CGPA, admission)
summary(model_lm1)
## 
## Call:
## lm(formula = Chance.of.Admit ~ CGPA, data = admission)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.276592 -0.028169  0.006619  0.038483  0.176961 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -1.04434    0.04230  -24.69 <0.0000000000000002 ***
## CGPA         0.20592    0.00492   41.85 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06647 on 498 degrees of freedom
## Multiple R-squared:  0.7787, Adjusted R-squared:  0.7782 
## F-statistic:  1752 on 1 and 498 DF,  p-value: < 0.00000000000000022

From the most correlated predictor, the Adjusted R-Squared we get is 0.7782. Now, let’s check if we use all predictors into the model

model_lm2 <- lm(Chance.of.Admit ~ ., admission)
summary(model_lm2)
## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = admission)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.266657 -0.023327  0.009191  0.033714  0.156818 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)       -1.2757251  0.1042962 -12.232 < 0.0000000000000002 ***
## GRE.Score          0.0018585  0.0005023   3.700             0.000240 ***
## TOEFL.Score        0.0027780  0.0008724   3.184             0.001544 ** 
## University.Rating  0.0059414  0.0038019   1.563             0.118753    
## SOP                0.0015861  0.0045627   0.348             0.728263    
## LOR                0.0168587  0.0041379   4.074            0.0000538 ***
## CGPA               0.1183851  0.0097051  12.198 < 0.0000000000000002 ***
## Research           0.0243075  0.0066057   3.680             0.000259 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05999 on 492 degrees of freedom
## Multiple R-squared:  0.8219, Adjusted R-squared:  0.8194 
## F-statistic: 324.4 on 7 and 492 DF,  p-value: < 0.00000000000000022

The Adjusted R-Squared using all predictor is higher but there is no significant difference. It is 0.8194

Now, let’s check, what is the Adjusted R-Squared we get if we use stepwise model. In this case, I will use backward

model_lm3 <- step(model_lm2, direction = "backward")
## Start:  AIC=-2805.71
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     SOP + LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## - SOP                1   0.00043 1.7708 -2807.6
## <none>                           1.7704 -2805.7
## - University.Rating  1   0.00879 1.7792 -2805.2
## - TOEFL.Score        1   0.03648 1.8069 -2797.5
## - Research           1   0.04872 1.8191 -2794.1
## - GRE.Score          1   0.04926 1.8196 -2794.0
## - LOR                1   0.05973 1.8301 -2791.1
## - CGPA               1   0.53542 2.3058 -2675.6
## 
## Step:  AIC=-2807.59
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## <none>                           1.7708 -2807.6
## - University.Rating  1   0.01190 1.7827 -2806.2
## - TOEFL.Score        1   0.03760 1.8084 -2799.1
## - Research           1   0.04893 1.8197 -2796.0
## - GRE.Score          1   0.04901 1.8198 -2795.9
## - LOR                1   0.06892 1.8397 -2790.5
## - CGPA               1   0.55954 2.3304 -2672.3
summary(model_lm3)
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     LOR + CGPA + Research, data = admission)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26617 -0.02321  0.00946  0.03345  0.15713 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)       -1.2800138  0.1034717 -12.371 < 0.0000000000000002 ***
## GRE.Score          0.0018528  0.0005016   3.694             0.000246 ***
## TOEFL.Score        0.0028072  0.0008676   3.236             0.001295 ** 
## University.Rating  0.0064279  0.0035318   1.820             0.069363 .  
## LOR                0.0172873  0.0039464   4.380            0.0000145 ***
## CGPA               0.1189994  0.0095344  12.481 < 0.0000000000000002 ***
## Research           0.0243538  0.0065985   3.691             0.000248 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05993 on 493 degrees of freedom
## Multiple R-squared:  0.8219, Adjusted R-squared:  0.8197 
## F-statistic: 379.1 on 6 and 493 DF,  p-value: < 0.00000000000000022

From the model, we can see that the most correlated prediction was GRE.Score, TOEFL.Score, University.Rating, LOR, CGPA, and Research and the Adjusted R-Squared resulted from this model is a bit higher, which is 0.8002. Not too significant, but it means we can eliminate some of the predictors which not too important to get higher result.

5 Model Prediction & Error

Now, let’s check the prediction for each model and evaluate so we can decide which model will we use

5.1 Model Prediction

First, let’s create the prediction for each model

admission$pred_lm1 <- predict(model_lm1, data.frame(CGPA = admission$CGPA))
admission$pred_lm2 <- predict(model_lm2, admission)
admission$pred_lm3 <- predict(model_lm3, data.frame(GRE.Score = admission$GRE.Score, TOEFL.Score = admission$TOEFL.Score, University.Rating = admission$University.Rating, LOR = admission$LOR, CGPA = admission$CGPA, Research = admission$Research
))

5.2 Model Error

Now, let’s evaluate the error for each model. I will use 2 kind of evaluation model

5.2.1 Mean Squared Error (MSE)

MSE(y_pred = admission$pred_lm1, y_true = admission$Chance.of.Admit)
## [1] 0.00440057
MSE(y_pred = admission$pred_lm2, y_true = admission$Chance.of.Admit)
## [1] 0.003540751
MSE(y_pred = admission$pred_lm3, y_true = admission$Chance.of.Admit)
## [1] 0.003541621

5.2.2 Mean Absolute Percentage Error (MAPE)

MAPE(y_pred = admission$pred_lm1, y_true = admission$Chance.of.Admit)
## [1] 0.07678185
MAPE(y_pred = admission$pred_lm2, y_true = admission$Chance.of.Admit)
## [1] 0.06853769
MAPE(y_pred = admission$pred_lm3, y_true = admission$Chance.of.Admit)
## [1] 0.06851627

From the results above, we can conclude that model_lm2 has the smallest error. The smaller error, the closer you are to finding the line of best fit. So, we will just use model_lm2 and let’s evaluate the model!

6 Model Evaluation

6.1 Normality

Now, let’s check the normality based on model_lm2 residuals

hist(model_lm2$residuals, 
     main="Histogram of Residual", 
     xlab="Residuals", 
     border="blue", 
     col="lightyellow")

Now, let’s check the normality with saphiro.test

shapiro.test(model_lm2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_lm2$residuals
## W = 0.92549, p-value = 0.000000000000004824

The p-value we get from the test is lower than 0.05 which means the error is not distributed normally

6.2 Heteroscedasticity

plot(admission$CGPA, model_lm2$residuals,
     xlab= "CGPA",
     ylab = "Residuals")
abline(h = 0, col = "red")

bptest(model_lm2)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_lm2
## BP = 30.516, df = 7, p-value = 0.00007634

The p-value we get from the test is lower than 0.05 which means the error is not Heteroscedasticity

6.3 Variance Inflation Factor (Multicollinearity)

vif(model_lm2)
##         GRE.Score       TOEFL.Score University.Rating               SOP 
##          4.464249          3.904213          2.621036          2.835210 
##               LOR              CGPA          Research 
##          2.033555          4.777992          1.494008

The results are below than 10 so there were no Multicollinearity between the predictors

7 Conclusion

The model_lm2 give the best result of Adjusted R-Squared for 0.8194, MSE for 0.003540751 and MAPE for 0.06853769, which defined that all predictors have strong relationship to calculate the probability to be accepted. The model shows that the p-value from the model is lower than 0.005, which means, the model has small confidence to be used with new data. So, it might result a different number for Chance.of.Admit of the newest data we have.