Introduction

Hi Everyone!

Welcome to my first data science project. Graduate Admission project is based on the problem faced by graduate students when they register for the postgraduate program. They had a dilemma when determining the university, while there are a number of factors that maybe can influence them or it could be the answer based on their own problems. But in reality, these factors are still based on previous exam experience. Therefore, in this project, it is hoped that we will be able to predict what factors have a significant effect on the chances of passing the postgraduate program administration with machine learning modeling.

Here the steps what we’re doing:

1. Library

library(GGally) #heatmap
library(tidyverse) #data wrangling
library(MLmetrics) # nilai MAPE
library(lmtest) # linear regression
library(car) # multicolinearity
library(inspectdf)
library(performance) # compare performance

2. Input Data

graduate <- read.csv("Admission_Predict.csv") # simpan ke object `graduate`

3. Data Understanding

head(graduate)
tail(graduate)
dim(graduate)
## [1] 400   9
names(graduate)
## [1] "Serial.No."        "GRE.Score"         "TOEFL.Score"      
## [4] "University.Rating" "SOP"               "LOR"              
## [7] "CGPA"              "Research"          "Chance.of.Admit"

Variables Description

GRE.Score : The Graduate Record Examination is an assessment test that must be done if you want to apply to various postgraduate programs

TOEFL.Score : TOEFL Score

University.Rating : University Rating

SOP : Statement of Purpose is a short essay that describes applicant educational background, achievements, and their future goals.

LOR : Letter of Recommendation

CGPA :College Grade Point Average is the score obtained from the entire semester that has been taken

Research : Research experience, has it been done or not

Chance.of.Admit : Opportunity to enter the university

summary(graduate)
##    Serial.No.      GRE.Score      TOEFL.Score    University.Rating
##  Min.   :  1.0   Min.   :290.0   Min.   : 92.0   Min.   :1.000    
##  1st Qu.:100.8   1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000    
##  Median :200.5   Median :317.0   Median :107.0   Median :3.000    
##  Mean   :200.5   Mean   :316.8   Mean   :107.4   Mean   :3.087    
##  3rd Qu.:300.2   3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000    
##  Max.   :400.0   Max.   :340.0   Max.   :120.0   Max.   :5.000    
##       SOP           LOR             CGPA          Research     
##  Min.   :1.0   Min.   :1.000   Min.   :6.800   Min.   :0.0000  
##  1st Qu.:2.5   1st Qu.:3.000   1st Qu.:8.170   1st Qu.:0.0000  
##  Median :3.5   Median :3.500   Median :8.610   Median :1.0000  
##  Mean   :3.4   Mean   :3.453   Mean   :8.599   Mean   :0.5475  
##  3rd Qu.:4.0   3rd Qu.:4.000   3rd Qu.:9.062   3rd Qu.:1.0000  
##  Max.   :5.0   Max.   :5.000   Max.   :9.920   Max.   :1.0000  
##  Chance.of.Admit 
##  Min.   :0.3400  
##  1st Qu.:0.6400  
##  Median :0.7300  
##  Mean   :0.7244  
##  3rd Qu.:0.8300  
##  Max.   :0.9700

Insights

  1. There’re 400 observations with 9 variables in our data with the target class Chance.of.Admit

  2. Serial.No. is a variable with unique ID, so we can remove the variable.

  3. Minimal of GRE.Score is 290, then the maximum value is 340. Mean of GRE.Score for all applicant is 316,8

  4. Minimal of TOEFL.Score the applicant is 92, then the maximum score got by an applicant is 120. Mean of TOEFL.Score or all applicant is 107.4

  5. Research only has a value of 0 or 1, it means if applicant have already experience in research (1) or not (0).

3. Data Wrangling

# ubah tipe data

graduate <- graduate %>% 
  mutate(Research = as.factor(Research))
# check missing value

colSums(is.na(graduate))
##        Serial.No.         GRE.Score       TOEFL.Score University.Rating 
##                 0                 0                 0                 0 
##               SOP               LOR              CGPA          Research 
##                 0                 0                 0                 0 
##   Chance.of.Admit 
##                 0

There’re no missing value in our dataset

4. Eksploratori Data Analisis

For the next step we do some data exploration to see more deeply what insights we can get in our data.

Data Distribution

In visualizing the distribution of the data for all variables, we used a boxplot. The results show there’re an outlier in our 3 variables, CGPA,Chance.of.Admit, and LOR.

Correlation

Before building the model, we want to test whether the independent variable and dependent variable have a linear relationship or not.

From 7 predictors, there’s only Serial.No. who doesn’t had linear correlation with our target class. This might be because the variable contains only unique IDs. For the rest overall have a positive correlation with the target class and CGPA is a variable with very strong positive correlation to target class when compared to the others.

Through the insights that we’ve found earlier, we can remove the Serial.No. variable to simplify the next analysis step

graduate_clean <- graduate %>% select(-Serial.No.)

5. Building Linear Regression Model

After we obtain the information about correlation between the predictor and the dependent variable through the previous step, next we’re trying to create a linear regression model with the predictor CGPA.

model_lm <- lm(formula = Chance.of.Admit~CGPA, graduate_clean)

summary(model_lm)
## 
## Call:
## lm(formula = Chance.of.Admit ~ CGPA, data = graduate_clean)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.274575 -0.030084  0.009443  0.041954  0.180734 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -1.07151    0.05034  -21.29 <0.0000000000000002 ***
## CGPA         0.20885    0.00584   35.76 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06957 on 398 degrees of freedom
## Multiple R-squared:  0.7626, Adjusted R-squared:  0.762 
## F-statistic:  1279 on 1 and 398 DF,  p-value: < 0.00000000000000022

Based on the summary model_lm we can interpret that the predictor CGPA is statistically significant with the target class (Chance.of.Admit) which is indicated by its p-value < 0.05. In addition, the predictor variable was able to explain the variance/diversity of the target as much as 76%. The rest is explained through other factors. The coefficient CGPA 0.2 indicates that the higher the value of the predictor variable CGPA, the higher the chances of passing postgraduate student administration (Chance.of.Admit).

After getting the findings from model_lm, is it enough for us to just use the model? How about the others? Next we’ll trying to use all the predictors to see if there were something changed

Multiple Linear Regression

model <- lm(formula = Chance.of.Admit~., graduate_clean)

summary(model)
## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = graduate_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26259 -0.02103  0.01005  0.03628  0.15928 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)       -1.2594325  0.1247307 -10.097 < 0.0000000000000002 ***
## GRE.Score          0.0017374  0.0005979   2.906              0.00387 ** 
## TOEFL.Score        0.0029196  0.0010895   2.680              0.00768 ** 
## University.Rating  0.0057167  0.0047704   1.198              0.23150    
## SOP               -0.0033052  0.0055616  -0.594              0.55267    
## LOR                0.0223531  0.0055415   4.034             0.000066 ***
## CGPA               0.1189395  0.0122194   9.734 < 0.0000000000000002 ***
## Research1          0.0245251  0.0079598   3.081              0.00221 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06378 on 392 degrees of freedom
## Multiple R-squared:  0.8035, Adjusted R-squared:    0.8 
## F-statistic: 228.9 on 7 and 392 DF,  p-value: < 0.00000000000000022

Based on summary model, there are several findings. All the predictor was significant with the target class, except for the variables Universtiy.Rating and SOP. In addition, if we look at all the predictor coefficient, it shows if the score getting higher, the more likely it is that someone will pass the postgraduate administration program.

If we compared between model and model_lm there is an increasing of R-square score with the addition of 6 variables. The score increase until 4% from the previous model. Due to the increase in the R-squared value, can we immediately state that this model is good? I think we need to check the of Adj.R-squared score. When compared to the previous model, we know that the Adj.R-squared score of model is increasing. However, we need to know what predictors are best at explaining the variance of the target class. So in the next stage we’ll do the feature selection step in order getting the best predictor

Backward Step-wise Regression

In order to create optimizing significant predictors, the multiple linear regression model will use stepwise backward regression.

model_back <- step(model, direction = "backward", trace = F)

summary(model_back)
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA + Research, data = graduate_clean)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.263542 -0.023297  0.009879  0.038078  0.159897 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -1.2984636  0.1172905 -11.070 < 0.0000000000000002 ***
## GRE.Score    0.0017820  0.0005955   2.992              0.00294 ** 
## TOEFL.Score  0.0030320  0.0010651   2.847              0.00465 ** 
## LOR          0.0227762  0.0048039   4.741           0.00000297 ***
## CGPA         0.1210042  0.0117349  10.312 < 0.0000000000000002 ***
## Research1    0.0245769  0.0079203   3.103              0.00205 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06374 on 394 degrees of freedom
## Multiple R-squared:  0.8027, Adjusted R-squared:  0.8002 
## F-statistic: 320.6 on 5 and 394 DF,  p-value: < 0.00000000000000022

When we’re evaluating the model with stepwise model using the AIC (Akaike Information Criterion/ Information Loss) value. AIC shows a lot of missing information on the model

The backward step wise regression method works from the overall predictor used, then the model is evaluated by reducing the predictor variables so that the smallest AIC model is obtained.

Model Interpretation:

When compared with the previous model in terms of the value of Adj.R-squared, the difference is not too significant, as well as the value of R-squared. However, if we look at the model_back summary, we can see which variables contributing to explain the variance of the target class (Chance.of.Admit).

We’ve already made 3 linear regression models, model_lm, model, and model_back. If we look together at the R-squared and Adj.R-squared values of each model, it turns out that the difference score between the two metrics isn’t much different. Through this information we can say that there is no multicollinearity in our data. However, we’ll analyze more detail in one part of the linear regression assumption test.

6. Assumption in Linear Regression

Before move to the next step predicting with the data test, make sure our model satisfies the following assumptions of the linear regression. There are four assumptions and are mentioned below. If the model fails to meet these assumptions, then we simply can not use this model.

a. Linearity

Linearity assumption expects that every predictor variable has correlation with the target variable. The important thing about checking the linearity of model linear regression because the model only capture linear pattern in our data. So, the predictor were used it must have a linear pattern with the target.

What we can infer by the graph above is the pattern of our data distributed in 0 residual and 0.5 - 0.8 fitted points.

b. Normality

We expect that when making a linear regression model, the error result is normally distributed. It means that many errors are clustered around 0.

The data is right skewed. This is strong signal for non-normal residual distribution. To be more statistically, we use Shapiro-Wilk Test.

shapiro.test(model_back$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_back$residuals
## W = 0.92193, p-value = 0.0000000000001443

P-value given by the Shapiro-Wilk normality test is less than alpha 0.05. Therefore, the model is indicating not-normally distributed residuals.

c. Homoscedasticity

We expect that the model will obtain an error/residual whose variance doesn’t formed a pattern so it must spread randomly.

bptest(model_back)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_back
## BP = 22.428, df = 5, p-value = 0.0004341

Breusch-Pagan coefficient given shows the p-value is less than alpha (0.05). Therefore the model is indicating of heteroscedasticity

d. Multicolinearity

From our previous analysis, we stated that if a small difference between R-squared and Adj. R-squared it isn’t indicating a multicolinearity. In this section we want to approve our previous hypothesis through vif() function

library(car)

vif(model_back)
##   GRE.Score TOEFL.Score         LOR        CGPA    Research 
##    4.585053    4.104255    1.829491    4.808767    1.530007

From the result, we can say the multicolinearity doesn’t exist in our data. It shown by the vif result for all predictor is less than 10.

After we’re doing assumption checking for our linear model regression. We found that from 4 type of linear regression assumption there’s only 2 assumption were qualified (Normality and Multicolinearity). For the rest of it, we’ll try to do an further analyzing in the next step.

7. Data Transformation

We’re already know that our model is still haven’t meet the linear regression assumption test. Then, how can we handled this problem? Luckily, based on this article it shows how we can solved this problem. In this article stated that “If a log transformation is applied to the dependent variable only, this is equivalent to assuming that it grows (or decays) exponentially as a function of the independent variables. Beside, if we use a log transformation applied to both the dependent variable and the independent variables, this is equivalent to assuming that the effects of the independent variables are multiplicative rather than additive in their original units”.

So, in this step we choose to transform or scale our dependent variable score (Chance.of.Admit).

graduate_log <- graduate_clean %>%
  mutate(Chance.of.Admit = log10(Chance.of.Admit))


head(graduate_log)

8. Modeling and Evaluation

After we’re transforming our dependent variable score, the next step is building the model and evaluating the model.

Modeling

# Model fiting

model_grad_log <- lm(formula = Chance.of.Admit~., graduate_log)

summary(model_grad_log)
## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = graduate_log)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.218581 -0.017131  0.007131  0.027996  0.110971 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)       -1.4182330  0.0925664 -15.321 < 0.0000000000000002 ***
## GRE.Score          0.0010423  0.0004437   2.349               0.0193 *  
## TOEFL.Score        0.0018417  0.0008086   2.278               0.0233 *  
## University.Rating  0.0007993  0.0035403   0.226               0.8215    
## SOP               -0.0037254  0.0041275  -0.903               0.3673    
## LOR                0.0166036  0.0041125   4.037             0.000065 ***
## CGPA               0.0797876  0.0090684   8.798 < 0.0000000000000002 ***
## Research1          0.0139027  0.0059072   2.354               0.0191 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04734 on 392 degrees of freedom
## Multiple R-squared:  0.7455, Adjusted R-squared:  0.741 
## F-statistic: 164.1 on 7 and 392 DF,  p-value: < 0.00000000000000022

It can be seen that University.Rating and SOP are the predictors which are not statistically significant with the target (Chance.of.Admit).

Backward Step-wise Regression

model_grad_log_back <- step(model_grad_log,
                            direction = "backward",
                            trace = F)

summary(model_grad_log_back)
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA + Research, data = graduate_log)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.223705 -0.016372  0.007085  0.027803  0.110051 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -1.4108842  0.0869699 -16.223 < 0.0000000000000002 ***
## GRE.Score    0.0010723  0.0004416   2.428               0.0156 *  
## TOEFL.Score  0.0017438  0.0007898   2.208               0.0278 *  
## LOR          0.0150532  0.0035621   4.226            0.0000296 ***
## CGPA         0.0785130  0.0087013   9.023 < 0.0000000000000002 ***
## Research1    0.0134695  0.0058728   2.294               0.0223 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04726 on 394 degrees of freedom
## Multiple R-squared:  0.745,  Adjusted R-squared:  0.7417 
## F-statistic: 230.2 on 5 and 394 DF,  p-value: < 0.00000000000000022

From the model that we’ve been built before, it is known that the score of Adj.R-squared is worse than the previous model with the original data. In addition, the number of predictors that contribute to explaining the variance of the target class is still same as the model we created in the early step of building model.

Evaluation

Before evaluating the model with our data test, it would be better if we checked the model first against the assumption of linear regression. Because previously we’ve known that only 2 assumptions already meet from 4 assumptions, then we’ll prove the rest of it (Normality and Homoscedasticity).

Assumption Checking

a. Normality

The data is still right skewed after we’re transforming.

shapiro.test(model_grad_log_back$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_grad_log_back$residuals
## W = 0.89568, p-value = 0.0000000000000006789

Because the p-value score of our model is still less than 0.05 (alpha) it indicates that our is still haven’t distributed normally eventhough we’re already transform the data

b. Homoscedasticity

bptest(model_grad_log_back)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_grad_log_back
## BP = 35.288, df = 5, p-value = 0.000001318

Using Breusch-Pagan test, p-value that given less than 0.05 (alpha value) indicates that the model violates Homoscedasticity assumption.

Based on the results of the linear regression assumption test, it can be concluded that linear regression model still can’t be used to predict the data test. In the future, it is expected to try other algorithms / perform deeper pre-processing steps in order to obtain a better model and fulfill the linear regression assumption test. So we can find out about what are the factors affected a person’s chances to pass the postgraduate admission program.

9. Conclusions

From several series of analysis steps, here some conclusions were found as follows.

  • Based on the data exploration step, all the predictors have a linear relationship with the dependent variable (Chance.of.Admit). The predictor who had the highest correlation score is CGPA.

  • Based on the summary results of model, it is found that of the overall predictor variables, University.Rating and SOP have no significant effect on the chances of passing the postgraduate admission program.

  • In choosing the best features for making linear regression models, the Backward step-wise regression method is used. The results from summary model_back shows that there are 5 predictors would’ve contribute to explaining the variance of the target class.

  • Even though we’ve been succeeded building a linear regression model, if we look closer at the results of the assumption test that we have done, it shows that our model still doesn’t meet the assumptions because it only meets 2 from 4 assumptions for the linear regression assumption test. Therefore, in this project, data transformation is carried out using logs on the dependent variable.

Overall, if we compare the initial model with the modified model, there is no change at all. So it is necessary to carry out a further pre-processing stage. Also, I think we need to use other machine learning algorithms/models than linear regression model.

10. Alternative

Suggestions in the future is to try a random forest regression modelto predict and find out roughly what are features could affect the chances of someone passing the postgraduate admission program

11. References

  1. Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019

  2. https://people.duke.edu/~rnau/testing.htm#:~:text=How%20to%20fix%3A%20consider,example%20on%20this%20web%20site

3.https://www.r-bloggers.com/2020/05/step-by-step-guide-on-how-to-build-linear-regression-in-r-with-code/#:~:text=A%20large%20difference%20between%20the%20R%2DSquared%20and%20Adjusted%20R%2Dsquared%20is%20not%20appreciated%20and%20generally%20indicates%20that%20multicollinearity%20exists%20within%20the%20data.