Introduction

We will learn to use linear regression model using Admission_Predict_Ver1.1 dataset. We want to know the relationship among variables, especially between the Chance.of.Admit with other variables. You can download the data here .

Data Preparation

1. Load the required package

library(GGally)
library(MLmetrics)
library(tidyverse)
library(dplyr)
library(lmtest)
library(car)

2. Load the dataset.

First we gonna read the dataset and then we will split this dataset for train and test with portion 80% : 20%.

graduate <- read.csv("Admission_Predict_Ver1.1.csv")
head(graduate, 5)
str(graduate)
## 'data.frame':    500 obs. of  9 variables:
##  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

The data for model has 500 rows and 9 columns. Serial.No is a unique identifier for each row, so we can ignore it. Out target variable is the Chance.of.Admit, which signifies the admission for graduate students. We will use other variable too. The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are :

  • GRE Scores ( out of 340 )
  • TOEFL Scores ( out of 120 )
  • University Rating ( out of 5 )
  • Statement of Purpose and Letter of Recommendation Strength (out of 5)
  • Undergraduate GPA ( out of 10 )
  • Research Experience ( either 0 or 1 )
  • Chance of Admit ( ranging from 0 to 1 )

Before we go further, first we need to make sure that our data is clean and will be useful. The step will taken is: 1. Changing categoric data to factor data type - Because it is stated that variable Research is either 0 or 1, we will change this column to factor data type 2. Check missing value is exist in the dataset. 3. Remove unused column from our dataset.

# 1. Changing categoric data to factor data type
graduate$Research <- as.factor(graduate$Research)
# 2. Check missing value is exist in the dataset.
colSums(is.na(graduate))
##        Serial.No.         GRE.Score       TOEFL.Score University.Rating 
##                 0                 0                 0                 0 
##               SOP               LOR              CGPA          Research 
##                 0                 0                 0                 0 
##   Chance.of.Admit 
##                 0
## [1] "Serial.No."        "GRE.Score"         "TOEFL.Score"      
## [4] "University.Rating" "SOP"               "LOR"              
## [7] "CGPA"              "Research"          "Chance.of.Admit"
## [1] "GRE.Score"         "TOEFL.Score"       "University.Rating"
## [4] "SOP"               "LOR"               "CGPA"             
## [7] "Research"          "Chance.of.Admit"

Exploratory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables. For this we will use package from GGally to find-out the correlation.

ggcorr(data = graduate, label = T, label_size = 2.9, hjust = 1, layout.exp = 2)
## Warning in ggcorr(data = graduate, label = T, label_size = 2.9, hjust = 1, :
## data in column(s) 'Research' are not numeric and were ignored

The graphic shows that several variable has strong correlation with the Chance.of.Admit variables.

Train-Test Split

From this test it is shown our dataset have 0 missing value for each column. Then we continue to do exploration in our dataset.

Now we can split the data set into train and test with proportion 70% and 30%.

set.seed(50)

index <- sample(nrow(graduate), nrow(graduate) * 0.7)
graduate_train <- graduate[index, ]
graduate_test <- graduate[-index, ]

Modeling

Linear Regression

Now we will try to model the linear regression using Chance.of.Admit as the target variable.

1. Simple Linear Regression

Simple linear regression is a model with one prediction. And the predictor we will use is CGPA.

model_sl <- lm(formula = Chance.of.Admit ~ CGPA, data = graduate_train)
summary(model_sl)
## 
## Call:
## lm(formula = Chance.of.Admit ~ CGPA, data = graduate_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.276747 -0.029088  0.006964  0.040095  0.171052 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.045207   0.049482  -21.12   <2e-16 ***
## CGPA         0.206041   0.005743   35.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06639 on 348 degrees of freedom
## Multiple R-squared:  0.7872, Adjusted R-squared:  0.7866 
## F-statistic:  1287 on 1 and 348 DF,  p-value: < 2.2e-16

Linear regression is one of the model with high interpretability, so now we gonna interpret this simple linear regression model. - Based on Intercept: When the CGPA is 0, the Chance.of.Admit is -1.068162, means the chance to be admitted to graduate is less than 0 or easily we can say the student will not enter the graduate studies in university.

  • Based on Coefficient or Slope: When CGPA goes up 1 value, then Chance.of.Admit increase around 0.21.

  • Based on P-value: CGPA is a significant predictor and has a linear influence.

  • Based on R-squared: R-squred values is 0.7825; the predictor that is used is good enough to explain the target variable.

2. Multiple Linear Regression

model_all <- lm(formula = Chance.of.Admit ~ ., data = graduate_train)
summary(model_all)
## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = graduate_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.262634 -0.023984  0.008514  0.033414  0.157793 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.2274554  0.1226521 -10.008  < 2e-16 ***
## GRE.Score          0.0018796  0.0005992   3.137 0.001857 ** 
## TOEFL.Score        0.0024807  0.0010399   2.386 0.017596 *  
## University.Rating  0.0072153  0.0045202   1.596 0.111361    
## SOP                0.0004345  0.0056815   0.076 0.939081    
## LOR                0.0177012  0.0050341   3.516 0.000497 ***
## CGPA               0.1149754  0.0121363   9.474  < 2e-16 ***
## Research1          0.0292794  0.0081277   3.602 0.000362 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05999 on 342 degrees of freedom
## Multiple R-squared:  0.8293, Adjusted R-squared:  0.8258 
## F-statistic: 237.3 on 7 and 342 DF,  p-value: < 2.2e-16

The summary of model_all model shows a lot of information. - Based on the Intercept or constant, the chance to be admit in graduate is decreasing -1.1917371 for all predictor when set to 0.

  • Based on P-value, we are looking at least they have 1 star or < 0.05. Because this value shows the significance level of the variable toward the model. If the value is below 0.05, than we can safely assume that the variable has significant effect toward the model (meaning that the estimated coefficient are no different than 0), and vice versa. Thus, we can made a simpler model by removing variables that has p-value > 0.05, since they don’t have significant effect toward our model. The estimate value shows the coefficient of each variable.

    So here, GRE.Score, TOEFL.Score, University.Rating, LOR, CGPA and Research are a good predictor.

  • Based on Adjusted R-squared the values is 0.8173 which is close to 1. It is representing the goodness of fit in our model.

Model Evaluation

1. Comparing based on R-squared value

summary(model_sl)$r.squared #one predictor
## [1] 0.7872048
summary(model_all)$adj.r.squared #all predictor
## [1] 0.8257972

Comparison Based on R-Squared with 1 predictor Model and all Predictor Model Comparing two model we got new information that based R-Squared, model with all predictor has higher R-squared value.

Model R-squared value
M. 1 Predictor 0.7824766
M. All Predictor 0.817276

2. Comparing Based on RMSE

pred_sl <- predict(object = model_sl, newdata = graduate_test)
pred_all <- predict(object = model_all, newdata = graduate_test)
RMSE(y_pred = pred_sl, y_true = graduate_test$Chance.of.Admit)
## [1] 0.06664615
RMSE(y_pred = pred_all, y_true = graduate_test$Chance.of.Admit)
## [1] 0.060229
range(graduate_test$Chance.of.Admit)
## [1] 0.36 0.97

Which model is better? From the Goodness of fit and error, the model_all model (which uses all predictors) is better than the model_sl model (which only uses 1 predictor)

Model Improvement

Step-wise Regression for Feature Selection

In the Step-wise regression technique, we start fitting the model with each individual predictor and see which one has the lowest p-value. Then pick that variable and then fit the model using two variable one which we already selected in the previous step and taking one by one all remaining ones.

We gonna used Stepwise regression using backward elimination.

model_backward <- step(object = model_all,
                       direction = "backward",
                       trace = F)
summary(model_backward)
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     LOR + CGPA + Research, data = graduate_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.262506 -0.023863  0.008652  0.033447  0.157893 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.228750   0.121303 -10.130  < 2e-16 ***
## GRE.Score          0.001878   0.000598   3.140 0.001834 ** 
## TOEFL.Score        0.002488   0.001035   2.404 0.016728 *  
## University.Rating  0.007346   0.004177   1.759 0.079527 .  
## LOR                0.017829   0.004743   3.759 0.000201 ***
## CGPA               0.115170   0.011848   9.721  < 2e-16 ***
## Research1          0.029272   0.008115   3.607 0.000356 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0599 on 343 degrees of freedom
## Multiple R-squared:  0.8293, Adjusted R-squared:  0.8263 
## F-statistic: 277.7 on 6 and 343 DF,  p-value: < 2.2e-16

This step-wise regression method will produce an optimum formula based on the lowest AIC value, where the lower the AIC value, the smaller the uncaught observation value.

If we compared with the initial model that only, we got summary like this:
Model R-squared value
M. 1 Predictor 0.7824766
M. All Predictor 0.817276
M. backward 0.8263

From here we can see that with selected features based on step-wise regression using backward step giving higher R-square value compared to previous model. The selected predictor used are GRE.Score, TOEFL.Score, University.Rating, LOR, CGPA, Research.

Prediction and RMSE

#Predict
pred_backward <- predict(object = model_backward, newdata = graduate_test)

#RMSE
RMSE(y_pred = pred_backward, y_true = graduate_test$Chance.of.Admit)
## [1] 0.06023782

Assumption

1. Normality of Residuals

The linear regression model is expected to produce errors that are normally distributed. With that, more errors cluster around the number of zero.

1.a. Visualize histogram residual

hist(model_backward$residuals)

1.b. Statistics test with shapiro.test()

shapiro.test(model_backward$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_backward$residuals
## W = 0.9158, p-value = 4.321e-13

Shapiro-Wilk hypothesis test:

  • H0: error distributed normal
  • H1: error NOT distributed normal We are looking H0 alpha 0.05.

Since we get p-value = 4.321e-13 which is less than 0.05, our error (residual) based on this model_backward is not distributed normally.

2. Homoscedasticity of Residuals

2.a. Visualize the scatter plot: fitted.values vs residuals

  • fitted.values is prediction value of data training
  • residuals adalah error value
# scatter plot
plot(x = model_backward$fitted.values, y = model_backward$residuals)
abline(h = 0, col = "red")

This pattern is more likely of Fan shape pattern or qualified as Heteroscedasticity.

2.b. Statistics test with bptest()

bptest(model_backward)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_backward
## BP = 24.048, df = 6, p-value = 0.0005117

Breusch-Pagan hypothesis test:

  • H0: error distributed constant/ homoscedasticity
  • H1: error distributed NOT constant/ heteroscedasticity We are looking H0 alpha 0.05.

Again we got p-value = 0.0005117 which is less than 0.05. Means this error distributed NOT constant/ heteroscedasticity.

3. No Multicollinearity

vif(model_backward)
##         GRE.Score       TOEFL.Score University.Rating               LOR 
##          4.696144          4.148376          2.281882          1.896098 
##              CGPA          Research 
##          5.229900          1.567856

Based on this assumption we are looking with VIF < 10. Here our model_backward fulfilling the assumption, because all the correlation predictor value is less than 10.

Summary

For our model_backward

  • Based on R-squared and RMSE, this model shows a better model.
  • Based on Assumption, two Assumptions is fulfilled which is Linearity and No Multicollinearity. While other two Assumptions is NOT fulfilled which is Normality of Residuals and Homoscedasticity of Residuals.

Note, It is suggested to:

  • if Normality of Residuals not fulfilled the assumption;
  1. Handle outliers by discarding 2.Perform data transformation on the target variable before it is used in modeling, for example with the log() or sqrt() functions
  2. Use other, more complex models (assumption-free models)
  • if Homoscedasticity of Residuals not fulfilled the assumption;
  1. Perform data transformation on target or predictor variables before use in modeling
  2. Use other, more complex models (assumption-free models)

This model need improvement for choosing the predictor.