1 . Intro

1.1 About Data

We’ll be doing some exploratory and prediction for numeric target variable. The data is provided at https://www.kaggle.com/mohansacharya/graduate-admissions?select=Admission_Predict.csv, records 400 observations which 8 variables (columns) of the data might be affecting the value of Chance.of.Admit column. We’ll predict this variable and acknowledge another variables that contribute into the value of target.

1.2 Steps

Before we start, take a look at the target list below as the sneak peek and guidance to predict a numeric variable by using Regression Models:
  • Target variable : Chance.of.Admit
  • Predictor(s) : all informative columns, besides the target

Step by step of Regression Models:
  1. Data preparation : drop uninformative columns, check NA, outliers, and correlation, adjust data type, etc
  2. Creating Model and Analysing
  3. Model Evaluation : Performance & Assumptions
  4. Conclusion

2 . Data Preparation

Library

library(dplyr)      #wrangling
library(GGally)     #correlation
library(MLmetrics)  #error check
library(lmtest)     #heteroscedasticity
library(car)        #multicolinearity/vif

Input

First of all, use read.csv() from baseR to read graduate_admission.csv that has been saved in local folder. Define it as grad

grad <- read.csv("data_input/graduate_admission.csv")
head(grad)

Any NA

anyNA(grad)
## [1] FALSE

Select Informative Variables

Drop variables that don’t provide/explain any information to target variable, in our case: Serial.No. To wrangling the data, we can use %>% from library(dplyr).

grad <- grad %>% 
  select(-Serial.No.)
head(grad)
It consist of variables with range:

  1. GRE Scores : out of 340
  2. TOEFL Scores : out of 120
  3. University Rating : out of 5
  4. Statement of Purpose : out of 5
  5. Letter of Recommendation : out of 5
  6. Undergraduate GPA: out of 10
  7. Research Experience: either 0 or 1. We assume 0 for no
  8. Chance of Admit: ranging from 0 to 1

Adjust Data Type

We want to see the data type of each variables by using str() or glimpse() from library(dplyr). Then, doing some tranformation by using mutate.

str(grad)
## 'data.frame':    400 obs. of  8 variables:
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
# change data type into categorical
grad <- grad %>% 
  mutate(Research = as.factor(Research))
str(grad)
## 'data.frame':    400 obs. of  8 variables:
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 1 1 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

Correlation Check

# library(GGally)
ggcorr(grad, label = TRUE, hjust = 1, label_size = 3.5, layout.exp = 2)

According to the plot, we can say that each variable has positive correlation to one another–mostly moderate to strong. Especially Chance.of.Admit, shows strong positive correlation with CGPA (0.9), TOEFL (0.8), GRE (0.8), etc

3 . Modeling

3.1 Cross Validation (train - test)

Cross validation known as a step to splitting our data into data train and data test with different weights ratio depends on the business demand. Referred to its name, we use data train to train our models while confirming its performance by using data test.

RNGkind(sample.kind = "Rounding")
set.seed(111)
idx <- sample(nrow(grad), nrow(grad) *0.75)
grad_train <- grad[idx, ]
grad_test <- grad[-idx, ]

3.2 Linear Regression Models

names(grad)
## [1] "GRE.Score"         "TOEFL.Score"       "University.Rating"
## [4] "SOP"               "LOR"               "CGPA"             
## [7] "Research"          "Chance.of.Admit"

Model 1 : Strongest Correlation Variable

Note, we are going to more than one models to predict the data test later. For the first models, we can use the predictor which has strongest correlation value with our target, CGPA.

model1 <- lm(Chance.of.Admit ~ CGPA, grad_train)
summary(model1)
## 
## Call:
## lm(formula = Chance.of.Admit ~ CGPA, data = grad_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.273007 -0.029511  0.009633  0.040942  0.183168 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.085506   0.059870  -18.13   <2e-16 ***
## CGPA         0.210292   0.006958   30.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07019 on 298 degrees of freedom
## Multiple R-squared:  0.754,  Adjusted R-squared:  0.7532 
## F-statistic: 913.4 on 1 and 298 DF,  p-value: < 2.2e-16

As summarised, CGPA shows a significant effect toward our model (p-value > 0.05 or code ***). In a number, every increased value of 1 unit-point CGPA will also contribute 0.210292 for increasing value of Chance.of.Admit. We can interpret adjusted R-squared value for this model : 0.7532. It means that model1 can explain 75.32% of variance in Chance.of.Admit.

Model 2 : Strong Correlation Variables

model2 <- lm(Chance.of.Admit ~ CGPA + TOEFL.Score + GRE.Score, grad_train)
summary(model2)
## 
## Call:
## lm(formula = Chance.of.Admit ~ CGPA + TOEFL.Score + GRE.Score, 
##     data = grad_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.287689 -0.020286  0.008645  0.039000  0.134669 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.5616272  0.1198263 -13.032  < 2e-16 ***
## CGPA         0.1503685  0.0130713  11.504  < 2e-16 ***
## TOEFL.Score  0.0032844  0.0012966   2.533  0.01182 *  
## GRE.Score    0.0020168  0.0006634   3.040  0.00258 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06726 on 296 degrees of freedom
## Multiple R-squared:  0.7756, Adjusted R-squared:  0.7733 
## F-statistic:   341 on 3 and 296 DF,  p-value: < 2.2e-16

As declared, the value of adjusted R-squared in model 2 is 0.7837. It means top 3 strongest-correlation-variables (CGPA, TOEFL.Score, GRE.Score) explain 77.33% of variance in Chance.of.Admit. It slightly has a better adjusted R-squared value. Can we say, it is the best models so far? Not yet. Let’s create one last models by using step() function.

Model 3 : Stepwise

lm_3 <- lm(Chance.of.Admit ~ ., grad_train)
step(lm_3, direction = "backward")
## Start:  AIC=-1632.94
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     SOP + LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## - SOP                1  0.002812 1.2332 -1634.2
## - University.Rating  1  0.003744 1.2341 -1634.0
## <none>                           1.2304 -1632.9
## - GRE.Score          1  0.022827 1.2532 -1629.4
## - Research           1  0.023518 1.2539 -1629.3
## - TOEFL.Score        1  0.024838 1.2552 -1628.9
## - LOR                1  0.053216 1.2836 -1622.2
## - CGPA               1  0.313747 1.5441 -1566.8
## 
## Step:  AIC=-1634.25
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## - University.Rating  1  0.002023 1.2352 -1635.8
## <none>                           1.2332 -1634.2
## - Research           1  0.022949 1.2561 -1630.7
## - TOEFL.Score        1  0.023474 1.2567 -1630.6
## - GRE.Score          1  0.024095 1.2573 -1630.5
## - LOR                1  0.052415 1.2856 -1623.8
## - CGPA               1  0.310955 1.5441 -1568.8
## 
## Step:  AIC=-1635.76
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
## 
##               Df Sum of Sq    RSS     AIC
## <none>                     1.2352 -1635.8
## - Research     1   0.02428 1.2595 -1631.9
## - GRE.Score    1   0.02491 1.2601 -1631.8
## - TOEFL.Score  1   0.02638 1.2616 -1631.4
## - LOR          1   0.06644 1.3016 -1622.0
## - CGPA         1   0.34753 1.5827 -1563.4
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA + Research, data = grad_train)
## 
## Coefficients:
## (Intercept)    GRE.Score  TOEFL.Score          LOR         CGPA    Research1  
##   -1.295564     0.001622     0.003133     0.022585     0.125403     0.021996

Function step() offers “the most effective” regression model recommendation which has the least AIC (missing value). Note: drop categoric variables (Research) because Regression Models performs well with numeric variables.

model3 <- lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA, data = grad_train)
summary(model3)
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA, data = grad_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.276348 -0.020703  0.009176  0.036326  0.147764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.4323082  0.1201939 -11.917  < 2e-16 ***
## GRE.Score    0.0020700  0.0006446   3.211  0.00147 ** 
## TOEFL.Score  0.0030853  0.0012604   2.448  0.01495 *  
## LOR          0.0244871  0.0056694   4.319 2.14e-05 ***
## CGPA         0.1259759  0.0138974   9.065  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06534 on 295 degrees of freedom
## Multiple R-squared:  0.789,  Adjusted R-squared:  0.7861 
## F-statistic: 275.7 on 4 and 295 DF,  p-value: < 2.2e-16

model3 has slightly higher value of Adjusted R-squared: 0.7895. One of the interesting point from summary(model3) is actually LOR generates significancy for our models. So, let’s try to create model4 with CGPA and LOR as the predictor.

[Extra] Model 4 : 2 Predictors

model4 <- lm(Chance.of.Admit ~ LOR + CGPA, grad_train)
summary(model4)
## 
## Call:
## lm(formula = Chance.of.Admit ~ LOR + CGPA, data = grad_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26601 -0.02496  0.01068  0.04066  0.19093 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.951596   0.066385 -14.334  < 2e-16 ***
## LOR          0.024918   0.005922   4.208 3.42e-05 ***
## CGPA         0.184664   0.009107  20.276  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0683 on 297 degrees of freedom
## Multiple R-squared:  0.7678, Adjusted R-squared:  0.7663 
## F-statistic: 491.1 on 2 and 297 DF,  p-value: < 2.2e-16

Adj-R Squared Comparison

summary(model1)$adj #1 pred
## [1] 0.753173
summary(model2)$adj #3 pred
## [1] 0.7733309
summary(model3)$adj #5 pred
## [1] 0.7860898
summary(model4)$adj #2 pred
## [1] 0.7662739

By considering to adjusted R squared values and number of predictor variable(s) for each models, we assume our top 2 models are model2 and model3.

4 . Model Evaluation

4.1 Performance

Next step is evaluate our models to predict data test by using predict(). Don’f forget we only use model2 and model3 to predict data test. In this phase, we’re still looking for the best models. Which one has a better performance with least error, model2 or model3?

# predicting model to our target variable in data test
predm2 <- predict(model2, newdata = data.frame(grad_test))
predm3 <- predict(model3, newdata = data.frame(grad_test))

Evaluation by MAPE

# MAPE of predm2 to data test
MAPE(predm2, grad_test$Chance.of.Admit)
## [1] 0.07661793
# MAPE of predm2 to data train
MAPE(predm2, grad_train$Chance.of.Admit)
## [1] 0.2509074
# MAPE of predm3 to data test
MAPE(predm3, grad_test$Chance.of.Admit)
## [1] 0.07523643
# MAPE of predm3 to data train
MAPE(predm3, grad_train$Chance.of.Admit)
## [1] 0.2496666

The smaller the MAPE value is, the better our models performs to predicting data. Particularly when the error of data test is smaller than train, then we can say our models has a great performance.
For this case, model3 is slightly better than model2 with MAPE of data test: 0.07523643 or 0.075%.

4.2 Limitations (4)

Linear regression models came with 4 assumptios:
  1. Linearity
    Linear regressions are best fitted on data where a linear relationship between the predictor and target exist.
  2. Normality
    Linear regression shows a normal residuals distribution.
  3. Heteroscedasticity
    Variance of regression models error shows unqeual variation accross the target variable range. We want our models has a constant variance of residuals, this condition known as homocedasticity.
  4. Multicollinearity
    Simple/multiple regression models assumes that the independent variables are not highly correlated with each other.


Now, we’ll check our model3 with 4 assumptions:

Linearity

  • using cor.test()
    H0 : Not Linear
    H1 : Linear
    p-value < 0.05, reject H0
    Target : Linear / reject H0
cor.test(grad_train$GRE.Score, grad_train$Chance.of.Admit)
## 
##  Pearson's product-moment correlation
## 
## data:  grad_train$GRE.Score and grad_train$Chance.of.Admit
## t = 21.987, df = 298, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7391407 0.8261988
## sample estimates:
##       cor 
## 0.7865468
cor.test(grad_train$TOEFL.Score, grad_train$Chance.of.Admit)
## 
##  Pearson's product-moment correlation
## 
## data:  grad_train$TOEFL.Score and grad_train$Chance.of.Admit
## t = 21.654, df = 298, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7336523 0.8223550
## sample estimates:
##       cor 
## 0.7819308
cor.test(grad_train$LOR, grad_train$Chance.of.Admit)
## 
##  Pearson's product-moment correlation
## 
## data:  grad_train$LOR and grad_train$Chance.of.Admit
## t = 15.504, df = 298, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6003691 0.7264558
## sample estimates:
##       cor 
## 0.6681827
cor.test(grad_train$CGPA, grad_train$Chance.of.Admit)
## 
##  Pearson's product-moment correlation
## 
## data:  grad_train$CGPA and grad_train$Chance.of.Admit
## t = 30.222, df = 298, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8374355 0.8936943
## sample estimates:
##       cor 
## 0.8683309

The test has a p-value bellow the significance level of 0.05, therefore we reject the null hypothesis means our models is linear.

Normality of Residuals

  • using shapiro.test()
    H0 : Residuals normally distributed
    H1 : Residuals not normally distributed
    p-value < 0.05, reject H0
    Target : Residuals normally distributed/ fail to reject H0
shapiro.test(model3$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model3$residuals
## W = 0.90795, p-value = 1.415e-12

We can conclude that the null hypothesis is rejected, means residuals not normally distributed.

Heteroscedasticity

  • using bptest() from library(lmtest)
    H0 : Homoscedasticity
    H1 : Heteroscedasticity
    p-value < 0.05, reject H0
    Target : Homoscedasticity/ fail to reject H0
bptest(model3)
## 
##  studentized Breusch-Pagan test
## 
## data:  model3
## BP = 13.033, df = 4, p-value = 0.01112

We can say that the null hypothesis is rejected, means heteroscedasticity is present in our model.

Multicollinearity

  • using vif() from library(car)
    VIF (Variance Inflation Factor) is a method to measure the effect of multicollinearity among the predictors in our model.
vif(model3)
##   GRE.Score TOEFL.Score         LOR        CGPA 
##    3.856675    3.868141    1.811634    4.602878

The result shows all predictors have less than 10 points, indicates there is no multicollinearity.

5 . Conclusion


Even though our model doesn’t fulfill 4 assumptions of Regression Models, it all depends on business case whether to accept or tunning this current model. For now, we will hold with this because it’s still tolerable, which model3 has fulfilled two of four assumptions : linearity and no multicollinearity.
Furthermore, model3 with predictors : GRE.Score, TOEFL.Score, LOR, CGPA has quite good performance. First, we can analyze from adjusted R squared value, with 78.60% variance of Chance.of.Admit explained by those predictors variables in model3. The models accuracy for predicting target in data test is measured by a low number of error. In this case, we use MAPE for training data, it is about 0.24% and testing data has a significant improvement with 0.075% for errors.