library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.6.2

## -- Attaching packages -------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'ggplot2' was built under R version 3.6.2

## Warning: package 'tibble' was built under R version 3.6.2

## Warning: package 'tidyr' was built under R version 3.6.2

## Warning: package 'readr' was built under R version 3.6.2

## Warning: package 'purrr' was built under R version 3.6.2

## Warning: package 'dplyr' was built under R version 3.6.2

## Warning: package 'stringr' was built under R version 3.6.2

## Warning: package 'forcats' was built under R version 3.6.2

## -- Conflicts ----------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(plotly)

## Warning: package 'plotly' was built under R version 3.6.2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(GGally)

## Warning: package 'GGally' was built under R version 3.6.2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

library(ggplot2)
library(readr)
library(dplyr)
library(funModeling)

## Warning: package 'funModeling' was built under R version 3.6.2

## Loading required package: Hmisc

## Warning: package 'Hmisc' was built under R version 3.6.2

## Loading required package: lattice

## Loading required package: survival

## Warning: package 'survival' was built under R version 3.6.2

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:plotly':
## 
##     subplot

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## funModeling v.1.9.3 :)
## Examples and tutorials at livebook.datascienceheroes.com
##  / Now in Spanish: librovivodecienciadedatos.ai

## 
## Attaching package: 'funModeling'

## The following object is masked from 'package:GGally':
## 
##     range01

library(lmtest)

## Warning: package 'lmtest' was built under R version 3.6.2

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.6.2

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(car)

## Warning: package 'car' was built under R version 3.6.2

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

library(MLmetrics)

## Warning: package 'MLmetrics' was built under R version 3.6.2

## 
## Attaching package: 'MLmetrics'

## The following object is masked from 'package:base':
## 
##     Recall

1 Background

This is a learn by building project to predict the chances of students getting admission for Masters Program in a university based on several academic performance measurement using Ordinary Least Square (OLS) multiple linear regression method.

2 Source of Dataset

The analysis will use dataset as follow:

Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019

https://www.kaggle.com/mohansacharya/graduate-admissions#Admission_Predict.csv

3 Importing Dataset

admission <- read.csv("Admission_Predict.csv")
head(str(admission))

## 'data.frame':    400 obs. of  9 variables:
##  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

## NULL

3.1 Context

This dataset is created for prediction of Graduate Admissions from an Indian perspective.

3.2 Content

The dataset contains several parameters which are considered important during the application for Masters Programs.

The parameters included are as follows:
1. GRE Scores (out of 340)
2. TOEFL Scores (out of 120)
3. University Rating (out of 5)
4. Statement of Purpose / SOP (out of 5)
5. Letter of Recommendation Strength / LOR (out of 5)
6. Undergraduate GPA (out of 10)
7. Research Experience (either 0 or 1)
8. Chance of Admit (ranging from 0 to 1)

4 Inspecting Dataset

admission %>% 
   is.na() %>% 
  colSums(is.na(admission))

##        Serial.No.         GRE.Score       TOEFL.Score University.Rating 
##                 0                 0                 0                 0 
##               SOP               LOR              CGPA          Research 
##                 0                 0                 0                 0 
##   Chance.of.Admit 
##                 0

There is no na or missing values in each columns of dataset.

admission_new <- admission %>% 
   select(-Serial.No.)
admission_new

The variable Serial.No is excluded from predictor variable.

admission_new$Research <- as.factor(admission_new$Research)
admission_new$University.Rating <- as.factor(admission_new$University.Rating)
head(str(admission_new))

## 'data.frame':    400 obs. of  8 variables:
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: Factor w/ 5 levels "1","2","3","4",..: 4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 1 1 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

## NULL

The type of variables University.Rating and Research are changed from integer into factor type.

5 Solving Business Problem

The model for solving business problem is to predict the chances of students getting admission for a Masters Program in a university based on several academic performance measurement . The model will be developed using variables as follows:

Target variable: Chance.of.Admit
Predictor variable: GRE.Score, TOEFL.Score, University.Rating, Research, SOP, LOR, CGPA

6 Exploratory Data Analysis

6.1 Checking correlation

ggcorr(admission_new, label = T)

## Warning in ggcorr(admission_new, label = T): data in column(s)
## 'University.Rating', 'Research' are not numeric and were ignored

The target variable Chance.of.Admit have a highest correlation with CGPA predictor variable.

6.2 Checking outlier

ggplot(data = admission_new, aes(x = GRE.Score, y = Chance.of.Admit)) +
  geom_point() +
  geom_smooth(method = "lm", level = 0.95)

ggplot(data = admission_new, aes(x = TOEFL.Score, y = Chance.of.Admit)) +
  geom_point() +
  geom_smooth(method = "lm", level = 0.95)

ggplot(data = admission_new, aes(x = SOP, y = Chance.of.Admit)) +
  geom_point() +
  geom_smooth(method = "lm", level = 0.95)

ggplot(data = admission_new, aes(x = LOR, y = Chance.of.Admit)) +
  geom_point() +
  geom_smooth(method = "lm", level = 0.95)

ggplot(data = admission_new, aes(x = CGPA, y = Chance.of.Admit)) +
  geom_point() +
  geom_smooth(method = "lm", level = 0.95)

plot_num(admission_new)

There are outliers in the predictor variables: GRE.Score, TOEFL.Score, SOP, LOR, and CGPA.

7 Preparing Train and Test Dataset

set.seed(417)
for_train <- sample(nrow(admission_new), nrow(admission_new)*0.8)
admission.train <- admission_new[for_train, ]
admission.test <- admission_new[-for_train, ]

8 Developing Model

The model for solving business problem is developed using Multiple Linear Regression - Ordinary Least Square method.

8.1 Target and predictor variables

model_admission.none <- lm(formula = Chance.of.Admit ~ 1,data = admission.train)
model_admission.none

## 
## Call:
## lm(formula = Chance.of.Admit ~ 1, data = admission.train)
## 
## Coefficients:
## (Intercept)  
##       0.727

model_admission.all <- lm(formula = Chance.of.Admit ~.,data = admission.train)
model_admission.all

## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = admission.train)
## 
## Coefficients:
##        (Intercept)           GRE.Score         TOEFL.Score  University.Rating2  
##         -1.1070932           0.0013514           0.0033989          -0.0231128  
## University.Rating3  University.Rating4  University.Rating5                 SOP  
##         -0.0185791          -0.0144997           0.0045554           0.0004671  
##                LOR                CGPA           Research1  
##          0.0224159           0.1115688           0.0310666

summary(model_admission.all)

## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = admission.train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.256292 -0.022879  0.009247  0.037961  0.155317 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.1070932  0.1452922  -7.620 3.14e-13 ***
## GRE.Score           0.0013514  0.0006607   2.046 0.041647 *  
## TOEFL.Score         0.0033989  0.0012075   2.815 0.005193 ** 
## University.Rating2 -0.0231128  0.0173197  -1.334 0.183030    
## University.Rating3 -0.0185791  0.0187624  -0.990 0.322838    
## University.Rating4 -0.0144997  0.0226167  -0.641 0.521928    
## University.Rating5  0.0045554  0.0249937   0.182 0.855495    
## SOP                 0.0004671  0.0063046   0.074 0.940992    
## LOR                 0.0224159  0.0060540   3.703 0.000253 ***
## CGPA                0.1115688  0.0134894   8.271 4.01e-15 ***
## Research1           0.0310666  0.0087716   3.542 0.000459 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06294 on 309 degrees of freedom
## Multiple R-squared:  0.7919, Adjusted R-squared:  0.7852 
## F-statistic: 117.6 on 10 and 309 DF,  p-value: < 2.2e-16

The model will be developed by using the most significant 5 (five) predictor variables, i.e.: GRE-Score, TOEFL.Score, LOR, CGPA, and Research.

8.2 Feature selection using backward method

step(object = model_admission.all,direction = "backward", trace = 0)

## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA + Research, data = admission.train)
## 
## Coefficients:
## (Intercept)    GRE.Score  TOEFL.Score          LOR         CGPA    Research1  
##   -1.183449     0.001430     0.003394     0.024024     0.115386     0.031796

model_admnew <- lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
    CGPA + Research, data = admission.train)
summary(model_admnew)

## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA + Research, data = admission.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25935 -0.02378  0.01047  0.03685  0.15385 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.1834495  0.1318372  -8.977  < 2e-16 ***
## GRE.Score    0.0014298  0.0006564   2.178 0.030123 *  
## TOEFL.Score  0.0033943  0.0011799   2.877 0.004291 ** 
## LOR          0.0240235  0.0053163   4.519 8.82e-06 ***
## CGPA         0.1153857  0.0129089   8.938  < 2e-16 ***
## Research1    0.0317956  0.0087479   3.635 0.000325 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06304 on 314 degrees of freedom
## Multiple R-squared:  0.7879, Adjusted R-squared:  0.7845 
## F-statistic: 233.3 on 5 and 314 DF,  p-value: < 2.2e-16

9 Testing Model

9.1 Check normality

hist(model_admnew$residuals)

shapiro.test(x = model_admnew$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_admnew$residuals
## W = 0.91188, p-value = 9.63e-13

p-value < 0.05, then reject Ho, accept H1

Shapiro-Wilk hypothesis test: (expectation: pvalue > alpha)

H0: error/residual distributed normally

H1: error/residual not-distributed normally

This assumption of error/residual distributed normally is not fullfilled.

9.2 Check heteroscedasticity

plot(model_admnew$fitted.values, model_admnew$residuals)
abline(h = 0, col = "red")

bptest(model_admnew)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_admnew
## BP = 20.18, df = 5, p-value = 0.001156

p-value < 0.05, then reject Ho, accept H1

Breusch-Pagan hypothesis test: (expectation: pvalue > alpha)

H0: error/residual splitted constantly (Homoscedasticity)

H1: error/residual not splitted constantly or creating pattern (Heteroscedasticity)

This assumption of no heteroscedasticity is not fullfilled.

9.3 Check multicolinearity

vif(model_admnew)

##   GRE.Score TOEFL.Score         LOR        CGPA    Research 
##    4.298824    3.883285    1.686968    4.372284    1.536764

The VIF numbers are below 10, meaning that there is no multicolinearity among predictor variables. This assumption of no multicolinearity is fullfilled.

10 Predicting Target & Error

pred_test <- predict(object = model_admnew, newdata = admission.test)
pred_test

##         9        14        21        23        24        26        33        38 
## 0.5536933 0.6524331 0.6172316 0.9273448 0.9571719 0.9576066 0.9248813 0.5499511 
##        39        41        55        59        63        64        65        70 
## 0.5090429 0.6572572 0.6574909 0.4459953 0.6576437 0.7147760 0.7459446 0.8606959 
##        73        83        84        92        94        99       102       109 
## 0.8945954 0.8512208 0.8872059 0.5412483 0.5892766 0.9012349 0.6280558 0.9177880 
##       111       116       118       120       126       129       139       151 
## 0.6697687 0.8025761 0.5050392 0.7729939 0.6880304 0.8167068 0.8279762 0.8970347 
##       152       161       167       171       173       174       181       185 
## 0.9076792 0.5715330 0.6758528 0.6490552 0.8557075 0.8668470 0.6121191 0.6842700 
##       189       200       203       204       206       237       244       249 
## 0.8760507 0.7380793 0.9933761 0.9921935 0.5291519 0.8593892 0.8151424 0.8045434 
##       262       265       271       272       274       276       277       283 
## 0.6452235 0.7572455 0.6628111 0.5248959 0.5907581 0.7880450 0.9008108 0.7525658 
##       292       293       297       326       327       331       333       337 
## 0.5478452 0.5520606 0.7074586 0.8589228 0.5569347 0.7707612 0.6558994 0.7203326 
##       340       342       347       353       360       369       377       379 
## 0.7701296 0.7826989 0.5100544 0.6351003 0.6486009 0.5121967 0.4724204 0.5251845 
##       381       382       386       388       389       390       395       398 
## 0.7778565 0.7453340 0.9776712 0.6306108 0.5170776 0.7424197 0.8566256 0.9124234

MSE from data train

MSE(y_pred = model_admnew$fitted.values, y_true = admission.train$Chance.of.Admit)

## [1] 0.003899291

MSE from data test

MSE(y_pred = pred_test, y_true = admission.test$Chance.of.Admit)

## [1] 0.004521008

11 Summary

The model model_admnew from data train has Adjusted R-square 0.7845 and MSE 0.003899291.

The testing of model model_admnew are as follows:
1. Normality test reveals p-value = 9.63e-13, meaning that error distributed normally is not fullfilled.
2. Heteroscedasticity test reveals p-value = 0.001156, meaning that no heteroscedasticity is not fullfilled.
3. Multicolinearity test reveals vif < 10, meaning that no multicolinearity is fullfilled.

The predicted value from data test show MSE 0.004521008, slightly higher than MSE 0.003899291 from data train, meaning that the model still gives such a good prediction.

The CGPA has the highest correlation (p-value < 2e-16), meaning that students with high GPA score combined with high score in GRE and TOEFL, have good Letter of Recommendation and have Research Experience, are highly likely to be accepted in the Master Program.

Multiple Linear Regression Analysis - Graduate Admission Dataset

Laurensius Wiwiek Winarta

2020-02-24