## Warning: package 'tidyverse' was built under R version 3.6.2
## -- Attaching packages -------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'readr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## Warning: package 'stringr' was built under R version 3.6.2
## Warning: package 'forcats' was built under R version 3.6.2
## -- Conflicts ----------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Warning: package 'plotly' was built under R version 3.6.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Warning: package 'GGally' was built under R version 3.6.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
## Warning: package 'funModeling' was built under R version 3.6.2
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 3.6.2
## Loading required package: lattice
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.6.2
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:plotly':
##
## subplot
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## funModeling v.1.9.3 :)
## Examples and tutorials at livebook.datascienceheroes.com
## / Now in Spanish: librovivodecienciadedatos.ai
##
## Attaching package: 'funModeling'
## The following object is masked from 'package:GGally':
##
## range01
## Warning: package 'lmtest' was built under R version 3.6.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.6.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Warning: package 'car' was built under R version 3.6.2
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
## Warning: package 'MLmetrics' was built under R version 3.6.2
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
Background
This is a learn by building project to predict the chances of students getting admission for Masters Program in a university based on several academic performance measurement using Ordinary Least Square (OLS) multiple linear regression method.
Importing Dataset
## 'data.frame': 400 obs. of 9 variables:
## $ Serial.No. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
## NULL
Context
This dataset is created for prediction of Graduate Admissions from an Indian perspective.
Content
The dataset contains several parameters which are considered important during the application for Masters Programs.
The parameters included are as follows:
1. GRE Scores (out of 340)
2. TOEFL Scores (out of 120)
3. University Rating (out of 5)
4. Statement of Purpose / SOP (out of 5)
5. Letter of Recommendation Strength / LOR (out of 5)
6. Undergraduate GPA (out of 10)
7. Research Experience (either 0 or 1)
8. Chance of Admit (ranging from 0 to 1)
Inspecting Dataset
## Serial.No. GRE.Score TOEFL.Score University.Rating
## 0 0 0 0
## SOP LOR CGPA Research
## 0 0 0 0
## Chance.of.Admit
## 0
There is no na or missing values in each columns of dataset.
The variable Serial.No is excluded from predictor variable.
## 'data.frame': 400 obs. of 8 variables:
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: Factor w/ 5 levels "1","2","3","4",..: 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 1 1 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
## NULL
The type of variables University.Rating and Research are changed from integer into factor type.
Solving Business Problem
The model for solving business problem is to predict the chances of students getting admission for a Masters Program in a university based on several academic performance measurement . The model will be developed using variables as follows:
Target variable: Chance.of.Admit
Predictor variable: GRE.Score, TOEFL.Score, University.Rating, Research, SOP, LOR, CGPA
Exploratory Data Analysis
Checking correlation
## Warning in ggcorr(admission_new, label = T): data in column(s)
## 'University.Rating', 'Research' are not numeric and were ignored

The target variable Chance.of.Admit have a highest correlation with CGPA predictor variable.
Checking outlier






There are outliers in the predictor variables: GRE.Score, TOEFL.Score, SOP, LOR, and CGPA.
Preparing Train and Test Dataset
Developing Model
The model for solving business problem is developed using Multiple Linear Regression - Ordinary Least Square method.
Target and predictor variables
##
## Call:
## lm(formula = Chance.of.Admit ~ 1, data = admission.train)
##
## Coefficients:
## (Intercept)
## 0.727
##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = admission.train)
##
## Coefficients:
## (Intercept) GRE.Score TOEFL.Score University.Rating2
## -1.1070932 0.0013514 0.0033989 -0.0231128
## University.Rating3 University.Rating4 University.Rating5 SOP
## -0.0185791 -0.0144997 0.0045554 0.0004671
## LOR CGPA Research1
## 0.0224159 0.1115688 0.0310666
##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = admission.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.256292 -0.022879 0.009247 0.037961 0.155317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.1070932 0.1452922 -7.620 3.14e-13 ***
## GRE.Score 0.0013514 0.0006607 2.046 0.041647 *
## TOEFL.Score 0.0033989 0.0012075 2.815 0.005193 **
## University.Rating2 -0.0231128 0.0173197 -1.334 0.183030
## University.Rating3 -0.0185791 0.0187624 -0.990 0.322838
## University.Rating4 -0.0144997 0.0226167 -0.641 0.521928
## University.Rating5 0.0045554 0.0249937 0.182 0.855495
## SOP 0.0004671 0.0063046 0.074 0.940992
## LOR 0.0224159 0.0060540 3.703 0.000253 ***
## CGPA 0.1115688 0.0134894 8.271 4.01e-15 ***
## Research1 0.0310666 0.0087716 3.542 0.000459 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06294 on 309 degrees of freedom
## Multiple R-squared: 0.7919, Adjusted R-squared: 0.7852
## F-statistic: 117.6 on 10 and 309 DF, p-value: < 2.2e-16
The model will be developed by using the most significant 5 (five) predictor variables, i.e.: GRE-Score, TOEFL.Score, LOR, CGPA, and Research.
Feature selection using backward method
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
## CGPA + Research, data = admission.train)
##
## Coefficients:
## (Intercept) GRE.Score TOEFL.Score LOR CGPA Research1
## -1.183449 0.001430 0.003394 0.024024 0.115386 0.031796
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
## CGPA + Research, data = admission.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25935 -0.02378 0.01047 0.03685 0.15385
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.1834495 0.1318372 -8.977 < 2e-16 ***
## GRE.Score 0.0014298 0.0006564 2.178 0.030123 *
## TOEFL.Score 0.0033943 0.0011799 2.877 0.004291 **
## LOR 0.0240235 0.0053163 4.519 8.82e-06 ***
## CGPA 0.1153857 0.0129089 8.938 < 2e-16 ***
## Research1 0.0317956 0.0087479 3.635 0.000325 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06304 on 314 degrees of freedom
## Multiple R-squared: 0.7879, Adjusted R-squared: 0.7845
## F-statistic: 233.3 on 5 and 314 DF, p-value: < 2.2e-16
Testing Model
Check normality

##
## Shapiro-Wilk normality test
##
## data: model_admnew$residuals
## W = 0.91188, p-value = 9.63e-13
p-value < 0.05, then reject Ho, accept H1
Shapiro-Wilk hypothesis test: (expectation: pvalue > alpha)
H0: error/residual distributed normally
H1: error/residual not-distributed normally
This assumption of error/residual distributed normally is not fullfilled.
Check heteroscedasticity

##
## studentized Breusch-Pagan test
##
## data: model_admnew
## BP = 20.18, df = 5, p-value = 0.001156
p-value < 0.05, then reject Ho, accept H1
Breusch-Pagan hypothesis test: (expectation: pvalue > alpha)
H0: error/residual splitted constantly (Homoscedasticity)
H1: error/residual not splitted constantly or creating pattern (Heteroscedasticity)
This assumption of no heteroscedasticity is not fullfilled.
Check multicolinearity
## GRE.Score TOEFL.Score LOR CGPA Research
## 4.298824 3.883285 1.686968 4.372284 1.536764
The VIF numbers are below 10, meaning that there is no multicolinearity among predictor variables. This assumption of no multicolinearity is fullfilled.
Predicting Target & Error
## 9 14 21 23 24 26 33 38
## 0.5536933 0.6524331 0.6172316 0.9273448 0.9571719 0.9576066 0.9248813 0.5499511
## 39 41 55 59 63 64 65 70
## 0.5090429 0.6572572 0.6574909 0.4459953 0.6576437 0.7147760 0.7459446 0.8606959
## 73 83 84 92 94 99 102 109
## 0.8945954 0.8512208 0.8872059 0.5412483 0.5892766 0.9012349 0.6280558 0.9177880
## 111 116 118 120 126 129 139 151
## 0.6697687 0.8025761 0.5050392 0.7729939 0.6880304 0.8167068 0.8279762 0.8970347
## 152 161 167 171 173 174 181 185
## 0.9076792 0.5715330 0.6758528 0.6490552 0.8557075 0.8668470 0.6121191 0.6842700
## 189 200 203 204 206 237 244 249
## 0.8760507 0.7380793 0.9933761 0.9921935 0.5291519 0.8593892 0.8151424 0.8045434
## 262 265 271 272 274 276 277 283
## 0.6452235 0.7572455 0.6628111 0.5248959 0.5907581 0.7880450 0.9008108 0.7525658
## 292 293 297 326 327 331 333 337
## 0.5478452 0.5520606 0.7074586 0.8589228 0.5569347 0.7707612 0.6558994 0.7203326
## 340 342 347 353 360 369 377 379
## 0.7701296 0.7826989 0.5100544 0.6351003 0.6486009 0.5121967 0.4724204 0.5251845
## 381 382 386 388 389 390 395 398
## 0.7778565 0.7453340 0.9776712 0.6306108 0.5170776 0.7424197 0.8566256 0.9124234
MSE from data train
## [1] 0.003899291
MSE from data test
## [1] 0.004521008
Summary
The model model_admnew from data train has Adjusted R-square 0.7845 and MSE 0.003899291.
The testing of model model_admnew are as follows:
1. Normality test reveals p-value = 9.63e-13, meaning that error distributed normally is not fullfilled.
2. Heteroscedasticity test reveals p-value = 0.001156, meaning that no heteroscedasticity is not fullfilled.
3. Multicolinearity test reveals vif < 10, meaning that no multicolinearity is fullfilled.
The predicted value from data test show MSE 0.004521008, slightly higher than MSE 0.003899291 from data train, meaning that the model still gives such a good prediction.
The CGPA has the highest correlation (p-value < 2e-16), meaning that students with high GPA score combined with high score in GRE and TOEFL, have good Letter of Recommendation and have Research Experience, are highly likely to be accepted in the Master Program.