LBB RM: Regression Model For University Admission

Musthofa Syarifudin

2021-04-27

Bussines Problem

Context

This dataset is created for prediction of Graduate Admissions from an Indian perspective.

Content

The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are :

GRE Scores ( out of 340 )
TOEFL Scores ( out of 120 )
University Rating ( out of 5 )
Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
Undergraduate GPA ( out of 10 )
Research Experience ( either 0 or 1 )
Chance of Admit ( ranging from 0 to 1 )

Acknowledgements

This dataset is inspired by the UCLA Graduate Dataset. The test scores and GPA are in the older format. The dataset is owned by Mohan S Acharya.

Inspiration

This dataset was built with the purpose of helping students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university.

The Goal

The goal of modeling is to predict the probability of student entering the university.

Preparation

Load the library

library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(car)
library(scales)
library(lmtest)
library(MLmetrics)
library(inspectdf)
options(scipen = 100, max.print = 1e+06)

Read the dataset

univ_data <- read.csv("archive/Admission_Predict.csv")
rmarkdown::paged_table(head(univ_data))

Checking dataset column types

str(univ_data)

## 'data.frame':    400 obs. of  9 variables:
##  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

because there some categorical variable like Research and University Rating we gonna remove it. we also not using Serial.No. because it is id of student and not gonna tell much information about chance of admiting in college.

Check na values

colSums(is.na(univ_data))

##        Serial.No.         GRE.Score       TOEFL.Score University.Rating 
##                 0                 0                 0                 0 
##               SOP               LOR              CGPA          Research 
##                 0                 0                 0                 0 
##   Chance.of.Admit 
##                 0

the data has zero na values. we can move on to next step

Summary of data

summary(univ_data)

##    Serial.No.      GRE.Score      TOEFL.Score    University.Rating
##  Min.   :  1.0   Min.   :290.0   Min.   : 92.0   Min.   :1.000    
##  1st Qu.:100.8   1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000    
##  Median :200.5   Median :317.0   Median :107.0   Median :3.000    
##  Mean   :200.5   Mean   :316.8   Mean   :107.4   Mean   :3.087    
##  3rd Qu.:300.2   3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000    
##  Max.   :400.0   Max.   :340.0   Max.   :120.0   Max.   :5.000    
##       SOP           LOR             CGPA          Research     
##  Min.   :1.0   Min.   :1.000   Min.   :6.800   Min.   :0.0000  
##  1st Qu.:2.5   1st Qu.:3.000   1st Qu.:8.170   1st Qu.:0.0000  
##  Median :3.5   Median :3.500   Median :8.610   Median :1.0000  
##  Mean   :3.4   Mean   :3.453   Mean   :8.599   Mean   :0.5475  
##  3rd Qu.:4.0   3rd Qu.:4.000   3rd Qu.:9.062   3rd Qu.:1.0000  
##  Max.   :5.0   Max.   :5.000   Max.   :9.920   Max.   :1.0000  
##  Chance.of.Admit 
##  Min.   :0.3400  
##  1st Qu.:0.6400  
##  Median :0.7300  
##  Mean   :0.7244  
##  3rd Qu.:0.8300  
##  Max.   :0.9700

we are going to remove some column like Research, University Rating, and Serial.No.

univ_data_clean <- univ_data %>% select(-c(Research, University.Rating, Serial.No.))

Exploratory Data Analysis

Plot the Pearson correlation between features

ggcorr(univ_data_clean, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

Inspect distribution of data

univ_data_clean %>% 
  inspect_num() %>% 
  show_plot()

we are seing almost all of the plot distribution almost like normal distribution.

Plot chance of admit with other feature

plot(univ_data_clean$GRE.Score, univ_data_clean$Chance.of.Admit)

plot(univ_data_clean$TOEFL.Score, univ_data_clean$Chance.of.Admit)

plot(univ_data_clean$SOP, univ_data_clean$Chance.of.Admit)

plot(univ_data_clean$LOR, univ_data_clean$Chance.of.Admit)

plot(univ_data_clean$CGPA, univ_data_clean$Chance.of.Admit)

they look quite linear to each feature and there are no outliers in the data.

Split data

RNGkind(sample.kind = "Rounding") # membuat hasil set.seed sama untuk versi R 3.xx dengan 4.xx

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(417) # mengunci random yang dihasilkan oleh fungsi sample
univ_split <- sample(nrow(univ_data_clean), nrow(univ_data_clean)*0.80) # 80% persen data
univ_train <- univ_data_clean[univ_split, ] # 80% data train
univ_test <- univ_data_clean[-univ_split, ] # 20% data test

we are gonna split data into train and test. with the proportion of 80:20

Plot train and test data distribution

univ_train %>% 
  inspect_num() %>% 
  show_plot()

univ_test %>% 
  inspect_num() %>% 
  show_plot()

after splitting the distribution of plot still looked like normal distribution

Modeling

Make linear regression model

model_rm <- lm(formula = Chance.of.Admit ~ ., data = univ_train)
summary(model_rm)

## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = univ_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28266 -0.02316  0.00914  0.03661  0.16555 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -1.5480609  0.1233053 -12.555 < 0.0000000000000002 ***
## GRE.Score    0.0027677  0.0006627   4.176            0.0000384 ***
## TOEFL.Score  0.0028498  0.0012089   2.357                0.019 *  
## SOP         -0.0005262  0.0057864  -0.091                0.928    
## LOR          0.0246474  0.0061795   3.989            0.0000828 ***
## CGPA         0.1170762  0.0134516   8.704 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06499 on 314 degrees of freedom
## Multiple R-squared:  0.8019, Adjusted R-squared:  0.7987 
## F-statistic: 254.2 on 5 and 314 DF,  p-value: < 0.00000000000000022

Regression Model:

chance of admit = -1.5480609 + 0.0027677 * gre + 0.0028498 * toefl - 0.0005262 * sop + 0.0246474 * lor + 0.1170762 * cgpa

Intercept: when gre, toefl, sop, lor, cgpa = 0, chance of admit is -1.5, which is make sense.
coefficient slope: gre, toefl, lor, cgpa positive with chance of admit and sop negative with chance of admit

Pengaruh variabel prediktor terhadap target

H0: gre, toefl, lor, cgpa not impacting inequality
H1: gre, toefl, lor, cgpa impacting inequality

Reject H0 if p-value < alpha (5%). From output the value of p-value< 0.05, then conclude that gre, toefl, lor, cgpa significantly impacting chance of admit

Goodness of fit

multiple R-Squared = 0.8019, gdp can descript chance of admit as big as 80.19 %

Evaluation

Making predcition and count the MSE, RMSE, and MAE

predicted_univ <- predict(object = model_rm, newdata = univ_test)
MSE(y_pred = predicted_univ, y_true = univ_test$Chance.of.Admit)

## [1] 0.003962463

RMSE(y_pred = predicted_univ, y_true = univ_test$Chance.of.Admit)

## [1] 0.0629481

MAE(y_pred = predicted_univ, y_true = univ_test$Chance.of.Admit)

## [1] 0.04393572

model has MSE, RMSE, dan MAE value 0.003962463, 0.0629481, and 0.04393572

Plot distribution of residual

hist(model_rm$residuals)

residual distribution is look good because they centered around 0 and looked like normal distribution

Plot predicted value and actual value

univ_test$predicted <- predicted_univ
univ_admition <- univ_test %>% 
  pivot_longer(cols = c(predicted, Chance.of.Admit), names_to = "var", values_to = "value")

ggplot(data = univ_admition, aes(x = GRE.Score, y = value)) +
  geom_point(aes(col = var)) +
  labs(title = "Chance Of admit VS GRE Score",
       col = NULL,
       x = "GRE Score",
       y = "Chance Of admit") +
  theme_minimal()

Almost predicted value close to actual value except some of predicted value in extreme position such as (315, 0.4) where there are not any actual value in there and this show the model did not quite understanding variance of the data.

Model Improvement Elimination

Using backward elimination

backward <- step(object = model_rm, trace = 0)
summary(backward)

## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA, data = univ_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.282824 -0.023104  0.009397  0.036541  0.165416 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -1.5457992  0.1205809 -12.820 < 0.0000000000000002 ***
## GRE.Score    0.0027709  0.0006608   4.193            0.0000358 ***
## TOEFL.Score  0.0028322  0.0011914   2.377                0.018 *  
## LOR          0.0243889  0.0054785   4.452            0.0000118 ***
## CGPA         0.1168115  0.0131122   8.909 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06489 on 315 degrees of freedom
## Multiple R-squared:  0.8019, Adjusted R-squared:  0.7994 
## F-statistic: 318.7 on 4 and 315 DF,  p-value: < 0.00000000000000022

the model did not improve at all it still having the same r squared. we are gonna using k-fold validation to improve the model

Using k-fold cross validation

set.seed(125) 

train_control <- trainControl(method = "cv",
                              number = 10)

model_rm_2 <- train( Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA,
                    method="lm",
                    data=univ_train,
                    trControl = train_control)
  
print(model_rm_2)

## Linear Regression 
## 
## 320 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 287, 288, 288, 287, 289, 288, ... 
## Resampling results:
## 
##   RMSE        Rsquared   MAE       
##   0.06493734  0.8048766  0.04720543
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

our model improve little bit from 0.8019 to 0.8048766.

Making new prediction and testing MSE, RMSE, and MAE

predicted_univ_1 <- predict(object = model_rm_2, newdata = univ_test)
MSE(y_pred = predicted_univ_1, y_true = univ_test$Chance.of.Admit)

## [1] 0.003961284

RMSE(y_pred = predicted_univ_1, y_true = univ_test$Chance.of.Admit)

## [1] 0.06293873

MAE(y_pred = predicted_univ_1, y_true = univ_test$Chance.of.Admit)

## [1] 0.04391287

Plot the model prediction along with confidence interval

univ_test$predicted2 <- predicted_univ_1
univ_admition2 <- univ_test %>% 
  pivot_longer(cols = c(predicted2, Chance.of.Admit), names_to = "var", values_to = "value")

ggplot(data = univ_admition2, aes(x = GRE.Score, y = value)) +
  geom_point(aes(col = var)) +
  labs(title = "Chance Of admit VS GRE Score",
       col = NULL,
       x = "GRE Score",
       y = "Chance Of admit") +
  geom_smooth(method = "lm", level = 0.95) +
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

The model seems better because each point in graph almost close to predicted value , but the model still suffered from the variance of data

Check Assumtion

1. No-Multicolinearity

vif(model_rm_2$finalModel)

##   GRE.Score TOEFL.Score         LOR        CGPA 
##    4.429674    3.968426    1.854264    4.729297

The model did not have multicolineairty because all of value < 10

2. Linearity

ggcorr(univ_data_clean, label=T)

All variables have linearity toward chance of admit

3. Normality Error

hist(resid(model_rm_2))

The error looked like normal distribution and have 0 as the majority of error

4. Shapiro-Wilk Test

Shapiro-Wilk hypothesis: - H0: error/residual distribution normal - H1: error/residual not distribution normal

shapiro.test(resid(model_rm_2))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model_rm_2)
## W = 0.92778, p-value = 0.00000000002496

Because of p-value = 0.00000000002496 < 0.05, the model fulfill the normal error distribution assumtion

5.Homoscedasticity

Breusch-Pagan hypothesis: - H0: Homoscedasticity - H1: Heteroscedasticity

bptest(model_rm_2$finalModel)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_rm_2$finalModel
## BP = 13.134, df = 4, p-value = 0.01064

Because the model have p-value=0.01064 < 0.05. The model has homoescedasticity

Conclusion

The model have accuracy about 0.804 and able to predict the probability on the test data with MAE about 0.04391287. Also the model pretty capable of profiling chance of admiting in student given new data