1 Greetings

Welcome to my Rmd. The reason why I created this Rmd is to improve my understanding on Linear Regression Model Machine Learning using R language.

2 Brief explanation about the data

The dataset contains several parameters which are considered important during the application for Masters Programs.

Columns Insight :

1. GRE Scores: Graduate Record Examination (out of 340)
2. TOEFL Scores: The TOEFL (Test of English as a Foreign Language) is a standardized test that measures a test-taker’s mastery of the English language (out of 120)
3. University Rating: It defines the rating of the university, the higher the better (out of 5)
4. Statement of Purpose and Letter of Recommendation Strength: A good letter of recommendation should be convincing an audience that the recommendee will be successful in the position they are applying for, so the higher the score the better (out of 5)
5. Undergraduate GPA: the undergraduate GPA is based on all coursework completed for your bachelor’s degree(out of 10)
6. Research Experience: It indicates whether the cannidate have research experience or not (either 0 = NO or 1 = YES)
7. Chance of Admit: Range of cannidate to be accepted for Master Programs (ranging from 0 to 1)

You may download the data set from kaggle: https://www.kaggle.com/mohansacharya/graduate-admissions

3 Bussines Questions

Create a Regression Model to predict the chance of admit.

4 Data Preparation

4.1 Import necessary library

library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(tidymodels)
library(car)
library(scales)
library(MASS)
library(lmtest)

4.2 Read the dataset.

admission <- read.csv("Admission_Predict.csv")
head(admission)

5 Exploratory Data Analysis

5.1 Check corellation between each columns.

There are a function in R called ggcorr() which can be used to check the correlation on each columns to determine how influenced the values of that column to the target variable, in this case target variable is column status.

ggcorr(admission, label = T)

From the ggcorr() visualization, almost all the columns has strong influenced to chance of admit value.

5.2 Check missing value

Let’s check whether there are missing value or not.

colSums(is.na(admission))

##        Serial.No.         GRE.Score       TOEFL.Score University.Rating 
##                 0                 0                 0                 0 
##               SOP               LOR              CGPA          Research 
##                 0                 0                 0                 0 
##   Chance.of.Admit 
##                 0

Good! There are no missing value.

5.3 Check Data type

glimpse(admission)

## Rows: 400
## Columns: 9
## $ Serial.No.        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1~
## $ GRE.Score         <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, 32~
## $ TOEFL.Score       <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, 10~
## $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3, 3~
## $ SOP               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, 3.~
## $ LOR               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, 4.~
## $ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.00~
## $ Research          <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1~
## $ Chance.of.Admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.50~

5.4 Drop unecessary columns

Column Serial.No. is a unique identifier for each student so can be ignore.

admission_new <- admission %>% 
  dplyr::select(-Serial.No.)

head(admission_new)

6 Build Model

6.1 Model Fitting

At first, Linear Regression model can be made with the predictor variable CGPA because this variable has the highest positive correlation to the target variable Chance.of.Admit.

model <- lm(formula = Chance.of.Admit ~ CGPA, #formula = target variable ~ predictor
            data = admission_new) #data set

summary(model)

## 
## Call:
## lm(formula = Chance.of.Admit ~ CGPA, data = admission_new)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.274575 -0.030084  0.009443  0.041954  0.180734 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.07151    0.05034  -21.29   <2e-16 ***
## CGPA         0.20885    0.00584   35.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06957 on 398 degrees of freedom
## Multiple R-squared:  0.7626, Adjusted R-squared:  0.762 
## F-statistic:  1279 on 1 and 398 DF,  p-value: < 2.2e-16

Function step() will provide information regarding which columns (predictor) which have high influenced by calculate the AIC value. The lower AIC value the better.

model_all <- lm(formula = Chance.of.Admit ~ ., 
            data = admission_new) 

summary(model_all)

## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = admission_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26259 -0.02103  0.01005  0.03628  0.15928 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.2594325  0.1247307 -10.097  < 2e-16 ***
## GRE.Score          0.0017374  0.0005979   2.906  0.00387 ** 
## TOEFL.Score        0.0029196  0.0010895   2.680  0.00768 ** 
## University.Rating  0.0057167  0.0047704   1.198  0.23150    
## SOP               -0.0033052  0.0055616  -0.594  0.55267    
## LOR                0.0223531  0.0055415   4.034  6.6e-05 ***
## CGPA               0.1189395  0.0122194   9.734  < 2e-16 ***
## Research           0.0245251  0.0079598   3.081  0.00221 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06378 on 392 degrees of freedom
## Multiple R-squared:  0.8035, Adjusted R-squared:    0.8 
## F-statistic: 228.9 on 7 and 392 DF,  p-value: < 2.2e-16

stepAIC(model_all, direction = "both")

## Start:  AIC=-2193.9
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     SOP + LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## - SOP                1   0.00144 1.5962 -2195.5
## - University.Rating  1   0.00584 1.6006 -2194.4
## <none>                           1.5948 -2193.9
## - TOEFL.Score        1   0.02921 1.6240 -2188.6
## - GRE.Score          1   0.03435 1.6291 -2187.4
## - Research           1   0.03862 1.6334 -2186.3
## - LOR                1   0.06620 1.6609 -2179.6
## - CGPA               1   0.38544 1.9802 -2109.3
## 
## Step:  AIC=-2195.54
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## - University.Rating  1   0.00464 1.6008 -2196.4
## <none>                           1.5962 -2195.5
## + SOP                1   0.00144 1.5948 -2193.9
## - TOEFL.Score        1   0.02806 1.6242 -2190.6
## - GRE.Score          1   0.03565 1.6318 -2188.7
## - Research           1   0.03769 1.6339 -2188.2
## - LOR                1   0.06983 1.6660 -2180.4
## - CGPA               1   0.38660 1.9828 -2110.8
## 
## Step:  AIC=-2196.38
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## <none>                           1.6008 -2196.4
## + University.Rating  1   0.00464 1.5962 -2195.5
## + SOP                1   0.00024 1.6006 -2194.4
## - TOEFL.Score        1   0.03292 1.6338 -2190.2
## - GRE.Score          1   0.03638 1.6372 -2189.4
## - Research           1   0.03912 1.6400 -2188.7
## - LOR                1   0.09133 1.6922 -2176.2
## - CGPA               1   0.43201 2.0328 -2102.8

## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA + Research, data = admission_new)
## 
## Coefficients:
## (Intercept)    GRE.Score  TOEFL.Score          LOR         CGPA     Research  
##   -1.298464     0.001782     0.003032     0.022776     0.121004     0.024577

The lowest AIC given from function step() are from column GRE.Score, TOEFL.Score, LOR, CGPA and Research. After getting information what columns will provide high influenced, assign those predictor columns into new model.

model_aic <- lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research, data = admission_new)
summary(model_aic)

## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA + Research, data = admission_new)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.263542 -0.023297  0.009879  0.038078  0.159897 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.2984636  0.1172905 -11.070  < 2e-16 ***
## GRE.Score    0.0017820  0.0005955   2.992  0.00294 ** 
## TOEFL.Score  0.0030320  0.0010651   2.847  0.00465 ** 
## LOR          0.0227762  0.0048039   4.741 2.97e-06 ***
## CGPA         0.1210042  0.0117349  10.312  < 2e-16 ***
## Research     0.0245769  0.0079203   3.103  0.00205 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06374 on 394 degrees of freedom
## Multiple R-squared:  0.8027, Adjusted R-squared:  0.8002 
## F-statistic: 320.6 on 5 and 394 DF,  p-value: < 2.2e-16

7 Prediction

admission_new$prediction <- predict(model_aic, admission_new)
head(admission_new)

8 Evaluation

8.1 Model Prediction Evaluation

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells how concentrated the data is around the line of best fit. Root mean square error is commonly used in climatology, forecasting, and regression analysis to verify experimental results.

RMSE with model_aic (model with selected columns using AIC function)

RMSE(pred = admission_new$prediction, obs = admission_new$Chance.of.Admit)

## [1] 0.06326207

The RMSE from model_aic can be interpret, the prediction results diverged as much as 0.063 on average. In order to make sure the prediction model from model_aic is good enough, let’s compare it with model that used all the columns as predictor.

RMSE with model (model with all columns)

admission_new$prediction <- predict(model, admission_new)

RMSE(pred = admission_new$prediction, obs = admission_new$Chance.of.Admit)

## [1] 0.0693927

The second model (model with selected columns using AIC function) produce slightly better result compared to first model (model with all columns), since the smaller the RMSE value the better the model.

8.2 Assumptions Checking

As a statistical model (parametric model), linear regression has several assumptions that need to be fulfilled so that the interpretation obtained is not biased. This assumption only needs to be fulfilled if the purpose of making a linear regression model is to interpret or see the effect of each predictor on the target value of the variable.

1. Linearity

The first assumption of linear regression is that there is a linear relationship between the independent variable, x, and the independent variable, y.

The easiest way to detect if this assumption is met is to create a scatter plot of x vs. y. This allows to visually see if there is a linear relationship between the two variables. If it looks like the points in the plot could fall along a straight line, then there exists some type of linear relationship between the two variables and this assumption is met.

plot <- data.frame(residual = model_aic$residuals, 
                  fitted = model_aic$fitted.values)

plot %>%
  ggplot(aes(fitted, residual)) + 
  geom_point() + 
  geom_smooth() + 
  geom_hline(aes(yintercept = 0)) + 
  theme(panel.grid = element_blank(), 
       panel.background = element_blank())

From the pattern plot above can be concluded that the model may not linear enough.

2. Normality Test

The second assumption in linear regression is that the residuals follow normal distribution, this assumption can be easily check this by using the Saphiro-Wilk normality test.

shapiro.test(model_aic$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_aic$residuals
## W = 0.92193, p-value = 1.443e-13

With p-value > 0.05, can conclude that the hypothesis is not rejected, and the residuals are following the normal distribution.

3. Autocorrelation

The next assumption of linear regression is that the residuals are independent. This is mostly relevant when working with time series data. Ideally, we don’t want there to be a pattern among consecutive residuals. For example, residuals shouldn’t steadily grow larger as time goes on.

Autocorrelation can be detected using the durbin watson test, with null hypothesis that there is no autocorrelation.

durbinWatsonTest(model_aic)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.6245931     0.7499111       0
##  Alternative hypothesis: rho != 0

The result shows that the null hypothesis is rejected, meaning that residual has autocorrelation in it.

4. Heterocedasticity

The next assumption of linear regression is that the residuals have constant variance at every level of x. This is known as homoscedasticity. When this is not the case, the residuals are said to suffer from heteroscedasticity.

When heteroscedasticity is present in a regression analysis, the results of the analysis become hard to trust. Specifically, heteroscedasticity increases the variance of the regression coefficient estimates, but the regression model doesn’t pick up on this. This makes it much more likely for a regression model to declare that a term in the model is statistically significant, when in fact it is not.

The simplest way to detect heteroscedasticity is by creating a fitted value vs. residual plot.

plot %>% 
  ggplot(aes(fitted, residual)) + 
  geom_point() + 
  theme_light() + 
  geom_hline(aes(yintercept = 0))

Second way to detect heterocesdasticity is using the Breusch-Pagan test, with null hypothesis is there is no heterocesdasticity.

bptest(model_aic)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_aic
## BP = 22.428, df = 5, p-value = 0.0004341

With p-value < 0.05, can conclude that heterocesdasticity is present in the model.

5. Multicollinearity

Multicollinearity occurs when the predictor variables used in the model have a strong relationship. Multicollinearity or not can be seen from the VIF (Variance Inflation Factor) value. VIF is a measure that explains how much the variance of the coefficient increases due to multicollinearity

When the VIF value is more than 10, it means that there is multicollinearity. If this happens, select one of the variables that is removed from the model that has a VIF> 10.

vif(model_aic)

##   GRE.Score TOEFL.Score         LOR        CGPA    Research 
##    4.585053    4.104255    1.829491    4.808767    1.530007

9 Conclusion

Variables that are useful to describe the variances in Chance.of.Admit are GRE.Score, TOEFL.Score, LOR, CGPA and Research. The R-squared of the model is high, with 80.02% of the variables can explain the variances in the car price. The accuracy of the model in predicting the Chance.of.Admit is measured with RMSE, with model_aic has RMSE 0.063.