Read Library

library(tidyverse)
library(ggplot2)
library(GGally)
library(leaps)
library(MLmetrics)
library(lmtest)
library(car)

Explanation

Hi!! Welcome to my LBB in this LBB i’m gonna use Universty Admisson Predict dataset and make a prediction using linear regression. Enjoy!!

Input Data

data <- read_csv("Admission_Predict.csv")

Data Inspection

head(data)
#> # A tibble: 6 x 9
#>   `Serial No.` `GRE Score` `TOEFL Score` `University Rat…   SOP   LOR  CGPA
#>          <dbl>       <dbl>         <dbl>            <dbl> <dbl> <dbl> <dbl>
#> 1            1         337           118                4   4.5   4.5  9.65
#> 2            2         324           107                4   4     4.5  8.87
#> 3            3         316           104                3   3     3.5  8   
#> 4            4         322           110                3   3.5   2.5  8.67
#> 5            5         314           103                2   2     3    8.21
#> 6            6         330           115                5   4.5   3    9.34
#> # … with 2 more variables: Research <dbl>, `Chance of Admit` <dbl>
dim(data)
#> [1] 400   9
names(data)
#> [1] "Serial No."        "GRE Score"         "TOEFL Score"      
#> [4] "University Rating" "SOP"               "LOR"              
#> [7] "CGPA"              "Research"          "Chance of Admit"

Data Cleansing & Coertions

glimpse(data)
#> Rows: 400
#> Columns: 9
#> $ `Serial No.`        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
#> $ `GRE Score`         <dbl> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323,…
#> $ `TOEFL Score`       <dbl> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108,…
#> $ `University Rating` <dbl> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3…
#> $ SOP                 <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5,…
#> $ LOR                 <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0,…
#> $ CGPA                <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8…
#> $ Research            <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0…
#> $ `Chance of Admit`   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0…

we’re gonna remove column Serial No. because we’re not gonna use it.

data <- data %>% select(-1)

Lets check if there is any missing value

colSums(is.na(data))
#>         GRE Score       TOEFL Score University Rating               SOP 
#>                 0                 0                 0                 0 
#>               LOR              CGPA          Research   Chance of Admit 
#>                 0                 0                 0                 0

great!! there is no missing value

rename column

data_clean<- data %>% rename(GRE.Score = "GRE Score",
                TOEFL.Score = "TOEFL Score",
                University.Rating = "University Rating",
                Chance.Admit = "Chance of Admit")

Data Exploration

glimpse(data_clean)
#> Rows: 400
#> Columns: 8
#> $ GRE.Score         <dbl> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, 3…
#> $ TOEFL.Score       <dbl> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, 1…
#> $ University.Rating <dbl> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3, …
#> $ SOP               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, 3…
#> $ LOR               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, 4…
#> $ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.0…
#> $ Research          <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
#> $ Chance.Admit      <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.5…

Variable Definitions :

  • GRE Score : GRE Score that a student has achieved
  • TOEFL Score : TOEFL Score that a student has achieved
  • University Rating : The provided ranking of a university
  • SOP : The presence of Statement of Purpose
  • LOR : The presence of Letter of Recommendation
  • CGPA : Cumulative Grade Point Average / the cumulative sum of GPA throughout the whole curriculum
  • Research : The presence of Research Paper
  • Chance.Admit : The probability of a university accepting the student

now we should check correlation between each variables

ggcorr(data_clean, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

Through the graph shown above, it proves that each variable plays a role in influencing positively towards Chance.Admit where CGPA plays a major role when compared to other variables.

we can now proceed in making our Regression Model. The target variable will be set as Chace.Admit.

Cross Validation

set.seed(123)
samplesize <- round(0.7 * nrow(data_clean), 0)
index <- sample(seq_len(nrow(data_clean)), size = samplesize)

data_train <- data_clean[index, ]
data_test <- data_clean[-index, ]

Data Modelling

Because CGPA plays a major role, lets check the summary of correlation between CGPA and Chance.Admit

model_lm <- lm(Chance.Admit~CGPA, data_train)
model_lm
#> 
#> Call:
#> lm(formula = Chance.Admit ~ CGPA, data = data_train)
#> 
#> Coefficients:
#> (Intercept)         CGPA  
#>     -1.0881       0.2106
summary(model_lm)
#> 
#> Call:
#> lm(formula = Chance.Admit ~ CGPA, data = data_train)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.272649 -0.026863  0.008908  0.036815  0.183681 
#> 
#> Coefficients:
#>              Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept) -1.088088   0.060619  -17.95 <0.0000000000000002 ***
#> CGPA         0.210551   0.007001   30.08 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.06824 on 278 degrees of freedom
#> Multiple R-squared:  0.7649, Adjusted R-squared:  0.7641 
#> F-statistic: 904.5 on 1 and 278 DF,  p-value: < 0.00000000000000022
plot(data_train$CGPA, data_train$Chance.Admit)
abline(a = model_lm$coefficients[1], b = model_lm$coefficients[2], col = "red")

The model shows that there is an acceptable amount of Multiple and Adjusted R-squared at 76%.

Now let us look at the second model where Admission Chance will be compared to all of the predictors available in the data

model_lm2 <- lm(Chance.Admit~.,data_train)
summary(model_lm2)
#> 
#> Call:
#> lm(formula = Chance.Admit ~ ., data = data_train)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.265746 -0.018221  0.009616  0.034890  0.163523 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)       -1.3410212  0.1488864  -9.007 < 0.0000000000000002 ***
#> GRE.Score          0.0022366  0.0006834   3.273              0.00120 ** 
#> TOEFL.Score        0.0024506  0.0012111   2.023              0.04400 *  
#> University.Rating  0.0082489  0.0057323   1.439              0.15130    
#> SOP               -0.0073628  0.0069045  -1.066              0.28720    
#> LOR                0.0222056  0.0068602   3.237              0.00136 ** 
#> CGPA               0.1167550  0.0144553   8.077   0.0000000000000217 ***
#> Research           0.0209651  0.0095761   2.189              0.02942 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.06225 on 272 degrees of freedom
#> Multiple R-squared:  0.8086, Adjusted R-squared:  0.8037 
#> F-statistic: 164.2 on 7 and 272 DF,  p-value: < 0.00000000000000022

The model shows that the amount of Multiple and Adjusted R-squared at 79%.

Now we procced to initiate a step-wise method to see which formula is most suitable for our model. The backward step-wise regression will indicate which formula is more optimal based on the AIC value that has the lowest score.

step(model_lm2,direction = "both")
#> Start:  AIC=-1547.05
#> Chance.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
#>     SOP + LOR + CGPA + Research
#> 
#>                     Df Sum of Sq    RSS     AIC
#> - SOP                1  0.004406 1.0583 -1547.9
#> <none>                           1.0539 -1547.0
#> - University.Rating  1  0.008023 1.0619 -1546.9
#> - TOEFL.Score        1  0.015864 1.0697 -1544.9
#> - Research           1  0.018571 1.0724 -1544.2
#> - LOR                1  0.040595 1.0945 -1538.5
#> - GRE.Score          1  0.041497 1.0954 -1538.2
#> - CGPA               1  0.252763 1.3066 -1488.8
#> 
#> Step:  AIC=-1547.88
#> Chance.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
#>     LOR + CGPA + Research
#> 
#>                     Df Sum of Sq    RSS     AIC
#> - University.Rating  1  0.005001 1.0633 -1548.6
#> <none>                           1.0583 -1547.9
#> + SOP                1  0.004406 1.0539 -1547.0
#> - TOEFL.Score        1  0.014632 1.0729 -1546.0
#> - Research           1  0.017194 1.0755 -1545.4
#> - LOR                1  0.036362 1.0946 -1540.4
#> - GRE.Score          1  0.044368 1.1026 -1538.4
#> - CGPA               1  0.248848 1.3071 -1490.8
#> 
#> Step:  AIC=-1548.56
#> Chance.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
#> 
#>                     Df Sum of Sq    RSS     AIC
#> <none>                           1.0633 -1548.6
#> + University.Rating  1  0.005001 1.0583 -1547.9
#> + SOP                1  0.001383 1.0619 -1546.9
#> - Research           1  0.017141 1.0804 -1546.1
#> - TOEFL.Score        1  0.018814 1.0821 -1545.7
#> - GRE.Score          1  0.045557 1.1088 -1538.8
#> - LOR                1  0.050913 1.1142 -1537.5
#> - CGPA               1  0.284433 1.3477 -1484.2
#> 
#> Call:
#> lm(formula = Chance.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + 
#>     Research, data = data_train)
#> 
#> Coefficients:
#> (Intercept)    GRE.Score  TOEFL.Score          LOR         CGPA     Research  
#>   -1.397213     0.002332     0.002610     0.021675     0.118082     0.020070

Now we put the formula from above and call it “model_both”

model_both <- lm(formula = Chance.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + 
    Research, data = data_train)
summary(model_both)
#> 
#> Call:
#> lm(formula = Chance.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + 
#>     Research, data = data_train)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.26845 -0.02046  0.01074  0.03471  0.16417 
#> 
#> Coefficients:
#>               Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept) -1.3972133  0.1386860 -10.075 < 0.0000000000000002 ***
#> GRE.Score    0.0023321  0.0006806   3.426             0.000706 ***
#> TOEFL.Score  0.0026100  0.0011854   2.202             0.028508 *  
#> LOR          0.0216748  0.0059840   3.622             0.000348 ***
#> CGPA         0.1180823  0.0137925   8.561 0.000000000000000814 ***
#> Research     0.0200696  0.0095492   2.102             0.036491 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.06229 on 274 degrees of freedom
#> Multiple R-squared:  0.8069, Adjusted R-squared:  0.8034 
#> F-statistic:   229 on 5 and 274 DF,  p-value: < 0.00000000000000022

there is only slightly change in adjusted R-squared from the previous model, so we’re gonna use this model instead.

plot(regsubsets(Chance.Admit ~ ., data = data_train), scale = "adjr")

plot above shows correlation between variables

Model Evalution

Performance

after finish making the model, now we continue to evaluate the model where we predict and check assumptions.

predict.both <- predict(object = model_both, newdata = data_test %>%  select(-Chance.Admit))

RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE () functions from caret package.

RMSE(y_pred = model_both$fitted.values, data_train$Chance.Admit)*100
#> [1] 6.162306
RMSE(y_pred = predict.both, y_true = data_test$Chance.Admit)*100
#> [1] 6.72332

The error of the training dataset is lower than the test dataset, suggesting that our model may be slightly overfit. because the small differences we can continue to do assumtion on our model.

Assumtions

1. Linearity

plot(model_both, 1)

2. Normality Residual

hist(model_both$residuals)

shapiro.test(model_both$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_both$residuals
#> W = 0.90714, p-value = 0.000000000003954

the null hypothesis is that the residuals follow normal distribution. With p-value < 0.05, we can conclude that our hypothesis is rejected, and our residuals are not following the normal distribution.

3. No-Multicollinearity

vif(model_both)
#>   GRE.Score TOEFL.Score         LOR        CGPA    Research 
#>    4.367008    3.709384    2.013825    4.657325    1.628436

We can see here that data is independent and not correlated to one another as the amount is less than 10. ### 4. No-Heterosdascity

bptest(model_both)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_both
#> BP = 12.483, df = 5, p-value = 0.02873

Here we can see that p-value is less than 0.05 therefore, there is Heterosdascity present.

plot(model_both, 3)

The graph above shows that it will indicate a “trumpet-like” graph indicating that data is Homogenic; Thus, data is not distributed randomly.

Conclusion

In conclusion, the model_both model has a higher R-squared capped at 79% and has RMSE of 5%, this means that the model_both has provided better R-squared than our first model (model_lm). Unfortunately, as we had analyze the assumption checking needed for a linear regression model, we can see that this data has a great criteria, but lacking in data accuracy.

The data here shows that there is Heterodascity, Multicollinearity, and data that is not distributed normally. This means that each data does not converse into a random alogrithm and may not be an appropriate data to utilize as a regression model.

However, if this data were to be taken into account, the Admission.Chance has positive correlation with CGPA, this indicates that the higher the CGPA a university student achieve, will in return, increase their chances to be admitted into the university. Thus, this concludes our linear regression model.