Hi!! Welcome to my LBB in this LBB i’m gonna use Universty Admisson Predict dataset and make a prediction using linear regression. Enjoy!!
#> # A tibble: 6 x 9
#> `Serial No.` `GRE Score` `TOEFL Score` `University Rat… SOP LOR CGPA
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 337 118 4 4.5 4.5 9.65
#> 2 2 324 107 4 4 4.5 8.87
#> 3 3 316 104 3 3 3.5 8
#> 4 4 322 110 3 3.5 2.5 8.67
#> 5 5 314 103 2 2 3 8.21
#> 6 6 330 115 5 4.5 3 9.34
#> # … with 2 more variables: Research <dbl>, `Chance of Admit` <dbl>
#> [1] 400 9
#> [1] "Serial No." "GRE Score" "TOEFL Score"
#> [4] "University Rating" "SOP" "LOR"
#> [7] "CGPA" "Research" "Chance of Admit"
#> Rows: 400
#> Columns: 9
#> $ `Serial No.` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
#> $ `GRE Score` <dbl> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323,…
#> $ `TOEFL Score` <dbl> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108,…
#> $ `University Rating` <dbl> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3…
#> $ SOP <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5,…
#> $ LOR <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0,…
#> $ CGPA <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8…
#> $ Research <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0…
#> $ `Chance of Admit` <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0…
we’re gonna remove column Serial No. because we’re not gonna use it.
Lets check if there is any missing value
#> GRE Score TOEFL Score University Rating SOP
#> 0 0 0 0
#> LOR CGPA Research Chance of Admit
#> 0 0 0 0
great!! there is no missing value
rename column
#> Rows: 400
#> Columns: 8
#> $ GRE.Score <dbl> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, 3…
#> $ TOEFL.Score <dbl> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, 1…
#> $ University.Rating <dbl> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3, …
#> $ SOP <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, 3…
#> $ LOR <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, 4…
#> $ CGPA <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.0…
#> $ Research <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
#> $ Chance.Admit <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.5…
Variable Definitions :
now we should check correlation between each variables
Through the graph shown above, it proves that each variable plays a role in influencing positively towards Chance.Admit where CGPA plays a major role when compared to other variables.
we can now proceed in making our Regression Model. The target variable will be set as Chace.Admit.
Because CGPA plays a major role, lets check the summary of correlation between CGPA and Chance.Admit
#>
#> Call:
#> lm(formula = Chance.Admit ~ CGPA, data = data_train)
#>
#> Coefficients:
#> (Intercept) CGPA
#> -1.0881 0.2106
#>
#> Call:
#> lm(formula = Chance.Admit ~ CGPA, data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.272649 -0.026863 0.008908 0.036815 0.183681
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.088088 0.060619 -17.95 <0.0000000000000002 ***
#> CGPA 0.210551 0.007001 30.08 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.06824 on 278 degrees of freedom
#> Multiple R-squared: 0.7649, Adjusted R-squared: 0.7641
#> F-statistic: 904.5 on 1 and 278 DF, p-value: < 0.00000000000000022
plot(data_train$CGPA, data_train$Chance.Admit)
abline(a = model_lm$coefficients[1], b = model_lm$coefficients[2], col = "red") The model shows that there is an acceptable amount of Multiple and Adjusted R-squared at 76%.
Now let us look at the second model where Admission Chance will be compared to all of the predictors available in the data
#>
#> Call:
#> lm(formula = Chance.Admit ~ ., data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.265746 -0.018221 0.009616 0.034890 0.163523
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.3410212 0.1488864 -9.007 < 0.0000000000000002 ***
#> GRE.Score 0.0022366 0.0006834 3.273 0.00120 **
#> TOEFL.Score 0.0024506 0.0012111 2.023 0.04400 *
#> University.Rating 0.0082489 0.0057323 1.439 0.15130
#> SOP -0.0073628 0.0069045 -1.066 0.28720
#> LOR 0.0222056 0.0068602 3.237 0.00136 **
#> CGPA 0.1167550 0.0144553 8.077 0.0000000000000217 ***
#> Research 0.0209651 0.0095761 2.189 0.02942 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.06225 on 272 degrees of freedom
#> Multiple R-squared: 0.8086, Adjusted R-squared: 0.8037
#> F-statistic: 164.2 on 7 and 272 DF, p-value: < 0.00000000000000022
The model shows that the amount of Multiple and Adjusted R-squared at 79%.
Now we procced to initiate a step-wise method to see which formula is most suitable for our model. The backward step-wise regression will indicate which formula is more optimal based on the AIC value that has the lowest score.
#> Start: AIC=-1547.05
#> Chance.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
#> SOP + LOR + CGPA + Research
#>
#> Df Sum of Sq RSS AIC
#> - SOP 1 0.004406 1.0583 -1547.9
#> <none> 1.0539 -1547.0
#> - University.Rating 1 0.008023 1.0619 -1546.9
#> - TOEFL.Score 1 0.015864 1.0697 -1544.9
#> - Research 1 0.018571 1.0724 -1544.2
#> - LOR 1 0.040595 1.0945 -1538.5
#> - GRE.Score 1 0.041497 1.0954 -1538.2
#> - CGPA 1 0.252763 1.3066 -1488.8
#>
#> Step: AIC=-1547.88
#> Chance.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
#> LOR + CGPA + Research
#>
#> Df Sum of Sq RSS AIC
#> - University.Rating 1 0.005001 1.0633 -1548.6
#> <none> 1.0583 -1547.9
#> + SOP 1 0.004406 1.0539 -1547.0
#> - TOEFL.Score 1 0.014632 1.0729 -1546.0
#> - Research 1 0.017194 1.0755 -1545.4
#> - LOR 1 0.036362 1.0946 -1540.4
#> - GRE.Score 1 0.044368 1.1026 -1538.4
#> - CGPA 1 0.248848 1.3071 -1490.8
#>
#> Step: AIC=-1548.56
#> Chance.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
#>
#> Df Sum of Sq RSS AIC
#> <none> 1.0633 -1548.6
#> + University.Rating 1 0.005001 1.0583 -1547.9
#> + SOP 1 0.001383 1.0619 -1546.9
#> - Research 1 0.017141 1.0804 -1546.1
#> - TOEFL.Score 1 0.018814 1.0821 -1545.7
#> - GRE.Score 1 0.045557 1.1088 -1538.8
#> - LOR 1 0.050913 1.1142 -1537.5
#> - CGPA 1 0.284433 1.3477 -1484.2
#>
#> Call:
#> lm(formula = Chance.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA +
#> Research, data = data_train)
#>
#> Coefficients:
#> (Intercept) GRE.Score TOEFL.Score LOR CGPA Research
#> -1.397213 0.002332 0.002610 0.021675 0.118082 0.020070
Now we put the formula from above and call it “model_both”
model_both <- lm(formula = Chance.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA +
Research, data = data_train)
summary(model_both)#>
#> Call:
#> lm(formula = Chance.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA +
#> Research, data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.26845 -0.02046 0.01074 0.03471 0.16417
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.3972133 0.1386860 -10.075 < 0.0000000000000002 ***
#> GRE.Score 0.0023321 0.0006806 3.426 0.000706 ***
#> TOEFL.Score 0.0026100 0.0011854 2.202 0.028508 *
#> LOR 0.0216748 0.0059840 3.622 0.000348 ***
#> CGPA 0.1180823 0.0137925 8.561 0.000000000000000814 ***
#> Research 0.0200696 0.0095492 2.102 0.036491 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.06229 on 274 degrees of freedom
#> Multiple R-squared: 0.8069, Adjusted R-squared: 0.8034
#> F-statistic: 229 on 5 and 274 DF, p-value: < 0.00000000000000022
there is only slightly change in adjusted R-squared from the previous model, so we’re gonna use this model instead.
plot above shows correlation between variables
after finish making the model, now we continue to evaluate the model where we predict and check assumptions.
RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE () functions from caret package.
#> [1] 6.162306
#> [1] 6.72332
The error of the training dataset is lower than the test dataset, suggesting that our model may be slightly overfit. because the small differences we can continue to do assumtion on our model.
#>
#> Shapiro-Wilk normality test
#>
#> data: model_both$residuals
#> W = 0.90714, p-value = 0.000000000003954
the null hypothesis is that the residuals follow normal distribution. With p-value < 0.05, we can conclude that our hypothesis is rejected, and our residuals are not following the normal distribution.
#> GRE.Score TOEFL.Score LOR CGPA Research
#> 4.367008 3.709384 2.013825 4.657325 1.628436
We can see here that data is independent and not correlated to one another as the amount is less than 10. ### 4. No-Heterosdascity
#>
#> studentized Breusch-Pagan test
#>
#> data: model_both
#> BP = 12.483, df = 5, p-value = 0.02873
Here we can see that p-value is less than 0.05 therefore, there is Heterosdascity present.
The graph above shows that it will indicate a “trumpet-like” graph indicating that data is Homogenic; Thus, data is not distributed randomly.
In conclusion, the model_both model has a higher R-squared capped at 79% and has RMSE of 5%, this means that the model_both has provided better R-squared than our first model (model_lm). Unfortunately, as we had analyze the assumption checking needed for a linear regression model, we can see that this data has a great criteria, but lacking in data accuracy.
The data here shows that there is Heterodascity, Multicollinearity, and data that is not distributed normally. This means that each data does not converse into a random alogrithm and may not be an appropriate data to utilize as a regression model.
However, if this data were to be taken into account, the Admission.Chance has positive correlation with CGPA, this indicates that the higher the CGPA a university student achieve, will in return, increase their chances to be admitted into the university. Thus, this concludes our linear regression model.