library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(tidymodels)
library(car)
library(scales)
library(lmtest)
library(funModeling)
library(MASS)
library(dplyr)Regression analysis is one of the supervised machine learning methods that apply to predict the particular value of a numerical variable based on its predictor variables. The regression model is among the most theoretically understood models in statistics and often leads to highly interpretable models. We will learn to use a linear regression model to predict students’ chances of shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university.
The source of the dataset can be found in https://www.kaggle.com/mohansacharya/graduate-admissions
Before we obtain the model, we need to perform data preprocessing to find any undesirable format in the raw data.
admission <- read.csv("Admission_Predict_Ver1.1.csv")
rmarkdown::paged_table(admission)names(admission)#> [1] "Serial.No." "GRE.Score" "TOEFL.Score"
#> [4] "University.Rating" "SOP" "LOR"
#> [7] "CGPA" "Research" "Chance.of.Admit"
The dataset includes information as follow:
Serial.No : Student NumberGRE.Score : GRE Scores ( out of 340 )TOEFL.Score : TOEFL Scores ( out of 120 )University.Rating : University Rating ( out of 5 )SOP : Statement of Purpose ( out of 5 )LOR : Letter of Recommendation Strength ( out of 5 )CGPA : Undergraduate GPA ( out of 10 )Research : Research Experience ( either 0 or 1 )Chance.of.Admit: Chance of Admit ( ranging from 0 to 1 )colSums(is.na(admission))#> Serial.No. GRE.Score TOEFL.Score University.Rating
#> 0 0 0 0
#> SOP LOR CGPA Research
#> 0 0 0 0
#> Chance.of.Admit
#> 0
anyNA(admission)#> [1] FALSE
We see that the data is clean and has no NA value. We do not need the Serial.No variable since it does not correlate with any other predictor, so we have to remove it.
admission_new <- admission %>%
dplyr::select(-Serial.No.)rmarkdown::paged_table(admission_new)ggcorr(admission_new, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)Each predictor variable correlates with a highly positive score with the response variable (Chance of Admit). CGPA is the variable that correlates the highest with the chance of admission than other variables.
boxplot(admission_new$Chance.of.Admit)boxplot(admission_new$TOEFL.Score)boxplot(admission_new$GRE.Score)boxplot(admission_new %>%
dplyr::select(-c(Chance.of.Admit, GRE.Score, TOEFL.Score)))plot_num(admission_new)Before we make the model, we split the data into training and testing dataset. We will use the training dataset to train the linear regression model. The test dataset will be used as a comparison and see if the model gets overfit and can not predict new data that has not been seen during the training phase. We will use 80% of the data as the training data and the rest of it as the testing data.
set.seed(11)
index <- sample(nrow(admission_new), nrow(admission_new)*0.8)
admission.train <- admission_new[index, ]
admission.val <- admission_new[-index, ]First, we build the model with all predictor variables
model_admission <- lm(formula = Chance.of.Admit ~.,data = admission.train)summary(model_admission)#>
#> Call:
#> lm(formula = Chance.of.Admit ~ ., data = admission.train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.268039 -0.022574 0.007516 0.033044 0.162165
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.3696887 0.1145370 -11.958 < 2e-16 ***
#> GRE.Score 0.0021181 0.0005539 3.824 0.000153 ***
#> TOEFL.Score 0.0028105 0.0009724 2.890 0.004063 **
#> University.Rating 0.0040452 0.0040932 0.988 0.323628
#> SOP -0.0009614 0.0049034 -0.196 0.844664
#> LOR 0.0191943 0.0046186 4.156 3.98e-05 ***
#> CGPA 0.1201964 0.0110339 10.893 < 2e-16 ***
#> Research 0.0225662 0.0074723 3.020 0.002693 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.05877 on 392 degrees of freedom
#> Multiple R-squared: 0.8359, Adjusted R-squared: 0.833
#> F-statistic: 285.4 on 7 and 392 DF, p-value: < 2.2e-16
From this model, we obtain the adjusted R-square with 83.3%. Next, we try to build a model automatically with stepwise regression with backward elimination.
MASS::stepAIC(model_admission, direction = "backward")#> Start: AIC=-2259.41
#> Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
#> SOP + LOR + CGPA + Research
#>
#> Df Sum of Sq RSS AIC
#> - SOP 1 0.00013 1.3540 -2261.4
#> - University.Rating 1 0.00337 1.3572 -2260.4
#> <none> 1.3538 -2259.4
#> - TOEFL.Score 1 0.02885 1.3827 -2253.0
#> - Research 1 0.03150 1.3853 -2252.2
#> - GRE.Score 1 0.05050 1.4043 -2246.8
#> - LOR 1 0.05965 1.4135 -2244.2
#> - CGPA 1 0.40983 1.7637 -2155.6
#>
#> Step: AIC=-2261.37
#> Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
#> LOR + CGPA + Research
#>
#> Df Sum of Sq RSS AIC
#> - University.Rating 1 0.00330 1.3573 -2262.4
#> <none> 1.3540 -2261.4
#> - TOEFL.Score 1 0.02872 1.3827 -2255.0
#> - Research 1 0.03154 1.3855 -2254.2
#> - GRE.Score 1 0.05047 1.4044 -2248.7
#> - LOR 1 0.06420 1.4182 -2244.8
#> - CGPA 1 0.42270 1.7767 -2154.7
#>
#> Step: AIC=-2262.4
#> Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
#>
#> Df Sum of Sq RSS AIC
#> <none> 1.3573 -2262.4
#> - TOEFL.Score 1 0.03189 1.3892 -2255.1
#> - Research 1 0.03401 1.3913 -2254.5
#> - GRE.Score 1 0.05057 1.4078 -2249.8
#> - LOR 1 0.07682 1.4341 -2242.4
#> - CGPA 1 0.47328 1.8305 -2144.7
#>
#> Call:
#> lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
#> CGPA + Research, data = admission.train)
#>
#> Coefficients:
#> (Intercept) GRE.Score TOEFL.Score LOR CGPA Research
#> -1.395850 0.002120 0.002923 0.020003 0.122494 0.023326
We find with the stepwise regression, the model contain variables of GRE.Score, TOEFL.Score, LOR, CGPA, and Research, We build the new model with this set of predictors.
new_model_admission <- lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research,data = admission.train)summary(new_model_admission)#>
#> Call:
#> lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
#> CGPA + Research, data = admission.train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.267891 -0.023946 0.007039 0.032910 0.162565
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.3958502 0.1087265 -12.838 < 2e-16 ***
#> GRE.Score 0.0021196 0.0005532 3.831 0.000148 ***
#> TOEFL.Score 0.0029226 0.0009606 3.042 0.002503 **
#> LOR 0.0200029 0.0042358 4.722 3.25e-06 ***
#> CGPA 0.1224944 0.0104507 11.721 < 2e-16 ***
#> Research 0.0233259 0.0074233 3.142 0.001803 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.05869 on 394 degrees of freedom
#> Multiple R-squared: 0.8355, Adjusted R-squared: 0.8334
#> F-statistic: 400.3 on 5 and 394 DF, p-value: < 2.2e-16
We find the adjusted-R-squared has a slightly better score with 83.34%. Thus, we will use this model to test its performance in our testing dataset.
For evaluating our model, we will use RMSE (Root Mean Square Error) with the following formula:
\(RMSE = \sqrt{\frac{\sum_i^{n}(Predicted_i-Actual_i)^2}{n}}\)
lm_pred <- predict(new_model_admission, newdata = admission.val %>% dplyr::select(-Chance.of.Admit))
# RMSE of train dataset
RMSE(admission.train$Chance.of.Admit, new_model_admission$fitted.values)#> [1] 0.05825091
RMSE(admission.val$Chance.of.Admit, lm_pred)#> [1] 0.06536872
The error in the testing dataset is higher than in the training dataset. There is a chance that the model overfit.
Linear Regression has several assumptions since it is a parametric model. Linear regression that does not follow the assumption may be misleading or has a biased estimator.
linear <- data.frame(residual = new_model_admission$residuals, fitted = new_model_admission$fitted.values)
linear %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank())Ideally, the residual plot will show no fitted pattern. That is, the blue line should be approximately horizontal at zero. The presence of a pattern may indicate a problem with some aspect of the linear model. In our model, there is no pattern in residual plot. This suggests that we can assume linear relationship between the predictors and the outcome variables.
\(H_0\): the normality of residual is fulfilled
\(H_1\): the normality of residual is not fulfilled
shapiro.test(new_model_admission$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: new_model_admission$residuals
#> W = 0.93312, p-value = 2.085e-12
The null hypothesis is that the residuals follow the normal distribution. With p-value < 0.05, we reject our null hypothesis, and our residuals are not following the normal distribution.
\(H_0\): residuals have no-autocorrelation
\(H_1\): residuals have autocorrelation
Box.test(new_model_admission$residuals, type = "Ljung-Box")#>
#> Box-Ljung test
#>
#> data: new_model_admission$residuals
#> X-squared = 0.30395, df = 1, p-value = 0.5814
The p-value is higher than 0.05. It is the sign that the data have no-autocorelation.
\(H_0\): Variation of residual is constant (Homoscedasticity)
\(H_1\): Variation of residual is not constant (Heteroscedasticity)
bptest(new_model_admission)#>
#> studentized Breusch-Pagan test
#>
#> data: new_model_admission
#> BP = 23.14, df = 5, p-value = 0.0003174
The model fail to reject the null hypothesis. We can conclude that our data have heterocedasticity problem.
Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.
vif(new_model_admission)#> GRE.Score TOEFL.Score LOR CGPA Research
#> 4.690604 3.965071 1.764345 4.751085 1.563673
Our model does not fit several assumptions. Several methods may be applied to fix this problem.