We will learn to use linear regression model using Admission_Predict_Ver1.1 dataset. We want to know the relationship among variables, especially between the Chance.of.Admit with other variables. You can download the data here .
library(GGally)
library(MLmetrics)
library(tidyverse)
library(dplyr)
library(lmtest)
library(car)First we gonna read the dataset and then we will split this dataset for train and test with portion 80% : 20%.
graduate <- read.csv("Admission_Predict_Ver1.1.csv")
head(graduate, 5)str(graduate)## 'data.frame': 500 obs. of 9 variables:
## $ Serial.No. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
The data for model has 500 rows and 9 columns. Serial.No is a unique identifier for each row, so we can ignore it. Out target variable is the Chance.of.Admit, which signifies the admission for graduate students. We will use other variable too. The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are :
Before we go further, first we need to make sure that our data is clean and will be useful. The step will taken is: 1. Changing categoric data to factor data type - Because it is stated that variable Research is either 0 or 1, we will change this column to factor data type 2. Check missing value is exist in the dataset. 3. Remove unused column from our dataset.
# 1. Changing categoric data to factor data type
graduate$Research <- as.factor(graduate$Research)# 2. Check missing value is exist in the dataset.
colSums(is.na(graduate))## Serial.No. GRE.Score TOEFL.Score University.Rating
## 0 0 0 0
## SOP LOR CGPA Research
## 0 0 0 0
## Chance.of.Admit
## 0
## [1] "Serial.No." "GRE.Score" "TOEFL.Score"
## [4] "University.Rating" "SOP" "LOR"
## [7] "CGPA" "Research" "Chance.of.Admit"
## [1] "GRE.Score" "TOEFL.Score" "University.Rating"
## [4] "SOP" "LOR" "CGPA"
## [7] "Research" "Chance.of.Admit"
Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables. For this we will use package from GGally to find-out the correlation.
ggcorr(data = graduate, label = T, label_size = 2.9, hjust = 1, layout.exp = 2)## Warning in ggcorr(data = graduate, label = T, label_size = 2.9, hjust = 1, :
## data in column(s) 'Research' are not numeric and were ignored
The graphic shows that several variable has strong correlation with the
Chance.of.Admit variables.
From this test it is shown our dataset have 0 missing value for each column. Then we continue to do exploration in our dataset.
Now we can split the data set into train and test with proportion 70% and 30%.
set.seed(50)
index <- sample(nrow(graduate), nrow(graduate) * 0.7)
graduate_train <- graduate[index, ]
graduate_test <- graduate[-index, ]Now we will try to model the linear regression using Chance.of.Admit as the target variable.
Simple linear regression is a model with one prediction. And the predictor we will use is CGPA.
model_sl <- lm(formula = Chance.of.Admit ~ CGPA, data = graduate_train)
summary(model_sl)##
## Call:
## lm(formula = Chance.of.Admit ~ CGPA, data = graduate_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.276747 -0.029088 0.006964 0.040095 0.171052
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.045207 0.049482 -21.12 <2e-16 ***
## CGPA 0.206041 0.005743 35.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06639 on 348 degrees of freedom
## Multiple R-squared: 0.7872, Adjusted R-squared: 0.7866
## F-statistic: 1287 on 1 and 348 DF, p-value: < 2.2e-16
Linear regression is one of the model with high interpretability, so now we gonna interpret this simple linear regression model. - Based on Intercept: When the CGPA is 0, the Chance.of.Admit is -1.068162, means the chance to be admitted to graduate is less than 0 or easily we can say the student will not enter the graduate studies in university.
Based on Coefficient or Slope: When CGPA goes up 1 value, then Chance.of.Admit increase around 0.21.
Based on P-value: CGPA is a significant predictor and has a linear influence.
Based on R-squared: R-squred values is 0.7825; the predictor that is used is good enough to explain the target variable.
model_all <- lm(formula = Chance.of.Admit ~ ., data = graduate_train)
summary(model_all)##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = graduate_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.262634 -0.023984 0.008514 0.033414 0.157793
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2274554 0.1226521 -10.008 < 2e-16 ***
## GRE.Score 0.0018796 0.0005992 3.137 0.001857 **
## TOEFL.Score 0.0024807 0.0010399 2.386 0.017596 *
## University.Rating 0.0072153 0.0045202 1.596 0.111361
## SOP 0.0004345 0.0056815 0.076 0.939081
## LOR 0.0177012 0.0050341 3.516 0.000497 ***
## CGPA 0.1149754 0.0121363 9.474 < 2e-16 ***
## Research1 0.0292794 0.0081277 3.602 0.000362 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05999 on 342 degrees of freedom
## Multiple R-squared: 0.8293, Adjusted R-squared: 0.8258
## F-statistic: 237.3 on 7 and 342 DF, p-value: < 2.2e-16
The summary of model_all model shows a lot of information. - Based on the Intercept or constant, the chance to be admit in graduate is decreasing -1.1917371 for all predictor when set to 0.
Based on P-value, we are looking at least they have 1 star or < 0.05. Because this value shows the significance level of the variable toward the model. If the value is below 0.05, than we can safely assume that the variable has significant effect toward the model (meaning that the estimated coefficient are no different than 0), and vice versa. Thus, we can made a simpler model by removing variables that has p-value > 0.05, since they don’t have significant effect toward our model. The estimate value shows the coefficient of each variable.
So here, GRE.Score, TOEFL.Score, University.Rating, LOR, CGPA and Research are a good predictor.
Based on Adjusted R-squared the values is 0.8173 which is close to 1. It is representing the goodness of fit in our model.
summary(model_sl)$r.squared #one predictor## [1] 0.7872048
summary(model_all)$adj.r.squared #all predictor## [1] 0.8257972
Comparison Based on R-Squared with 1 predictor Model and all Predictor Model Comparing two model we got new information that based R-Squared, model with all predictor has higher R-squared value.
| Model | R-squared value |
|---|---|
| M. 1 Predictor | 0.7824766 |
| M. All Predictor | 0.817276 |
pred_sl <- predict(object = model_sl, newdata = graduate_test)
pred_all <- predict(object = model_all, newdata = graduate_test)RMSE(y_pred = pred_sl, y_true = graduate_test$Chance.of.Admit)## [1] 0.06664615
RMSE(y_pred = pred_all, y_true = graduate_test$Chance.of.Admit)## [1] 0.060229
range(graduate_test$Chance.of.Admit)## [1] 0.36 0.97
Which model is better? From the Goodness of fit and error, the model_all model (which uses all predictors) is better than the model_sl model (which only uses 1 predictor)
In the Step-wise regression technique, we start fitting the model with each individual predictor and see which one has the lowest p-value. Then pick that variable and then fit the model using two variable one which we already selected in the previous step and taking one by one all remaining ones.
We gonna used Stepwise regression using backward elimination.
model_backward <- step(object = model_all,
direction = "backward",
trace = F)
summary(model_backward)##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## LOR + CGPA + Research, data = graduate_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.262506 -0.023863 0.008652 0.033447 0.157893
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.228750 0.121303 -10.130 < 2e-16 ***
## GRE.Score 0.001878 0.000598 3.140 0.001834 **
## TOEFL.Score 0.002488 0.001035 2.404 0.016728 *
## University.Rating 0.007346 0.004177 1.759 0.079527 .
## LOR 0.017829 0.004743 3.759 0.000201 ***
## CGPA 0.115170 0.011848 9.721 < 2e-16 ***
## Research1 0.029272 0.008115 3.607 0.000356 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0599 on 343 degrees of freedom
## Multiple R-squared: 0.8293, Adjusted R-squared: 0.8263
## F-statistic: 277.7 on 6 and 343 DF, p-value: < 2.2e-16
This step-wise regression method will produce an optimum formula based on the lowest AIC value, where the lower the AIC value, the smaller the uncaught observation value.
If we compared with the initial model that only, we got summary like this:| Model | R-squared value |
|---|---|
| M. 1 Predictor | 0.7824766 |
| M. All Predictor | 0.817276 |
| M. backward | 0.8263 |
From here we can see that with selected features based on step-wise regression using backward step giving higher R-square value compared to previous model. The selected predictor used are GRE.Score, TOEFL.Score, University.Rating, LOR, CGPA, Research.
Prediction and RMSE
#Predict
pred_backward <- predict(object = model_backward, newdata = graduate_test)
#RMSE
RMSE(y_pred = pred_backward, y_true = graduate_test$Chance.of.Admit)## [1] 0.06023782
The linear regression model is expected to produce errors that are normally distributed. With that, more errors cluster around the number of zero.
hist(model_backward$residuals)shapiro.test(model_backward$residuals)##
## Shapiro-Wilk normality test
##
## data: model_backward$residuals
## W = 0.9158, p-value = 4.321e-13
Shapiro-Wilk hypothesis test:
Since we get p-value = 4.321e-13 which is less than 0.05, our error (residual) based on this model_backward is not distributed normally.
fitted.values vs
residualsfitted.values is prediction value of data trainingresiduals adalah error value# scatter plot
plot(x = model_backward$fitted.values, y = model_backward$residuals)
abline(h = 0, col = "red")
This pattern is more likely of Fan shape pattern or qualified as
Heteroscedasticity.
bptest(model_backward)##
## studentized Breusch-Pagan test
##
## data: model_backward
## BP = 24.048, df = 6, p-value = 0.0005117
Breusch-Pagan hypothesis test:
Again we got p-value = 0.0005117 which is less than 0.05. Means this error distributed NOT constant/ heteroscedasticity.
vif(model_backward)## GRE.Score TOEFL.Score University.Rating LOR
## 4.696144 4.148376 2.281882 1.896098
## CGPA Research
## 5.229900 1.567856
Based on this assumption we are looking with VIF < 10. Here our model_backward fulfilling the assumption, because all the correlation predictor value is less than 10.
For our model_backward
Linearity and No Multicollinearity. While
other two Assumptions is NOT fulfilled which is
Normality of Residuals and
Homoscedasticity of Residuals.Note, It is suggested to:
Normality of Residuals not fulfilled the
assumption;Homoscedasticity of Residuals not fulfilled the
assumption;This model need improvement for choosing the predictor.