This report has a purpose to demonstrate the implementation of Ordinary Least Square (OLS) Regression to predict and model the probability of Indian student candidates to be admitted in a Masters Program. The dataset is inspired from UCLA Graduate Program data set and owned by Mohan S Acharya (Please refer the data set from this link). A brief description of the variables are given as below:
In this demonstration, the target variable that is used is the chance_of_admit which consist of the chance or probability of candidates to be admitted to the program. Meanwhile, several variables apart from the target would be used for the predictors of the model. To model the OLS Regression, two methods are used including the manual methods (based on the experience of the writer), and the stepwise regression (both, forward, and backward directions) to conduct the feature selection.
The flow of this report consists of 7 sections including Introductions, Data Preparations, Exploratory Data Analysis (EDA), Modelling, Predictions and Errors, Assumption Testing, and Model Improvement. Briefly, the data preparations consist of the preparation of R packages and the data set to be used. Next, EDA consist of the exploratory analysis to identify the correlations among the target variables and predicted variables. In the Modelling and Predictions and Errors section, the created models are explained along the calculations of errors within each models to select the appropriate model with the least error rate and highest R-squared rate. Lastly, Assumptions of Linear Regression are conducted for the models that has been selected and to make several improvements to the model.
First off, the library would be loaded. The list of the used library is given as below:
library(tidyverse)
library(caret)
library(car)
library(GGally)
library(scales)
library(lmtest)
library(performance)Next, the data set would be imported. The readr package from tidyverse would be used here to load the data. Moreover, adjustment of the research variable is done by converting the type to factor (due to the content which consist of 0 and 1). Note that the column names are transformed to ease the Data Pre-processing.
admission_df <- read_csv("data_input/Admission_Predict_Ver1.1.csv")
names(admission_df) <- c("id", "gre_score", "toefl_score",
"university_rating", "sop", "lor", "cgpa",
"research", "chance_of_admit")
admission_df <- admission_df %>%
mutate(research = as.factor(research))
admission_dfAfter the adjustment, it is recommended to explore the missing and duplicated value within the data set to assure a cleaner data. The missing values for each columns are given below:
colSums(is.na(admission_df))## id gre_score toefl_score university_rating
## 0 0 0 0
## sop lor cgpa research
## 0 0 0 0
## chance_of_admit
## 0
It seems that there are no missing value within the data set. Next, the duplicated values are explored by using anyDuplicated function to attain the overall duplicated values accross the columns. The duplicated values for the data set are given below:
anyDuplicated(admission_df)## [1] 0
It is seen that there are no duplicated values within the data set. As the focus is the target and predictor variables, the id column could be ommited.
admission_df <- admission_df %>%
select(-id)In this section, the data set is explored by calling the function summary to explore the quantiles, min, max and central values of each variables. the summary of the data set is given below:
summary(admission_df)## gre_score toefl_score university_rating sop
## Min. :290.0 Min. : 92.0 Min. :1.000 Min. :1.000
## 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000 1st Qu.:2.500
## Median :317.0 Median :107.0 Median :3.000 Median :3.500
## Mean :316.5 Mean :107.2 Mean :3.114 Mean :3.374
## 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :340.0 Max. :120.0 Max. :5.000 Max. :5.000
## lor cgpa research chance_of_admit
## Min. :1.000 Min. :6.800 0:220 Min. :0.3400
## 1st Qu.:3.000 1st Qu.:8.127 1:280 1st Qu.:0.6300
## Median :3.500 Median :8.560 Median :0.7200
## Mean :3.484 Mean :8.576 Mean :0.7217
## 3rd Qu.:4.000 3rd Qu.:9.040 3rd Qu.:0.8200
## Max. :5.000 Max. :9.920 Max. :0.9700
Based on the result, it is shown that the range of each variables are matched with the data set description above (Introduction). It is also shown that the range of target variable (chance_of_admit) spans between 0.34 to 0.97.
For further exploration, the correlation for target variable to the predictors are explained by using ggcorr function from GGally package. The correlation chart is given below:
ggcorr(admission_df, label = T, label_size = 5, hjust = 1, layout.exp = 2)The chart shows the strengths of each correlations. As seen on the graph, the correlations of target variables (chance_of_admit) to its predictors have strong correlations (above 0.5). Nevertheless, It is recommended that each correlation of target variable to its predictors to be further examined. In this section, the investigation of correlations is conducted by using Pearson correlation coefficient test. Note that this step is optional to be included in the EDA phase as it also could be performed in the assumptions testing phase. The hypotheses of the test are given below:
H0: Correlation equals to 0 (non-linear)
H1: Correlation not equals to 0 (linear)
By calling cor.test function, the calculation for each correlation test are performed. Below are the result:
xgpa <- cor.test(admission_df$chance_of_admit, admission_df$cgpa)[3]
xlor <- cor.test(admission_df$chance_of_admit, admission_df$lor)[3]
xsop <- cor.test(admission_df$chance_of_admit, admission_df$sop)[3]
xrating <- cor.test(admission_df$chance_of_admit, admission_df$university_rating)[3]
xtoefl <- cor.test(admission_df$chance_of_admit, admission_df$toefl_score)[3]
xgre <- cor.test(admission_df$chance_of_admit, admission_df$gre_score)[3]
p_value <- rbind(xgpa, xlor, xsop, xrating, xtoefl, xgre)It is shown that the the given result of p-values for each correlations are below alpha (or 0.05) which conclude the null hypothesis to be rejected. This means that each of the correlations shows a linear relationships among the target and each predictors which would answer one of the assumptions of OLS Regression model (further explained in the Assumption Testing section).
In this section, the OLS modelling processes are performed. The steps that are conducted include separation of train and test data, formulation of manual modelling, and stepwise regression methods.
In this step, the data set is divided into two subsets which are Train Data and Test data. This step would be mandatory for any methods of modelling to evaluate the model performance that has been built. There are not any formal guidelines in conducting this step as each of the case study and the methods of the models would require different methods of splitting. However, the rule of thumb to split the data is by having a greater proportion of Train Data as oppose to the Test Data. In this case, the split that is performed is by having 75% of the total records in the dataset to the train data, while the remains would be allocated to the test data. Both set.seed and sample functions would be used in this process which could be seen as follow:
set.seed(318)
samplesize <- round(0.75 * nrow(admission_df), 0)
index <- sample(nrow(admission_df), size = samplesize)
data_train <- admission_df[index, ]
data_test <- admission_df[-index, ]
nrow(data_train)## [1] 375
nrow(data_test)## [1] 125
In this section, the selection of features would be applied manually referring to the experience of the writer. The variables that are used within the model includes the GRE Scores, GPA scores from the previous degree, TOEFL Scores, Statement of Purpose (or Essay), and Letter of Recommendations. To establish the model, lm function is used by passing the formula of the model and the train data into the function.
manual_lm <- lm(chance_of_admit ~ gre_score + cgpa + toefl_score + sop + lor, data_train)Next, the model would be interpreted by calling the summary function of the model. To interpret the summary of the model, below is the general equation of linear regression model:
\[\hat{y} = \beta_0 + \beta_1*x1 + ...\]
The explanation of each elements in the equation would be described as below:
summary(manual_lm)##
## Call:
## lm(formula = chance_of_admit ~ gre_score + cgpa + toefl_score +
## sop + lor, data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.277625 -0.025659 0.009358 0.038457 0.157752
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.3634386 0.1155177 -11.803 < 0.0000000000000002 ***
## gre_score 0.0019124 0.0005823 3.284 0.00112 **
## cgpa 0.1298502 0.0120928 10.738 < 0.0000000000000002 ***
## toefl_score 0.0024801 0.0010161 2.441 0.01512 *
## sop 0.0072953 0.0051959 1.404 0.16114
## lor 0.0210838 0.0049369 4.271 0.0000248 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06381 on 369 degrees of freedom
## Multiple R-squared: 0.7931, Adjusted R-squared: 0.7903
## F-statistic: 282.8 on 5 and 369 DF, p-value: < 0.00000000000000022
To briefly interpret the summary of the model, we could focus on the first column (Estimate), the code of significance as shown on the left of the Coefficients table, and the adjusted R-squared values of the model. In the first column, the intercept along with the slopes of each predictors are shown. It is also described that among several predictors, sop is less significant in comparisson to other predictors. It would be optional to disregard this variable in the tuning of the model (depends on the change of R-squared value and the error rates). The adjusted R-squared value also seems high as the 79 % the data of the models are explainable. In summary, the equation for manual_lm model is described as below
chance_of_admit = -1.3634386 + 0.0019124 x gre_score + 0.1298502 x cgpa + 0.0024801 x toefl_score + 0.0072953 x sop + 0.0210838 x lor
The stepwise regression is one of the method to select the features (or predictors to be used) of the regression model. The method uses the optimization of Akaike Information Criterion (AIC) that determines which of the models to be used by having the lowest value of it. In this section, the Stepwise Regression is performed by three different directions (forward, backward, and both) to compare the differences of each directions. To initialize, lm_none and lm_all objects are made to determine the scope of stepwise function step. Below are the result and summary of stepwise regression by both direction:
lm_none <- lm(chance_of_admit ~ 1, data_train)
lm_all <- lm(chance_of_admit ~ ., data_train)
stepwise_both <- step(lm_none, scope = list(lower = lm_none, upper = lm_all), direction = "both")## Start: AIC=-1477.13
## chance_of_admit ~ 1
##
## Df Sum of Sq RSS AIC
## + cgpa 1 5.5057 1.7557 -2007.5
## + gre_score 1 4.4326 2.8288 -1828.7
## + toefl_score 1 4.1611 3.1003 -1794.3
## + sop 1 3.4633 3.7980 -1718.2
## + university_rating 1 3.2953 3.9660 -1701.9
## + lor 1 2.9969 4.2644 -1674.7
## + research 1 1.9601 5.3012 -1593.1
## <none> 7.2614 -1477.1
##
## Step: AIC=-2007.53
## chance_of_admit ~ cgpa
##
## Df Sum of Sq RSS AIC
## + lor 1 0.1172 1.6385 -2031.4
## + research 1 0.1099 1.6458 -2029.8
## + gre_score 1 0.1006 1.6550 -2027.7
## + toefl_score 1 0.0976 1.6580 -2027.0
## + university_rating 1 0.0970 1.6587 -2026.8
## + sop 1 0.0618 1.6938 -2019.0
## <none> 1.7557 -2007.5
## - cgpa 1 5.5057 7.2614 -1477.1
##
## Step: AIC=-2031.43
## chance_of_admit ~ cgpa + lor
##
## Df Sum of Sq RSS AIC
## + gre_score 1 0.09891 1.5396 -2052.8
## + research 1 0.09002 1.5485 -2050.6
## + toefl_score 1 0.08407 1.5544 -2049.2
## + university_rating 1 0.04791 1.5906 -2040.6
## + sop 1 0.01755 1.6209 -2033.5
## <none> 1.6385 -2031.4
## - lor 1 0.11718 1.7557 -2007.5
## - cgpa 1 2.62594 4.2644 -1674.7
##
## Step: AIC=-2052.78
## chance_of_admit ~ cgpa + lor + gre_score
##
## Df Sum of Sq RSS AIC
## + research 1 0.04665 1.4929 -2062.3
## + toefl_score 1 0.02893 1.5106 -2057.9
## + university_rating 1 0.02668 1.5129 -2057.3
## + sop 1 0.01270 1.5269 -2053.9
## <none> 1.5396 -2052.8
## - gre_score 1 0.09891 1.6385 -2031.4
## - lor 1 0.11548 1.6550 -2027.7
## - cgpa 1 0.71950 2.2591 -1911.0
##
## Step: AIC=-2062.32
## chance_of_admit ~ cgpa + lor + gre_score + research
##
## Df Sum of Sq RSS AIC
## + toefl_score 1 0.03126 1.4617 -2068.3
## + university_rating 1 0.02170 1.4712 -2065.8
## + sop 1 0.01076 1.4822 -2063.0
## <none> 1.4929 -2062.3
## - research 1 0.04665 1.5396 -2052.8
## - gre_score 1 0.05553 1.5485 -2050.6
## - lor 1 0.10049 1.5934 -2039.9
## - cgpa 1 0.71215 2.2051 -1918.1
##
## Step: AIC=-2068.26
## chance_of_admit ~ cgpa + lor + gre_score + research + toefl_score
##
## Df Sum of Sq RSS AIC
## + university_rating 1 0.01464 1.4470 -2070.0
## <none> 1.4617 -2068.3
## + sop 1 0.00629 1.4554 -2067.9
## - gre_score 1 0.01904 1.4807 -2065.4
## - toefl_score 1 0.03126 1.4929 -2062.3
## - research 1 0.04897 1.5106 -2057.9
## - lor 1 0.09184 1.5535 -2047.4
## - cgpa 1 0.54381 2.0055 -1951.6
##
## Step: AIC=-2070.03
## chance_of_admit ~ cgpa + lor + gre_score + research + toefl_score +
## university_rating
##
## Df Sum of Sq RSS AIC
## <none> 1.4470 -2070.0
## + sop 1 0.00166 1.4454 -2068.5
## - university_rating 1 0.01464 1.4617 -2068.3
## - gre_score 1 0.01646 1.4635 -2067.8
## - toefl_score 1 0.02420 1.4712 -2065.8
## - research 1 0.04437 1.4914 -2060.7
## - lor 1 0.06515 1.5122 -2055.5
## - cgpa 1 0.49860 1.9456 -1961.0
In comparision, the backward direction differs from the both ones by eliminating each feature from lm_all which as seen as follow:
stepwise_bwd <- step(lm_all, direction = "backward")## Start: AIC=-2068.46
## chance_of_admit ~ gre_score + toefl_score + university_rating +
## sop + lor + cgpa + research
##
## Df Sum of Sq RSS AIC
## - sop 1 0.00166 1.4470 -2070.0
## <none> 1.4454 -2068.5
## - university_rating 1 0.01001 1.4554 -2067.9
## - gre_score 1 0.01691 1.4623 -2066.1
## - toefl_score 1 0.02282 1.4682 -2064.6
## - research 1 0.04400 1.4894 -2059.2
## - lor 1 0.05498 1.5003 -2056.5
## - cgpa 1 0.45360 1.8990 -1968.1
##
## Step: AIC=-2070.03
## chance_of_admit ~ gre_score + toefl_score + university_rating +
## lor + cgpa + research
##
## Df Sum of Sq RSS AIC
## <none> 1.4470 -2070.0
## - university_rating 1 0.01464 1.4617 -2068.3
## - gre_score 1 0.01646 1.4635 -2067.8
## - toefl_score 1 0.02420 1.4712 -2065.8
## - research 1 0.04437 1.4914 -2060.7
## - lor 1 0.06515 1.5122 -2055.5
## - cgpa 1 0.49860 1.9456 -1961.0
The forward direction would also differ as the direction goes from lm_none to lm_all
stepwise_fwd <- step(lm_none, scope = list(lower = lm_none, upper = lm_all), direction = "forward")## Start: AIC=-1477.13
## chance_of_admit ~ 1
##
## Df Sum of Sq RSS AIC
## + cgpa 1 5.5057 1.7557 -2007.5
## + gre_score 1 4.4326 2.8288 -1828.7
## + toefl_score 1 4.1611 3.1003 -1794.3
## + sop 1 3.4633 3.7980 -1718.2
## + university_rating 1 3.2953 3.9660 -1701.9
## + lor 1 2.9969 4.2644 -1674.7
## + research 1 1.9601 5.3012 -1593.1
## <none> 7.2614 -1477.1
##
## Step: AIC=-2007.53
## chance_of_admit ~ cgpa
##
## Df Sum of Sq RSS AIC
## + lor 1 0.117183 1.6385 -2031.4
## + research 1 0.109891 1.6458 -2029.8
## + gre_score 1 0.100612 1.6550 -2027.7
## + toefl_score 1 0.097620 1.6580 -2027.0
## + university_rating 1 0.096985 1.6587 -2026.8
## + sop 1 0.061820 1.6938 -2019.0
## <none> 1.7557 -2007.5
##
## Step: AIC=-2031.43
## chance_of_admit ~ cgpa + lor
##
## Df Sum of Sq RSS AIC
## + gre_score 1 0.098907 1.5396 -2052.8
## + research 1 0.090019 1.5485 -2050.6
## + toefl_score 1 0.084067 1.5544 -2049.2
## + university_rating 1 0.047906 1.5906 -2040.6
## + sop 1 0.017551 1.6209 -2033.5
## <none> 1.6385 -2031.4
##
## Step: AIC=-2052.78
## chance_of_admit ~ cgpa + lor + gre_score
##
## Df Sum of Sq RSS AIC
## + research 1 0.046645 1.4929 -2062.3
## + toefl_score 1 0.028934 1.5106 -2057.9
## + university_rating 1 0.026680 1.5129 -2057.3
## + sop 1 0.012699 1.5269 -2053.9
## <none> 1.5396 -2052.8
##
## Step: AIC=-2062.32
## chance_of_admit ~ cgpa + lor + gre_score + research
##
## Df Sum of Sq RSS AIC
## + toefl_score 1 0.031260 1.4617 -2068.3
## + university_rating 1 0.021700 1.4712 -2065.8
## + sop 1 0.010756 1.4822 -2063.0
## <none> 1.4929 -2062.3
##
## Step: AIC=-2068.26
## chance_of_admit ~ cgpa + lor + gre_score + research + toefl_score
##
## Df Sum of Sq RSS AIC
## + university_rating 1 0.014643 1.4470 -2070.0
## <none> 1.4617 -2068.3
## + sop 1 0.006293 1.4554 -2067.9
##
## Step: AIC=-2070.03
## chance_of_admit ~ cgpa + lor + gre_score + research + toefl_score +
## university_rating
##
## Df Sum of Sq RSS AIC
## <none> 1.4470 -2070.0
## + sop 1 0.0016611 1.4454 -2068.5
By the each models above, it is observed that the features of each directions are similar (only differed from the sequence) which is also further explained in the next section. Thus, any directions of the stepwise could be choose in this case to compare with the manual model. The summary of the stepwise model is as below:
summary(stepwise_both)##
## Call:
## lm(formula = chance_of_admit ~ cgpa + lor + gre_score + research +
## toefl_score + university_rating, data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.257232 -0.024547 0.009304 0.033994 0.163434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.1561424 0.1261732 -9.163 < 0.0000000000000002 ***
## cgpa 0.1298942 0.0115352 11.261 < 0.0000000000000002 ***
## lor 0.0193331 0.0047498 4.070 0.0000575 ***
## gre_score 0.0012244 0.0005984 2.046 0.041452 *
## research1 0.0262669 0.0078199 3.359 0.000864 ***
## toefl_score 0.0024844 0.0010014 2.481 0.013549 *
## university_rating 0.0081611 0.0042291 1.930 0.054404 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06271 on 368 degrees of freedom
## Multiple R-squared: 0.8007, Adjusted R-squared: 0.7975
## F-statistic: 246.4 on 6 and 368 DF, p-value: < 0.00000000000000022
Having this summary, the model could be summarized as the equation below:
chance_of_admit = -1.1561424 + 0.1298942 x cgpa + 0.0193331 x lor + 0.0012244 x gre_score + 0.0262669 x research1 + 0.0024844 x toefl_score + 0.0081611 x university_rating
Likewise, the performance of each model would be seen from the respective values of Adjusted R-Squared values. Due to the multiple predictors are used in this models, Adjusted R-Squared would be preferably used instead of Multiple R-Squared as certain biases occured in the calculation of Multiple R-squared when it comes to the multiple linear regression. The equation of Adjusted R-Squared would be described as below:
\[\bar{R^2} = 1(1-R^2)\frac{n-1}{n-p-1}\]
Based on the summaries of each models above, it is shown that stepwise regression method has a greater value of Adjusted R-Squared value in comparison to the manual models.
In this section, the built models are then compared by their error rates from their respective results of their predictions from both of the training data and the test data. This step would be required to see whether the models are fit to the data set. To test the error rates, the Root Mean Square Error (RMSE) are used in this case. The implementation of RMSE would be preferrable in comparison to Mean Absolute Error (MAE) as the errors of each predictions would be squared and then rooted which would resulted in having a greater penalties for the higher residuals. The equation to calculate RMSE would be described as below:
\[RMSE = \sqrt{\frac{\Sigma(prediction_i - actual_i)^2}{n}}\]
To calculate the RMSE in this study, RMSE function would be used by the caret package. Below is the calculation of predicted values along with the RMSE calculations of train and test data for manual model:
pred_manual_lm <- predict(manual_lm, newdata = data_test %>% select(-chance_of_admit))
#train
paste("RMSE Value for Train Data:", RMSE(manual_lm$fitted.values, data_train$chance_of_admit))## [1] "RMSE Value for Train Data: 0.0633005307721381"
#test
paste("RMSE Value for Test Data:", RMSE(pred_manual_lm, data_test$chance_of_admit))## [1] "RMSE Value for Test Data: 0.0523712410669489"
The step would be repeated for the stepwise regression model. Below are the calculations of predicted values by the stepwise model along with the error rates of train and test data:
pred_stepwise <- predict(stepwise_both, newdata = data_test %>% select(-chance_of_admit))
#train
paste("RMSE Value for Train Data:", RMSE(stepwise_both$fitted.values, data_train$chance_of_admit))## [1] "RMSE Value for Train Data: 0.0621186436179306"
#test
paste("RMSE Value for Test Data:",RMSE(pred_stepwise, data_test$chance_of_admit))## [1] "RMSE Value for Test Data: 0.0522589904242547"
Based on the calculations above, it is evident that there are slight differences of error rates for each models. The manual model has RMSE value of 0.063 for the train data calculation. Meanwhile the stepwise regression model has 0.062 for the train data calculation. Moreover the slight difference of RMSE values for the train and test data calculation towards each of the models show that each models does not seem to be overfit or underfit due to its slight differences. Thus, both models could be chosen for this case. To go forward, stepwise model would be choosen for the case due to the lower error rates and the higher value of Adjusted R-Squared. To efficiently compare each of the models, below are shown the comparison of each models by using compare_performance from performance package:
compare_performance(manual_lm, stepwise_both, stepwise_bwd, stepwise_fwd)After choosing the model, it would then be tested for the assumptions of the linear regression. This step would be required in establishing an OLS Regression model as the model itself would only use a linear methods for predicting the target variable. In other words, the statistical model would only be formed in a straight line. Meanwhile, most of the cases in data science would not have a linear correlations between the target and predictors.
Several assumption tests would be performed include the linearity, heteroscedasticity, normality, autocorrelation, multicolinearity, and influence identification.
The linearity test mainly focused on the correlation of target to each of its predictors. In the Exploratory Data Analysis section, the Pearson Correlation Coefficient Test has been conducted and it is seen that each of correlations formed a linear correlations (due to alpha < 0.05). The linearity would also be shown from the residuals vs. fitted values as seen as below:
plot(stepwise_both, 1)Based on the plot above, the trend line of the plot (red line) is likely to be closed to the 0 line. This means that the residuals would show the linearity of the target and each predictor variables. Thus, the first assumption has been answered.
The Heteroscedasticity Test is mainly focused on identifying patterns within the residuals. The patterns within the residuals are likely ocurred as the variance of the residuals are non-constant. To identify the patterns of the residuals, a plot of scale-location vs. root-standardized residuals would be used.
plot(stepwise_both, 3)Based on the plot, it is shown that the residuals on the left side of the grid (scale location 0.5 to 0.7) show more dispersed pattern than the right side of the grid. This would likely show the heteroscedasticity of the model, while the homoscedasticity (no pattern within residuals) would be desired. To investigate further, a Breusch-Pagan Test would be used. The hypotheses of the test would be described as follow:
H0: The error terms has no pattern (Homoscedasticity)
H1: The error has pattern (Heteroscedasticity)
To perform the test, a function bptest would be used by the package lmtest
bptest(stepwise_both)##
## studentized Breusch-Pagan test
##
## data: stepwise_both
## BP = 23.553, df = 6, p-value = 0.0006309
It is clear now that there is a pattern within the residuals as the p-value of the test is below 0.05 which would reject the null hypothesis. By this means, a transformation to either a target variables or predictor variables might be preferable to fit this assumption. The transformation would be explained in the Model Improvement section.
The normality test is mainly focused on the distribution of the residuals. The desired output for this test is the normal distribution of the residuals. To identify, a histogram and Shapiro-Wilk test would be used.
hist(stepwise_both$residuals)Based on the plot above, there are certain values of residuals that are deviate a little far away from the central points (bins -0.3 and -0.2) which made the distribution skewed. This would also seen from the Shapiro-Wilk Test as performed below. The hypotheses for the test are as follow:
H0: The residuals are normality distributed
H1: The residals are not normality distributed
shapiro.test(stepwise_both$residuals)##
## Shapiro-Wilk normality test
##
## data: stepwise_both$residuals
## W = 0.92706, p-value = 0.000000000001484
Due to p-value < 0.05, it is concluded that the residuals of the models are not normally distributed. This could be handled by variable transformation or removing certain outliers in the train data. Despite the removed outliers could be the answer, an influence check would be recommended beforehand which would be described in the next sub-section.
Autocorrelation test is used to identify whether there are any correlation among error-terms or residuals. The effect of this correlation would be the reduction of the confidence interval when predicting the target variable. For example, A prediction of a model with a confidence interval of 95% may in reality have a lower probability than 0.95 of the true value of the containing parameter. The test that would be used is Durbin Watson Test with the hypotheses as follow:
H0: No Autocorrelation presented
H1: An Autocorrelation is presented
durbinWatsonTest(stepwise_both)## lag Autocorrelation D-W Statistic p-value
## 1 -0.06768163 2.091591 0.378
## Alternative hypothesis: rho != 0
Based on the test, the p-value is above 0.05 which has meaning that there are no autocorrelation presented in the residuals.
Multicolinearity test is performed to identify whethere there are any correlation between predictor variables. To test the multicolinearity of the model, a Variance Inflation Factor (VIF) test would be performed. The rule of thumb of this test is whether the value of VIF for each independent variables/predictor exceed 5 or 10.
vif(stepwise_both)## cgpa lor gre_score research
## 4.247711 1.812956 4.114458 1.436964
## toefl_score university_rating
## 3.343981 2.160270
Based on the test result, it is shown that each of the VIF values does not exceed 5 or 10. Thus, it could be concluded that there are no multicolinearity among the predictors.
To check the influence of the outliers, Cook’s Distance could identify whether certain outliers have an influence towards the linear models that are built. The rule of thumb to identify outliers by their Cook’s Distance is by having a cutoff level of the Cook’s Distance above 0.5 or 1.5. If the outliers are below this cutoff level, it is recommended not to perform any deletion of the outliers. This is due to the reduction of information if these outliers are deleted.
par(mfrow = c(1,2))
plot(stepwise_both, 5)
plot(stepwise_both, 4)The plots shows several outliers within the residuals which are observation number 100, 193, and 270. However, the Cook’s distance of these outliers does not seem to exceed the cutoff level. Thus, this outliers should not be deleted.
In the Model Improvement phase, the built model are then tuned to satisfy several assumptions that has not been satisfied in the previous section. Based on the model evaluation, there are two assumptions that has not met including heteroscedasticity and normality test. To satisfy these assumptions, a transformation of the model could be performed by transforming the scale of either the predictors or the target variable. In this case, the target model is transformed by using the box-cox transformation which the equation of transformation is seen as below:
\[\frac{y^\lambda - 1}{\lambda}\]
Note that this equation should be used if the lambda is not equal to 0 (as this would make the transformation uses the natural log). To determine the lambda of the transformation, boxcox function is used from MASS library.
MASS::boxcox(stepwise_both, plotit = T, lambda = seq(1, 2, 0.1))Based on the plot above, the lambda that would be used is 1.9 as it meets the confidence interval line of 95%. The transformation then would be performed as below:
model_tuned <- lm((((chance_of_admit^1.9) - 1)/1.9) ~ cgpa + lor + gre_score + research + toefl_score +
university_rating, data = data_train)
summary(model_tuned)##
## Call:
## lm(formula = (((chance_of_admit^1.9) - 1)/1.9) ~ cgpa + lor +
## gre_score + research + toefl_score + university_rating, data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.167619 -0.020530 0.007452 0.025932 0.110108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.6153434 0.0846802 -19.076 < 0.0000000000000002 ***
## cgpa 0.0944725 0.0077418 12.203 < 0.0000000000000002 ***
## lor 0.0119563 0.0031878 3.751 0.000205 ***
## gre_score 0.0008731 0.0004016 2.174 0.030351 *
## research1 0.0196593 0.0052483 3.746 0.000209 ***
## toefl_score 0.0019932 0.0006721 2.966 0.003216 **
## university_rating 0.0085358 0.0028383 3.007 0.002817 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04209 on 368 degrees of freedom
## Multiple R-squared: 0.8307, Adjusted R-squared: 0.8279
## F-statistic: 300.8 on 6 and 368 DF, p-value: < 0.00000000000000022
It is evident now that the adjusted R-squared has increased due to the transformation. Nevertheless, it is still encouraged to assess the error rates of the transformed model. The error rates of the tuned model are described as below:
pred_tuned <- predict(model_tuned, newdata = data_test %>% select(-chance_of_admit))
#train
paste("RMSE Value for Train Data:", RMSE(model_tuned$fitted.values, data_train$chance_of_admit))## [1] "RMSE Value for Train Data: 0.957688731797463"
#test
paste("RMSE Value for Test Data:", RMSE(pred_tuned, data_test$chance_of_admit))## [1] "RMSE Value for Test Data: 0.967013369224131"
It seems that the error rates has increased from 0.06 to 0.96 (raised approx. 90 percent). This would be considered as high as the target variable range from 0 to 1 (the probability of chance to admit in the program). To further investigate, several assumption testing would be performed to the tuned model.
plot(model_tuned, 1)The linearity assumption is still satisfied as the red line of residuals vs. fitted values seems to fit to the zero line.
plot(model_tuned, 3)The plot of root-standardized residuals vs. fitted values would also still show a certain pattern from 0.7 - 0.9 of the fitted value but was slightly better than the model without tuning
bptest(model_tuned)##
## studentized Breusch-Pagan test
##
## data: model_tuned
## BP = 12.484, df = 6, p-value = 0.052
Likewise, the p-value of the Breusch-Pagan test shows the homoscedasticity of the residual patterns which made the heteroscedasticity assumption for the tuned model satisfied.
hist(model_tuned$residuals)In the normality test, however, the residual distributions are still skewed. Potentially, this is due to a certain pattern in the data set that could not be described by the linear regression model.
shapiro.test(model_tuned$residuals)##
## Shapiro-Wilk normality test
##
## data: model_tuned$residuals
## W = 0.94103, p-value = 0.00000000004743
Likewise, the Shapiro-Wilk test result shows that the distribution is not normal or skewed (due to p-value < 0.05). Nevertheless, the model is still usable with this condition but with a certain highlight that under certain data patterns would likely make a large error.
durbinWatsonTest(model_tuned)## lag Autocorrelation D-W Statistic p-value
## 1 -0.06807809 2.085291 0.446
## Alternative hypothesis: rho != 0
The Durbin Watson test of the tuned model would also shown that there are no autocorrelation.
vif(model_tuned)## cgpa lor gre_score research
## 4.247711 1.812956 4.114458 1.436964
## toefl_score university_rating
## 3.343981 2.160270
VIF test would also shown that there are no multicoliniearity among the independent variables for the tuned models.
par(mfrow = c(1,2))
plot(model_tuned, 5)
plot(model_tuned, 4)The influence check by the Cook’s Distance would also shows that there are no outliers would influence the tuned model.
In conclusion, The OLS Regression is one of the methods to model and predicts the target variables based on several predictors. One of the methods to select features of the models (in this case, the predictors) is by using Stepwise Regression which is optimized by having the lowest value of Akaika Information Criterion (AIC). Based on the error rates and the adjusted R-squared values, it is also concluded that the stepwise regression would be the best fit for the Graduate Admission case study.
Due to several assumptions, including the normality and heteroscedasticity, the chosen model would require to be transformed. The transformation that is utilized is box-cox transformation by finding the appropriate value of lambda. As the result of the transformation, the heteroscedasticity is satisfied while the normality assumption is not changed. Nevertheless, the model could still be utilized under certain note that a certain pattern in the data would likely generate a larger error or residuals. It would also recommended to try to deploy other methods of machine learning such as decision tree or random forest which do not have any assumptions in their deployment.