Introduction

This report has a purpose to demonstrate the implementation of Ordinary Least Square (OLS) Regression to predict and model the probability of Indian student candidates to be admitted in a Masters Program. The dataset is inspired from UCLA Graduate Program data set and owned by Mohan S Acharya (Please refer the data set from this link). A brief description of the variables are given as below:

  • GRE Scores ( out of 340 )
  • TOEFL Scores ( out of 120 )
  • University Rating ( out of 5 )
  • Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
  • Undergraduate GPA ( out of 10 )
  • Research Experience ( either 0 or 1 )
  • Chance of Admit ( ranging from 0 to 1 )

In this demonstration, the target variable that is used is the chance_of_admit which consist of the chance or probability of candidates to be admitted to the program. Meanwhile, several variables apart from the target would be used for the predictors of the model. To model the OLS Regression, two methods are used including the manual methods (based on the experience of the writer), and the stepwise regression (both, forward, and backward directions) to conduct the feature selection.

The flow of this report consists of 7 sections including Introductions, Data Preparations, Exploratory Data Analysis (EDA), Modelling, Predictions and Errors, Assumption Testing, and Model Improvement. Briefly, the data preparations consist of the preparation of R packages and the data set to be used. Next, EDA consist of the exploratory analysis to identify the correlations among the target variables and predicted variables. In the Modelling and Predictions and Errors section, the created models are explained along the calculations of errors within each models to select the appropriate model with the least error rate and highest R-squared rate. Lastly, Assumptions of Linear Regression are conducted for the models that has been selected and to make several improvements to the model.

Data Preparation

First off, the library would be loaded. The list of the used library is given as below:

library(tidyverse)
library(caret)
library(car)
library(GGally)
library(scales)
library(lmtest)
library(performance)

Next, the data set would be imported. The readr package from tidyverse would be used here to load the data. Moreover, adjustment of the research variable is done by converting the type to factor (due to the content which consist of 0 and 1). Note that the column names are transformed to ease the Data Pre-processing.

admission_df <- read_csv("data_input/Admission_Predict_Ver1.1.csv") 

names(admission_df) <- c("id", "gre_score", "toefl_score", 
                         "university_rating", "sop", "lor", "cgpa", 
                         "research", "chance_of_admit")

admission_df <- admission_df %>% 
  mutate(research = as.factor(research))

admission_df

After the adjustment, it is recommended to explore the missing and duplicated value within the data set to assure a cleaner data. The missing values for each columns are given below:

colSums(is.na(admission_df))
##                id         gre_score       toefl_score university_rating 
##                 0                 0                 0                 0 
##               sop               lor              cgpa          research 
##                 0                 0                 0                 0 
##   chance_of_admit 
##                 0

It seems that there are no missing value within the data set. Next, the duplicated values are explored by using anyDuplicated function to attain the overall duplicated values accross the columns. The duplicated values for the data set are given below:

anyDuplicated(admission_df)
## [1] 0

It is seen that there are no duplicated values within the data set. As the focus is the target and predictor variables, the id column could be ommited.

admission_df <- admission_df %>% 
  select(-id)

Exploratory Data Analysis

In this section, the data set is explored by calling the function summary to explore the quantiles, min, max and central values of each variables. the summary of the data set is given below:

summary(admission_df)
##    gre_score      toefl_score    university_rating      sop       
##  Min.   :290.0   Min.   : 92.0   Min.   :1.000     Min.   :1.000  
##  1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000     1st Qu.:2.500  
##  Median :317.0   Median :107.0   Median :3.000     Median :3.500  
##  Mean   :316.5   Mean   :107.2   Mean   :3.114     Mean   :3.374  
##  3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000     3rd Qu.:4.000  
##  Max.   :340.0   Max.   :120.0   Max.   :5.000     Max.   :5.000  
##       lor             cgpa       research chance_of_admit 
##  Min.   :1.000   Min.   :6.800   0:220    Min.   :0.3400  
##  1st Qu.:3.000   1st Qu.:8.127   1:280    1st Qu.:0.6300  
##  Median :3.500   Median :8.560            Median :0.7200  
##  Mean   :3.484   Mean   :8.576            Mean   :0.7217  
##  3rd Qu.:4.000   3rd Qu.:9.040            3rd Qu.:0.8200  
##  Max.   :5.000   Max.   :9.920            Max.   :0.9700

Based on the result, it is shown that the range of each variables are matched with the data set description above (Introduction). It is also shown that the range of target variable (chance_of_admit) spans between 0.34 to 0.97.

For further exploration, the correlation for target variable to the predictors are explained by using ggcorr function from GGally package. The correlation chart is given below:

ggcorr(admission_df, label = T, label_size = 5, hjust = 1, layout.exp = 2)

The chart shows the strengths of each correlations. As seen on the graph, the correlations of target variables (chance_of_admit) to its predictors have strong correlations (above 0.5). Nevertheless, It is recommended that each correlation of target variable to its predictors to be further examined. In this section, the investigation of correlations is conducted by using Pearson correlation coefficient test. Note that this step is optional to be included in the EDA phase as it also could be performed in the assumptions testing phase. The hypotheses of the test are given below:

H0: Correlation equals to 0 (non-linear)
H1: Correlation not equals to 0 (linear)

By calling cor.test function, the calculation for each correlation test are performed. Below are the result:

xgpa <- cor.test(admission_df$chance_of_admit, admission_df$cgpa)[3]
xlor <- cor.test(admission_df$chance_of_admit, admission_df$lor)[3]
xsop <- cor.test(admission_df$chance_of_admit, admission_df$sop)[3]
xrating <- cor.test(admission_df$chance_of_admit, admission_df$university_rating)[3]
xtoefl <- cor.test(admission_df$chance_of_admit, admission_df$toefl_score)[3]
xgre <- cor.test(admission_df$chance_of_admit, admission_df$gre_score)[3]

p_value <- rbind(xgpa, xlor, xsop, xrating, xtoefl, xgre)

It is shown that the the given result of p-values for each correlations are below alpha (or 0.05) which conclude the null hypothesis to be rejected. This means that each of the correlations shows a linear relationships among the target and each predictors which would answer one of the assumptions of OLS Regression model (further explained in the Assumption Testing section).

Modelling

In this section, the OLS modelling processes are performed. The steps that are conducted include separation of train and test data, formulation of manual modelling, and stepwise regression methods.

Holdout: Train-Test Split

In this step, the data set is divided into two subsets which are Train Data and Test data. This step would be mandatory for any methods of modelling to evaluate the model performance that has been built. There are not any formal guidelines in conducting this step as each of the case study and the methods of the models would require different methods of splitting. However, the rule of thumb to split the data is by having a greater proportion of Train Data as oppose to the Test Data. In this case, the split that is performed is by having 75% of the total records in the dataset to the train data, while the remains would be allocated to the test data. Both set.seed and sample functions would be used in this process which could be seen as follow:

set.seed(318)

samplesize <- round(0.75 * nrow(admission_df), 0)
index <- sample(nrow(admission_df), size = samplesize)

data_train <- admission_df[index, ]
data_test <- admission_df[-index, ]

nrow(data_train)
## [1] 375
nrow(data_test)
## [1] 125

Manual Modelling

In this section, the selection of features would be applied manually referring to the experience of the writer. The variables that are used within the model includes the GRE Scores, GPA scores from the previous degree, TOEFL Scores, Statement of Purpose (or Essay), and Letter of Recommendations. To establish the model, lm function is used by passing the formula of the model and the train data into the function.

manual_lm <- lm(chance_of_admit ~ gre_score + cgpa + toefl_score + sop + lor, data_train)

Next, the model would be interpreted by calling the summary function of the model. To interpret the summary of the model, below is the general equation of linear regression model:

\[\hat{y} = \beta_0 + \beta_1*x1 + ...\]

The explanation of each elements in the equation would be described as below:

  • y hat : a predicted value or the target variable by the linear regression model
  • beta 0 : an intercept (or constant) that is generated from the model
  • beta 1 : slope coefficients of predictor variable 1
  • x 1 : values of predictor variable 1
summary(manual_lm)
## 
## Call:
## lm(formula = chance_of_admit ~ gre_score + cgpa + toefl_score + 
##     sop + lor, data = data_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.277625 -0.025659  0.009358  0.038457  0.157752 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -1.3634386  0.1155177 -11.803 < 0.0000000000000002 ***
## gre_score    0.0019124  0.0005823   3.284              0.00112 ** 
## cgpa         0.1298502  0.0120928  10.738 < 0.0000000000000002 ***
## toefl_score  0.0024801  0.0010161   2.441              0.01512 *  
## sop          0.0072953  0.0051959   1.404              0.16114    
## lor          0.0210838  0.0049369   4.271            0.0000248 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06381 on 369 degrees of freedom
## Multiple R-squared:  0.7931, Adjusted R-squared:  0.7903 
## F-statistic: 282.8 on 5 and 369 DF,  p-value: < 0.00000000000000022

To briefly interpret the summary of the model, we could focus on the first column (Estimate), the code of significance as shown on the left of the Coefficients table, and the adjusted R-squared values of the model. In the first column, the intercept along with the slopes of each predictors are shown. It is also described that among several predictors, sop is less significant in comparisson to other predictors. It would be optional to disregard this variable in the tuning of the model (depends on the change of R-squared value and the error rates). The adjusted R-squared value also seems high as the 79 % the data of the models are explainable. In summary, the equation for manual_lm model is described as below

chance_of_admit = -1.3634386 + 0.0019124 x gre_score + 0.1298502 x cgpa + 0.0024801 x toefl_score + 0.0072953 x sop + 0.0210838 x lor

Stepwise Regression

The stepwise regression is one of the method to select the features (or predictors to be used) of the regression model. The method uses the optimization of Akaike Information Criterion (AIC) that determines which of the models to be used by having the lowest value of it. In this section, the Stepwise Regression is performed by three different directions (forward, backward, and both) to compare the differences of each directions. To initialize, lm_none and lm_all objects are made to determine the scope of stepwise function step. Below are the result and summary of stepwise regression by both direction:

lm_none <- lm(chance_of_admit ~ 1, data_train)
lm_all <- lm(chance_of_admit ~ ., data_train)

stepwise_both <- step(lm_none, scope = list(lower = lm_none, upper = lm_all), direction = "both")
## Start:  AIC=-1477.13
## chance_of_admit ~ 1
## 
##                     Df Sum of Sq    RSS     AIC
## + cgpa               1    5.5057 1.7557 -2007.5
## + gre_score          1    4.4326 2.8288 -1828.7
## + toefl_score        1    4.1611 3.1003 -1794.3
## + sop                1    3.4633 3.7980 -1718.2
## + university_rating  1    3.2953 3.9660 -1701.9
## + lor                1    2.9969 4.2644 -1674.7
## + research           1    1.9601 5.3012 -1593.1
## <none>                           7.2614 -1477.1
## 
## Step:  AIC=-2007.53
## chance_of_admit ~ cgpa
## 
##                     Df Sum of Sq    RSS     AIC
## + lor                1    0.1172 1.6385 -2031.4
## + research           1    0.1099 1.6458 -2029.8
## + gre_score          1    0.1006 1.6550 -2027.7
## + toefl_score        1    0.0976 1.6580 -2027.0
## + university_rating  1    0.0970 1.6587 -2026.8
## + sop                1    0.0618 1.6938 -2019.0
## <none>                           1.7557 -2007.5
## - cgpa               1    5.5057 7.2614 -1477.1
## 
## Step:  AIC=-2031.43
## chance_of_admit ~ cgpa + lor
## 
##                     Df Sum of Sq    RSS     AIC
## + gre_score          1   0.09891 1.5396 -2052.8
## + research           1   0.09002 1.5485 -2050.6
## + toefl_score        1   0.08407 1.5544 -2049.2
## + university_rating  1   0.04791 1.5906 -2040.6
## + sop                1   0.01755 1.6209 -2033.5
## <none>                           1.6385 -2031.4
## - lor                1   0.11718 1.7557 -2007.5
## - cgpa               1   2.62594 4.2644 -1674.7
## 
## Step:  AIC=-2052.78
## chance_of_admit ~ cgpa + lor + gre_score
## 
##                     Df Sum of Sq    RSS     AIC
## + research           1   0.04665 1.4929 -2062.3
## + toefl_score        1   0.02893 1.5106 -2057.9
## + university_rating  1   0.02668 1.5129 -2057.3
## + sop                1   0.01270 1.5269 -2053.9
## <none>                           1.5396 -2052.8
## - gre_score          1   0.09891 1.6385 -2031.4
## - lor                1   0.11548 1.6550 -2027.7
## - cgpa               1   0.71950 2.2591 -1911.0
## 
## Step:  AIC=-2062.32
## chance_of_admit ~ cgpa + lor + gre_score + research
## 
##                     Df Sum of Sq    RSS     AIC
## + toefl_score        1   0.03126 1.4617 -2068.3
## + university_rating  1   0.02170 1.4712 -2065.8
## + sop                1   0.01076 1.4822 -2063.0
## <none>                           1.4929 -2062.3
## - research           1   0.04665 1.5396 -2052.8
## - gre_score          1   0.05553 1.5485 -2050.6
## - lor                1   0.10049 1.5934 -2039.9
## - cgpa               1   0.71215 2.2051 -1918.1
## 
## Step:  AIC=-2068.26
## chance_of_admit ~ cgpa + lor + gre_score + research + toefl_score
## 
##                     Df Sum of Sq    RSS     AIC
## + university_rating  1   0.01464 1.4470 -2070.0
## <none>                           1.4617 -2068.3
## + sop                1   0.00629 1.4554 -2067.9
## - gre_score          1   0.01904 1.4807 -2065.4
## - toefl_score        1   0.03126 1.4929 -2062.3
## - research           1   0.04897 1.5106 -2057.9
## - lor                1   0.09184 1.5535 -2047.4
## - cgpa               1   0.54381 2.0055 -1951.6
## 
## Step:  AIC=-2070.03
## chance_of_admit ~ cgpa + lor + gre_score + research + toefl_score + 
##     university_rating
## 
##                     Df Sum of Sq    RSS     AIC
## <none>                           1.4470 -2070.0
## + sop                1   0.00166 1.4454 -2068.5
## - university_rating  1   0.01464 1.4617 -2068.3
## - gre_score          1   0.01646 1.4635 -2067.8
## - toefl_score        1   0.02420 1.4712 -2065.8
## - research           1   0.04437 1.4914 -2060.7
## - lor                1   0.06515 1.5122 -2055.5
## - cgpa               1   0.49860 1.9456 -1961.0

In comparision, the backward direction differs from the both ones by eliminating each feature from lm_all which as seen as follow:

stepwise_bwd <- step(lm_all, direction = "backward")
## Start:  AIC=-2068.46
## chance_of_admit ~ gre_score + toefl_score + university_rating + 
##     sop + lor + cgpa + research
## 
##                     Df Sum of Sq    RSS     AIC
## - sop                1   0.00166 1.4470 -2070.0
## <none>                           1.4454 -2068.5
## - university_rating  1   0.01001 1.4554 -2067.9
## - gre_score          1   0.01691 1.4623 -2066.1
## - toefl_score        1   0.02282 1.4682 -2064.6
## - research           1   0.04400 1.4894 -2059.2
## - lor                1   0.05498 1.5003 -2056.5
## - cgpa               1   0.45360 1.8990 -1968.1
## 
## Step:  AIC=-2070.03
## chance_of_admit ~ gre_score + toefl_score + university_rating + 
##     lor + cgpa + research
## 
##                     Df Sum of Sq    RSS     AIC
## <none>                           1.4470 -2070.0
## - university_rating  1   0.01464 1.4617 -2068.3
## - gre_score          1   0.01646 1.4635 -2067.8
## - toefl_score        1   0.02420 1.4712 -2065.8
## - research           1   0.04437 1.4914 -2060.7
## - lor                1   0.06515 1.5122 -2055.5
## - cgpa               1   0.49860 1.9456 -1961.0

The forward direction would also differ as the direction goes from lm_none to lm_all

stepwise_fwd <- step(lm_none, scope = list(lower = lm_none, upper = lm_all), direction = "forward")
## Start:  AIC=-1477.13
## chance_of_admit ~ 1
## 
##                     Df Sum of Sq    RSS     AIC
## + cgpa               1    5.5057 1.7557 -2007.5
## + gre_score          1    4.4326 2.8288 -1828.7
## + toefl_score        1    4.1611 3.1003 -1794.3
## + sop                1    3.4633 3.7980 -1718.2
## + university_rating  1    3.2953 3.9660 -1701.9
## + lor                1    2.9969 4.2644 -1674.7
## + research           1    1.9601 5.3012 -1593.1
## <none>                           7.2614 -1477.1
## 
## Step:  AIC=-2007.53
## chance_of_admit ~ cgpa
## 
##                     Df Sum of Sq    RSS     AIC
## + lor                1  0.117183 1.6385 -2031.4
## + research           1  0.109891 1.6458 -2029.8
## + gre_score          1  0.100612 1.6550 -2027.7
## + toefl_score        1  0.097620 1.6580 -2027.0
## + university_rating  1  0.096985 1.6587 -2026.8
## + sop                1  0.061820 1.6938 -2019.0
## <none>                           1.7557 -2007.5
## 
## Step:  AIC=-2031.43
## chance_of_admit ~ cgpa + lor
## 
##                     Df Sum of Sq    RSS     AIC
## + gre_score          1  0.098907 1.5396 -2052.8
## + research           1  0.090019 1.5485 -2050.6
## + toefl_score        1  0.084067 1.5544 -2049.2
## + university_rating  1  0.047906 1.5906 -2040.6
## + sop                1  0.017551 1.6209 -2033.5
## <none>                           1.6385 -2031.4
## 
## Step:  AIC=-2052.78
## chance_of_admit ~ cgpa + lor + gre_score
## 
##                     Df Sum of Sq    RSS     AIC
## + research           1  0.046645 1.4929 -2062.3
## + toefl_score        1  0.028934 1.5106 -2057.9
## + university_rating  1  0.026680 1.5129 -2057.3
## + sop                1  0.012699 1.5269 -2053.9
## <none>                           1.5396 -2052.8
## 
## Step:  AIC=-2062.32
## chance_of_admit ~ cgpa + lor + gre_score + research
## 
##                     Df Sum of Sq    RSS     AIC
## + toefl_score        1  0.031260 1.4617 -2068.3
## + university_rating  1  0.021700 1.4712 -2065.8
## + sop                1  0.010756 1.4822 -2063.0
## <none>                           1.4929 -2062.3
## 
## Step:  AIC=-2068.26
## chance_of_admit ~ cgpa + lor + gre_score + research + toefl_score
## 
##                     Df Sum of Sq    RSS     AIC
## + university_rating  1  0.014643 1.4470 -2070.0
## <none>                           1.4617 -2068.3
## + sop                1  0.006293 1.4554 -2067.9
## 
## Step:  AIC=-2070.03
## chance_of_admit ~ cgpa + lor + gre_score + research + toefl_score + 
##     university_rating
## 
##        Df Sum of Sq    RSS     AIC
## <none>              1.4470 -2070.0
## + sop   1 0.0016611 1.4454 -2068.5

By the each models above, it is observed that the features of each directions are similar (only differed from the sequence) which is also further explained in the next section. Thus, any directions of the stepwise could be choose in this case to compare with the manual model. The summary of the stepwise model is as below:

summary(stepwise_both)
## 
## Call:
## lm(formula = chance_of_admit ~ cgpa + lor + gre_score + research + 
##     toefl_score + university_rating, data = data_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.257232 -0.024547  0.009304  0.033994  0.163434 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)       -1.1561424  0.1261732  -9.163 < 0.0000000000000002 ***
## cgpa               0.1298942  0.0115352  11.261 < 0.0000000000000002 ***
## lor                0.0193331  0.0047498   4.070            0.0000575 ***
## gre_score          0.0012244  0.0005984   2.046             0.041452 *  
## research1          0.0262669  0.0078199   3.359             0.000864 ***
## toefl_score        0.0024844  0.0010014   2.481             0.013549 *  
## university_rating  0.0081611  0.0042291   1.930             0.054404 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06271 on 368 degrees of freedom
## Multiple R-squared:  0.8007, Adjusted R-squared:  0.7975 
## F-statistic: 246.4 on 6 and 368 DF,  p-value: < 0.00000000000000022

Having this summary, the model could be summarized as the equation below:

chance_of_admit = -1.1561424 + 0.1298942 x cgpa + 0.0193331 x lor + 0.0012244 x gre_score + 0.0262669 x research1 + 0.0024844 x toefl_score + 0.0081611 x university_rating

Likewise, the performance of each model would be seen from the respective values of Adjusted R-Squared values. Due to the multiple predictors are used in this models, Adjusted R-Squared would be preferably used instead of Multiple R-Squared as certain biases occured in the calculation of Multiple R-squared when it comes to the multiple linear regression. The equation of Adjusted R-Squared would be described as below:

\[\bar{R^2} = 1(1-R^2)\frac{n-1}{n-p-1}\]

Based on the summaries of each models above, it is shown that stepwise regression method has a greater value of Adjusted R-Squared value in comparison to the manual models.

Predictions and Errors

In this section, the built models are then compared by their error rates from their respective results of their predictions from both of the training data and the test data. This step would be required to see whether the models are fit to the data set. To test the error rates, the Root Mean Square Error (RMSE) are used in this case. The implementation of RMSE would be preferrable in comparison to Mean Absolute Error (MAE) as the errors of each predictions would be squared and then rooted which would resulted in having a greater penalties for the higher residuals. The equation to calculate RMSE would be described as below:

\[RMSE = \sqrt{\frac{\Sigma(prediction_i - actual_i)^2}{n}}\]

To calculate the RMSE in this study, RMSE function would be used by the caret package. Below is the calculation of predicted values along with the RMSE calculations of train and test data for manual model:

pred_manual_lm <- predict(manual_lm, newdata = data_test %>% select(-chance_of_admit))

#train
paste("RMSE Value for Train Data:", RMSE(manual_lm$fitted.values, data_train$chance_of_admit))
## [1] "RMSE Value for Train Data: 0.0633005307721381"
#test
paste("RMSE Value for Test Data:", RMSE(pred_manual_lm, data_test$chance_of_admit))
## [1] "RMSE Value for Test Data: 0.0523712410669489"

The step would be repeated for the stepwise regression model. Below are the calculations of predicted values by the stepwise model along with the error rates of train and test data:

pred_stepwise <- predict(stepwise_both, newdata = data_test %>% select(-chance_of_admit))

#train
paste("RMSE Value for Train Data:", RMSE(stepwise_both$fitted.values, data_train$chance_of_admit))
## [1] "RMSE Value for Train Data: 0.0621186436179306"
#test
paste("RMSE Value for Test Data:",RMSE(pred_stepwise, data_test$chance_of_admit))
## [1] "RMSE Value for Test Data: 0.0522589904242547"

Based on the calculations above, it is evident that there are slight differences of error rates for each models. The manual model has RMSE value of 0.063 for the train data calculation. Meanwhile the stepwise regression model has 0.062 for the train data calculation. Moreover the slight difference of RMSE values for the train and test data calculation towards each of the models show that each models does not seem to be overfit or underfit due to its slight differences. Thus, both models could be chosen for this case. To go forward, stepwise model would be choosen for the case due to the lower error rates and the higher value of Adjusted R-Squared. To efficiently compare each of the models, below are shown the comparison of each models by using compare_performance from performance package:

compare_performance(manual_lm, stepwise_both, stepwise_bwd, stepwise_fwd)

Assumption Testing

After choosing the model, it would then be tested for the assumptions of the linear regression. This step would be required in establishing an OLS Regression model as the model itself would only use a linear methods for predicting the target variable. In other words, the statistical model would only be formed in a straight line. Meanwhile, most of the cases in data science would not have a linear correlations between the target and predictors.

Several assumption tests would be performed include the linearity, heteroscedasticity, normality, autocorrelation, multicolinearity, and influence identification.

Linearity

The linearity test mainly focused on the correlation of target to each of its predictors. In the Exploratory Data Analysis section, the Pearson Correlation Coefficient Test has been conducted and it is seen that each of correlations formed a linear correlations (due to alpha < 0.05). The linearity would also be shown from the residuals vs. fitted values as seen as below:

plot(stepwise_both, 1)

Based on the plot above, the trend line of the plot (red line) is likely to be closed to the 0 line. This means that the residuals would show the linearity of the target and each predictor variables. Thus, the first assumption has been answered.

Heteroscedasticity

The Heteroscedasticity Test is mainly focused on identifying patterns within the residuals. The patterns within the residuals are likely ocurred as the variance of the residuals are non-constant. To identify the patterns of the residuals, a plot of scale-location vs. root-standardized residuals would be used.

plot(stepwise_both, 3)

Based on the plot, it is shown that the residuals on the left side of the grid (scale location 0.5 to 0.7) show more dispersed pattern than the right side of the grid. This would likely show the heteroscedasticity of the model, while the homoscedasticity (no pattern within residuals) would be desired. To investigate further, a Breusch-Pagan Test would be used. The hypotheses of the test would be described as follow:

H0: The error terms has no pattern (Homoscedasticity)
H1: The error has pattern (Heteroscedasticity)

To perform the test, a function bptest would be used by the package lmtest

bptest(stepwise_both)
## 
##  studentized Breusch-Pagan test
## 
## data:  stepwise_both
## BP = 23.553, df = 6, p-value = 0.0006309

It is clear now that there is a pattern within the residuals as the p-value of the test is below 0.05 which would reject the null hypothesis. By this means, a transformation to either a target variables or predictor variables might be preferable to fit this assumption. The transformation would be explained in the Model Improvement section.

Normality

The normality test is mainly focused on the distribution of the residuals. The desired output for this test is the normal distribution of the residuals. To identify, a histogram and Shapiro-Wilk test would be used.

hist(stepwise_both$residuals)

Based on the plot above, there are certain values of residuals that are deviate a little far away from the central points (bins -0.3 and -0.2) which made the distribution skewed. This would also seen from the Shapiro-Wilk Test as performed below. The hypotheses for the test are as follow:

H0: The residuals are normality distributed
H1: The residals are not normality distributed

shapiro.test(stepwise_both$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  stepwise_both$residuals
## W = 0.92706, p-value = 0.000000000001484

Due to p-value < 0.05, it is concluded that the residuals of the models are not normally distributed. This could be handled by variable transformation or removing certain outliers in the train data. Despite the removed outliers could be the answer, an influence check would be recommended beforehand which would be described in the next sub-section.

Autocorrelation

Autocorrelation test is used to identify whether there are any correlation among error-terms or residuals. The effect of this correlation would be the reduction of the confidence interval when predicting the target variable. For example, A prediction of a model with a confidence interval of 95% may in reality have a lower probability than 0.95 of the true value of the containing parameter. The test that would be used is Durbin Watson Test with the hypotheses as follow:

H0: No Autocorrelation presented
H1: An Autocorrelation is presented

durbinWatsonTest(stepwise_both)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.06768163      2.091591   0.378
##  Alternative hypothesis: rho != 0

Based on the test, the p-value is above 0.05 which has meaning that there are no autocorrelation presented in the residuals.

Multicolinearity

Multicolinearity test is performed to identify whethere there are any correlation between predictor variables. To test the multicolinearity of the model, a Variance Inflation Factor (VIF) test would be performed. The rule of thumb of this test is whether the value of VIF for each independent variables/predictor exceed 5 or 10.

vif(stepwise_both)
##              cgpa               lor         gre_score          research 
##          4.247711          1.812956          4.114458          1.436964 
##       toefl_score university_rating 
##          3.343981          2.160270

Based on the test result, it is shown that each of the VIF values does not exceed 5 or 10. Thus, it could be concluded that there are no multicolinearity among the predictors.

Checking Influence

To check the influence of the outliers, Cook’s Distance could identify whether certain outliers have an influence towards the linear models that are built. The rule of thumb to identify outliers by their Cook’s Distance is by having a cutoff level of the Cook’s Distance above 0.5 or 1.5. If the outliers are below this cutoff level, it is recommended not to perform any deletion of the outliers. This is due to the reduction of information if these outliers are deleted.

par(mfrow = c(1,2))
plot(stepwise_both, 5)
plot(stepwise_both, 4)

The plots shows several outliers within the residuals which are observation number 100, 193, and 270. However, the Cook’s distance of these outliers does not seem to exceed the cutoff level. Thus, this outliers should not be deleted.

Model Improvement

In the Model Improvement phase, the built model are then tuned to satisfy several assumptions that has not been satisfied in the previous section. Based on the model evaluation, there are two assumptions that has not met including heteroscedasticity and normality test. To satisfy these assumptions, a transformation of the model could be performed by transforming the scale of either the predictors or the target variable. In this case, the target model is transformed by using the box-cox transformation which the equation of transformation is seen as below:

\[\frac{y^\lambda - 1}{\lambda}\]

Note that this equation should be used if the lambda is not equal to 0 (as this would make the transformation uses the natural log). To determine the lambda of the transformation, boxcox function is used from MASS library.

MASS::boxcox(stepwise_both, plotit = T, lambda = seq(1, 2, 0.1))

Based on the plot above, the lambda that would be used is 1.9 as it meets the confidence interval line of 95%. The transformation then would be performed as below:

model_tuned <- lm((((chance_of_admit^1.9) - 1)/1.9) ~ cgpa + lor + gre_score + research + toefl_score + 
    university_rating, data = data_train)

summary(model_tuned)
## 
## Call:
## lm(formula = (((chance_of_admit^1.9) - 1)/1.9) ~ cgpa + lor + 
##     gre_score + research + toefl_score + university_rating, data = data_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.167619 -0.020530  0.007452  0.025932  0.110108 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)       -1.6153434  0.0846802 -19.076 < 0.0000000000000002 ***
## cgpa               0.0944725  0.0077418  12.203 < 0.0000000000000002 ***
## lor                0.0119563  0.0031878   3.751             0.000205 ***
## gre_score          0.0008731  0.0004016   2.174             0.030351 *  
## research1          0.0196593  0.0052483   3.746             0.000209 ***
## toefl_score        0.0019932  0.0006721   2.966             0.003216 ** 
## university_rating  0.0085358  0.0028383   3.007             0.002817 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04209 on 368 degrees of freedom
## Multiple R-squared:  0.8307, Adjusted R-squared:  0.8279 
## F-statistic: 300.8 on 6 and 368 DF,  p-value: < 0.00000000000000022

It is evident now that the adjusted R-squared has increased due to the transformation. Nevertheless, it is still encouraged to assess the error rates of the transformed model. The error rates of the tuned model are described as below:

pred_tuned <- predict(model_tuned, newdata = data_test %>% select(-chance_of_admit))

#train
paste("RMSE Value for Train Data:", RMSE(model_tuned$fitted.values, data_train$chance_of_admit))
## [1] "RMSE Value for Train Data: 0.957688731797463"
#test
paste("RMSE Value for Test Data:", RMSE(pred_tuned, data_test$chance_of_admit))
## [1] "RMSE Value for Test Data: 0.967013369224131"

It seems that the error rates has increased from 0.06 to 0.96 (raised approx. 90 percent). This would be considered as high as the target variable range from 0 to 1 (the probability of chance to admit in the program). To further investigate, several assumption testing would be performed to the tuned model.

plot(model_tuned, 1)

The linearity assumption is still satisfied as the red line of residuals vs. fitted values seems to fit to the zero line.

plot(model_tuned, 3)

The plot of root-standardized residuals vs. fitted values would also still show a certain pattern from 0.7 - 0.9 of the fitted value but was slightly better than the model without tuning

bptest(model_tuned)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_tuned
## BP = 12.484, df = 6, p-value = 0.052

Likewise, the p-value of the Breusch-Pagan test shows the homoscedasticity of the residual patterns which made the heteroscedasticity assumption for the tuned model satisfied.

hist(model_tuned$residuals)

In the normality test, however, the residual distributions are still skewed. Potentially, this is due to a certain pattern in the data set that could not be described by the linear regression model.

shapiro.test(model_tuned$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_tuned$residuals
## W = 0.94103, p-value = 0.00000000004743

Likewise, the Shapiro-Wilk test result shows that the distribution is not normal or skewed (due to p-value < 0.05). Nevertheless, the model is still usable with this condition but with a certain highlight that under certain data patterns would likely make a large error.

durbinWatsonTest(model_tuned)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.06807809      2.085291   0.446
##  Alternative hypothesis: rho != 0

The Durbin Watson test of the tuned model would also shown that there are no autocorrelation.

vif(model_tuned)
##              cgpa               lor         gre_score          research 
##          4.247711          1.812956          4.114458          1.436964 
##       toefl_score university_rating 
##          3.343981          2.160270

VIF test would also shown that there are no multicoliniearity among the independent variables for the tuned models.

par(mfrow = c(1,2))
plot(model_tuned, 5)
plot(model_tuned, 4)

The influence check by the Cook’s Distance would also shows that there are no outliers would influence the tuned model.

Conclusion

In conclusion, The OLS Regression is one of the methods to model and predicts the target variables based on several predictors. One of the methods to select features of the models (in this case, the predictors) is by using Stepwise Regression which is optimized by having the lowest value of Akaika Information Criterion (AIC). Based on the error rates and the adjusted R-squared values, it is also concluded that the stepwise regression would be the best fit for the Graduate Admission case study.

Due to several assumptions, including the normality and heteroscedasticity, the chosen model would require to be transformed. The transformation that is utilized is box-cox transformation by finding the appropriate value of lambda. As the result of the transformation, the heteroscedasticity is satisfied while the normality assumption is not changed. Nevertheless, the model could still be utilized under certain note that a certain pattern in the data would likely generate a larger error or residuals. It would also recommended to try to deploy other methods of machine learning such as decision tree or random forest which do not have any assumptions in their deployment.

Reference

  1. Aora, Ridhi. Answering question in “What are the ways to deal with auto-correlation problems in Multiple Regression Analysis?”. Accessed September 23, 2019, via https://www.researchgate.net/post/What_are_the_ways_to_deal_with_auto-correlation_problems_in_Multiple_Regression_Analysis.
  2. David Dalpiaz. “Transformation”. Accessed September 1, 2021, via https://daviddalpiaz.github.io/appliedstats/transformations.html
  3. Gujarati, Damodar and Dawn C. Porter. 2009. Basic Econometrics . New York: McGraw-Hill.
  4. James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert. 2013. An Introduction to Statistical Learning: with Applications in R . New York: Springer.
  5. Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019
  6. Thieme, Christian. “Identifying Outliers in Linear Regression — Cook’s Distance: What Is Cook’s Distance and How Can It Help You Identify and Remove Outliers From Your Dataset?”. Accessed September 1, 2021, https://towardsdatascience.com/identifying-outliers-in-linear-regression-cooks-distance-9e212e9136a.