A simple modeling with college admissions data obtained from https://www.kaggle.com/mohansacharya/graduate-admissions
Objective :
Loading the required packages for this simple modeling
library(dplyr)
library(MLmetrics)
library(readr)
library(assertive)
library(bigassertr)
library(ggplot2)
library(car)
library(lmtest)
library(ggcorrplot)
library(tidyr)
library(ggpubr)Loading the dataset
df_admission<-read_csv("Admission_Predict_Ver1.1.csv")
df_admission<-df_admission %>% select(-`Serial No.`)
head(df_admission)str(df_admission)## tibble [500 x 8] (S3: tbl_df/tbl/data.frame)
## $ GRE Score : num [1:500] 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL Score : num [1:500] 118 107 104 110 103 115 109 101 102 108 ...
## $ University Rating: num [1:500] 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num [1:500] 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num [1:500] 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num [1:500] 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : num [1:500] 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance of Admit : num [1:500] 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
The dataset contains several parameters which are considered important during the application for Masters Programs.The parameters included are :
Checking if there is any missing data
colSums(is.na(df_admission))## GRE Score TOEFL Score University Rating SOP
## 0 0 0 0
## LOR CGPA Research Chance of Admit
## 0 0 0 0
Checking if all values are in range
assert_all_are_in_closed_range(df_admission$`GRE Score`,lower=0,upper=340)
assert_all_are_in_closed_range(df_admission$`TOEFL Score`,lower=0,upper=120)
assert_all_are_in_closed_range(df_admission$`University Rating`,lower=0,upper=5)
assert_all_are_in_closed_range(df_admission$SOP,lower=0,upper=5)
assert_all_are_in_closed_range(df_admission$CGPA,lower=0,upper=10)
assert_all_are_in_closed_range(df_admission$LOR,lower=0,upper=5)
assert_all_are_in_closed_range(df_admission$`Chance of Admit`,lower=0,upper=1)No errors found! Now it’s time to check the binary scaled variable
assert_01(df_admission$Research)No errors found as well, now it’s time to do some exploratory data analysis.
Correlation<-round(cor(df_admission),1)
ggcorrplot(Correlation,type="lower",lab=TRUE) +
scale_fill_gradient2(breaks=c(0, 1), limit=c(0, 1),low="white",mid="blue",high="red")+labs(title="Correlation Between Variables",fill="Correlation")df_admission %>%
gather(key=Variable,value = Value) %>%
ggplot(aes(x=Value))+geom_histogram()+facet_wrap(~Variable,nrow = 2,ncol=4,scales="free")+theme_minimal()a<-df_admission %>%
ggplot(aes(x=`GRE Score`,y=`Chance of Admit`))+geom_point()+geom_smooth()
b<-df_admission %>%
ggplot(aes(x=`TOEFL Score`,y=`Chance of Admit`))+geom_point()+geom_smooth()
c<-df_admission %>%
ggplot(aes(x=`University Rating`,y=`Chance of Admit`))+geom_point()+geom_smooth()
d<-df_admission %>%
ggplot(aes(x=SOP,y=`Chance of Admit`))+geom_point()+geom_smooth()
e<-df_admission %>%
ggplot(aes(x=LOR,y=`Chance of Admit`))+geom_point()+geom_smooth()
f<-df_admission %>%
ggplot(aes(x=CGPA,y=`Chance of Admit`))+geom_point()+geom_smooth()
ggarrange(a,b,c,d,e,f + rremove("x.text"),
ncol = 2, nrow = 3)To decrease the probability model overfit and bias, we will first split the dataset into training and testing data. We are going to split the data into 70% train set and 30% test set.
n <- nrow(df_admission)
n_data<- round(n * 0.7,0)
index <- sample(seq_len(nrow(df_admission)), size = n_data)
set.seed(111)
df_admission_train <- df_admission[index, ]
df_admission_test <- df_admission[-index, ]Now we want to see which of the predicted variables impacts the full model the most. So, we will start off by scaling the whole dataset
df_scaled<-df_admission%>% mutate_if(is.numeric,scale)
model_scaled<-lm(`Chance of Admit`~.,df_scaled)Based on the scaled model, we will the construct a bar plot to compare values between coefficients
plot_coeffs <- function(mlr_model) {
coeffs <- coefficients(mlr_model)
mp <- barplot(coeffs[order(coeffs,decreasing = T)], col="#3F97D0", xaxt='n', main="Regression Coefficients")
lablist <- names(coeffs)
text(mp, par("usr")[3], labels = lablist, srt = 45, adj = c(1.1,1.1), xpd = TRUE, cex=0.6)
}
plot_coeffs(model_scaled)The plot above shows the coefficient from largest to smallest of the model.
From the plot above, we see that the most impacful variables to the graduate admissions in terms of coefficients are GRE Score and TOEFL Score, in which we will try to do some modeling with those variables.
model_1<-lm(`Chance of Admit`~`GRE Score`+`TOEFL Score`,df_admission_train)
summary(model_1)##
## Call:
## lm(formula = `Chance of Admit` ~ `GRE Score` + `TOEFL Score`,
## data = df_admission_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32337 -0.04167 0.01267 0.04695 0.16062
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.252122 0.117375 -19.187 < 2e-16 ***
## `GRE Score` 0.006779 0.000634 10.691 < 2e-16 ***
## `TOEFL Score` 0.007740 0.001190 6.504 2.73e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07516 on 347 degrees of freedom
## Multiple R-squared: 0.7221, Adjusted R-squared: 0.7205
## F-statistic: 450.8 on 2 and 347 DF, p-value: < 2.2e-16
Now, we will try to examine the possibility of an interaction between two variables
model_2<-lm(`Chance of Admit`~`GRE Score`+`TOEFL Score`+ `GRE Score`:`TOEFL Score`,df_admission_train)
summary(model_2)##
## Call:
## lm(formula = `Chance of Admit` ~ `GRE Score` + `TOEFL Score` +
## `GRE Score`:`TOEFL Score`, data = df_admission_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32252 -0.04183 0.01083 0.04663 0.15937
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.702e+00 1.808e+00 -0.941 0.347
## `GRE Score` 5.047e-03 5.711e-03 0.884 0.377
## `TOEFL Score` 2.578e-03 1.697e-02 0.152 0.879
## `GRE Score`:`TOEFL Score` 1.621e-05 5.315e-05 0.305 0.761
##
## Residual standard error: 0.07525 on 346 degrees of freedom
## Multiple R-squared: 0.7222, Adjusted R-squared: 0.7198
## F-statistic: 299.8 on 3 and 346 DF, p-value: < 2.2e-16
We see from the results above that a interactive relationship does not help either, the variables becomes insignificant and the adjusted R-Squared does not improve at all, so we stick to the first model which is :
\[ Model\ 1\\ Chance\ of\ Admit=-2.201218+ 0.006339\ GRE\ Score + 0.008537\ TOEFL\ Score \] And, we will call this Model 1
Now we are going to start modeling with creating a null and full model to create a stepwise regression
model_null<-lm(`Chance of Admit`~1,df_admission_train)
model_full<-lm(`Chance of Admit`~.,df_admission_train)
model_stepwise<-step(model_full,scope=list(lower=model_null,upper=model_full),direction = "both")## Start: AIC=-1965.47
## `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + `University Rating` +
## SOP + LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## - SOP 1 0.00294 1.2202 -1966.6
## - `University Rating` 1 0.00357 1.2208 -1966.4
## <none> 1.2172 -1965.5
## - `TOEFL Score` 1 0.01918 1.2364 -1962.0
## - Research 1 0.02208 1.2393 -1961.2
## - `GRE Score` 1 0.04314 1.2604 -1955.3
## - LOR 1 0.05684 1.2741 -1951.5
## - CGPA 1 0.31742 1.5347 -1886.4
##
## Step: AIC=-1966.63
## `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + `University Rating` +
## LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## - `University Rating` 1 0.00678 1.2270 -1966.7
## <none> 1.2202 -1966.6
## + SOP 1 0.00294 1.2172 -1965.5
## - `TOEFL Score` 1 0.02087 1.2411 -1962.7
## - Research 1 0.02253 1.2427 -1962.2
## - `GRE Score` 1 0.04247 1.2627 -1956.7
## - LOR 1 0.07259 1.2928 -1948.4
## - CGPA 1 0.34771 1.5679 -1880.9
##
## Step: AIC=-1966.69
## `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + LOR + CGPA +
## Research
##
## Df Sum of Sq RSS AIC
## <none> 1.2270 -1966.7
## + `University Rating` 1 0.00678 1.2202 -1966.6
## + SOP 1 0.00615 1.2208 -1966.4
## - Research 1 0.02390 1.2509 -1961.9
## - `TOEFL Score` 1 0.02668 1.2536 -1961.2
## - `GRE Score` 1 0.04189 1.2689 -1956.9
## - LOR 1 0.09495 1.3219 -1942.6
## - CGPA 1 0.38651 1.6135 -1872.8
Now we examine the model
summary(model_stepwise)##
## Call:
## lm(formula = `Chance of Admit` ~ `GRE Score` + `TOEFL Score` +
## LOR + CGPA + Research, data = df_admission_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.266334 -0.023151 0.008303 0.035222 0.164953
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.3710322 0.1195284 -11.470 < 2e-16 ***
## `GRE Score` 0.0021409 0.0006247 3.427 0.000684 ***
## `TOEFL Score` 0.0027758 0.0010149 2.735 0.006561 **
## LOR 0.0237213 0.0045975 5.160 4.19e-07 ***
## CGPA 0.1191731 0.0114481 10.410 < 2e-16 ***
## Research 0.0201557 0.0077858 2.589 0.010040 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05972 on 344 degrees of freedom
## Multiple R-squared: 0.826, Adjusted R-squared: 0.8235
## F-statistic: 326.7 on 5 and 344 DF, p-value: < 2.2e-16
Now, we want to compare model 1 and model stepwise. First we generate predictions from the model
predicted1<-predict(model_1,df_admission_test)
predicted2<-predict(model_stepwise,df_admission_test)Now, to make sure that the predictions are still in bound of the variable restriction, which is from 0 to 1
predicted1<-ifelse(predicted1>1,1,predicted1)
predicted2<-ifelse(predicted1>1,1,predicted2)errors <- matrix(c(rmse_1,mae_1,mape_1,rmse_2,mae_2,mape_2),ncol=2,byrow=FALSE)
colnames(errors) <- c("Model 1","model stepwise")
rownames(errors) <- c("RMSE","MAE","MAPE")
error_table <- as.table(errors)
error_table## Model 1 model stepwise
## RMSE 0.08144377 0.06130972
## MAE 0.05947997 0.04339546
## MAPE 0.09383281 0.06757797
By the stepwise regression, it is found that only one variable which is the university ranking is omitted from the model, resulting in a multiple R-squared of 80.32%. We will Call this model stepwise :
\[ Chance\ of\ Admit = -1.371\ + 0.002\ GRE \ Score + 0.003\ TOEFL \ Score + 0.015\ LOR + 0.122\ CGPA + 0.0272 \ Research \]
We can see that in terms of the error performance, model stepwise outperforms model 1 in every aspect of the error metrics, and thus we can conclude that the stepwise regression has been successful in finding a better model than model 1.
Now that we have that settled, we are going to check if our model is in any way shape or form overfitted or underfitted.
First, we are going to create a prediction using the train data
predicted3<-predict(model_stepwise,df_admission_train)Now, we are going to calculate the error measures
Creating a table to compare the errors
errors2 <- matrix(c(rmse_3,mae_3,mape_3,rmse_2,mae_2,mape_2),ncol=2,byrow=FALSE)
colnames(errors2) <- c("Train Set ","Test Set")
rownames(errors2) <- c("RMSE","MAE","MAPE")
error_table2 <- as.table(errors2)
error_table2## Train Set Test Set
## RMSE 0.05920826 0.06130972
## MAE 0.04306291 0.04339546
## MAPE 0.06937679 0.06757797
Now, we see that the errors don’t differ as much, so we can conclude that our model is not over or under fitted.
After we have come into terms that model stepwise, the model resulted from the stepwise regression is the final model, it is time to evaluate and check if the underlying assumptions for a multiple linear regression model is fulfilled by model stepwise. In this part, we are going to use an alpha of 5%, which indicates a 5% significance.
First, we are going to check if our residuals in fact fulfills the normality assumption of the residuals of the model. We are going to use the shapiro-wilk test to check the normality of the residuals.
shapiro.test(model_stepwise$residuals)##
## Shapiro-Wilk normality test
##
## data: model_stepwise$residuals
## W = 0.94215, p-value = 1.888e-10
From the results above, we see that our p-value is less than of 5% which, indicates that null hypothesis is rejected. Thus, our model’s residuals is not normally distributed
Now, we will try plotting the residuals
hist(model_stepwise$residuals)From the plot above, we can see that the residuals of the model is in fact skewed to the left and not normally distributed.
Next, we are going to check if any auto-correlations exist between the predictor variables, we are using the variance inflation factor, in which a value greater than 10 indicates that auto-correlation in fact exists.
vif(model_stepwise)## `GRE Score` `TOEFL Score` LOR CGPA Research
## 5.082656 3.807112 1.733881 4.666194 1.469552
No autocorrelation exist! Now, we move on to homoscedasticity.
homoscedasticity is a test used to check the homogeneity of variance within the predictors. In this step, we are using the breusch-pagan test.
bptest(model_stepwise)##
## studentized Breusch-Pagan test
##
## data: model_stepwise
## BP = 25.882, df = 5, p-value = 9.408e-05
Since, the p-value is less than 5% and thus we reject null hypothesis. So, we can conclude that heteroscedasticity is in fact present in out model.
plot(model_stepwise,1)The plot shows that there is a fairly linear relationship between the fitted line and the residual value
In conclusion, we can conclude that while the stepwise regression managed to generate fairly low errors and good R-squared, the linear model assumptions which is normality and heteroscedasticity fails to adhere, and thus might be an indicator that our data is not really suitable for linear regression.
Now, we are going to try if Box-Cox transformation will work on our model.
lambda_try <- boxCox(model_stepwise, lambda = seq(-10, 10,by=0.1))lambda <- lambda_try$x[ which.max(lambda_try$y) ]Now, we transform the variable
df_admission_new <- df_admission %>%
mutate(`Chance of Admit` = (`Chance of Admit`^lambda - 1) / lambda)Then, we split into training and testing set
n <- nrow(df_admission_new)
n_data<- round(n * 0.7,0)
index <- sample(seq_len(nrow(df_admission_new)), size = n_data)
set.seed(111)
df_admission_new_train <- df_admission_new[index, ]
df_admission_new_test <- df_admission_new[-index, ]Now, we start the modeling process
model_4 <- lm(`Chance of Admit` ~ ., df_admission_new_train)
model_transformed <- step(model_4, trace = 0)Then, we examine the model
summary(model_transformed)##
## Call:
## lm(formula = `Chance of Admit` ~ `GRE Score` + `TOEFL Score` +
## `University Rating` + LOR + CGPA + Research, data = df_admission_new_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.158650 -0.019352 0.006652 0.024205 0.094424
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5705279 0.0789858 -19.884 < 2e-16 ***
## `GRE Score` 0.0013436 0.0003889 3.455 0.000619 ***
## `TOEFL Score` 0.0018498 0.0006681 2.769 0.005930 **
## `University Rating` 0.0084454 0.0027896 3.027 0.002653 **
## LOR 0.0082005 0.0029445 2.785 0.005649 **
## CGPA 0.0766884 0.0070841 10.825 < 2e-16 ***
## Research 0.0162931 0.0049671 3.280 0.001144 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03845 on 343 degrees of freedom
## Multiple R-squared: 0.8368, Adjusted R-squared: 0.834
## F-statistic: 293.2 on 6 and 343 DF, p-value: < 2.2e-16
With the model, we are able to get a higher Adjusted R-squared than model stepwise by about 3%
Now, we examine the errors
errors4 <- matrix(c(rmse_4,mae_4,mape_4),ncol=1,
byrow=FALSE)
colnames(errors4) <- c("Model Transformed")
rownames(errors4) <- c("RMSE","MAE","MAPE")
error_table4 <- as.table(errors4)
error_table4## Model Transformed
## RMSE 0.03496419
## MAE 0.02610167
## MAPE 0.15422195
# Normality
shapiro.test(model_transformed$residuals)##
## Shapiro-Wilk normality test
##
## data: model_transformed$residuals
## W = 0.94716, p-value = 7.297e-10
From the p-value obtained, we come to the conclusion of rejecting null hypothesis, which means that the residuals of the model, is not normally distributed even our box-cox transformation.
hist(model_transformed$residuals)From the plot of histogram using residuals, we are able to see that the histogram is skewed to the left, thus indicating a non-normal distribution of the residuals.
Next, we are going to check if any auto-correlations exist between the predictor variables, we are using the variance inflation factor, in which a value greater than 10 indicates that auto-correlation in fact exists.
vif(model_transformed)## `GRE Score` `TOEFL Score` `University Rating` LOR
## 4.572280 3.892519 2.317527 1.744561
## CGPA Research
## 4.447588 1.437115
From inspecting the variance inflation factor, we are able to conclude that our model does not contain any autocorrelation as none of the variance inflation factor exceeds 10.
bptest(model_transformed)##
## studentized Breusch-Pagan test
##
## data: model_transformed
## BP = 8.6446, df = 6, p-value = 0.1946
From the p-value, it exceeds 5% which we can conclude that we are going to accept null hypothesis of the breusch-pagan test, which means that our model is homoscedastic.
plot(model_transformed,1)From the plot above, we are able to see that the fitted values and residuals has a fairly linear relationship.
While transforming the model was able to result the fulfillment of the homoscedastic assumption, it is not able to get the model to adhere to the normality assumption.
predicted4<-predict(model_transformed,df_admission_new_test)To make sure the predicted chance of admittance is still in bound.
predicted4<-ifelse(predicted1>1,1,predicted4)predicted5<-predict(model_transformed,df_admission_new_test)To make sure the predicted chance of admittance is still in bound.
predicted5<-ifelse(predicted1>1,1,predicted5)errors5 <- matrix(c(rmse_4,mae_4,mape_4,rmse_5,mae_5,mape_5),ncol=2,byrow=FALSE)
colnames(errors5) <- c("Train Set ","Test Set")
rownames(errors5) <- c("RMSE","MAE","MAPE")
error_table5 <- as.table(errors5)
error_table5## Train Set Test Set
## RMSE 0.03482717 0.03482717
## MAE 0.02585899 0.02585899
## MAPE 0.15220554 0.15220554
As seen from the matrix above, the model is in fact not underfitted/overfitted as the values of the error metrics don’t differ as much.
Overall, there are three resulting models that we have created and improved :
Now, we will compare the errors
errors3 <- matrix(c(rmse_2,mae_2,mape_2,rmse_4,mae_4,mape_4),ncol=2,
byrow=FALSE)
colnames(errors3) <- c("Model Stepwise","Model Transformed")
rownames(errors3) <- c("RMSE","MAE","MAPE")
error_table3 <- as.table(errors3)
error_table3## Model Stepwise Model Transformed
## RMSE 0.06130972 0.03482717
## MAE 0.04339546 0.02585899
## MAPE 0.06757797 0.15220554
Now, we will compare the r-squared
r<- matrix(c(summary(model_stepwise)$adj.r.squared,summary(model_transformed)$adj.r.squared),ncol=2,
byrow=FALSE)
colnames(r) <- c("Model Stepwise","Model Transformed")
rownames(r) <- c("Rsquared")
rsq_table <- as.table(r)
rsq_table## Model Stepwise Model Transformed
## Rsquared 0.8235030 0.8339871
Over all, the box-cox transformation definitely helped in boosting the model’s performance in terms of RMSE, MAE and also adjusted r-squared.
However, there is a trade-off in which the MAPE of the model is increased. According to Lewis, C. D in the book Industrial and Business Forecasting Methods, a MAPE that is less than 10% goes into the category of highly accurate forecasting, while 10-20% goes into the category of good forecasting. Which means that while there is an increase of performance in some areas, there is a slight decay in the model’s forecast performance in terms of MAPE.
Furthermore, even with the Box-Cox transformation the residual of the model is not normally distributed, which violates the linear regression assumption. In conclusion, more methods to predict the chance of graduate admittance needs to be explored.