We’ll be doing some exploratory and prediction for numeric target variable. The data is provided at https://www.kaggle.com/mohansacharya/graduate-admissions?select=Admission_Predict.csv, records 400 observations which 8 variables (columns) of the data might be affecting the value of Chance.of.Admit column. We’ll predict this variable and acknowledge another variables that contribute into the value of target.
Chance.of.Admit
library(dplyr) #wrangling
library(GGally) #correlation
library(MLmetrics) #error check
library(lmtest) #heteroscedasticity
library(car) #multicolinearity/vif
First of all, use read.csv() from baseR to read graduate_admission.csv that has been saved in local folder. Define it as grad
grad <- read.csv("data_input/graduate_admission.csv")
head(grad)
anyNA(grad)
## [1] FALSE
Drop variables that don’t provide/explain any information to target variable, in our case: Serial.No. To wrangling the data, we can use %>% from library(dplyr).
grad <- grad %>%
select(-Serial.No.)
head(grad)
GRE Scores : out of 340TOEFL Scores : out of 120University Rating : out of 5Statement of Purpose : out of 5Letter of Recommendation : out of 5Undergraduate GPA: out of 10Research Experience: either 0 or 1. We assume 0 for noChance of Admit: ranging from 0 to 1We want to see the data type of each variables by using str() or glimpse() from library(dplyr). Then, doing some tranformation by using mutate.
str(grad)
## 'data.frame': 400 obs. of 8 variables:
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
# change data type into categorical
grad <- grad %>%
mutate(Research = as.factor(Research))
str(grad)
## 'data.frame': 400 obs. of 8 variables:
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 1 1 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
# library(GGally)
ggcorr(grad, label = TRUE, hjust = 1, label_size = 3.5, layout.exp = 2)
According to the plot, we can say that each variable has positive correlation to one another–mostly moderate to strong. Especially
Chance.of.Admit, shows strong positive correlation with CGPA (0.9), TOEFL (0.8), GRE (0.8), etc
Cross validation known as a step to splitting our data into data train and data test with different weights ratio depends on the business demand. Referred to its name, we use data train to train our models while confirming its performance by using data test.
RNGkind(sample.kind = "Rounding")
set.seed(111)
idx <- sample(nrow(grad), nrow(grad) *0.75)
grad_train <- grad[idx, ]
grad_test <- grad[-idx, ]
names(grad)
## [1] "GRE.Score" "TOEFL.Score" "University.Rating"
## [4] "SOP" "LOR" "CGPA"
## [7] "Research" "Chance.of.Admit"
Note, we are going to more than one models to predict the data test later. For the first models, we can use the predictor which has strongest correlation value with our target, CGPA.
model1 <- lm(Chance.of.Admit ~ CGPA, grad_train)
summary(model1)
##
## Call:
## lm(formula = Chance.of.Admit ~ CGPA, data = grad_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.273007 -0.029511 0.009633 0.040942 0.183168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.085506 0.059870 -18.13 <2e-16 ***
## CGPA 0.210292 0.006958 30.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07019 on 298 degrees of freedom
## Multiple R-squared: 0.754, Adjusted R-squared: 0.7532
## F-statistic: 913.4 on 1 and 298 DF, p-value: < 2.2e-16
As summarised, CGPA shows a significant effect toward our model (p-value > 0.05 or code ***). In a number, every increased value of 1 unit-point CGPA will also contribute 0.210292 for increasing value of Chance.of.Admit. We can interpret adjusted R-squared value for this model : 0.7532. It means that model1 can explain 75.32% of variance in Chance.of.Admit.
model2 <- lm(Chance.of.Admit ~ CGPA + TOEFL.Score + GRE.Score, grad_train)
summary(model2)
##
## Call:
## lm(formula = Chance.of.Admit ~ CGPA + TOEFL.Score + GRE.Score,
## data = grad_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.287689 -0.020286 0.008645 0.039000 0.134669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5616272 0.1198263 -13.032 < 2e-16 ***
## CGPA 0.1503685 0.0130713 11.504 < 2e-16 ***
## TOEFL.Score 0.0032844 0.0012966 2.533 0.01182 *
## GRE.Score 0.0020168 0.0006634 3.040 0.00258 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06726 on 296 degrees of freedom
## Multiple R-squared: 0.7756, Adjusted R-squared: 0.7733
## F-statistic: 341 on 3 and 296 DF, p-value: < 2.2e-16
As declared, the value of adjusted R-squared in model 2 is 0.7837. It means top 3 strongest-correlation-variables (CGPA, TOEFL.Score, GRE.Score) explain 77.33% of variance in Chance.of.Admit. It slightly has a better adjusted R-squared value. Can we say, it is the best models so far? Not yet. Let’s create one last models by using step() function.
lm_3 <- lm(Chance.of.Admit ~ ., grad_train)
step(lm_3, direction = "backward")
## Start: AIC=-1632.94
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## SOP + LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## - SOP 1 0.002812 1.2332 -1634.2
## - University.Rating 1 0.003744 1.2341 -1634.0
## <none> 1.2304 -1632.9
## - GRE.Score 1 0.022827 1.2532 -1629.4
## - Research 1 0.023518 1.2539 -1629.3
## - TOEFL.Score 1 0.024838 1.2552 -1628.9
## - LOR 1 0.053216 1.2836 -1622.2
## - CGPA 1 0.313747 1.5441 -1566.8
##
## Step: AIC=-1634.25
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## - University.Rating 1 0.002023 1.2352 -1635.8
## <none> 1.2332 -1634.2
## - Research 1 0.022949 1.2561 -1630.7
## - TOEFL.Score 1 0.023474 1.2567 -1630.6
## - GRE.Score 1 0.024095 1.2573 -1630.5
## - LOR 1 0.052415 1.2856 -1623.8
## - CGPA 1 0.310955 1.5441 -1568.8
##
## Step: AIC=-1635.76
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## <none> 1.2352 -1635.8
## - Research 1 0.02428 1.2595 -1631.9
## - GRE.Score 1 0.02491 1.2601 -1631.8
## - TOEFL.Score 1 0.02638 1.2616 -1631.4
## - LOR 1 0.06644 1.3016 -1622.0
## - CGPA 1 0.34753 1.5827 -1563.4
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
## CGPA + Research, data = grad_train)
##
## Coefficients:
## (Intercept) GRE.Score TOEFL.Score LOR CGPA Research1
## -1.295564 0.001622 0.003133 0.022585 0.125403 0.021996
Function step() offers “the most effective” regression model recommendation which has the least AIC (missing value). Note: drop categoric variables (Research) because Regression Models performs well with numeric variables.
model3 <- lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA, data = grad_train)
summary(model3)
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
## CGPA, data = grad_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.276348 -0.020703 0.009176 0.036326 0.147764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4323082 0.1201939 -11.917 < 2e-16 ***
## GRE.Score 0.0020700 0.0006446 3.211 0.00147 **
## TOEFL.Score 0.0030853 0.0012604 2.448 0.01495 *
## LOR 0.0244871 0.0056694 4.319 2.14e-05 ***
## CGPA 0.1259759 0.0138974 9.065 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06534 on 295 degrees of freedom
## Multiple R-squared: 0.789, Adjusted R-squared: 0.7861
## F-statistic: 275.7 on 4 and 295 DF, p-value: < 2.2e-16
model3 has slightly higher value of Adjusted R-squared: 0.7895. One of the interesting point from summary(model3) is actually LOR generates significancy for our models. So, let’s try to create model4 with CGPA and LOR as the predictor.
model4 <- lm(Chance.of.Admit ~ LOR + CGPA, grad_train)
summary(model4)
##
## Call:
## lm(formula = Chance.of.Admit ~ LOR + CGPA, data = grad_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26601 -0.02496 0.01068 0.04066 0.19093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.951596 0.066385 -14.334 < 2e-16 ***
## LOR 0.024918 0.005922 4.208 3.42e-05 ***
## CGPA 0.184664 0.009107 20.276 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0683 on 297 degrees of freedom
## Multiple R-squared: 0.7678, Adjusted R-squared: 0.7663
## F-statistic: 491.1 on 2 and 297 DF, p-value: < 2.2e-16
summary(model1)$adj #1 pred
## [1] 0.753173
summary(model2)$adj #3 pred
## [1] 0.7733309
summary(model3)$adj #5 pred
## [1] 0.7860898
summary(model4)$adj #2 pred
## [1] 0.7662739
By considering to adjusted R squared values and number of predictor variable(s) for each models, we assume our top 2 models are model2 and model3.
Next step is evaluate our models to predict data test by using predict(). Don’f forget we only use model2 and model3 to predict data test. In this phase, we’re still looking for the best models. Which one has a better performance with least error, model2 or model3?
# predicting model to our target variable in data test
predm2 <- predict(model2, newdata = data.frame(grad_test))
predm3 <- predict(model3, newdata = data.frame(grad_test))
# MAPE of predm2 to data test
MAPE(predm2, grad_test$Chance.of.Admit)
## [1] 0.07661793
# MAPE of predm2 to data train
MAPE(predm2, grad_train$Chance.of.Admit)
## [1] 0.2509074
# MAPE of predm3 to data test
MAPE(predm3, grad_test$Chance.of.Admit)
## [1] 0.07523643
# MAPE of predm3 to data train
MAPE(predm3, grad_train$Chance.of.Admit)
## [1] 0.2496666
The smaller the MAPE value is, the better our models performs to predicting data. Particularly when the error of data test is smaller than train, then we can say our models has a great performance.
For this case, model3 is slightly better than model2 with MAPE of data test: 0.07523643 or 0.075%.
model3 with 4 assumptions:
cor.test()Linear / reject H0cor.test(grad_train$GRE.Score, grad_train$Chance.of.Admit)
##
## Pearson's product-moment correlation
##
## data: grad_train$GRE.Score and grad_train$Chance.of.Admit
## t = 21.987, df = 298, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7391407 0.8261988
## sample estimates:
## cor
## 0.7865468
cor.test(grad_train$TOEFL.Score, grad_train$Chance.of.Admit)
##
## Pearson's product-moment correlation
##
## data: grad_train$TOEFL.Score and grad_train$Chance.of.Admit
## t = 21.654, df = 298, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7336523 0.8223550
## sample estimates:
## cor
## 0.7819308
cor.test(grad_train$LOR, grad_train$Chance.of.Admit)
##
## Pearson's product-moment correlation
##
## data: grad_train$LOR and grad_train$Chance.of.Admit
## t = 15.504, df = 298, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6003691 0.7264558
## sample estimates:
## cor
## 0.6681827
cor.test(grad_train$CGPA, grad_train$Chance.of.Admit)
##
## Pearson's product-moment correlation
##
## data: grad_train$CGPA and grad_train$Chance.of.Admit
## t = 30.222, df = 298, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8374355 0.8936943
## sample estimates:
## cor
## 0.8683309
The test has a p-value bellow the significance level of 0.05, therefore we reject the null hypothesis means our models is linear.
shapiro.test()Residuals normally distributed/ fail to reject H0shapiro.test(model3$residuals)
##
## Shapiro-Wilk normality test
##
## data: model3$residuals
## W = 0.90795, p-value = 1.415e-12
We can conclude that the null hypothesis is rejected, means residuals not normally distributed.
bptest() from library(lmtest)Homoscedasticity/ fail to reject H0bptest(model3)
##
## studentized Breusch-Pagan test
##
## data: model3
## BP = 13.033, df = 4, p-value = 0.01112
We can say that the null hypothesis is rejected, means heteroscedasticity is present in our model.
vif() from library(car) vif(model3)
## GRE.Score TOEFL.Score LOR CGPA
## 3.856675 3.868141 1.811634 4.602878
The result shows all predictors have less than 10 points, indicates there is no multicollinearity.
Even though our model doesn’t fulfill 4 assumptions of Regression Models, it all depends on business case whether to accept or tunning this current model. For now, we will hold with this because it’s still tolerable, which model3 has fulfilled two of four assumptions : linearity and no multicollinearity.
Furthermore, model3 with predictors : GRE.Score, TOEFL.Score, LOR, CGPA has quite good performance. First, we can analyze from adjusted R squared value, with 78.60% variance of Chance.of.Admit explained by those predictors variables in model3. The models accuracy for predicting target in data test is measured by a low number of error. In this case, we use MAPE for training data, it is about 0.24% and testing data has a significant improvement with 0.075% for errors.