1.1 Main Objective
In this report, our final goal is answer following question:
- What factor that significantly affect applicant’s admission chance?
- Are those significant variable good enough to describe applicant’s admission chance?
This is Learning By Building Project for Regression Model (RM). The data used for this analysis is Graduate Admission for Master Programs. In this data, there are several columns as consideration for applicant’s admission chance (stated in chance of admit). This analysis will be separated into 6 parts:
Let’s get started!
In this report, our final goal is answer following question:
In this part, we’re going to read our data, and prepare it into “clean data”.
First, let’s read our data and we will assign it as graduate.
# read the data
graduate <- read.csv("data_input/Admission_Predict_Ver1.1.csv")
graduateThe data set contains 500 rows, and 9 columns.
GRE ScoresTOEFL ScoresUniversity RatingStatement of PurposeLetter of Recommendation : Stated in recommendation strengthUndergraduate GPAResearch Experience : 1 if the applicant has research experience, and 0 if the applicant has no research experienceChance of AdmitIn this analysis, we will use Chance of Admit as our target variable.
Check if there are any mismatch data types in our dataset.
# check data types for each column
str(graduate)#> 'data.frame': 500 obs. of 9 variables:
#> $ Serial.No. : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
#> $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
#> $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
#> $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
#> $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
#> $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
#> $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
#> $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
For Research column, it will be more proper if we change it into categorical. Since the information contain in column is 1 if the applicant has done research, and 0 if the applicant hasn’t done research yet.
library(dplyr)
graduate <- graduate %>%
mutate(Research = as.factor(Research))
str(graduate)#> 'data.frame': 500 obs. of 9 variables:
#> $ Serial.No. : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
#> $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
#> $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
#> $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
#> $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
#> $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
#> $ Research : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 1 1 ...
#> $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
Our dataset column now already in proper types.
# check missing value
colSums(is.na(graduate))#> Serial.No. GRE.Score TOEFL.Score University.Rating
#> 0 0 0 0
#> SOP LOR CGPA Research
#> 0 0 0 0
#> Chance.of.Admit
#> 0
Good! Our data set has no missing value. We can proceed into next step.
In this section, we’re going to drop column that we’re not going to use. In our data, there is column for Seral.No, and we’re not going to use this column.
# drop Seral.No column
graduate_clean <-
graduate %>%
select(-Serial.No.)
head(graduate_clean)After we got clean data, now we’re going to explore our data. We can try to observe our data in general by using summary, and we’re going to observe correlation between column.
# data summary
summary(graduate_clean)#> GRE.Score TOEFL.Score University.Rating SOP
#> Min. :290.0 Min. : 92.0 Min. :1.000 Min. :1.000
#> 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000 1st Qu.:2.500
#> Median :317.0 Median :107.0 Median :3.000 Median :3.500
#> Mean :316.5 Mean :107.2 Mean :3.114 Mean :3.374
#> 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000 3rd Qu.:4.000
#> Max. :340.0 Max. :120.0 Max. :5.000 Max. :5.000
#> LOR CGPA Research Chance.of.Admit
#> Min. :1.000 Min. :6.800 0:220 Min. :0.3400
#> 1st Qu.:3.000 1st Qu.:8.127 1:280 1st Qu.:0.6300
#> Median :3.500 Median :8.560 Median :0.7200
#> Mean :3.484 Mean :8.576 Mean :0.7217
#> 3rd Qu.:4.000 3rd Qu.:9.040 3rd Qu.:0.8200
#> Max. :5.000 Max. :9.920 Max. :0.9700
Insight:
From boxplot we can see there are outlier data in LOR and Chance of Admit column.
## Variable Correlation
library(GGally)
ggcorr(graduate_clean, cex = 4, label = T)Assumption :
CGPA has the strongest relation with target variableResearch column is not numeric, we can’t see correlation with out target variable (Chance.of.Admit) here.To answer our main objective, we’ll try to make regression model.
Before making our model, we have to split our data into train and test data. We will create our model with train data, and we’ll check model performance with test data. We’re going to split our data into 80/20 proportion (80% of the data as train, and 20% will be treated as test).
RNGkind(sample.kind = "Rounding")
set.seed(100)
library(rsample)
# indexing the data
index_graduate <- initial_split(data = graduate_clean, prop = 0.8, strata = "Chance.of.Admit")
# splitting data
train <- training(index_graduate)
test <- testing(index_graduate)First, let’s make regression model with all column as predictor variable, and one model that has no predictor at all.
# regression model with no predictor variable
model_none <- lm(formula = Chance.of.Admit ~ 1, data = train)
summary(model_none)#>
#> Call:
#> lm(formula = Chance.of.Admit ~ 1, data = train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.3806 -0.0906 -0.0006 0.0994 0.2494
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.720602 0.007065 102 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1411 on 398 degrees of freedom
# regression model with all column as predictor variable
model_all <- lm(formula = Chance.of.Admit ~ ., data = train)
summary(model_all)#>
#> Call:
#> lm(formula = Chance.of.Admit ~ ., data = train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.27069 -0.02389 0.01083 0.03394 0.16295
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.3463885 0.1204243 -11.180 < 0.0000000000000002 ***
#> GRE.Score 0.0023265 0.0005749 4.047 0.0000626 ***
#> TOEFL.Score 0.0022165 0.0009997 2.217 0.0272 *
#> University.Rating 0.0068044 0.0043921 1.549 0.1221
#> SOP 0.0006159 0.0054084 0.114 0.9094
#> LOR 0.0200814 0.0046805 4.290 0.0000225 ***
#> CGPA 0.1155127 0.0112072 10.307 < 0.0000000000000002 ***
#> Research1 0.0185453 0.0075253 2.464 0.0142 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.06173 on 391 degrees of freedom
#> Multiple R-squared: 0.812, Adjusted R-squared: 0.8086
#> F-statistic: 241.3 on 7 and 391 DF, p-value: < 0.00000000000000022
From this base model, we can get information:
GRE.Score, LOR, and CGPA has significant influence in our model.We’re going to improve our base model using step-wise by eliminating or adding predictor from or to our base model. In step-wise method, there are 3 ways to eliminate our variable backward, forward, both. We’re try all 3 ways, and choosing our best model.
model_backward <- step(object = model_all, direction = "backward")#> Start: AIC=-2214.5
#> Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
#> SOP + LOR + CGPA + Research
#>
#> Df Sum of Sq RSS AIC
#> - SOP 1 0.00005 1.4900 -2216.5
#> <none> 1.4900 -2214.5
#> - University.Rating 1 0.00915 1.4991 -2214.1
#> - TOEFL.Score 1 0.01873 1.5087 -2211.5
#> - Research 1 0.02314 1.5131 -2210.3
#> - GRE.Score 1 0.06240 1.5523 -2200.1
#> - LOR 1 0.07015 1.5601 -2198.1
#> - CGPA 1 0.40482 1.8948 -2120.6
#>
#> Step: AIC=-2216.48
#> Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
#> LOR + CGPA + Research
#>
#> Df Sum of Sq RSS AIC
#> <none> 1.4900 -2216.5
#> - University.Rating 1 0.01118 1.5012 -2215.5
#> - TOEFL.Score 1 0.01892 1.5089 -2213.4
#> - Research 1 0.02318 1.5132 -2212.3
#> - GRE.Score 1 0.06235 1.5523 -2202.1
#> - LOR 1 0.07755 1.5675 -2198.2
#> - CGPA 1 0.43108 1.9211 -2117.1
summary(model_backward)#>
#> Call:
#> lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
#> LOR + CGPA + Research, data = train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.27051 -0.02375 0.01121 0.03409 0.16309
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.3483036 0.1190943 -11.321 < 0.0000000000000002 ***
#> GRE.Score 0.0023251 0.0005741 4.050 0.00006169 ***
#> TOEFL.Score 0.0022235 0.0009966 2.231 0.0262 *
#> University.Rating 0.0069893 0.0040760 1.715 0.0872 .
#> LOR 0.0202338 0.0044795 4.517 0.00000831 ***
#> CGPA 0.1158148 0.0108751 10.650 < 0.0000000000000002 ***
#> Research1 0.0185593 0.0075148 2.470 0.0139 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.06165 on 392 degrees of freedom
#> Multiple R-squared: 0.812, Adjusted R-squared: 0.8091
#> F-statistic: 282.2 on 6 and 392 DF, p-value: < 0.00000000000000022
model_forward <- step(object = model_none,
direction = "forward",
scope = list(lower = model_none, upper = model_all),
trace = 0)
summary(model_forward)#>
#> Call:
#> lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + LOR + Research +
#> TOEFL.Score + University.Rating, data = train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.27051 -0.02375 0.01121 0.03409 0.16309
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.3483036 0.1190943 -11.321 < 0.0000000000000002 ***
#> CGPA 0.1158148 0.0108751 10.650 < 0.0000000000000002 ***
#> GRE.Score 0.0023251 0.0005741 4.050 0.00006169 ***
#> LOR 0.0202338 0.0044795 4.517 0.00000831 ***
#> Research1 0.0185593 0.0075148 2.470 0.0139 *
#> TOEFL.Score 0.0022235 0.0009966 2.231 0.0262 *
#> University.Rating 0.0069893 0.0040760 1.715 0.0872 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.06165 on 392 degrees of freedom
#> Multiple R-squared: 0.812, Adjusted R-squared: 0.8091
#> F-statistic: 282.2 on 6 and 392 DF, p-value: < 0.00000000000000022
model_both <- step(object = model_all, direction = "both")#> Start: AIC=-2214.5
#> Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
#> SOP + LOR + CGPA + Research
#>
#> Df Sum of Sq RSS AIC
#> - SOP 1 0.00005 1.4900 -2216.5
#> <none> 1.4900 -2214.5
#> - University.Rating 1 0.00915 1.4991 -2214.1
#> - TOEFL.Score 1 0.01873 1.5087 -2211.5
#> - Research 1 0.02314 1.5131 -2210.3
#> - GRE.Score 1 0.06240 1.5523 -2200.1
#> - LOR 1 0.07015 1.5601 -2198.1
#> - CGPA 1 0.40482 1.8948 -2120.6
#>
#> Step: AIC=-2216.48
#> Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
#> LOR + CGPA + Research
#>
#> Df Sum of Sq RSS AIC
#> <none> 1.4900 -2216.5
#> - University.Rating 1 0.01118 1.5012 -2215.5
#> + SOP 1 0.00005 1.4900 -2214.5
#> - TOEFL.Score 1 0.01892 1.5089 -2213.4
#> - Research 1 0.02318 1.5132 -2212.3
#> - GRE.Score 1 0.06235 1.5523 -2202.1
#> - LOR 1 0.07755 1.5675 -2198.2
#> - CGPA 1 0.43108 1.9211 -2117.1
summary(model_both)#>
#> Call:
#> lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
#> LOR + CGPA + Research, data = train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.27051 -0.02375 0.01121 0.03409 0.16309
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.3483036 0.1190943 -11.321 < 0.0000000000000002 ***
#> GRE.Score 0.0023251 0.0005741 4.050 0.00006169 ***
#> TOEFL.Score 0.0022235 0.0009966 2.231 0.0262 *
#> University.Rating 0.0069893 0.0040760 1.715 0.0872 .
#> LOR 0.0202338 0.0044795 4.517 0.00000831 ***
#> CGPA 0.1158148 0.0108751 10.650 < 0.0000000000000002 ***
#> Research1 0.0185593 0.0075148 2.470 0.0139 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.06165 on 392 degrees of freedom
#> Multiple R-squared: 0.812, Adjusted R-squared: 0.8091
#> F-statistic: 282.2 on 6 and 392 DF, p-value: < 0.00000000000000022
After making 3 regression models, let’s resume our model and we’ll choose which model is the best.
library(performance)
compare_performance(model_backward, model_forward, model_both)Every model has AIC and adjusted R squared same score, this indicate that all of our model is the same. For further analysis we’re going to use model_both.
Notes: Since every model is the same or having same AIC value/adjusted R squared, we can use any model we prefer
Choosen model has adjusted R squared 0.809 or 80.9%, this means that our regression model can explain 80.9% variance in our target variable (admission chance), and the rest can be explained by other unknown variables.
There are several ways to evaluate our model.
In regression model, several assumptions must be fulfilled until we can conclude that our model is “good regression model”.
In linearity test, target variable must have a linear relation with predictor variable. In our case, we use 6 predictor variable, and we have to check the relation for each predictor variable to target variable.
model_both$call#> lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
#> LOR + CGPA + Research, data = train)
ggcorr(train, cex = 3, label = T)Every predictor we use in our model has strong relation with our target variable. Let’s check correlation in University.Rating with correlation value at 0.7.
cor.test(train$University.Rating, train$Chance.of.Admit)#>
#> Pearson's product-moment correlation
#>
#> data: train$University.Rating and train$Chance.of.Admit
#> t = 18.795, df = 397, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.6304810 0.7348535
#> sample estimates:
#> cor
#> 0.6861828
P-value from
cor.testis less than 5%, this indicate forUniversity.Ratinghas linear relation withChance.of.Admit
In normality of residuals test, we want our residual (error) distribution having normal distribution pattern. For this test, we can use shapiro.test().
shapiro.test(model_both$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_both$residuals
#> W = 0.92285, p-value = 0.0000000000001865
check_normality(model_both)#> Warning: Non-normality of residuals detected (p < .001).
P-Value for normality test is 1.865e-13, this is very small (< 5%). Since our null hypothesis is residual has normal distribution, and our p-value is < 5%, so we have to reject our null hypothesis (residuals has no normal distribution pattern).
Recomendation:
In homoscedasticity of residuals, we want our residuals has constant pattern or having no specific pattern.
library(lmtest)
bptest(model_both)#>
#> studentized Breusch-Pagan test
#>
#> data: model_both
#> BP = 23.162, df = 6, p-value = 0.0007439
check_heteroscedasticity(model_both)#> Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
From Breusch-Pagan test, we got p-value 0.0007439 < 5%. We can conclude that heteroscedasticity is present in our model (non-constant pattern).
library(car)
vif(model_both)#> GRE.Score TOEFL.Score University.Rating LOR
#> 4.197792 3.826486 2.279753 1.890667
#> CGPA Research
#> 4.423021 1.459647
Each colum / predictor variable VIF < 10, we can conclude there is no multicolinearity in our model.
We have done assumption test for our model, to sum up our test:
Here is recommendation to improve our model:
There are several error value we can use to evaluate our model, in this evaluation we’re going to use RMSE.
Let’s try make prediction using test data using predict().
pred <- predict(model_both, newdata = test)
head(pred)#> 1 3 4 11 12 16
#> 0.9527978 0.6545252 0.7391786 0.7363406 0.8387032 0.6480498
The output from predict() is value for our target variable (chance of admit).
For model error evaluation, we will using RMSE. We will evaluate RMSE from the prediction result using test data, and prediction from train data.
# RMSE using test data
library(MLmetrics)
RMSE(y_pred = pred, y_true = test$Chance.of.Admit)#> [1] 0.05337702
Let’s check RMSE from train data
# RMSE using train data
RMSE(y_pred = model_both$fitted.values,y_true = train$Chance.of.Admit)#> [1] 0.06110918
RMSE using test data is slightly lower than RMSE using train data (0.0533 < 0.0611).
Is our error big or small? To answer this, we have to compare our error value with our data range or compare it with our data distribution.
# check original data range for chance of admit
summary(graduate_clean$Chance.of.Admit)#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.3400 0.6300 0.7200 0.7217 0.8200 0.9700
From our original data, minimum chance of admit is 0.34 and maximum at 0.97. Our bigger error at 0.0611. Compare to our data range, our bigger RMSE error is very small.
To sum up our analysis, we have to return to our main objective and answer following question:
GRE Score, TOEFL Score, University Rating, Letter of Recommendation, Current GPA, and Research Experience significantly affect applicant’s admission chances.Here is recommendation that may improve our model: