1 General Brief

This is Learning By Building Project for Regression Model (RM). The data used for this analysis is Graduate Admission for Master Programs. In this data, there are several columns as consideration for applicant’s admission chance (stated in chance of admit). This analysis will be separated into 6 parts:

  1. General brief
  2. Pre-processing Data
  3. Data exploration
  4. Modelling
  5. Model evaluation
  6. Conclusion
  7. Reference

Let’s get started!

1.1 Main Objective

In this report, our final goal is answer following question:

  1. What factor that significantly affect applicant’s admission chance?
  2. Are those significant variable good enough to describe applicant’s admission chance?

2 Pre-processing Data

In this part, we’re going to read our data, and prepare it into “clean data”.

2.1 Read The Data

First, let’s read our data and we will assign it as graduate.

# read the data
graduate <- read.csv("data_input/Admission_Predict_Ver1.1.csv")
graduate

The data set contains 500 rows, and 9 columns.

  1. GRE Scores
  2. TOEFL Scores
  3. University Rating
  4. Statement of Purpose
  5. Letter of Recommendation : Stated in recommendation strength
  6. Undergraduate GPA
  7. Research Experience : 1 if the applicant has research experience, and 0 if the applicant has no research experience
  8. Chance of Admit

In this analysis, we will use Chance of Admit as our target variable.

2.2 Check and Data Cleansing

2.2.1 Data Type

Check if there are any mismatch data types in our dataset.

# check data types for each column
str(graduate)
#> 'data.frame':    500 obs. of  9 variables:
#>  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
#>  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
#>  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
#>  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
#>  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
#>  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
#>  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
#>  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

For Research column, it will be more proper if we change it into categorical. Since the information contain in column is 1 if the applicant has done research, and 0 if the applicant hasn’t done research yet.

library(dplyr)
graduate <- graduate %>% 
  mutate(Research = as.factor(Research))

str(graduate)
#> 'data.frame':    500 obs. of  9 variables:
#>  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
#>  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
#>  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
#>  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
#>  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
#>  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
#>  $ Research         : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 1 1 ...
#>  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

Our dataset column now already in proper types.

2.2.2 Missing Value

# check missing value
colSums(is.na(graduate))
#>        Serial.No.         GRE.Score       TOEFL.Score University.Rating 
#>                 0                 0                 0                 0 
#>               SOP               LOR              CGPA          Research 
#>                 0                 0                 0                 0 
#>   Chance.of.Admit 
#>                 0

Good! Our data set has no missing value. We can proceed into next step.

2.2.3 Redundant Information

In this section, we’re going to drop column that we’re not going to use. In our data, there is column for Seral.No, and we’re not going to use this column.

# drop Seral.No column
graduate_clean <-
  graduate %>%
  select(-Serial.No.)
head(graduate_clean)

3 Data Exploration

After we got clean data, now we’re going to explore our data. We can try to observe our data in general by using summary, and we’re going to observe correlation between column.

3.1 Data Summary

# data summary
summary(graduate_clean)
#>    GRE.Score      TOEFL.Score    University.Rating      SOP       
#>  Min.   :290.0   Min.   : 92.0   Min.   :1.000     Min.   :1.000  
#>  1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000     1st Qu.:2.500  
#>  Median :317.0   Median :107.0   Median :3.000     Median :3.500  
#>  Mean   :316.5   Mean   :107.2   Mean   :3.114     Mean   :3.374  
#>  3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000     3rd Qu.:4.000  
#>  Max.   :340.0   Max.   :120.0   Max.   :5.000     Max.   :5.000  
#>       LOR             CGPA       Research Chance.of.Admit 
#>  Min.   :1.000   Min.   :6.800   0:220    Min.   :0.3400  
#>  1st Qu.:3.000   1st Qu.:8.127   1:280    1st Qu.:0.6300  
#>  Median :3.500   Median :8.560            Median :0.7200  
#>  Mean   :3.484   Mean   :8.576            Mean   :0.7217  
#>  3rd Qu.:4.000   3rd Qu.:9.040            3rd Qu.:0.8200  
#>  Max.   :5.000   Max.   :9.920            Max.   :0.9700

Insight:

  • From 500 observation, average for chance of admit is around 0.72
  • 280 applicant (56%) has done research, and 220 (44%) hasn’t done research yet or has no research experience.

From boxplot we can see there are outlier data in LOR and Chance of Admit column.

## Variable Correlation

library(GGally)
ggcorr(graduate_clean, cex = 4, label = T)

Assumption :

  • Every column have positive relation to target variable
  • CGPA has the strongest relation with target variable
  • Since Research column is not numeric, we can’t see correlation with out target variable (Chance.of.Admit) here.

4 Modelling

To answer our main objective, we’ll try to make regression model.

4.1 Cross Validation

Before making our model, we have to split our data into train and test data. We will create our model with train data, and we’ll check model performance with test data. We’re going to split our data into 80/20 proportion (80% of the data as train, and 20% will be treated as test).

RNGkind(sample.kind = "Rounding")
set.seed(100)
library(rsample)

# indexing the data
index_graduate <- initial_split(data = graduate_clean, prop = 0.8, strata = "Chance.of.Admit")

# splitting data
train <- training(index_graduate)
test <- testing(index_graduate)

4.2 Regression Model

First, let’s make regression model with all column as predictor variable, and one model that has no predictor at all.

# regression model with no predictor variable
model_none <- lm(formula = Chance.of.Admit ~ 1, data = train)
summary(model_none)
#> 
#> Call:
#> lm(formula = Chance.of.Admit ~ 1, data = train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.3806 -0.0906 -0.0006  0.0994  0.2494 
#> 
#> Coefficients:
#>             Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept) 0.720602   0.007065     102 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.1411 on 398 degrees of freedom
# regression model with all column as predictor variable
model_all <- lm(formula = Chance.of.Admit ~ ., data = train)
summary(model_all)
#> 
#> Call:
#> lm(formula = Chance.of.Admit ~ ., data = train)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.27069 -0.02389  0.01083  0.03394  0.16295 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)       -1.3463885  0.1204243 -11.180 < 0.0000000000000002 ***
#> GRE.Score          0.0023265  0.0005749   4.047            0.0000626 ***
#> TOEFL.Score        0.0022165  0.0009997   2.217               0.0272 *  
#> University.Rating  0.0068044  0.0043921   1.549               0.1221    
#> SOP                0.0006159  0.0054084   0.114               0.9094    
#> LOR                0.0200814  0.0046805   4.290            0.0000225 ***
#> CGPA               0.1155127  0.0112072  10.307 < 0.0000000000000002 ***
#> Research1          0.0185453  0.0075253   2.464               0.0142 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.06173 on 391 degrees of freedom
#> Multiple R-squared:  0.812,  Adjusted R-squared:  0.8086 
#> F-statistic: 241.3 on 7 and 391 DF,  p-value: < 0.00000000000000022

From this base model, we can get information:

  • GRE.Score, LOR, and CGPA has significant influence in our model.
  • Adjusted R squared for base model is 0.8086.
  • If the applicant has done research, it will increase admission chance by 0.018 point.

4.3 Feature Selection

We’re going to improve our base model using step-wise by eliminating or adding predictor from or to our base model. In step-wise method, there are 3 ways to eliminate our variable backward, forward, both. We’re try all 3 ways, and choosing our best model.

4.3.1 Backward Model

model_backward <- step(object = model_all, direction = "backward")
#> Start:  AIC=-2214.5
#> Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
#>     SOP + LOR + CGPA + Research
#> 
#>                     Df Sum of Sq    RSS     AIC
#> - SOP                1   0.00005 1.4900 -2216.5
#> <none>                           1.4900 -2214.5
#> - University.Rating  1   0.00915 1.4991 -2214.1
#> - TOEFL.Score        1   0.01873 1.5087 -2211.5
#> - Research           1   0.02314 1.5131 -2210.3
#> - GRE.Score          1   0.06240 1.5523 -2200.1
#> - LOR                1   0.07015 1.5601 -2198.1
#> - CGPA               1   0.40482 1.8948 -2120.6
#> 
#> Step:  AIC=-2216.48
#> Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
#>     LOR + CGPA + Research
#> 
#>                     Df Sum of Sq    RSS     AIC
#> <none>                           1.4900 -2216.5
#> - University.Rating  1   0.01118 1.5012 -2215.5
#> - TOEFL.Score        1   0.01892 1.5089 -2213.4
#> - Research           1   0.02318 1.5132 -2212.3
#> - GRE.Score          1   0.06235 1.5523 -2202.1
#> - LOR                1   0.07755 1.5675 -2198.2
#> - CGPA               1   0.43108 1.9211 -2117.1
summary(model_backward)
#> 
#> Call:
#> lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
#>     LOR + CGPA + Research, data = train)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.27051 -0.02375  0.01121  0.03409  0.16309 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)       -1.3483036  0.1190943 -11.321 < 0.0000000000000002 ***
#> GRE.Score          0.0023251  0.0005741   4.050           0.00006169 ***
#> TOEFL.Score        0.0022235  0.0009966   2.231               0.0262 *  
#> University.Rating  0.0069893  0.0040760   1.715               0.0872 .  
#> LOR                0.0202338  0.0044795   4.517           0.00000831 ***
#> CGPA               0.1158148  0.0108751  10.650 < 0.0000000000000002 ***
#> Research1          0.0185593  0.0075148   2.470               0.0139 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.06165 on 392 degrees of freedom
#> Multiple R-squared:  0.812,  Adjusted R-squared:  0.8091 
#> F-statistic: 282.2 on 6 and 392 DF,  p-value: < 0.00000000000000022

4.3.2 Forward Model

model_forward <- step(object = model_none, 
                      direction = "forward", 
                      scope = list(lower = model_none, upper = model_all),
                      trace = 0)
summary(model_forward)
#> 
#> Call:
#> lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + LOR + Research + 
#>     TOEFL.Score + University.Rating, data = train)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.27051 -0.02375  0.01121  0.03409  0.16309 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)       -1.3483036  0.1190943 -11.321 < 0.0000000000000002 ***
#> CGPA               0.1158148  0.0108751  10.650 < 0.0000000000000002 ***
#> GRE.Score          0.0023251  0.0005741   4.050           0.00006169 ***
#> LOR                0.0202338  0.0044795   4.517           0.00000831 ***
#> Research1          0.0185593  0.0075148   2.470               0.0139 *  
#> TOEFL.Score        0.0022235  0.0009966   2.231               0.0262 *  
#> University.Rating  0.0069893  0.0040760   1.715               0.0872 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.06165 on 392 degrees of freedom
#> Multiple R-squared:  0.812,  Adjusted R-squared:  0.8091 
#> F-statistic: 282.2 on 6 and 392 DF,  p-value: < 0.00000000000000022

4.3.3 Both Direction Model

model_both <- step(object = model_all, direction = "both")
#> Start:  AIC=-2214.5
#> Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
#>     SOP + LOR + CGPA + Research
#> 
#>                     Df Sum of Sq    RSS     AIC
#> - SOP                1   0.00005 1.4900 -2216.5
#> <none>                           1.4900 -2214.5
#> - University.Rating  1   0.00915 1.4991 -2214.1
#> - TOEFL.Score        1   0.01873 1.5087 -2211.5
#> - Research           1   0.02314 1.5131 -2210.3
#> - GRE.Score          1   0.06240 1.5523 -2200.1
#> - LOR                1   0.07015 1.5601 -2198.1
#> - CGPA               1   0.40482 1.8948 -2120.6
#> 
#> Step:  AIC=-2216.48
#> Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
#>     LOR + CGPA + Research
#> 
#>                     Df Sum of Sq    RSS     AIC
#> <none>                           1.4900 -2216.5
#> - University.Rating  1   0.01118 1.5012 -2215.5
#> + SOP                1   0.00005 1.4900 -2214.5
#> - TOEFL.Score        1   0.01892 1.5089 -2213.4
#> - Research           1   0.02318 1.5132 -2212.3
#> - GRE.Score          1   0.06235 1.5523 -2202.1
#> - LOR                1   0.07755 1.5675 -2198.2
#> - CGPA               1   0.43108 1.9211 -2117.1
summary(model_both)
#> 
#> Call:
#> lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
#>     LOR + CGPA + Research, data = train)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.27051 -0.02375  0.01121  0.03409  0.16309 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)       -1.3483036  0.1190943 -11.321 < 0.0000000000000002 ***
#> GRE.Score          0.0023251  0.0005741   4.050           0.00006169 ***
#> TOEFL.Score        0.0022235  0.0009966   2.231               0.0262 *  
#> University.Rating  0.0069893  0.0040760   1.715               0.0872 .  
#> LOR                0.0202338  0.0044795   4.517           0.00000831 ***
#> CGPA               0.1158148  0.0108751  10.650 < 0.0000000000000002 ***
#> Research1          0.0185593  0.0075148   2.470               0.0139 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.06165 on 392 degrees of freedom
#> Multiple R-squared:  0.812,  Adjusted R-squared:  0.8091 
#> F-statistic: 282.2 on 6 and 392 DF,  p-value: < 0.00000000000000022

4.3.4 Model Comparison

After making 3 regression models, let’s resume our model and we’ll choose which model is the best.

library(performance)
compare_performance(model_backward, model_forward, model_both)

Every model has AIC and adjusted R squared same score, this indicate that all of our model is the same. For further analysis we’re going to use model_both.

Notes: Since every model is the same or having same AIC value/adjusted R squared, we can use any model we prefer

Choosen model has adjusted R squared 0.809 or 80.9%, this means that our regression model can explain 80.9% variance in our target variable (admission chance), and the rest can be explained by other unknown variables.

5 Model Evaluation

There are several ways to evaluate our model.

5.1 Assumptions

In regression model, several assumptions must be fulfilled until we can conclude that our model is “good regression model”.

5.1.1 Linearity

In linearity test, target variable must have a linear relation with predictor variable. In our case, we use 6 predictor variable, and we have to check the relation for each predictor variable to target variable.

model_both$call
#> lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
#>     LOR + CGPA + Research, data = train)
ggcorr(train, cex = 3, label = T)

Every predictor we use in our model has strong relation with our target variable. Let’s check correlation in University.Rating with correlation value at 0.7.

cor.test(train$University.Rating, train$Chance.of.Admit)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  train$University.Rating and train$Chance.of.Admit
#> t = 18.795, df = 397, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.6304810 0.7348535
#> sample estimates:
#>       cor 
#> 0.6861828

P-value from cor.test is less than 5%, this indicate for University.Rating has linear relation with Chance.of.Admit

5.1.2 Normality of Residuals Test

In normality of residuals test, we want our residual (error) distribution having normal distribution pattern. For this test, we can use shapiro.test().

shapiro.test(model_both$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_both$residuals
#> W = 0.92285, p-value = 0.0000000000001865
check_normality(model_both)
#> Warning: Non-normality of residuals detected (p < .001).

P-Value for normality test is 1.865e-13, this is very small (< 5%). Since our null hypothesis is residual has normal distribution, and our p-value is < 5%, so we have to reject our null hypothesis (residuals has no normal distribution pattern).

Recomendation:

  • Transforming our data
  • Add more sample data

5.1.3 Homoscedasticity of Residuals

In homoscedasticity of residuals, we want our residuals has constant pattern or having no specific pattern.

library(lmtest)
bptest(model_both)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_both
#> BP = 23.162, df = 6, p-value = 0.0007439
check_heteroscedasticity(model_both)
#> Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).

From Breusch-Pagan test, we got p-value 0.0007439 < 5%. We can conclude that heteroscedasticity is present in our model (non-constant pattern).

5.1.4 Multicolinearity

library(car)

vif(model_both)
#>         GRE.Score       TOEFL.Score University.Rating               LOR 
#>          4.197792          3.826486          2.279753          1.890667 
#>              CGPA          Research 
#>          4.423021          1.459647

Each colum / predictor variable VIF < 10, we can conclude there is no multicolinearity in our model.

5.1.5 Summary

We have done assumption test for our model, to sum up our test:

  • Pass linearity test
  • Fail normality of residual
  • Fail homoscedasticity
  • Pass multicolinearity

Here is recommendation to improve our model:

  • Scale or transform our data first.
  • Add more sample data into our data set.

5.2 Error Value

There are several error value we can use to evaluate our model, in this evaluation we’re going to use RMSE.

5.2.1 Prediction

Let’s try make prediction using test data using predict().

pred <- predict(model_both, newdata = test)
head(pred)
#>         1         3         4        11        12        16 
#> 0.9527978 0.6545252 0.7391786 0.7363406 0.8387032 0.6480498

The output from predict() is value for our target variable (chance of admit).

5.3 RMSE

For model error evaluation, we will using RMSE. We will evaluate RMSE from the prediction result using test data, and prediction from train data.

# RMSE using test data
library(MLmetrics)
RMSE(y_pred = pred, y_true = test$Chance.of.Admit)
#> [1] 0.05337702

Let’s check RMSE from train data

# RMSE using train data
RMSE(y_pred = model_both$fitted.values,y_true = train$Chance.of.Admit)
#> [1] 0.06110918

RMSE using test data is slightly lower than RMSE using train data (0.0533 < 0.0611).

Is our error big or small? To answer this, we have to compare our error value with our data range or compare it with our data distribution.

# check original data range for chance of admit
summary(graduate_clean$Chance.of.Admit)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.3400  0.6300  0.7200  0.7217  0.8200  0.9700

From our original data, minimum chance of admit is 0.34 and maximum at 0.97. Our bigger error at 0.0611. Compare to our data range, our bigger RMSE error is very small.

6 Conclusion

To sum up our analysis, we have to return to our main objective and answer following question:

  1. What factor that significantly affect applicant’s admission chance?
  • From our analysis we can conclude GRE Score, TOEFL Score, University Rating, Letter of Recommendation, Current GPA, and Research Experience significantly affect applicant’s admission chances.
  1. Are those significant variable good enough to describe applicant’s chance of admit?
  • From our analysis, if we make regression model with all factor stated in number 1, we can describe applicant’s admission chance up to 80.9%, with RMSE value at 0.0533 with testing data. However, our model still having problem with linear regression assumption (normality of residuals, and homoscedasticity).

Here is recommendation that may improve our model:

  • Transform / scale the data,
  • Collect more sample data if possible.
  • Try to use other model and compare it.

7 Reference

  • The dataset is provided by Team Algoritma.