Linear Regression in Predicting the Factors That Affect the Chances of Admission of Prospective Students to Their Dream Collage

/

Linear regression is an algorithm modeling the relationship between a variable dependent and independent variables that can be used to forecast based on previous data the main goal is to make the right decision based on the prediction results and existing data. We will learn the application of linear regression in predicting the factors that affect the chances of admission of prospective students to their dream collage.

1. Load Data

data_admission<-read.csv("Admission_Predict.csv")
head(data_admission)
  Serial.No. GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
1          1       337         118                 4 4.5 4.5 9.65        1
2          2       324         107                 4 4.0 4.5 8.87        1
3          3       316         104                 3 3.0 3.5 8.00        1
4          4       322         110                 3 3.5 2.5 8.67        1
5          5       314         103                 2 2.0 3.0 8.21        0
6          6       330         115                 5 4.5 3.0 9.34        1
  Chance.of.Admit
1            0.92
2            0.76
3            0.72
4            0.80
5            0.65
6            0.90

The following is a description of each column:
1. Serial.No: Registrant serial number.
2. GRE.Score: Graduate Record Examination (GRE) score.
3. TOEFL.Score:English as a Foreign Language (TOEFL) Score.
4. University.Rating:University rating (score 1-5).
5. SOP:Statement of Purpose (out of 5).
6. LOR: Letter of Recommendation Strength (out of 5).
7. CGPA: Undergraduate GPA (out of 5).
8. Research:Research Experience (either 0 or 1).
9. Chance.of.Admit: Chance of Admit (ranging from 0 to 1).

2. Data Wrangling

Data wrangling is done so that the data format is in accordance with the objectives of the analysis.

a. check data type

library(dplyr)
glimpse(data_admission)
Rows: 400
Columns: 9
$ Serial.No.        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1~
$ GRE.Score         <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, 32~
$ TOEFL.Score       <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, 10~
$ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3, 3~
$ SOP               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, 3.~
$ LOR               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, 4.~
$ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.00~
$ Research          <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1~
$ Chance.of.Admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.50~

There are some data types that are still wrong, because we have to change them

data_admission<-data_admission %>%
  mutate(University.Rating=as.factor(University.Rating),Research=as.factor(Research)) %>%
  select(-1)

b. Check Missing Value

anyNA(data_admission)
[1] FALSE

There is no missing value, continue to the next process

c. Checking Outlier

boxplot(data_admission)

3. Exploratory Data Analysis

Check data correlation

library(GGally)
ggcorr(data_admission, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

In the correlation graph, it can be seen that all variables have a positive influence on Chance.of.Admit where the CGPA factor has the highest positive correlation compared to other factors.

Check the data distribution

boxplot(data_admission)

Based on the boxplot visualization, a few outliers were found in the variable column LOR,CGPA and Chance.of.Admit, so that the data is still tolerated so that it can be analyzed further.

4. Modelling

Before we make a linear regression model, the first step is to divide the data into two, namely the train data and test data. the train data is used to train the algorithm while the test data is used to find out how well the previously trained algorithm is performing.

set.seed(1000)
samplesize <- round(0.8 * nrow(data_admission), 0)
index <- sample(seq_len(nrow(data_admission)), size = samplesize)

train_admission <- data_admission[index, ]
test_admission <- data_admission[-index, ]

Now we will try to model the linear regression using Chance.of.Admit as the target variable.

# Linear Regression Model
set.seed(1000)
Model.admission <- lm(Chance.of.Admit ~ ., data = train_admission)
summary(Model.admission)

Call:
lm(formula = Chance.of.Admit ~ ., data = train_admission)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.267054 -0.022439  0.008427  0.034591  0.151357 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -1.4265321  0.1419140 -10.052  < 2e-16 ***
GRE.Score           0.0023336  0.0006581   3.546 0.000452 ***
TOEFL.Score         0.0033881  0.0012064   2.808 0.005297 ** 
University.Rating2 -0.0118163  0.0168387  -0.702 0.483374    
University.Rating3 -0.0087542  0.0182096  -0.481 0.631037    
University.Rating4 -0.0137625  0.0217701  -0.632 0.527739    
University.Rating5  0.0041168  0.0241594   0.170 0.864807    
SOP                -0.0065361  0.0060452  -1.081 0.280451    
LOR                 0.0259379  0.0060116   4.315 2.15e-05 ***
CGPA                0.1136913  0.0135063   8.418 1.45e-15 ***
Research1           0.0210593  0.0088155   2.389 0.017499 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06301 on 309 degrees of freedom
Multiple R-squared:  0.8126,    Adjusted R-squared:  0.8065 
F-statistic:   134 on 10 and 309 DF,  p-value: < 2.2e-16
Insight form summary model:

  1. Based on the results of the summary model above, the coefficients obtained are more than the number of predictor variables. this happens because two categorical variables, namely University.Rating andResearch are changed to become dummy variables.for example, the Research variable is changed automatically to a dummy variable, namely the Research1 variable.
  2. There are five significant predictor variables, namely the variable GRE.Score,Toefl.Score, SOP,LOR and Research1.
  3. Based on the results of the summary model above, can be interpreted as follows:
  • It’s a negative slope/coefficient, and is statistically insignificant (P-value higher than 0.05)
  • every 1 unit increase in Graduate Record Examination (GRE) score, the chances of a prospective student being accepted will increase by 1.002336
  • Every decrease of 1 unit of English as a Foreign Language (TOEFL) Score, the chance of acceptance of a prospective student will decrease by 1.003394


5. Model Improvment

Summary of the previous model, there are several predictor variables that do not significantly affect the model performance. Therefore, feature selection is carried out to remove the most insignificant variables. There are 3 feature selection methods namely backward elimination, forward elimination and stepwise elimination. All three methods select each variable based on the AIC.

a. Feature selection

# Backward Elimination
model.backward<-step(object = Model.admission,direction = "backward", trace=T,data=train_admission)
Start:  AIC=-1758.45
Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
    SOP + LOR + CGPA + Research

                    Df Sum of Sq    RSS     AIC
- University.Rating  4  0.011058 1.2378 -1763.6
- SOP                1  0.004641 1.2314 -1759.2
<none>                           1.2268 -1758.5
- Research           1  0.022657 1.2494 -1754.6
- TOEFL.Score        1  0.031312 1.2581 -1752.4
- GRE.Score          1  0.049922 1.2767 -1747.7
- LOR                1  0.073909 1.3007 -1741.7
- CGPA               1  0.281316 1.5081 -1694.4

Step:  AIC=-1763.58
Chance.of.Admit ~ GRE.Score + TOEFL.Score + SOP + LOR + CGPA + 
    Research

              Df Sum of Sq    RSS     AIC
- SOP          1  0.004520 1.2424 -1764.4
<none>                     1.2378 -1763.6
- Research     1  0.022242 1.2601 -1759.9
- TOEFL.Score  1  0.031952 1.2698 -1757.4
- GRE.Score    1  0.052117 1.2900 -1752.4
- LOR          1  0.076145 1.3140 -1746.5
- CGPA         1  0.302978 1.5408 -1695.5

Step:  AIC=-1764.42
Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research

              Df Sum of Sq    RSS     AIC
<none>                     1.2424 -1764.4
- Research     1  0.020560 1.2629 -1761.2
- TOEFL.Score  1  0.028639 1.2710 -1759.1
- GRE.Score    1  0.054826 1.2972 -1752.6
- LOR          1  0.075221 1.3176 -1747.6
- CGPA         1  0.300658 1.5430 -1697.1
# Model Forward
Model.admission.none<-lm(Chance.of.Admit~1, data = train_admission)
model.forward<-step(object = Model.admission.none,direction = "forward",data = train_admission,
        scope=list(lower=Model.admission.none,upper=Model.admission))
Start:  AIC=-1242.62
Chance.of.Admit ~ 1

                    Df Sum of Sq    RSS     AIC
+ CGPA               1    5.0085 1.5378 -1704.2
+ GRE.Score          1    4.3258 2.2204 -1586.6
+ TOEFL.Score        1    4.1820 2.3642 -1566.5
+ University.Rating  4    3.2471 3.2991 -1453.9
+ LOR                1    2.9227 3.6235 -1429.9
+ SOP                1    2.7967 3.7496 -1418.9
+ Research           1    1.9254 4.6208 -1352.1
<none>                           6.5462 -1242.6

Step:  AIC=-1704.15
Chance.of.Admit ~ CGPA

                    Df Sum of Sq    RSS     AIC
+ GRE.Score          1  0.164873 1.3729 -1738.4
+ TOEFL.Score        1  0.112203 1.4256 -1726.4
+ Research           1  0.079162 1.4586 -1719.1
+ LOR                1  0.078991 1.4588 -1719.0
+ SOP                1  0.013663 1.5241 -1705.0
+ University.Rating  4  0.039961 1.4978 -1704.6
<none>                           1.5378 -1704.2

Step:  AIC=-1738.44
Chance.of.Admit ~ CGPA + GRE.Score

                    Df Sum of Sq    RSS     AIC
+ LOR                1  0.082384 1.2905 -1756.2
+ TOEFL.Score        1  0.028627 1.3443 -1743.2
+ Research           1  0.025532 1.3474 -1742.5
+ SOP                1  0.011242 1.3617 -1739.1
<none>                           1.3729 -1738.4
+ University.Rating  4  0.023064 1.3498 -1735.9

Step:  AIC=-1756.25
Chance.of.Admit ~ CGPA + GRE.Score + LOR

                    Df Sum of Sq    RSS     AIC
+ TOEFL.Score        1 0.0276002 1.2629 -1761.2
+ Research           1 0.0195209 1.2710 -1759.1
<none>                           1.2905 -1756.2
+ SOP                1 0.0004878 1.2900 -1754.4
+ University.Rating  4 0.0108369 1.2797 -1751.0

Step:  AIC=-1761.16
Chance.of.Admit ~ CGPA + GRE.Score + LOR + TOEFL.Score

                    Df Sum of Sq    RSS     AIC
+ Research           1 0.0205601 1.2424 -1764.4
<none>                           1.2629 -1761.2
+ SOP                1 0.0028381 1.2601 -1759.9
+ University.Rating  4 0.0101449 1.2528 -1755.8

Step:  AIC=-1764.42
Chance.of.Admit ~ CGPA + GRE.Score + LOR + TOEFL.Score + Research

                    Df Sum of Sq    RSS     AIC
<none>                           1.2424 -1764.4
+ SOP                1 0.0045196 1.2378 -1763.6
+ University.Rating  4 0.0109367 1.2314 -1759.2
#Model Stepwise
model.stepwise<-step(object = Model.admission.none,data=train_admission, direction = "both",scope = list(lower=Model.admission.none, upper=Model.admission))
Start:  AIC=-1242.62
Chance.of.Admit ~ 1

                    Df Sum of Sq    RSS     AIC
+ CGPA               1    5.0085 1.5378 -1704.2
+ GRE.Score          1    4.3258 2.2204 -1586.6
+ TOEFL.Score        1    4.1820 2.3642 -1566.5
+ University.Rating  4    3.2471 3.2991 -1453.9
+ LOR                1    2.9227 3.6235 -1429.9
+ SOP                1    2.7967 3.7496 -1418.9
+ Research           1    1.9254 4.6208 -1352.1
<none>                           6.5462 -1242.6

Step:  AIC=-1704.15
Chance.of.Admit ~ CGPA

                    Df Sum of Sq    RSS     AIC
+ GRE.Score          1    0.1649 1.3729 -1738.4
+ TOEFL.Score        1    0.1122 1.4256 -1726.4
+ Research           1    0.0792 1.4586 -1719.1
+ LOR                1    0.0790 1.4588 -1719.0
+ SOP                1    0.0137 1.5241 -1705.0
+ University.Rating  4    0.0400 1.4978 -1704.6
<none>                           1.5378 -1704.2
- CGPA               1    5.0085 6.5462 -1242.6

Step:  AIC=-1738.44
Chance.of.Admit ~ CGPA + GRE.Score

                    Df Sum of Sq    RSS     AIC
+ LOR                1   0.08238 1.2905 -1756.2
+ TOEFL.Score        1   0.02863 1.3443 -1743.2
+ Research           1   0.02553 1.3474 -1742.5
+ SOP                1   0.01124 1.3617 -1739.1
<none>                           1.3729 -1738.4
+ University.Rating  4   0.02306 1.3498 -1735.9
- GRE.Score          1   0.16487 1.5378 -1704.2
- CGPA               1   0.84750 2.2204 -1586.6

Step:  AIC=-1756.25
Chance.of.Admit ~ CGPA + GRE.Score + LOR

                    Df Sum of Sq    RSS     AIC
+ TOEFL.Score        1   0.02760 1.2629 -1761.2
+ Research           1   0.01952 1.2710 -1759.1
<none>                           1.2905 -1756.2
+ SOP                1   0.00049 1.2900 -1754.4
+ University.Rating  4   0.01084 1.2797 -1751.0
- LOR                1   0.08238 1.3729 -1738.4
- GRE.Score          1   0.16827 1.4588 -1719.0
- CGPA               1   0.46835 1.7589 -1659.2

Step:  AIC=-1761.16
Chance.of.Admit ~ CGPA + GRE.Score + LOR + TOEFL.Score

                    Df Sum of Sq    RSS     AIC
+ Research           1  0.020560 1.2424 -1764.4
<none>                           1.2629 -1761.2
+ SOP                1  0.002838 1.2601 -1759.9
- TOEFL.Score        1  0.027600 1.2905 -1756.2
+ University.Rating  4  0.010145 1.2528 -1755.8
- LOR                1  0.081357 1.3443 -1743.2
- GRE.Score          1  0.084207 1.3471 -1742.5
- CGPA               1  0.307246 1.5702 -1693.5

Step:  AIC=-1764.42
Chance.of.Admit ~ CGPA + GRE.Score + LOR + TOEFL.Score + Research

                    Df Sum of Sq    RSS     AIC
<none>                           1.2424 -1764.4
+ SOP                1  0.004520 1.2378 -1763.6
- Research           1  0.020560 1.2629 -1761.2
+ University.Rating  4  0.010937 1.2314 -1759.2
- TOEFL.Score        1  0.028639 1.2710 -1759.1
- GRE.Score          1  0.054826 1.2972 -1752.6
- LOR                1  0.075221 1.3176 -1747.6
- CGPA               1  0.300658 1.5430 -1697.1

b. Perbandingan nilai adjusted r-squared

summary(model.backward)$adj.r.squared
[1] 0.8071952
summary(model.forward)$adj.r.squared
[1] 0.8071952
summary(model.stepwise)$adj.r.squared
[1] 0.8071952

c. tunning new model

Model terbaik berdasarkan adjusted r-squared terbesar adalah model_backward

final.model<-lm(Chance.of.Admit ~ GRE.Score+TOEFL.Score+LOR+CGPA+Research, data= train_admission)
summary(final.model)

Call:
lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
    CGPA + Research, data = train_admission)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27060 -0.02341  0.00944  0.03709  0.15021 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.439059   0.130429 -11.033  < 2e-16 ***
GRE.Score    0.002420   0.000650   3.722 0.000234 ***
TOEFL.Score  0.003154   0.001172   2.690 0.007518 ** 
LOR          0.023203   0.005321   4.360 1.76e-05 ***
CGPA         0.112553   0.012912   8.717  < 2e-16 ***
Research1    0.019911   0.008735   2.280 0.023304 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0629 on 314 degrees of freedom
Multiple R-squared:  0.8102,    Adjusted R-squared:  0.8072 
F-statistic: 268.1 on 5 and 314 DF,  p-value: < 2.2e-16

6. Evaluation Model

a. Prediction model

library(dplyr)
predict.admission <- predict(object = final.model, newdata = test_admission
                             %>% select(-Chance.of.Admit))
head(predict.admission)
        1        11        17        34        38        43 
0.9589020 0.7397319 0.7141813 0.9363171 0.5422415 0.6787007 

b. Model Performence

The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error: \[ RMSE = \sqrt{\frac{1}{n} \sum (\hat y - y)^2} \] RMSE is the square root form of MSE. Because it is rooted, the interpretation is more or less the same as MAE. RMSE can be used if we are more concerned with very large errors.

#RMSE Of Test Data
library(MLmetrics)
RMSE(y_pred = predict.admission,y_true = test_admission$Chance.of.Admit)
[1] 0.06760164

Based on the RMSE value above, it can be set aside that the level of accuracy of the results prediction is very good it can be seen from the relative error rate small.

*Computing MAE from the Data Train

MAE<-mean(abs(predict.admission - test_admission$Chance.of.Admit))
MAE
[1] 0.04774232

c. Assumpstions

Linearity

test.linearirty<- data.frame(residual = final.model$residuals, fitted = final.model$fitted.values)

test.linearirty %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) + 
    geom_smooth() + theme(panel.grid = element_blank(), panel.background = element_blank())

There is little to no discernible pattern in our residual plot, we can conclude that our model is linear.

  • Normality Test*
hist(final.model$residuals)

shapiro.test(final.model$residuals)

    Shapiro-Wilk normality test

data:  final.model$residuals
W = 0.92383, p-value = 1.067e-11

Asumsi normality of residuals tidak terpenuhi karena p-value < alpha (0.05). Jika asumsi normalitas tidak terpenuhi, maka hasil uji signifikansi serta nilai standard error dari intercept dan slope setiap prediktor yang dihasilkan bersifat bias atau tidak mencerminkan nilai sebenarnya. Jika residual memiliki distribusi yang tidak normal, bisa lakukan transformasi data pada target variabel.

Multicolinearity

library(car)
vif(final.model)
  GRE.Score TOEFL.Score         LOR        CGPA    Research 
   4.385634    4.057766    1.827869    4.869063    1.518504 

There is no multicolinearity