1 Business Case

Greetings fellow readers! My name is Gissella Nadya and this is my Regression Model - Learning By Building Project for Algoritma Academy. Now who doesn’t want to be accepted to the United States’ Graduate School? Well, I, for one, would like to continue my education there! There are many things to prepare and as a data analyst, we are going to explore the data about US Graduate School Admission and find out which parameters are essentials to be accepted as a graduate student in the US.
For this project, the dataset is provided through this link.

Let’s get started!

2 Getting Started

2.1 Importing Libraries

# import libs
library(tidyverse)
library(lubridate)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)
library(plotly)
library(performance)

2.2 Read data

admission <- read_csv("Admission_Predict_Ver1.1.csv")

admission <- admission %>% 
  janitor::clean_names()

3 Data Wrangling

head(admission, n = 5)

glimpse(admission)

#> Rows: 500
#> Columns: 9
#> $ serial_no         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
#> $ gre_score         <dbl> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, 3…
#> $ toefl_score       <dbl> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, 1…
#> $ university_rating <dbl> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3, …
#> $ sop               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, 3…
#> $ lor               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, 4…
#> $ cgpa              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.0…
#> $ research          <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
#> $ chance_of_admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.5…

serial_no = Serial number
gre_score = GRE score (out of 340)
toefl_score = TOEFL score (out of 120)
university_rating = The rating of the student’s previous university (undergraduate) (out of 5)
sop = Statement of Purpose and Strength (out of 5)
lor = Letter of Recommendation (out of 5)
cgpa = Undergraduate Cumulative GPA (out of 10)
research = Research Experience (0 = no, 1 = yes)
chance_of_admit = Chance of Admission (ranging from 0 to 1)

For the SOP and LOR variables, the dataset owner do not mention thoroughly the definition of it, and since we know US Graduate School requires applicants to submit these documents, we are going to assume that the number ranging from 1 to 5 are the document’s content quality that the Graduate School received.

3.1 Choosing Necessary Data

We are not going to use the serial number as most unique number are not useful for building models.

admission <- admission %>% 
  select(-serial_no)

3.2 Changing Data Type

admission <- admission %>% 
  mutate(research = as.factor(research))

3.3 Checking the Missing Values

colSums(is.na(admission))

#>         gre_score       toefl_score university_rating               sop 
#>                 0                 0                 0                 0 
#>               lor              cgpa          research   chance_of_admit 
#>                 0                 0                 0                 0

4 EDA

4.1 Checking the correlations:

ggcorr(admission,
       label = T,
       label_size = 4, 
       hjust = 1,
       layout.exp = 1)

target variable / dv = chance_of_admit

most of the data have STRONG positive correlation with the target variable.

4.2 Checking the distribution

ggplot(gather(admission %>% 
                select_if(is.numeric)), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x')

Here we can see that the observations or the US Graduate Candidates comes from different background and have different merit of scores.

summary(admission)

#>    gre_score      toefl_score    university_rating      sop       
#>  Min.   :290.0   Min.   : 92.0   Min.   :1.000     Min.   :1.000  
#>  1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000     1st Qu.:2.500  
#>  Median :317.0   Median :107.0   Median :3.000     Median :3.500  
#>  Mean   :316.5   Mean   :107.2   Mean   :3.114     Mean   :3.374  
#>  3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000     3rd Qu.:4.000  
#>  Max.   :340.0   Max.   :120.0   Max.   :5.000     Max.   :5.000  
#>       lor             cgpa       research chance_of_admit 
#>  Min.   :1.000   Min.   :6.800   0:220    Min.   :0.3400  
#>  1st Qu.:3.000   1st Qu.:8.127   1:280    1st Qu.:0.6300  
#>  Median :3.500   Median :8.560            Median :0.7200  
#>  Mean   :3.484   Mean   :8.576            Mean   :0.7217  
#>  3rd Qu.:4.000   3rd Qu.:9.040            3rd Qu.:0.8200  
#>  Max.   :5.000   Max.   :9.920            Max.   :0.9700

4.3 Data Visualization

Previously, we saw that gre_score, toefl_score, and cgpa have the highest correlation with our target variable, chance_of_admit. Here we would like to visualize it.

plot_ly(data = admission, 
        x = ~ gre_score, 
        y = ~ toefl_score, 
        z = ~ cgpa, 
        color = ~ chance_of_admit,
        # type = "scatter3d",
        colors = c("black", "#BF3124","#F27D52","#F2C46D", "#800080"), # tinggi ke rendah
        hoverinfo = 'text',
        text = ~ paste("</br> GRE Score: ", gre_score,
                      "</br> TOEFL Score: ", toefl_score,
                      "</br> CGPA: ", cgpa,
                      "</br> Chance of Admit: ", chance_of_admit)
        # alpha = 2, 
        ) %>% 
  add_markers() %>%
  layout(scene = list(xaxis = list(title = "GRE Score"),
                      yaxis = list(title = "TOEFL Score"),
                      zaxis = list(title = "CGPA")))

5 Cross Validation

RNGkind(sample.kind = "Rounding")
set.seed(159)


index <- sample(nrow(admission),
                nrow(admission) * 0.8)

admission_train <- admission[index, ]
admission_test <- admission[-index, ]

6 Modeling

6.1 Simple Linear Regression:

admission_train

As we seen in the correlation plot before, the strongest correlation towards the target variable is cgpa, therefore we are going to perform a simple linear regression modeling.

coa_cgpa <-  lm(formula = chance_of_admit ~ cgpa,
                data = admission_train)
summary(coa_cgpa)

#> 
#> Call:
#> lm(formula = chance_of_admit ~ cgpa, data = admission_train)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.278889 -0.028357  0.006111  0.040275  0.172544 
#> 
#> Coefficients:
#>              Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept) -1.011654   0.048002  -21.07 <0.0000000000000002 ***
#> cgpa         0.202389   0.005587   36.22 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.06591 on 398 degrees of freedom
#> Multiple R-squared:  0.7673, Adjusted R-squared:  0.7667 
#> F-statistic:  1312 on 1 and 398 DF,  p-value: < 0.00000000000000022

\[ Chance Of Admit = -1.04434 + 0.202389(cgpa) \]

cor.test(admission_train$cgpa, 
         admission_train$chance_of_admit)

#> 
#>  Pearson's product-moment correlation
#> 
#> data:  admission_train$cgpa and admission_train$chance_of_admit
#> t = 36.223, df = 398, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.8509759 0.8969542
#> sample estimates:
#>       cor 
#> 0.8759396

\(H_0\) : CGPA does NOT has a linear correlation with Chance of Admit
\(H_1\) : CGPA has a linear correlation with Chance of Admit

p-value < 0.00000000000000022,

p-value < H0,

reject H0, accept H1,

CGPA have a linear correlation with Chance of Admit.

6.2 Multiple Linear Regression:

Baseline model, making the model using ALL predictors:

all_admission <- lm(formula = chance_of_admit ~ . , # target variable with ALL
                    data = admission_train)
summary(all_admission)

#> 
#> Call:
#> lm(formula = chance_of_admit ~ ., data = admission_train)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.267256 -0.024179  0.007702  0.034469  0.152604 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)       -1.3036161  0.1179768 -11.050 < 0.0000000000000002 ***
#> gre_score          0.0021416  0.0005560   3.852             0.000137 ***
#> toefl_score        0.0022874  0.0009484   2.412             0.016328 *  
#> university_rating  0.0030528  0.0042301   0.722             0.470912    
#> sop                0.0022870  0.0051065   0.448             0.654499    
#> lor                0.0186909  0.0046137   4.051            0.0000615 ***
#> cgpa               0.1173084  0.0104464  11.230 < 0.0000000000000002 ***
#> research1          0.0265714  0.0071137   3.735             0.000215 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.05898 on 392 degrees of freedom
#> Multiple R-squared:  0.8164, Adjusted R-squared:  0.8132 
#> F-statistic: 249.1 on 7 and 392 DF,  p-value: < 0.00000000000000022

Insights:

all variables have a positive correlation with the variable target chance_of_admit, and the highest positive correlation with the variable target is cgpa (0.1173084).
the variable university_rating and sop do not have a significant correlation towards our target variable.
in this model, all variables represents 0.8132 or 81.32% of chance_of_admit, meanwhile the other 18.68% are represented by the other variables outside the model.

df_all <- data.frame(coefficient = all_admission$coefficients, 
                     variables = names(all_admission$coefficients))
rownames(df_all) <-  NULL

plot_multi <- df_all %>%  
  ggplot(mapping = aes(x = variables, y = coefficient, col = variables)) +
  geom_point(shape = 19, size = 5) + 
  ylim(c(0, 0.15)) +
  theme(legend.position = "none") + 
  labs(title = "The Visualization of Multiple Linear Regression Coefficient",
       subtitle = "Using the All Variables (all_admission)")

plot_multi %>%  ggplotly()

6.2.1 Model Tuning (STEPWISE)

We use the stepwise to choose the best variables

6.2.1.1 None

none_admission <- lm(formula = chance_of_admit ~ 1, 
                     data = admission_train)

6.2.1.2 Backward

backward_admission <- step(object = all_admission,
                           direction = "backward",
                           trace = F)

Formula = chance_of_admit ~ gre_score + toefl_score + lor + cgpa + research

6.2.1.3 Forward

forward_admission <- step(object = none_admission,
                          direction = "forward",
                          scope = list(lower = none_admission,
                                       upper = all_admission),
                          trace = F)

formula = chance_of_admit ~ cgpa + gre_score + lor + research + toefl_score

6.2.1.4 Both

both1_admission <- step(object = all_admission,
                        direction = "both", 
                        scope = list(lower = none_admission,
                                     upper = all_admission), 
                        trace = F)

formula = chance_of_admit ~ gre_score + toefl_score + lor + cgpa + research

both2_admission <- step(object = none_admission,
                        direction = "both", 
                        scope = list(lower = none_admission,
                                     upper = all_admission),
                        trace = F)

formula = chance_of_admit ~ cgpa + gre_score + lor + research + toefl_score

6.2.2 goodness of fit comparison

By comparing the Adjusted r squared, it shows that all models are almost the same, in which the variables within the model represents 0.8135837 or 81% of Chance of Admit, with the exception of using the all_admission model which got 0.8131597 or 81% (a slightly lower value).

6.3 Model Evaluation

6.3.1 RMSE, MAE, and MAPE

Predict to unseen data (data test)

For predicting the data, we can evaluate using the RMSE or Root Mean Square Error, MAE or Mean Absolute Error or MAPE or Mean Absolute Percentage Error. Here we could see the all_admission model has the lowest RMSE and MAE out of all other models. Therefore we are going to use it.

7 Assumption Test

We use All_admission because it has the lowest MAE and RMSE value out of all the models.

dfastest <- data.frame(fitted_value = all_admission$fitted.values, 
                       residuals = all_admission$residuals)

7.1 Linearity

The scatter plot between residuals and predicted values (fitted value) is useful for checking the assumption of linearity and homoscedasticity. If the plot depicts any specific or regular pattern then it is assumed the relation between the target variable and predictors is non-linear in nature (non-linearity exists). And no pattern in the curve is a sign of linearity among the selected features and the target variable.

plot(all_admission,which = 1)

Through the ggcorr() function, we have seen that all the variables have a strong positive correlation with the target variable.
By looking at the pattern on the residuals vs fitted plot, it shows that the residual tends to be negative, while the fitted_value tends to be positive, the red line near y = 0 is slightly curved, and there is no obvious patterns (parabola) for the observations. Therefore we may conclude that our data passed the linearity. test.

the red line is and the residuals seem to increase as the fitted Y values increase

7.2 Normality

The normality assumption means that the residuals or error from the linear regression model should be normally distributed because we expect to get residuals near the zero value. There are several ways to check if our model has passed the normality assumptions or not.

7.2.1 Histogram

dfastest %>% 
  ggplot(aes(x = residuals)) + 
  geom_histogram(fill = "#69b3a2", # warna dalam kotak
                 color ="#e9ecef", # warna luar kotak
                 alpha = 5) +
  labs(title = "Histogram for Normality Test")

7.2.2 Kolmogorov-Smirnov

The Kolmogorov-Smirnov or K-S test is a test of the equality of two distributions. The One-sample K–S test used to test the normality of a selected continuous variable, and the Two-Sample K–S test used to test whether two samples have the same distribution or not (used for Simple Linear Regression).

ks.test(all_admission$residuals,
        "pnorm", 
        mean = mean(all_admission$residuals), 
        sd = sd(all_admission$residuals))

#> 
#>  One-sample Kolmogorov-Smirnov test
#> 
#> data:  all_admission$residuals
#> D = 0.10145, p-value = 0.0005311
#> alternative hypothesis: two-sided

\(H_0\) : error / residual normally distributed
\(H_1\) : error / residual NOT normally distributed

D = 0.10145 | p-value = 0.0005311

p-value < alpha, meaning we reject the null hypothesis

According to the Kolmogorov-Smirnos Test, the error is not normally distributed

Density plot and Q-Q plot can be used to check normality visually.

Density plot:
QQ plot:

Source

7.2.3 Density plot

the density plot provides a visual judgment about whether the distribution is bell shaped.

library(ggpubr)
ggdensity(all_admission$residuals, fill = "grey", title = "Chance_of_admit") +
  # scale_x_continuous(limits = c(-0.3, 0.3)) +
  stat_overlay_normal_density(color = "red", linetype = "dashed")

Skewed distributions

7.2.4 QQ plot

QQ plot (or quantile-quantile plot) draws the correlation between a given sample and the normal distribution. A 45-degree reference line is also plotted. In a QQ plot, each observation is plotted as a single dot. If the data are normal, the dots should form a straight line.

ggpubr::ggqqplot(all_admission$residuals) + 
  labs(title = "Normality using the QQ-Plot")

There are dots that are outside of the line, meaning that…

According to the QQ-Plot the errors are NOT normally distributed.

7.2.5 Shapiro-Wilk

We may use the Shapiro-Wilk Normality test and look at its p-value.

shapiro.test(all_admission$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  all_admission$residuals
#> W = 0.92715, p-value = 0.0000000000004841

\(H_0\) : Residuals are normally distributed
\(H_1\) : Residuals are NOT normally distributed
- p-value = 0.000000000000004824
- p-value < alpha
- reject H0,

Accodring to Shapiro-Wilk Test, the Error / Residual are NOT normally distributed

7.3 Homoscedasticity

As previously stated, the plot we used for checking the Linearity Assumption and Homoscedasticity Assumption are the same.

plot_homosc <- dfastest %>% 
  ggplot(aes(fitted_value, residuals)) + 
  geom_point() + 
  theme_light() + 
  geom_hline(aes(yintercept = 0), col = "red")

plot_homosc %>%  ggplotly()

Breusch-Pagan Hypothesis :

We can also use the Breusch-Pagan test to check the assumption of Homoscedasticity by using the function bptest() in the library lmtest. Here, we would like to achieve Homoscedasticity.

library(lmtest)
bptest(all_admission)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  all_admission
#> BP = 20.842, df = 7, p-value = 0.00401

\(H_0\) : Variation of residual is constant (Homoscedasticity)
\(H_1\) : Variation of residual is not constant (Heteroscedasticity)
p-value = 0.00007634
p-value < 0.05, therefore
reject h0,

error / residual are Heteroscedasticity

7.4 NO-Multicollinearity

The Variance Inflation Factor (VIF) is a statistical tool that can be used to asses the multicollinearity assumption. It is the tool to measure the effect of multicollinearity among the predictors in our model. We may use the function vif() from the car or also known as Companion to Applied Regression library. We must have all the variables below 10 to pass the no-multicollinearity assumptions.

vif(all_admission)

#>         gre_score       toefl_score university_rating               sop 
#>          4.168858          3.546784          2.514230          2.799301 
#>               lor              cgpa          research 
#>          2.027913          4.365244          1.438676

Target are NO-multicolinear because VIF < 10

To conclude, out of all 4 assumptions test, the model only passed the No-Multicollinearity, and Linearity test. Therefore, we need to improve our model. Because Regression Models are sensitive to outliers, we are going to redo our data wrangling.

8 Model Improvement

8.1 Checking the Outlier

Previously when we use the function plot() to check our linearity, R already mentioned 3 outliers observation in our data. We can also use the function outlierTest() to check it. After that, we are going to remove these observations.

outlierTest(all_admission)

#>      rstudent unadjusted p-value Bonferroni p
#> 5   -4.682108       0.0000039222    0.0015689
#> 78  -4.013394       0.0000717440    0.0286970
#> 140 -3.972432       0.0000846990    0.0338800

admission_no_out <- admission[ -c(5,78,140), ]

To resolve the Heteroscedasticity assumption, we may use the Box-Cox Transformation. Box-cox transformation is a mathematical transformation of the variable to make it approximate to a normal distribution.

bc_trans <- caret::BoxCoxTrans(admission_no_out$chance_of_admit)

admission_no_out <- cbind(admission_no_out, coa_new = predict(bc_trans, admission_no_out$chance_of_admit))

admission_no_out <- admission_no_out %>% 
  select(-chance_of_admit)

admission_no_out

ggcorr(admission_no_out,
       label = T,
       label_size = 4, 
       hjust = 1,
       layout.exp = 1)

We have previously seen that university rating and SOP has no significant value, therefore we are going to try to remove this for our model improvement.

admission_no_out <- admission_no_out %>%
  select(-c(university_rating, sop))

8.2 Cross Validation

RNGkind(sample.kind = "Rounding")
set.seed(159)


index <- sample(nrow(admission_no_out),
                nrow(admission_no_out) * 0.8)

admission_no_out_train <- admission_no_out[index, ]
admission_no_out_test <- admission_no_out[-index, ]

8.3 Modeling

all_no_out_admission <- lm(formula = coa_new ~ . , # target variable with ALL
                           data = admission_no_out_train)
summary(all_no_out_admission)

#> 
#> Call:
#> lm(formula = coa_new ~ ., data = admission_no_out_train)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.209806 -0.019814  0.007221  0.029631  0.126079 
#> 
#> Coefficients:
#>               Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept) -2.0056565  0.0914883 -21.923 < 0.0000000000000002 ***
#> gre_score    0.0015362  0.0004624   3.322             0.000978 ***
#> toefl_score  0.0026035  0.0007971   3.266             0.001186 ** 
#> lor          0.0136055  0.0035044   3.882             0.000121 ***
#> cgpa         0.1083541  0.0089360  12.126 < 0.0000000000000002 ***
#> research1    0.0201701  0.0061708   3.269             0.001176 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.04954 on 391 degrees of freedom
#> Multiple R-squared:  0.8346, Adjusted R-squared:  0.8325 
#> F-statistic: 394.7 on 5 and 391 DF,  p-value: < 0.00000000000000022

# data prediction
new_all <- predict(object = all_no_out_admission, newdata = admission_no_out_test)

# MAE for Data TRAIN
MAE(y_pred = all_no_out_admission$fitted.values, y_true = admission_no_out_train$coa_new)

#> [1] 0.03621813

# MAE for Data TEST
MAE(y_pred = new_all, y_true = admission_no_out_test$coa_new)

#> [1] 0.02932204

Now our new model has an Adjusted R-Square of 0.8325 or 83.25%. The value has increased the previous model. The MAE for the test data is 0.02932204, meanwhile for the train data is 0.03621813. This indicates that our model is underfit.

8.4 Assumptions

Linearity

plot(all_no_out_admission, which = 1)

newassump <- data.frame(fitted_value = all_no_out_admission$fitted.values, 
                       residuals = all_no_out_admission$residuals)

new_linearity <- newassump %>%  
  ggplot(aes(x = fitted_value, y = residuals)) +
  geom_point() +
  geom_smooth() + 
  geom_hline(aes(yintercept = 0, col = "red")) +
  theme(legend.position = "none") + 
  ylim(c(-0.3, 0.3))

new_linearity %>%  ggplotly()

There is little to no discernible pattern in our residual plot, we can conclude that our model is linear.

Normality Test

ggpubr::ggqqplot(all_no_out_admission$residuals) + 
  labs(title = "Normality using the QQ-Plot")

There are still dots that are outside of the line, meaning that…

According to the QQ-Plot the errors are NOT normally distributed.

shapiro.test(all_no_out_admission$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  all_no_out_admission$residuals
#> W = 0.94354, p-value = 0.00000000003755

\(H_0\) : Residuals are normally distributed
\(H_1\) : Residuals are NOT normally distributed
- p-value = 0.00000000003755
- p-value < alpha
- reject H0,

Accodring to Shapiro-Wilk Test, the Error / Residual are NOT normally distributed

Homoscedasticity

bptest(all_no_out_admission)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  all_no_out_admission
#> BP = 10.9, df = 5, p-value = 0.05341

\(H_0\) : Variation of residual is constant (Homoscedasticity)
\(H_1\) : Variation of residual is not constant (Heteroscedasticity)
p-value = 0.05341
p-value > 0.05, therefore
reject h0,

According to the Breusch-Pagan test, the variation of residual is constant (Homoscedasticity)

No-Multicolinearity

vif(all_no_out_admission)

#>   gre_score toefl_score         lor        cgpa    research 
#>    4.585801    3.911981    1.735044    4.870249    1.518496

Target are NO-multicolinear because VIF < 10

9 Conclusion

To conclude, on our first model, the variables represents 81.31%, and our improved model represents 83.25%. Although the latest model represents more, it is actually proven to be underfit. After we try to improve our model, it is seen that it has not passed only the Normality assumptions. There are several reasons why the data are not distributed normally, one of them being that it may be because there is not enough data for the model to learn. To resolve this, we can add more data so that we can feed it to our model in order for them to learn better and create a better model that we can use.

The important variables for predicting the chance of a person to be admitted in US Grad school are their CGPA, previous research experience, Letter of Recommendation, GRE Score, and TOEFL SCORE (in this respective order). According to this dataset, we suggest that students who are applying for US Graduate School shall focus on their CGPA first as it is the highest coefficients wit chance of admit. In the end we also have two variables that has no significant coefficient with our target variable, those are university ranking and statement of purpose. We may conclude that although the student’s undergraduate university ranking and the statement of purpose are a required document to be sent to the Grad School, when we leave these two out of the new model, it actually improves our percentage and therefore can be put aside for a moment while focusing on the first 5 variables mentioned.

Thank you very much for taking the time to read my Machine Learning - Regression Model Report. I hope you it may be useful for you readers and let’s prepare for our grad school applications! Meanwhile, on to the next projects…! -gn

Uncovering the US Graduate Admission - An Analysis Using Regression Models

Gissella Nadya

March 1st, 2021

1 Business Case

2 Getting Started

2.1 Importing Libraries

2.2 Read data

3 Data Wrangling

3.1 Choosing Necessary Data

3.2 Changing Data Type

3.3 Checking the Missing Values

4 EDA

4.1 Checking the correlations:

4.2 Checking the distribution

4.3 Data Visualization

5 Cross Validation

6 Modeling

6.1 Simple Linear Regression:

6.2 Multiple Linear Regression:

6.2.1 Model Tuning (STEPWISE)

6.2.1.1 None

6.2.1.2 Backward

6.2.1.3 Forward

6.2.1.4 Both

6.2.2 goodness of fit comparison

6.3 Model Evaluation

6.3.1 RMSE, MAE, and MAPE

7 Assumption Test

7.1 Linearity

7.2 Normality

7.2.1 Histogram

7.2.2 Kolmogorov-Smirnov

7.2.3 Density plot

7.2.4 QQ plot

7.2.5 Shapiro-Wilk

7.3 Homoscedasticity

7.4 NO-Multicollinearity

8 Model Improvement

8.1 Checking the Outlier

8.2 Cross Validation

8.3 Modeling

8.4 Assumptions

9 Conclusion