Greetings fellow readers! My name is Gissella Nadya and this is my Regression Model - Learning By Building Project for Algoritma Academy. Now who doesn’t want to be accepted to the United States’ Graduate School? Well, I, for one, would like to continue my education there! There are many things to prepare and as a data analyst, we are going to explore the data about US Graduate School Admission and find out which parameters are essentials to be accepted as a graduate student in the US.
For this project, the dataset is provided through this link.
Let’s get started!
# import libs
library(tidyverse)
library(lubridate)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)
library(plotly)
library(performance)admission <- read_csv("Admission_Predict_Ver1.1.csv")admission <- admission %>%
janitor::clean_names()head(admission, n = 5)glimpse(admission)#> Rows: 500
#> Columns: 9
#> $ serial_no <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
#> $ gre_score <dbl> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, 3…
#> $ toefl_score <dbl> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, 1…
#> $ university_rating <dbl> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3, …
#> $ sop <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, 3…
#> $ lor <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, 4…
#> $ cgpa <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.0…
#> $ research <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
#> $ chance_of_admit <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.5…
serial_no = Serial numbergre_score = GRE score (out of 340)toefl_score = TOEFL score (out of 120)university_rating = The rating of the student’s previous university (undergraduate) (out of 5)sop = Statement of Purpose and Strength (out of 5)lor = Letter of Recommendation (out of 5)cgpa = Undergraduate Cumulative GPA (out of 10)research = Research Experience (0 = no, 1 = yes)chance_of_admit = Chance of Admission (ranging from 0 to 1)For the SOP and LOR variables, the dataset owner do not mention thoroughly the definition of it, and since we know US Graduate School requires applicants to submit these documents, we are going to assume that the number ranging from 1 to 5 are the document’s content quality that the Graduate School received.
We are not going to use the serial number as most unique number are not useful for building models.
admission <- admission %>%
select(-serial_no)admission <- admission %>%
mutate(research = as.factor(research))colSums(is.na(admission))#> gre_score toefl_score university_rating sop
#> 0 0 0 0
#> lor cgpa research chance_of_admit
#> 0 0 0 0
ggcorr(admission,
label = T,
label_size = 4,
hjust = 1,
layout.exp = 1)target variable / dv = chance_of_admit
most of the data have STRONG positive correlation with the target variable.
ggplot(gather(admission %>%
select_if(is.numeric)), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')Here we can see that the observations or the US Graduate Candidates comes from different background and have different merit of scores.
summary(admission)#> gre_score toefl_score university_rating sop
#> Min. :290.0 Min. : 92.0 Min. :1.000 Min. :1.000
#> 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000 1st Qu.:2.500
#> Median :317.0 Median :107.0 Median :3.000 Median :3.500
#> Mean :316.5 Mean :107.2 Mean :3.114 Mean :3.374
#> 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000 3rd Qu.:4.000
#> Max. :340.0 Max. :120.0 Max. :5.000 Max. :5.000
#> lor cgpa research chance_of_admit
#> Min. :1.000 Min. :6.800 0:220 Min. :0.3400
#> 1st Qu.:3.000 1st Qu.:8.127 1:280 1st Qu.:0.6300
#> Median :3.500 Median :8.560 Median :0.7200
#> Mean :3.484 Mean :8.576 Mean :0.7217
#> 3rd Qu.:4.000 3rd Qu.:9.040 3rd Qu.:0.8200
#> Max. :5.000 Max. :9.920 Max. :0.9700
Previously, we saw that gre_score, toefl_score, and cgpa have the highest correlation with our target variable, chance_of_admit. Here we would like to visualize it.
plot_ly(data = admission,
x = ~ gre_score,
y = ~ toefl_score,
z = ~ cgpa,
color = ~ chance_of_admit,
# type = "scatter3d",
colors = c("black", "#BF3124","#F27D52","#F2C46D", "#800080"), # tinggi ke rendah
hoverinfo = 'text',
text = ~ paste("</br> GRE Score: ", gre_score,
"</br> TOEFL Score: ", toefl_score,
"</br> CGPA: ", cgpa,
"</br> Chance of Admit: ", chance_of_admit)
# alpha = 2,
) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = "GRE Score"),
yaxis = list(title = "TOEFL Score"),
zaxis = list(title = "CGPA"))) RNGkind(sample.kind = "Rounding")
set.seed(159)
index <- sample(nrow(admission),
nrow(admission) * 0.8)
admission_train <- admission[index, ]
admission_test <- admission[-index, ]admission_trainAs we seen in the correlation plot before, the strongest correlation towards the target variable is cgpa, therefore we are going to perform a simple linear regression modeling.
coa_cgpa <- lm(formula = chance_of_admit ~ cgpa,
data = admission_train)
summary(coa_cgpa)#>
#> Call:
#> lm(formula = chance_of_admit ~ cgpa, data = admission_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.278889 -0.028357 0.006111 0.040275 0.172544
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.011654 0.048002 -21.07 <0.0000000000000002 ***
#> cgpa 0.202389 0.005587 36.22 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.06591 on 398 degrees of freedom
#> Multiple R-squared: 0.7673, Adjusted R-squared: 0.7667
#> F-statistic: 1312 on 1 and 398 DF, p-value: < 0.00000000000000022
\[ Chance Of Admit = -1.04434 + 0.202389(cgpa) \]
cor.test(admission_train$cgpa,
admission_train$chance_of_admit)#>
#> Pearson's product-moment correlation
#>
#> data: admission_train$cgpa and admission_train$chance_of_admit
#> t = 36.223, df = 398, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.8509759 0.8969542
#> sample estimates:
#> cor
#> 0.8759396
p-value < 0.00000000000000022,
p-value < H0,
reject H0, accept H1,
CGPA have a linear correlation with Chance of Admit.
Baseline model, making the model using ALL predictors:
all_admission <- lm(formula = chance_of_admit ~ . , # target variable with ALL
data = admission_train)
summary(all_admission)#>
#> Call:
#> lm(formula = chance_of_admit ~ ., data = admission_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.267256 -0.024179 0.007702 0.034469 0.152604
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.3036161 0.1179768 -11.050 < 0.0000000000000002 ***
#> gre_score 0.0021416 0.0005560 3.852 0.000137 ***
#> toefl_score 0.0022874 0.0009484 2.412 0.016328 *
#> university_rating 0.0030528 0.0042301 0.722 0.470912
#> sop 0.0022870 0.0051065 0.448 0.654499
#> lor 0.0186909 0.0046137 4.051 0.0000615 ***
#> cgpa 0.1173084 0.0104464 11.230 < 0.0000000000000002 ***
#> research1 0.0265714 0.0071137 3.735 0.000215 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.05898 on 392 degrees of freedom
#> Multiple R-squared: 0.8164, Adjusted R-squared: 0.8132
#> F-statistic: 249.1 on 7 and 392 DF, p-value: < 0.00000000000000022
Insights:
chance_of_admit, and the highest positive correlation with the variable target is cgpa (0.1173084).university_rating and sop do not have a significant correlation towards our target variable.chance_of_admit, meanwhile the other 18.68% are represented by the other variables outside the model.df_all <- data.frame(coefficient = all_admission$coefficients,
variables = names(all_admission$coefficients))
rownames(df_all) <- NULL
plot_multi <- df_all %>%
ggplot(mapping = aes(x = variables, y = coefficient, col = variables)) +
geom_point(shape = 19, size = 5) +
ylim(c(0, 0.15)) +
theme(legend.position = "none") +
labs(title = "The Visualization of Multiple Linear Regression Coefficient",
subtitle = "Using the All Variables (all_admission)")
plot_multi %>% ggplotly()We use the stepwise to choose the best variables
none_admission <- lm(formula = chance_of_admit ~ 1,
data = admission_train)backward_admission <- step(object = all_admission,
direction = "backward",
trace = F)Formula = chance_of_admit ~ gre_score + toefl_score + lor + cgpa + research
forward_admission <- step(object = none_admission,
direction = "forward",
scope = list(lower = none_admission,
upper = all_admission),
trace = F)formula = chance_of_admit ~ cgpa + gre_score + lor + research + toefl_score
both1_admission <- step(object = all_admission,
direction = "both",
scope = list(lower = none_admission,
upper = all_admission),
trace = F)formula = chance_of_admit ~ gre_score + toefl_score + lor + cgpa + research
both2_admission <- step(object = none_admission,
direction = "both",
scope = list(lower = none_admission,
upper = all_admission),
trace = F)formula = chance_of_admit ~ cgpa + gre_score + lor + research + toefl_score
By comparing the Adjusted r squared, it shows that all models are almost the same, in which the variables within the model represents 0.8135837 or 81% of Chance of Admit, with the exception of using the all_admission model which got 0.8131597 or 81% (a slightly lower value).
For predicting the data, we can evaluate using the RMSE or Root Mean Square Error, MAE or Mean Absolute Error or MAPE or Mean Absolute Percentage Error. Here we could see the all_admission model has the lowest RMSE and MAE out of all other models. Therefore we are going to use it.
We use All_admission because it has the lowest MAE and RMSE value out of all the models.
dfastest <- data.frame(fitted_value = all_admission$fitted.values,
residuals = all_admission$residuals)The scatter plot between residuals and predicted values (fitted value) is useful for checking the assumption of linearity and homoscedasticity. If the plot depicts any specific or regular pattern then it is assumed the relation between the target variable and predictors is non-linear in nature (non-linearity exists). And no pattern in the curve is a sign of linearity among the selected features and the target variable.
plot(all_admission,which = 1)ggcorr() function, we have seen that all the variables have a strong positive correlation with the target variable.the red line is and the residuals seem to increase as the fitted Y values increase
The normality assumption means that the residuals or error from the linear regression model should be normally distributed because we expect to get residuals near the zero value. There are several ways to check if our model has passed the normality assumptions or not.
dfastest %>%
ggplot(aes(x = residuals)) +
geom_histogram(fill = "#69b3a2", # warna dalam kotak
color ="#e9ecef", # warna luar kotak
alpha = 5) +
labs(title = "Histogram for Normality Test")The Kolmogorov-Smirnov or K-S test is a test of the equality of two distributions. The One-sample K–S test used to test the normality of a selected continuous variable, and the Two-Sample K–S test used to test whether two samples have the same distribution or not (used for Simple Linear Regression).
ks.test(all_admission$residuals,
"pnorm",
mean = mean(all_admission$residuals),
sd = sd(all_admission$residuals))#>
#> One-sample Kolmogorov-Smirnov test
#>
#> data: all_admission$residuals
#> D = 0.10145, p-value = 0.0005311
#> alternative hypothesis: two-sided
D = 0.10145 | p-value = 0.0005311
p-value < alpha, meaning we reject the null hypothesis
According to the Kolmogorov-Smirnos Test, the error is not normally distributed
Density plot and Q-Q plot can be used to check normality visually.
Density plot:
QQ plot:
the density plot provides a visual judgment about whether the distribution is bell shaped.
library(ggpubr)
ggdensity(all_admission$residuals, fill = "grey", title = "Chance_of_admit") +
# scale_x_continuous(limits = c(-0.3, 0.3)) +
stat_overlay_normal_density(color = "red", linetype = "dashed")Skewed distributions
QQ plot (or quantile-quantile plot) draws the correlation between a given sample and the normal distribution. A 45-degree reference line is also plotted. In a QQ plot, each observation is plotted as a single dot. If the data are normal, the dots should form a straight line.
ggpubr::ggqqplot(all_admission$residuals) +
labs(title = "Normality using the QQ-Plot")There are dots that are outside of the line, meaning that…
According to the QQ-Plot the errors are NOT normally distributed.
We may use the Shapiro-Wilk Normality test and look at its p-value.
shapiro.test(all_admission$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: all_admission$residuals
#> W = 0.92715, p-value = 0.0000000000004841
\(H_0\) : Residuals are normally distributed
\(H_1\) : Residuals are NOT normally distributed
Accodring to Shapiro-Wilk Test, the Error / Residual are NOT normally distributed
As previously stated, the plot we used for checking the Linearity Assumption and Homoscedasticity Assumption are the same.
plot_homosc <- dfastest %>%
ggplot(aes(fitted_value, residuals)) +
geom_point() +
theme_light() +
geom_hline(aes(yintercept = 0), col = "red")
plot_homosc %>% ggplotly() We can also use the Breusch-Pagan test to check the assumption of Homoscedasticity by using the function bptest() in the library lmtest. Here, we would like to achieve Homoscedasticity.
library(lmtest)
bptest(all_admission)#>
#> studentized Breusch-Pagan test
#>
#> data: all_admission
#> BP = 20.842, df = 7, p-value = 0.00401
\(H_0\) : Variation of residual is constant (Homoscedasticity)
\(H_1\) : Variation of residual is not constant (Heteroscedasticity)
p-value = 0.00007634
p-value < 0.05, therefore
reject h0,
error / residual are Heteroscedasticity
The Variance Inflation Factor (VIF) is a statistical tool that can be used to asses the multicollinearity assumption. It is the tool to measure the effect of multicollinearity among the predictors in our model. We may use the function vif() from the car or also known as Companion to Applied Regression library. We must have all the variables below 10 to pass the no-multicollinearity assumptions.
vif(all_admission)#> gre_score toefl_score university_rating sop
#> 4.168858 3.546784 2.514230 2.799301
#> lor cgpa research
#> 2.027913 4.365244 1.438676
Target are NO-multicolinear because VIF < 10
To conclude, out of all 4 assumptions test, the model only passed the No-Multicollinearity, and Linearity test. Therefore, we need to improve our model. Because Regression Models are sensitive to outliers, we are going to redo our data wrangling.
Previously when we use the function plot() to check our linearity, R already mentioned 3 outliers observation in our data. We can also use the function outlierTest() to check it. After that, we are going to remove these observations.
outlierTest(all_admission)#> rstudent unadjusted p-value Bonferroni p
#> 5 -4.682108 0.0000039222 0.0015689
#> 78 -4.013394 0.0000717440 0.0286970
#> 140 -3.972432 0.0000846990 0.0338800
admission_no_out <- admission[ -c(5,78,140), ] To resolve the Heteroscedasticity assumption, we may use the Box-Cox Transformation. Box-cox transformation is a mathematical transformation of the variable to make it approximate to a normal distribution.
bc_trans <- caret::BoxCoxTrans(admission_no_out$chance_of_admit)
admission_no_out <- cbind(admission_no_out, coa_new = predict(bc_trans, admission_no_out$chance_of_admit))
admission_no_out <- admission_no_out %>%
select(-chance_of_admit)
admission_no_outggcorr(admission_no_out,
label = T,
label_size = 4,
hjust = 1,
layout.exp = 1)We have previously seen that university rating and SOP has no significant value, therefore we are going to try to remove this for our model improvement.
admission_no_out <- admission_no_out %>%
select(-c(university_rating, sop))RNGkind(sample.kind = "Rounding")
set.seed(159)
index <- sample(nrow(admission_no_out),
nrow(admission_no_out) * 0.8)
admission_no_out_train <- admission_no_out[index, ]
admission_no_out_test <- admission_no_out[-index, ]all_no_out_admission <- lm(formula = coa_new ~ . , # target variable with ALL
data = admission_no_out_train)
summary(all_no_out_admission)#>
#> Call:
#> lm(formula = coa_new ~ ., data = admission_no_out_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.209806 -0.019814 0.007221 0.029631 0.126079
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.0056565 0.0914883 -21.923 < 0.0000000000000002 ***
#> gre_score 0.0015362 0.0004624 3.322 0.000978 ***
#> toefl_score 0.0026035 0.0007971 3.266 0.001186 **
#> lor 0.0136055 0.0035044 3.882 0.000121 ***
#> cgpa 0.1083541 0.0089360 12.126 < 0.0000000000000002 ***
#> research1 0.0201701 0.0061708 3.269 0.001176 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.04954 on 391 degrees of freedom
#> Multiple R-squared: 0.8346, Adjusted R-squared: 0.8325
#> F-statistic: 394.7 on 5 and 391 DF, p-value: < 0.00000000000000022
# data prediction
new_all <- predict(object = all_no_out_admission, newdata = admission_no_out_test)# MAE for Data TRAIN
MAE(y_pred = all_no_out_admission$fitted.values, y_true = admission_no_out_train$coa_new)#> [1] 0.03621813
# MAE for Data TEST
MAE(y_pred = new_all, y_true = admission_no_out_test$coa_new)#> [1] 0.02932204
Now our new model has an Adjusted R-Square of 0.8325 or 83.25%. The value has increased the previous model. The MAE for the test data is 0.02932204, meanwhile for the train data is 0.03621813. This indicates that our model is underfit.
plot(all_no_out_admission, which = 1)newassump <- data.frame(fitted_value = all_no_out_admission$fitted.values,
residuals = all_no_out_admission$residuals)
new_linearity <- newassump %>%
ggplot(aes(x = fitted_value, y = residuals)) +
geom_point() +
geom_smooth() +
geom_hline(aes(yintercept = 0, col = "red")) +
theme(legend.position = "none") +
ylim(c(-0.3, 0.3))
new_linearity %>% ggplotly()There is little to no discernible pattern in our residual plot, we can conclude that our model is linear.
ggpubr::ggqqplot(all_no_out_admission$residuals) +
labs(title = "Normality using the QQ-Plot")There are still dots that are outside of the line, meaning that…
According to the QQ-Plot the errors are NOT normally distributed.
shapiro.test(all_no_out_admission$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: all_no_out_admission$residuals
#> W = 0.94354, p-value = 0.00000000003755
\(H_0\) : Residuals are normally distributed
\(H_1\) : Residuals are NOT normally distributed
Accodring to Shapiro-Wilk Test, the Error / Residual are NOT normally distributed
bptest(all_no_out_admission)#>
#> studentized Breusch-Pagan test
#>
#> data: all_no_out_admission
#> BP = 10.9, df = 5, p-value = 0.05341
\(H_0\) : Variation of residual is constant (Homoscedasticity)
\(H_1\) : Variation of residual is not constant (Heteroscedasticity)
p-value = 0.05341
p-value > 0.05, therefore
reject h0,
According to the Breusch-Pagan test, the variation of residual is constant (Homoscedasticity)
vif(all_no_out_admission)#> gre_score toefl_score lor cgpa research
#> 4.585801 3.911981 1.735044 4.870249 1.518496
Target are NO-multicolinear because VIF < 10
To conclude, on our first model, the variables represents 81.31%, and our improved model represents 83.25%. Although the latest model represents more, it is actually proven to be underfit. After we try to improve our model, it is seen that it has not passed only the Normality assumptions. There are several reasons why the data are not distributed normally, one of them being that it may be because there is not enough data for the model to learn. To resolve this, we can add more data so that we can feed it to our model in order for them to learn better and create a better model that we can use.
The important variables for predicting the chance of a person to be admitted in US Grad school are their CGPA, previous research experience, Letter of Recommendation, GRE Score, and TOEFL SCORE (in this respective order). According to this dataset, we suggest that students who are applying for US Graduate School shall focus on their CGPA first as it is the highest coefficients wit chance of admit. In the end we also have two variables that has no significant coefficient with our target variable, those are university ranking and statement of purpose. We may conclude that although the student’s undergraduate university ranking and the statement of purpose are a required document to be sent to the Grad School, when we leave these two out of the new model, it actually improves our percentage and therefore can be put aside for a moment while focusing on the first 5 variables mentioned.
Thank you very much for taking the time to read my Machine Learning - Regression Model Report. I hope you it may be useful for you readers and let’s prepare for our grad school applications! Meanwhile, on to the next projects…! -gn