Hi Everyone!
Welcome to my first data science project. Graduate Admission project is based on the problem faced by graduate students when they register for the postgraduate program. They had a dilemma when determining the university, while there are a number of factors that maybe can influence them or it could be the answer based on their own problems. But in reality, these factors are still based on previous exam experience. Therefore, in this project, it is hoped that we will be able to predict what factors have a significant effect on the chances of passing the postgraduate program administration with machine learning modeling.
Here the steps what we’re doing:
library(GGally) #heatmap
library(tidyverse) #data wrangling
library(MLmetrics) # nilai MAPE
library(lmtest) # linear regression
library(car) # multicolinearity
library(inspectdf)
library(performance) # compare performance
graduate <- read.csv("Admission_Predict.csv") # simpan ke object `graduate`
head(graduate)
tail(graduate)
dim(graduate)
## [1] 400 9
names(graduate)
## [1] "Serial.No." "GRE.Score" "TOEFL.Score"
## [4] "University.Rating" "SOP" "LOR"
## [7] "CGPA" "Research" "Chance.of.Admit"
GRE.Score : The Graduate Record Examination is an assessment test that must be done if you want to apply to various postgraduate programs
TOEFL.Score : TOEFL Score
University.Rating : University Rating
SOP : Statement of Purpose is a short essay that describes applicant educational background, achievements, and their future goals.
LOR : Letter of Recommendation
CGPA :College Grade Point Average is the score obtained from the entire semester that has been taken
Research : Research experience, has it been done or not
Chance.of.Admit : Opportunity to enter the university
summary(graduate)
## Serial.No. GRE.Score TOEFL.Score University.Rating
## Min. : 1.0 Min. :290.0 Min. : 92.0 Min. :1.000
## 1st Qu.:100.8 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000
## Median :200.5 Median :317.0 Median :107.0 Median :3.000
## Mean :200.5 Mean :316.8 Mean :107.4 Mean :3.087
## 3rd Qu.:300.2 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000
## Max. :400.0 Max. :340.0 Max. :120.0 Max. :5.000
## SOP LOR CGPA Research
## Min. :1.0 Min. :1.000 Min. :6.800 Min. :0.0000
## 1st Qu.:2.5 1st Qu.:3.000 1st Qu.:8.170 1st Qu.:0.0000
## Median :3.5 Median :3.500 Median :8.610 Median :1.0000
## Mean :3.4 Mean :3.453 Mean :8.599 Mean :0.5475
## 3rd Qu.:4.0 3rd Qu.:4.000 3rd Qu.:9.062 3rd Qu.:1.0000
## Max. :5.0 Max. :5.000 Max. :9.920 Max. :1.0000
## Chance.of.Admit
## Min. :0.3400
## 1st Qu.:0.6400
## Median :0.7300
## Mean :0.7244
## 3rd Qu.:0.8300
## Max. :0.9700
Insights
There’re 400 observations with 9 variables in our data with the target class Chance.of.Admit
Serial.No. is a variable with unique ID, so we can remove the variable.
Minimal of GRE.Score is 290, then the maximum value is 340. Mean of GRE.Score for all applicant is 316,8
Minimal of TOEFL.Score the applicant is 92, then the maximum score got by an applicant is 120. Mean of TOEFL.Score or all applicant is 107.4
Research only has a value of 0 or 1, it means if applicant have already experience in research (1) or not (0).
# ubah tipe data
graduate <- graduate %>%
mutate(Research = as.factor(Research))
# check missing value
colSums(is.na(graduate))
## Serial.No. GRE.Score TOEFL.Score University.Rating
## 0 0 0 0
## SOP LOR CGPA Research
## 0 0 0 0
## Chance.of.Admit
## 0
There’re no missing value in our dataset
For the next step we do some data exploration to see more deeply what insights we can get in our data.
In visualizing the distribution of the data for all variables, we used a boxplot. The results show there’re an outlier in our 3 variables, CGPA,Chance.of.Admit, and LOR.
Before building the model, we want to test whether the independent variable and dependent variable have a linear relationship or not.
From 7 predictors, there’s only Serial.No. who doesn’t had linear correlation with our target class. This might be because the variable contains only unique IDs. For the rest overall have a positive correlation with the target class and CGPA is a variable with very strong positive correlation to target class when compared to the others.
Through the insights that we’ve found earlier, we can remove the Serial.No. variable to simplify the next analysis step
graduate_clean <- graduate %>% select(-Serial.No.)
After we obtain the information about correlation between the predictor and the dependent variable through the previous step, next we’re trying to create a linear regression model with the predictor CGPA.
model_lm <- lm(formula = Chance.of.Admit~CGPA, graduate_clean)
summary(model_lm)
##
## Call:
## lm(formula = Chance.of.Admit ~ CGPA, data = graduate_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.274575 -0.030084 0.009443 0.041954 0.180734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.07151 0.05034 -21.29 <0.0000000000000002 ***
## CGPA 0.20885 0.00584 35.76 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06957 on 398 degrees of freedom
## Multiple R-squared: 0.7626, Adjusted R-squared: 0.762
## F-statistic: 1279 on 1 and 398 DF, p-value: < 0.00000000000000022
Based on the summary model_lm we can interpret that the predictor CGPA is statistically significant with the target class (Chance.of.Admit) which is indicated by its p-value < 0.05. In addition, the predictor variable was able to explain the variance/diversity of the target as much as 76%. The rest is explained through other factors. The coefficient CGPA 0.2 indicates that the higher the value of the predictor variable CGPA, the higher the chances of passing postgraduate student administration (Chance.of.Admit).
After getting the findings from model_lm, is it enough for us to just use the model? How about the others? Next we’ll trying to use all the predictors to see if there were something changed
model <- lm(formula = Chance.of.Admit~., graduate_clean)
summary(model)
##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = graduate_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26259 -0.02103 0.01005 0.03628 0.15928
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2594325 0.1247307 -10.097 < 0.0000000000000002 ***
## GRE.Score 0.0017374 0.0005979 2.906 0.00387 **
## TOEFL.Score 0.0029196 0.0010895 2.680 0.00768 **
## University.Rating 0.0057167 0.0047704 1.198 0.23150
## SOP -0.0033052 0.0055616 -0.594 0.55267
## LOR 0.0223531 0.0055415 4.034 0.000066 ***
## CGPA 0.1189395 0.0122194 9.734 < 0.0000000000000002 ***
## Research1 0.0245251 0.0079598 3.081 0.00221 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06378 on 392 degrees of freedom
## Multiple R-squared: 0.8035, Adjusted R-squared: 0.8
## F-statistic: 228.9 on 7 and 392 DF, p-value: < 0.00000000000000022
Based on summary model, there are several findings. All the predictor was significant with the target class, except for the variables Universtiy.Rating and SOP. In addition, if we look at all the predictor coefficient, it shows if the score getting higher, the more likely it is that someone will pass the postgraduate administration program.
If we compared between model and model_lm there is an increasing of R-square score with the addition of 6 variables. The score increase until 4% from the previous model. Due to the increase in the R-squared value, can we immediately state that this model is good? I think we need to check the of Adj.R-squared score. When compared to the previous model, we know that the Adj.R-squared score of model is increasing. However, we need to know what predictors are best at explaining the variance of the target class. So in the next stage we’ll do the feature selection step in order getting the best predictor
In order to create optimizing significant predictors, the multiple linear regression model will use stepwise backward regression.
model_back <- step(model, direction = "backward", trace = F)
summary(model_back)
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
## CGPA + Research, data = graduate_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.263542 -0.023297 0.009879 0.038078 0.159897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2984636 0.1172905 -11.070 < 0.0000000000000002 ***
## GRE.Score 0.0017820 0.0005955 2.992 0.00294 **
## TOEFL.Score 0.0030320 0.0010651 2.847 0.00465 **
## LOR 0.0227762 0.0048039 4.741 0.00000297 ***
## CGPA 0.1210042 0.0117349 10.312 < 0.0000000000000002 ***
## Research1 0.0245769 0.0079203 3.103 0.00205 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06374 on 394 degrees of freedom
## Multiple R-squared: 0.8027, Adjusted R-squared: 0.8002
## F-statistic: 320.6 on 5 and 394 DF, p-value: < 0.00000000000000022
When we’re evaluating the model with stepwise model using the AIC (Akaike Information Criterion/ Information Loss) value. AIC shows a lot of missing information on the model
The backward step wise regression method works from the overall predictor used, then the model is evaluated by reducing the predictor variables so that the smallest AIC model is obtained.
Model Interpretation:
When compared with the previous model in terms of the value of Adj.R-squared, the difference is not too significant, as well as the value of R-squared. However, if we look at the model_back summary, we can see which variables contributing to explain the variance of the target class (Chance.of.Admit).
We’ve already made 3 linear regression models, model_lm, model, and model_back. If we look together at the R-squared and Adj.R-squared values of each model, it turns out that the difference score between the two metrics isn’t much different. Through this information we can say that there is no multicollinearity in our data. However, we’ll analyze more detail in one part of the linear regression assumption test.
Before move to the next step predicting with the data test, make sure our model satisfies the following assumptions of the linear regression. There are four assumptions and are mentioned below. If the model fails to meet these assumptions, then we simply can not use this model.
Linearity assumption expects that every predictor variable has correlation with the target variable. The important thing about checking the linearity of model linear regression because the model only capture linear pattern in our data. So, the predictor were used it must have a linear pattern with the target.
What we can infer by the graph above is the pattern of our data distributed in 0 residual and 0.5 - 0.8 fitted points.
We expect that when making a linear regression model, the error result is normally distributed. It means that many errors are clustered around 0.
The data is right skewed. This is strong signal for non-normal residual distribution. To be more statistically, we use Shapiro-Wilk Test.
shapiro.test(model_back$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_back$residuals
## W = 0.92193, p-value = 0.0000000000001443
P-value given by the Shapiro-Wilk normality test is less than alpha 0.05. Therefore, the model is indicating not-normally distributed residuals.
We expect that the model will obtain an error/residual whose variance doesn’t formed a pattern so it must spread randomly.
bptest(model_back)
##
## studentized Breusch-Pagan test
##
## data: model_back
## BP = 22.428, df = 5, p-value = 0.0004341
Breusch-Pagan coefficient given shows the p-value is less than alpha (0.05). Therefore the model is indicating of heteroscedasticity
From our previous analysis, we stated that if a small difference between R-squared and Adj. R-squared it isn’t indicating a multicolinearity. In this section we want to approve our previous hypothesis through vif() function
library(car)
vif(model_back)
## GRE.Score TOEFL.Score LOR CGPA Research
## 4.585053 4.104255 1.829491 4.808767 1.530007
From the result, we can say the multicolinearity doesn’t exist in our data. It shown by the vif result for all predictor is less than 10.
After we’re doing assumption checking for our linear model regression. We found that from 4 type of linear regression assumption there’s only 2 assumption were qualified (Normality and Multicolinearity). For the rest of it, we’ll try to do an further analyzing in the next step.
We’re already know that our model is still haven’t meet the linear regression assumption test. Then, how can we handled this problem? Luckily, based on this article it shows how we can solved this problem. In this article stated that “If a log transformation is applied to the dependent variable only, this is equivalent to assuming that it grows (or decays) exponentially as a function of the independent variables. Beside, if we use a log transformation applied to both the dependent variable and the independent variables, this is equivalent to assuming that the effects of the independent variables are multiplicative rather than additive in their original units”.
So, in this step we choose to transform or scale our dependent variable score (Chance.of.Admit).
graduate_log <- graduate_clean %>%
mutate(Chance.of.Admit = log10(Chance.of.Admit))
head(graduate_log)
After we’re transforming our dependent variable score, the next step is building the model and evaluating the model.
# Model fiting
model_grad_log <- lm(formula = Chance.of.Admit~., graduate_log)
summary(model_grad_log)
##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = graduate_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.218581 -0.017131 0.007131 0.027996 0.110971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4182330 0.0925664 -15.321 < 0.0000000000000002 ***
## GRE.Score 0.0010423 0.0004437 2.349 0.0193 *
## TOEFL.Score 0.0018417 0.0008086 2.278 0.0233 *
## University.Rating 0.0007993 0.0035403 0.226 0.8215
## SOP -0.0037254 0.0041275 -0.903 0.3673
## LOR 0.0166036 0.0041125 4.037 0.000065 ***
## CGPA 0.0797876 0.0090684 8.798 < 0.0000000000000002 ***
## Research1 0.0139027 0.0059072 2.354 0.0191 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04734 on 392 degrees of freedom
## Multiple R-squared: 0.7455, Adjusted R-squared: 0.741
## F-statistic: 164.1 on 7 and 392 DF, p-value: < 0.00000000000000022
It can be seen that University.Rating and SOP are the predictors which are not statistically significant with the target (Chance.of.Admit).
model_grad_log_back <- step(model_grad_log,
direction = "backward",
trace = F)
summary(model_grad_log_back)
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
## CGPA + Research, data = graduate_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.223705 -0.016372 0.007085 0.027803 0.110051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4108842 0.0869699 -16.223 < 0.0000000000000002 ***
## GRE.Score 0.0010723 0.0004416 2.428 0.0156 *
## TOEFL.Score 0.0017438 0.0007898 2.208 0.0278 *
## LOR 0.0150532 0.0035621 4.226 0.0000296 ***
## CGPA 0.0785130 0.0087013 9.023 < 0.0000000000000002 ***
## Research1 0.0134695 0.0058728 2.294 0.0223 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04726 on 394 degrees of freedom
## Multiple R-squared: 0.745, Adjusted R-squared: 0.7417
## F-statistic: 230.2 on 5 and 394 DF, p-value: < 0.00000000000000022
From the model that we’ve been built before, it is known that the score of Adj.R-squared is worse than the previous model with the original data. In addition, the number of predictors that contribute to explaining the variance of the target class is still same as the model we created in the early step of building model.
Before evaluating the model with our data test, it would be better if we checked the model first against the assumption of linear regression. Because previously we’ve known that only 2 assumptions already meet from 4 assumptions, then we’ll prove the rest of it (Normality and Homoscedasticity).
The data is still right skewed after we’re transforming.
shapiro.test(model_grad_log_back$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_grad_log_back$residuals
## W = 0.89568, p-value = 0.0000000000000006789
Because the p-value score of our model is still less than 0.05 (alpha) it indicates that our is still haven’t distributed normally eventhough we’re already transform the data
bptest(model_grad_log_back)
##
## studentized Breusch-Pagan test
##
## data: model_grad_log_back
## BP = 35.288, df = 5, p-value = 0.000001318
Using Breusch-Pagan test, p-value that given less than 0.05 (alpha value) indicates that the model violates Homoscedasticity assumption.
Based on the results of the linear regression assumption test, it can be concluded that linear regression model still can’t be used to predict the data test. In the future, it is expected to try other algorithms / perform deeper pre-processing steps in order to obtain a better model and fulfill the linear regression assumption test. So we can find out about what are the factors affected a person’s chances to pass the postgraduate admission program.
From several series of analysis steps, here some conclusions were found as follows.
Based on the data exploration step, all the predictors have a linear relationship with the dependent variable (Chance.of.Admit). The predictor who had the highest correlation score is CGPA.
Based on the summary results of model, it is found that of the overall predictor variables, University.Rating and SOP have no significant effect on the chances of passing the postgraduate admission program.
In choosing the best features for making linear regression models, the Backward step-wise regression method is used. The results from summary model_back shows that there are 5 predictors would’ve contribute to explaining the variance of the target class.
Even though we’ve been succeeded building a linear regression model, if we look closer at the results of the assumption test that we have done, it shows that our model still doesn’t meet the assumptions because it only meets 2 from 4 assumptions for the linear regression assumption test. Therefore, in this project, data transformation is carried out using logs on the dependent variable.
Overall, if we compare the initial model with the modified model, there is no change at all. So it is necessary to carry out a further pre-processing stage. Also, I think we need to use other machine learning algorithms/models than linear regression model.
Suggestions in the future is to try a random forest regression modelto predict and find out roughly what are features could affect the chances of someone passing the postgraduate admission program
Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019
https://people.duke.edu/~rnau/testing.htm#:~:text=How%20to%20fix%3A%20consider,example%20on%20this%20web%20site
3.https://www.r-bloggers.com/2020/05/step-by-step-guide-on-how-to-build-linear-regression-in-r-with-code/#:~:text=A%20large%20difference%20between%20the%20R%2DSquared%20and%20Adjusted%20R%2Dsquared%20is%20not%20appreciated%20and%20generally%20indicates%20that%20multicollinearity%20exists%20within%20the%20data.