Hi, welcome to my Learning by Building for Regression Model. In this project, I will try to analyze about prediction of Graduate Admissions to applying for Master Program. I found the dataset from kaggle.com
1 Load Library
Here are the library I will use in this LBB
2 Load Dataset
Let’s just load the data for this poject
The data consist of 9 columns, which are:
Serial.No.
: No of studentGRE.Score
: Graduate Record Examination scoreTOEFL.Score
: Standardized test used to measure the English-language ability of non-native speakers wishing to enroll in English-speaking universitiesSOP
- Statement of PurposeLOR
- Letter of RecommendationCGPA
- Undergraduate GPAResearch
- If the Undergraduate Degree graduate had ever done a research or notChance.of.Admit
- The variable to be predicted.The score given to
3 Explanatory Data Analysis
Now, let’s check if our dataset already set in appropriate type
## 'data.frame': 500 obs. of 9 variables:
## $ Serial.No. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
The data already had appropriate dataset. Now, lets’s check if there were any missing values in the data
## Serial.No. GRE.Score TOEFL.Score University.Rating
## 0 0 0 0
## SOP LOR CGPA Research
## 0 0 0 0
## Chance.of.Admit
## 0
Our data were cleaned from missing values
Now, let’s just take out the Serial.No.
column since it just gave us information about row number
Let’s check the summary for the dataset we have
## GRE.Score TOEFL.Score University.Rating SOP
## Min. :290.0 Min. : 92.0 Min. :1.000 Min. :1.000
## 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000 1st Qu.:2.500
## Median :317.0 Median :107.0 Median :3.000 Median :3.500
## Mean :316.5 Mean :107.2 Mean :3.114 Mean :3.374
## 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :340.0 Max. :120.0 Max. :5.000 Max. :5.000
## LOR CGPA Research Chance.of.Admit
## Min. :1.000 Min. :6.800 Min. :0.00 Min. :0.3400
## 1st Qu.:3.000 1st Qu.:8.127 1st Qu.:0.00 1st Qu.:0.6300
## Median :3.500 Median :8.560 Median :1.00 Median :0.7200
## Mean :3.484 Mean :8.576 Mean :0.56 Mean :0.7217
## 3rd Qu.:4.000 3rd Qu.:9.040 3rd Qu.:1.00 3rd Qu.:0.8200
## Max. :5.000 Max. :9.920 Max. :1.00 Max. :0.9700
As we can see, we don’t have any outliers for our dataset. And the range of Chance.of.Admit
was between 30% to 97%
Now, let’s check the correlation between our tables. In the chunk below, I will try to show up from the most correlated to the least correlated
numericVars <- which(sapply(admission, is.numeric))
all_numVar <- admission[, numericVars]
cor_numVar <- cor(all_numVar, use = "pairwise.complete.obs")
# Sort on decreasing correlations with Admission Probability
cor_sorted <- as.matrix(sort(cor_numVar[, "Chance.of.Admit"], decreasing = TRUE))
# Selecting high correlations
Cor_High <- names(which(apply(cor_sorted, 1, function(x) abs(x) > 0.25)))
cor_numVar <- cor_numVar[Cor_High, Cor_High]
corrplot.mixed(cor_numVar, tl.col = "black", tl.pos = "lt")
As we can see, the variable mostly had strong correlation to the Chance.of.Admit
, but the the most powerful one was CGPA
Now let’s check the correlation between Chance.of.Admit
and CGPA
ggplot(admission, aes(x = CGPA, y = Chance.of.Admit)) + geom_point(col = "orchid") +
labs(x = "CGPA", y = "Chance of Admit") +
labs(title = "College GPA vs Chance of Admit") +
geom_smooth(method = "lm", se = FALSE, col = "red") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank()
)
From the graph, we can see that the better your GPA is, the better chance you get to be admitted.
4 Linear Regression
We got the difference for the prediction, now, let’s try create the model of Chance.of.Admit
compared to CGPA
##
## Call:
## lm(formula = Chance.of.Admit ~ CGPA, data = admission)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.276592 -0.028169 0.006619 0.038483 0.176961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.04434 0.04230 -24.69 <0.0000000000000002 ***
## CGPA 0.20592 0.00492 41.85 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06647 on 498 degrees of freedom
## Multiple R-squared: 0.7787, Adjusted R-squared: 0.7782
## F-statistic: 1752 on 1 and 498 DF, p-value: < 0.00000000000000022
From the most correlated predictor, the Adjusted R-Squared we get is 0.7782. Now, let’s check if we use all predictors into the model
##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = admission)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.266657 -0.023327 0.009191 0.033714 0.156818
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2757251 0.1042962 -12.232 < 0.0000000000000002 ***
## GRE.Score 0.0018585 0.0005023 3.700 0.000240 ***
## TOEFL.Score 0.0027780 0.0008724 3.184 0.001544 **
## University.Rating 0.0059414 0.0038019 1.563 0.118753
## SOP 0.0015861 0.0045627 0.348 0.728263
## LOR 0.0168587 0.0041379 4.074 0.0000538 ***
## CGPA 0.1183851 0.0097051 12.198 < 0.0000000000000002 ***
## Research 0.0243075 0.0066057 3.680 0.000259 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05999 on 492 degrees of freedom
## Multiple R-squared: 0.8219, Adjusted R-squared: 0.8194
## F-statistic: 324.4 on 7 and 492 DF, p-value: < 0.00000000000000022
The Adjusted R-Squared using all predictor is higher but there is no significant difference. It is 0.8194
Now, let’s check, what is the Adjusted R-Squared we get if we use stepwise model. In this case, I will use backward
## Start: AIC=-2805.71
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## SOP + LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## - SOP 1 0.00043 1.7708 -2807.6
## <none> 1.7704 -2805.7
## - University.Rating 1 0.00879 1.7792 -2805.2
## - TOEFL.Score 1 0.03648 1.8069 -2797.5
## - Research 1 0.04872 1.8191 -2794.1
## - GRE.Score 1 0.04926 1.8196 -2794.0
## - LOR 1 0.05973 1.8301 -2791.1
## - CGPA 1 0.53542 2.3058 -2675.6
##
## Step: AIC=-2807.59
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## <none> 1.7708 -2807.6
## - University.Rating 1 0.01190 1.7827 -2806.2
## - TOEFL.Score 1 0.03760 1.8084 -2799.1
## - Research 1 0.04893 1.8197 -2796.0
## - GRE.Score 1 0.04901 1.8198 -2795.9
## - LOR 1 0.06892 1.8397 -2790.5
## - CGPA 1 0.55954 2.3304 -2672.3
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## LOR + CGPA + Research, data = admission)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26617 -0.02321 0.00946 0.03345 0.15713
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2800138 0.1034717 -12.371 < 0.0000000000000002 ***
## GRE.Score 0.0018528 0.0005016 3.694 0.000246 ***
## TOEFL.Score 0.0028072 0.0008676 3.236 0.001295 **
## University.Rating 0.0064279 0.0035318 1.820 0.069363 .
## LOR 0.0172873 0.0039464 4.380 0.0000145 ***
## CGPA 0.1189994 0.0095344 12.481 < 0.0000000000000002 ***
## Research 0.0243538 0.0065985 3.691 0.000248 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05993 on 493 degrees of freedom
## Multiple R-squared: 0.8219, Adjusted R-squared: 0.8197
## F-statistic: 379.1 on 6 and 493 DF, p-value: < 0.00000000000000022
From the model, we can see that the most correlated prediction was GRE.Score
, TOEFL.Score
, University.Rating
, LOR
, CGPA
, and Research
and the Adjusted R-Squared resulted from this model is a bit higher, which is 0.8002. Not too significant, but it means we can eliminate some of the predictors which not too important to get higher result.
5 Model Prediction & Error
Now, let’s check the prediction for each model and evaluate so we can decide which model will we use
5.1 Model Prediction
First, let’s create the prediction for each model
admission$pred_lm1 <- predict(model_lm1, data.frame(CGPA = admission$CGPA))
admission$pred_lm2 <- predict(model_lm2, admission)
admission$pred_lm3 <- predict(model_lm3, data.frame(GRE.Score = admission$GRE.Score, TOEFL.Score = admission$TOEFL.Score, University.Rating = admission$University.Rating, LOR = admission$LOR, CGPA = admission$CGPA, Research = admission$Research
))
5.2 Model Error
Now, let’s evaluate the error for each model. I will use 2 kind of evaluation model
5.2.1 Mean Squared Error (MSE)
## [1] 0.00440057
## [1] 0.003540751
## [1] 0.003541621
5.2.2 Mean Absolute Percentage Error (MAPE)
## [1] 0.07678185
## [1] 0.06853769
## [1] 0.06851627
From the results above, we can conclude that model_lm2
has the smallest error. The smaller error, the closer you are to finding the line of best fit. So, we will just use model_lm2
and let’s evaluate the model!
6 Model Evaluation
6.1 Normality
Now, let’s check the normality based on model_lm2
residuals
hist(model_lm2$residuals,
main="Histogram of Residual",
xlab="Residuals",
border="blue",
col="lightyellow")
Now, let’s check the normality with
saphiro.test
##
## Shapiro-Wilk normality test
##
## data: model_lm2$residuals
## W = 0.92549, p-value = 0.000000000000004824
The
p-value
we get from the test is lower than 0.05 which means the error is not distributed normally
6.2 Heteroscedasticity
plot(admission$CGPA, model_lm2$residuals,
xlab= "CGPA",
ylab = "Residuals")
abline(h = 0, col = "red")
##
## studentized Breusch-Pagan test
##
## data: model_lm2
## BP = 30.516, df = 7, p-value = 0.00007634
The
p-value
we get from the test is lower than 0.05 which means the error is not Heteroscedasticity
7 Conclusion
The model_lm2
give the best result of Adjusted R-Squared for 0.8194, MSE for 0.003540751 and MAPE for 0.06853769, which defined that all predictors have strong relationship to calculate the probability to be accepted. The model shows that the p-value from the model is lower than 0.005, which means, the model has small confidence to be used with new data. So, it might result a different number for Chance.of.Admit
of the newest data we have.