Bussines Problem
Context
This dataset is created for prediction of Graduate Admissions from an Indian perspective.
Content
The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are :
GRE Scores ( out of 340 )
TOEFL Scores ( out of 120 )
University Rating ( out of 5 )
Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
Undergraduate GPA ( out of 10 )
Research Experience ( either 0 or 1 )
Chance of Admit ( ranging from 0 to 1 )
Acknowledgements
This dataset is inspired by the UCLA Graduate Dataset. The test scores and GPA are in the older format. The dataset is owned by Mohan S Acharya.
Inspiration
This dataset was built with the purpose of helping students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university.
The Goal
The goal of modeling is to predict the probability of student entering the university.
Preparation
Load the library
library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(car)
library(scales)
library(lmtest)
library(MLmetrics)
library(inspectdf)
options(scipen = 100, max.print = 1e+06)Read the dataset
univ_data <- read.csv("archive/Admission_Predict.csv")
rmarkdown::paged_table(head(univ_data))Checking dataset column types
str(univ_data)## 'data.frame': 400 obs. of 9 variables:
## $ Serial.No. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
because there some categorical variable like Research and University Rating we gonna remove it. we also not using Serial.No. because it is id of student and not gonna tell much information about chance of admiting in college.
Check na values
colSums(is.na(univ_data))## Serial.No. GRE.Score TOEFL.Score University.Rating
## 0 0 0 0
## SOP LOR CGPA Research
## 0 0 0 0
## Chance.of.Admit
## 0
the data has zero na values. we can move on to next step
Summary of data
summary(univ_data)## Serial.No. GRE.Score TOEFL.Score University.Rating
## Min. : 1.0 Min. :290.0 Min. : 92.0 Min. :1.000
## 1st Qu.:100.8 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000
## Median :200.5 Median :317.0 Median :107.0 Median :3.000
## Mean :200.5 Mean :316.8 Mean :107.4 Mean :3.087
## 3rd Qu.:300.2 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000
## Max. :400.0 Max. :340.0 Max. :120.0 Max. :5.000
## SOP LOR CGPA Research
## Min. :1.0 Min. :1.000 Min. :6.800 Min. :0.0000
## 1st Qu.:2.5 1st Qu.:3.000 1st Qu.:8.170 1st Qu.:0.0000
## Median :3.5 Median :3.500 Median :8.610 Median :1.0000
## Mean :3.4 Mean :3.453 Mean :8.599 Mean :0.5475
## 3rd Qu.:4.0 3rd Qu.:4.000 3rd Qu.:9.062 3rd Qu.:1.0000
## Max. :5.0 Max. :5.000 Max. :9.920 Max. :1.0000
## Chance.of.Admit
## Min. :0.3400
## 1st Qu.:0.6400
## Median :0.7300
## Mean :0.7244
## 3rd Qu.:0.8300
## Max. :0.9700
we are going to remove some column like Research, University Rating, and Serial.No.
univ_data_clean <- univ_data %>% select(-c(Research, University.Rating, Serial.No.))Exploratory Data Analysis
Plot the Pearson correlation between features
ggcorr(univ_data_clean, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)Inspect distribution of data
univ_data_clean %>%
inspect_num() %>%
show_plot() we are seing almost all of the plot distribution almost like normal distribution.
Plot chance of admit with other feature
plot(univ_data_clean$GRE.Score, univ_data_clean$Chance.of.Admit)plot(univ_data_clean$TOEFL.Score, univ_data_clean$Chance.of.Admit)plot(univ_data_clean$SOP, univ_data_clean$Chance.of.Admit)plot(univ_data_clean$LOR, univ_data_clean$Chance.of.Admit)plot(univ_data_clean$CGPA, univ_data_clean$Chance.of.Admit) they look quite linear to each feature and there are no outliers in the data.
Split data
RNGkind(sample.kind = "Rounding") # membuat hasil set.seed sama untuk versi R 3.xx dengan 4.xx## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(417) # mengunci random yang dihasilkan oleh fungsi sample
univ_split <- sample(nrow(univ_data_clean), nrow(univ_data_clean)*0.80) # 80% persen data
univ_train <- univ_data_clean[univ_split, ] # 80% data train
univ_test <- univ_data_clean[-univ_split, ] # 20% data testwe are gonna split data into train and test. with the proportion of 80:20
Plot train and test data distribution
univ_train %>%
inspect_num() %>%
show_plot()univ_test %>%
inspect_num() %>%
show_plot() after splitting the distribution of plot still looked like normal distribution
Modeling
Make linear regression model
model_rm <- lm(formula = Chance.of.Admit ~ ., data = univ_train)
summary(model_rm)##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = univ_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28266 -0.02316 0.00914 0.03661 0.16555
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5480609 0.1233053 -12.555 < 0.0000000000000002 ***
## GRE.Score 0.0027677 0.0006627 4.176 0.0000384 ***
## TOEFL.Score 0.0028498 0.0012089 2.357 0.019 *
## SOP -0.0005262 0.0057864 -0.091 0.928
## LOR 0.0246474 0.0061795 3.989 0.0000828 ***
## CGPA 0.1170762 0.0134516 8.704 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06499 on 314 degrees of freedom
## Multiple R-squared: 0.8019, Adjusted R-squared: 0.7987
## F-statistic: 254.2 on 5 and 314 DF, p-value: < 0.00000000000000022
- Regression Model:
chance of admit = -1.5480609 + 0.0027677 * gre + 0.0028498 * toefl - 0.0005262 * sop + 0.0246474 * lor + 0.1170762 * cgpa
- Intercept: when gre, toefl, sop, lor, cgpa = 0, chance of admit is -1.5, which is make sense.
- coefficient slope: gre, toefl, lor, cgpa positive with chance of admit and sop negative with chance of admit
- Pengaruh variabel prediktor terhadap target
- H0: gre, toefl, lor, cgpa not impacting inequality
- H1: gre, toefl, lor, cgpa impacting inequality
Reject H0 if p-value < alpha (5%). From output the value of p-value< 0.05, then conclude that gre, toefl, lor, cgpa significantly impacting chance of admit
- Goodness of fit
multiple R-Squared = 0.8019, gdp can descript chance of admit as big as 80.19 %
Evaluation
Making predcition and count the MSE, RMSE, and MAE
predicted_univ <- predict(object = model_rm, newdata = univ_test)
MSE(y_pred = predicted_univ, y_true = univ_test$Chance.of.Admit)## [1] 0.003962463
RMSE(y_pred = predicted_univ, y_true = univ_test$Chance.of.Admit)## [1] 0.0629481
MAE(y_pred = predicted_univ, y_true = univ_test$Chance.of.Admit)## [1] 0.04393572
model has MSE, RMSE, dan MAE value 0.003962463, 0.0629481, and 0.04393572
Plot distribution of residual
hist(model_rm$residuals) residual distribution is look good because they centered around 0 and looked like normal distribution
Plot predicted value and actual value
univ_test$predicted <- predicted_univ
univ_admition <- univ_test %>%
pivot_longer(cols = c(predicted, Chance.of.Admit), names_to = "var", values_to = "value")
ggplot(data = univ_admition, aes(x = GRE.Score, y = value)) +
geom_point(aes(col = var)) +
labs(title = "Chance Of admit VS GRE Score",
col = NULL,
x = "GRE Score",
y = "Chance Of admit") +
theme_minimal() Almost predicted value close to actual value except some of predicted value in extreme position such as (315, 0.4) where there are not any actual value in there and this show the model did not quite understanding variance of the data.
Model Improvement Elimination
Using backward elimination
backward <- step(object = model_rm, trace = 0)
summary(backward)##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
## CGPA, data = univ_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.282824 -0.023104 0.009397 0.036541 0.165416
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5457992 0.1205809 -12.820 < 0.0000000000000002 ***
## GRE.Score 0.0027709 0.0006608 4.193 0.0000358 ***
## TOEFL.Score 0.0028322 0.0011914 2.377 0.018 *
## LOR 0.0243889 0.0054785 4.452 0.0000118 ***
## CGPA 0.1168115 0.0131122 8.909 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06489 on 315 degrees of freedom
## Multiple R-squared: 0.8019, Adjusted R-squared: 0.7994
## F-statistic: 318.7 on 4 and 315 DF, p-value: < 0.00000000000000022
the model did not improve at all it still having the same r squared. we are gonna using k-fold validation to improve the model
Using k-fold cross validation
set.seed(125)
train_control <- trainControl(method = "cv",
number = 10)
model_rm_2 <- train( Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA,
method="lm",
data=univ_train,
trControl = train_control)
print(model_rm_2)## Linear Regression
##
## 320 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 287, 288, 288, 287, 289, 288, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.06493734 0.8048766 0.04720543
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
our model improve little bit from 0.8019 to 0.8048766.
Making new prediction and testing MSE, RMSE, and MAE
predicted_univ_1 <- predict(object = model_rm_2, newdata = univ_test)
MSE(y_pred = predicted_univ_1, y_true = univ_test$Chance.of.Admit)## [1] 0.003961284
RMSE(y_pred = predicted_univ_1, y_true = univ_test$Chance.of.Admit)## [1] 0.06293873
MAE(y_pred = predicted_univ_1, y_true = univ_test$Chance.of.Admit)## [1] 0.04391287
Plot the model prediction along with confidence interval
univ_test$predicted2 <- predicted_univ_1
univ_admition2 <- univ_test %>%
pivot_longer(cols = c(predicted2, Chance.of.Admit), names_to = "var", values_to = "value")
ggplot(data = univ_admition2, aes(x = GRE.Score, y = value)) +
geom_point(aes(col = var)) +
labs(title = "Chance Of admit VS GRE Score",
col = NULL,
x = "GRE Score",
y = "Chance Of admit") +
geom_smooth(method = "lm", level = 0.95) +
theme_minimal() ## `geom_smooth()` using formula 'y ~ x'
The model seems better because each point in graph almost close to predicted value , but the model still suffered from the variance of data
Check Assumtion
1. No-Multicolinearity
vif(model_rm_2$finalModel)## GRE.Score TOEFL.Score LOR CGPA
## 4.429674 3.968426 1.854264 4.729297
The model did not have multicolineairty because all of value < 10
2. Linearity
ggcorr(univ_data_clean, label=T) All variables have linearity toward chance of admit
3. Normality Error
hist(resid(model_rm_2)) The error looked like normal distribution and have 0 as the majority of error
4. Shapiro-Wilk Test
Shapiro-Wilk hypothesis: - H0: error/residual distribution normal - H1: error/residual not distribution normal
shapiro.test(resid(model_rm_2))##
## Shapiro-Wilk normality test
##
## data: resid(model_rm_2)
## W = 0.92778, p-value = 0.00000000002496
Because of p-value = 0.00000000002496 < 0.05, the model fulfill the normal error distribution assumtion
5.Homoscedasticity
Breusch-Pagan hypothesis: - H0: Homoscedasticity - H1: Heteroscedasticity
bptest(model_rm_2$finalModel)##
## studentized Breusch-Pagan test
##
## data: model_rm_2$finalModel
## BP = 13.134, df = 4, p-value = 0.01064
Because the model have p-value=0.01064 < 0.05. The model has homoescedasticity
Conclusion
The model have accuracy about 0.804 and able to predict the probability on the test data with MAE about 0.04391287. Also the model pretty capable of profiling chance of admiting in student given new data