Team Members
Context: This dataset is created for prediction of Graduate Admissions from an Indian perspective.
Objective: What are the chances of a student getting an admission into grad school?
Content: The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are :
| Sr. No. | Field Name | Range |
| 1 | GRE Scores | out of 340 |
| 2 | TOEFL Scores | out of 120 |
| 3 | University Rating | out of 5 |
| 4 | SOP and LOR Strength | out of 5 |
| 5 | Undergraduate GPA | out of 10 |
| 6 | Research Experience | either 0 or 1 |
| 7 | Chance of Admit | ranging from 0 to 1 |
Acknowledgements: This dataset is inspired by the UCLA Graduate Dataset. The test scores and GPA are in the older format. The dataset is owned by Mohan S Acharya.
Inspiration: This dataset was built with the purpose of helping students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university.
Citation: Please cite the following if you are interested in using the dataset : Mohan S Acharya, Asfia Armaan, Aneeta S Antony
A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019
# Make sure the CSV is copied to working directory
admitdata <- read.csv("Admission_Predict_Ver1.1.csv")
# Summarizing the dimensions and variable characterisitics of the initial dataset
dim(admitdata)
## [1] 500 9
summary(admitdata)
## Serial.No. GRE.Score TOEFL.Score University.Rating
## Min. : 1.0 Min. :290.0 Min. : 92.0 Min. :1.000
## 1st Qu.:125.8 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000
## Median :250.5 Median :317.0 Median :107.0 Median :3.000
## Mean :250.5 Mean :316.5 Mean :107.2 Mean :3.114
## 3rd Qu.:375.2 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000
## Max. :500.0 Max. :340.0 Max. :120.0 Max. :5.000
## SOP LOR CGPA Research
## Min. :1.000 Min. :1.000 Min. :6.800 Min. :0.00
## 1st Qu.:2.500 1st Qu.:3.000 1st Qu.:8.127 1st Qu.:0.00
## Median :3.500 Median :3.500 Median :8.560 Median :1.00
## Mean :3.374 Mean :3.484 Mean :8.576 Mean :0.56
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:9.040 3rd Qu.:1.00
## Max. :5.000 Max. :5.000 Max. :9.920 Max. :1.00
## Chance.of.Admit
## Min. :0.3400
## 1st Qu.:0.6300
## Median :0.7200
## Mean :0.7217
## 3rd Qu.:0.8200
## Max. :0.9700
It can be oserved that all the columns are of numerical data type. However, within the Research column, we can see that there are only 2 possible values (0 and 1).
#Eliminating Duplicate Records
admitdata_1 <- unique(admitdata)
#Checking for the null values
sum(is.na(admitdata_1))
## [1] 0
#Renaming Columns
admitdata_2 = admitdata_1[c(-1)]
names(admitdata_2)[4] <- "SOP.Rating"
names(admitdata_2)[5] <- "LOR.Rating"
names(admitdata_2)[7] <- "Research.Experience"
head(admitdata_2)
## GRE.Score TOEFL.Score University.Rating SOP.Rating LOR.Rating CGPA
## 1 337 118 4 4.5 4.5 9.65
## 2 324 107 4 4.0 4.5 8.87
## 3 316 104 3 3.0 3.5 8.00
## 4 322 110 3 3.5 2.5 8.67
## 5 314 103 2 2.0 3.0 8.21
## 6 330 115 5 4.5 3.0 9.34
## Research.Experience Chance.of.Admit
## 1 1 0.92
## 2 1 0.76
## 3 1 0.72
## 4 1 0.80
## 5 0 0.65
## 6 1 0.90
There are no duplicate records or null values.
We have dropped the ‘Serial No.’ column from the dataset as we wont be needing it for our analysis. For uniformity, we have also renamed the column headers of ‘SOP’,‘LOR’ and ‘Research’ to ‘SOP.Rating’, ‘LOR.Rating’ and ‘Research.Experience’ respectively.
Now, the dataset is clean and is ready for analysis.
# Summarizing the dimensions and variable characterisitics of the cleaned dataset
str(admitdata_2)
## 'data.frame': 500 obs. of 8 variables:
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating : int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP.Rating : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR.Rating : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research.Experience: int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
dim(admitdata_2)
## [1] 500 8
summary(admitdata_2)
## GRE.Score TOEFL.Score University.Rating SOP.Rating
## Min. :290.0 Min. : 92.0 Min. :1.000 Min. :1.000
## 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000 1st Qu.:2.500
## Median :317.0 Median :107.0 Median :3.000 Median :3.500
## Mean :316.5 Mean :107.2 Mean :3.114 Mean :3.374
## 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :340.0 Max. :120.0 Max. :5.000 Max. :5.000
## LOR.Rating CGPA Research.Experience Chance.of.Admit
## Min. :1.000 Min. :6.800 Min. :0.00 Min. :0.3400
## 1st Qu.:3.000 1st Qu.:8.127 1st Qu.:0.00 1st Qu.:0.6300
## Median :3.500 Median :8.560 Median :1.00 Median :0.7200
## Mean :3.484 Mean :8.576 Mean :0.56 Mean :0.7217
## 3rd Qu.:4.000 3rd Qu.:9.040 3rd Qu.:1.00 3rd Qu.:0.8200
## Max. :5.000 Max. :9.920 Max. :1.00 Max. :0.9700
# Correlation Plots, Histograms and Boxplots
pairs.panels(admitdata_2,gap = 0)
multi.hist(admitdata_2[,sapply(admitdata_2, is.numeric)])
par(mfrow=c(2,2)) # combine the box and whiskers plots
boxplot(admitdata_2$GRE.Score, main="GRE Score")
boxplot(admitdata_2$TOEFL.Score, main ="TOEFL Score")
boxplot(admitdata_2$University.Rating, main="University Rating")
boxplot(admitdata_2$SOP.Rating, main ="SOP Rating")
boxplot(admitdata_2$LOR.Rating, main ="LOR Rating")
boxplot(admitdata_2$CGPA, main="CGPA")
boxplot(admitdata_2$Chance.of.Admit, main="Chance of Admit")
We see a high correlation between:
We can see from the histograms that except for research experience(as this is a binary response), all the values are more or less normally distributed around the mean. So we do not need to apply any kind transformation on any of the columns at this stage.
From the boxplot, there are a few outliers in LOR and Chance of Admit. However, since our dataset is very small removing these outliers might significantly change its composition. Hence, we decided to let it be as is.
# split data into train and test set
set.seed(123) # set random seed
split = sample.split(admitdata_2$Chance.of.Admit, SplitRatio = 0.607)
train_set = subset(admitdata_2, split == TRUE) # training set
test_set = subset(admitdata_2, split == FALSE) # test set
Upon splitting of this dataset using the random seed, we observe that the training set has 300 observations and the testing dataset has 200 observations.
# Multiple Linear regression model with all the variables
model_1 = lm(Chance.of.Admit ~ ., data = train_set)
summary(model_1)
##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.234236 -0.024014 0.007303 0.034962 0.154220
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2858625 0.1390077 -9.250 < 2e-16 ***
## GRE.Score 0.0020697 0.0006739 3.071 0.00233 **
## TOEFL.Score 0.0024059 0.0012241 1.966 0.05030 .
## University.Rating 0.0063646 0.0049472 1.287 0.19928
## SOP.Rating 0.0073533 0.0059144 1.243 0.21476
## LOR.Rating 0.0134878 0.0052862 2.552 0.01124 *
## CGPA 0.1151881 0.0125123 9.206 < 2e-16 ***
## Research.Experience 0.0249928 0.0085414 2.926 0.00370 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0606 on 292 degrees of freedom
## Multiple R-squared: 0.82, Adjusted R-squared: 0.8157
## F-statistic: 190 on 7 and 292 DF, p-value: < 2.2e-16
mean(summary(model_1)$residuals^2)
## [1] 0.003573885
We observe that ‘GRE Score’, ‘LOR Rating’, ‘CGPA’ and ‘Research Experience’ are significant, while ‘TOEFL Score’, ‘University Ranking’ and ‘SOP rating’ are not significant.
Besides, the Adjusted R- squared values is 0.816 which is pretty high and the F-statistic corresponds to a value lesser than 0.05 suggesting that the model is able to predict the dependent variable.
However, let’s perform variable selection process to identify the significant covariates.
# Adjusted R- Squared method
model_2 <- leaps( x = train_set[,-1], y = train_set[,1], method = "adjr2")
model_2
## $which
## 1 2 3 4 5 6 7
## 1 TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## 1 FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## 1 FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## 1 FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## 2 TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## 2 TRUE FALSE FALSE FALSE FALSE FALSE TRUE
## 2 TRUE FALSE FALSE FALSE FALSE TRUE FALSE
## 2 TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## 2 TRUE FALSE TRUE FALSE FALSE FALSE FALSE
## 2 TRUE FALSE FALSE TRUE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## 2 FALSE FALSE FALSE FALSE TRUE TRUE FALSE
## 2 FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## 3 TRUE FALSE FALSE FALSE TRUE TRUE FALSE
## 3 TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## 3 TRUE FALSE FALSE FALSE TRUE FALSE TRUE
## 3 TRUE TRUE FALSE FALSE TRUE FALSE FALSE
## 3 TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## 3 TRUE FALSE TRUE FALSE TRUE FALSE FALSE
## 3 TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 3 TRUE FALSE TRUE FALSE FALSE FALSE TRUE
## 3 TRUE FALSE FALSE TRUE FALSE FALSE TRUE
## 3 TRUE FALSE TRUE FALSE FALSE TRUE FALSE
## 4 TRUE FALSE FALSE FALSE TRUE TRUE TRUE
## 4 TRUE FALSE FALSE TRUE TRUE TRUE FALSE
## 4 TRUE FALSE TRUE FALSE TRUE TRUE FALSE
## 4 TRUE TRUE FALSE FALSE TRUE TRUE FALSE
## 4 TRUE FALSE FALSE TRUE TRUE FALSE TRUE
## 4 TRUE TRUE FALSE FALSE FALSE TRUE TRUE
## 4 TRUE FALSE TRUE FALSE FALSE TRUE TRUE
## 4 TRUE FALSE FALSE TRUE FALSE TRUE TRUE
## 4 TRUE FALSE TRUE FALSE TRUE FALSE TRUE
## 4 TRUE TRUE FALSE FALSE TRUE FALSE TRUE
## 5 TRUE FALSE FALSE TRUE TRUE TRUE TRUE
## 5 TRUE FALSE TRUE FALSE TRUE TRUE TRUE
## 5 TRUE TRUE FALSE FALSE TRUE TRUE TRUE
## 5 TRUE FALSE TRUE TRUE TRUE TRUE FALSE
## 5 TRUE TRUE FALSE TRUE TRUE TRUE FALSE
## 5 TRUE TRUE TRUE FALSE TRUE TRUE FALSE
## 5 TRUE TRUE FALSE TRUE TRUE FALSE TRUE
## 5 TRUE FALSE TRUE TRUE TRUE FALSE TRUE
## 5 TRUE FALSE TRUE TRUE FALSE TRUE TRUE
## 5 TRUE TRUE FALSE TRUE FALSE TRUE TRUE
## 6 TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## 6 TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## 6 TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## 6 TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## 6 TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## 6 TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## 6 FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## 7 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##
## $label
## [1] "(Intercept)" "1" "2" "3" "4"
## [6] "5" "6" "7"
##
## $size
## [1] 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6
## [39] 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 8
##
## $adjr2
## [1] 0.7066050 0.6748175 0.6617018 0.4161640 0.4008062 0.3066930 0.2992672
## [8] 0.7641087 0.7634878 0.7378060 0.7157976 0.7157957 0.7138432 0.7109018
## [15] 0.7069969 0.6827763 0.6802347 0.7817418 0.7752558 0.7749289 0.7635554
## [22] 0.7634160 0.7633530 0.7630372 0.7628187 0.7627430 0.7428243 0.7870539
## [29] 0.7812901 0.7810118 0.7810037 0.7751674 0.7745738 0.7745611 0.7745504
## [36] 0.7743073 0.7741681 0.7873610 0.7865704 0.7864696 0.7805534 0.7805494
## [43] 0.7802670 0.7744166 0.7744068 0.7739163 0.7738963 0.7866803 0.7866677
## [50] 0.7858894 0.7798054 0.7736581 0.7731969 0.7286635 0.7859641
cbind(model_2$which, adjRsq = model_2$adjr2)
## 1 2 3 4 5 6 7 adjRsq
## 1 1 0 0 0 0 0 0 0.7066050
## 1 0 0 0 0 1 0 0 0.6748175
## 1 0 0 0 0 0 0 1 0.6617018
## 1 0 1 0 0 0 0 0 0.4161640
## 1 0 0 1 0 0 0 0 0.4008062
## 1 0 0 0 0 0 1 0 0.3066930
## 1 0 0 0 1 0 0 0 0.2992672
## 2 1 0 0 0 1 0 0 0.7641087
## 2 1 0 0 0 0 0 1 0.7634878
## 2 1 0 0 0 0 1 0 0.7378060
## 2 1 1 0 0 0 0 0 0.7157976
## 2 1 0 1 0 0 0 0 0.7157957
## 2 1 0 0 1 0 0 0 0.7138432
## 2 0 0 0 0 1 0 1 0.7109018
## 2 0 0 0 0 1 1 0 0.7069969
## 2 0 1 0 0 1 0 0 0.6827763
## 2 0 0 0 0 0 1 1 0.6802347
## 3 1 0 0 0 1 1 0 0.7817418
## 3 1 0 0 0 0 1 1 0.7752558
## 3 1 0 0 0 1 0 1 0.7749289
## 3 1 1 0 0 1 0 0 0.7635554
## 3 1 0 0 1 1 0 0 0.7634160
## 3 1 0 1 0 1 0 0 0.7633530
## 3 1 1 0 0 0 0 1 0.7630372
## 3 1 0 1 0 0 0 1 0.7628187
## 3 1 0 0 1 0 0 1 0.7627430
## 3 1 0 1 0 0 1 0 0.7428243
## 4 1 0 0 0 1 1 1 0.7870539
## 4 1 0 0 1 1 1 0 0.7812901
## 4 1 0 1 0 1 1 0 0.7810118
## 4 1 1 0 0 1 1 0 0.7810037
## 4 1 0 0 1 1 0 1 0.7751674
## 4 1 1 0 0 0 1 1 0.7745738
## 4 1 0 1 0 0 1 1 0.7745611
## 4 1 0 0 1 0 1 1 0.7745504
## 4 1 0 1 0 1 0 1 0.7743073
## 4 1 1 0 0 1 0 1 0.7741681
## 5 1 0 0 1 1 1 1 0.7873610
## 5 1 0 1 0 1 1 1 0.7865704
## 5 1 1 0 0 1 1 1 0.7864696
## 5 1 0 1 1 1 1 0 0.7805534
## 5 1 1 0 1 1 1 0 0.7805494
## 5 1 1 1 0 1 1 0 0.7802670
## 5 1 1 0 1 1 0 1 0.7744166
## 5 1 0 1 1 1 0 1 0.7744068
## 5 1 0 1 1 0 1 1 0.7739163
## 5 1 1 0 1 0 1 1 0.7738963
## 6 1 1 0 1 1 1 1 0.7866803
## 6 1 0 1 1 1 1 1 0.7866677
## 6 1 1 1 0 1 1 1 0.7858894
## 6 1 1 1 1 1 1 0 0.7798054
## 6 1 1 1 1 1 0 1 0.7736581
## 6 1 1 1 1 0 1 1 0.7731969
## 6 0 1 1 1 1 1 1 0.7286635
## 7 1 1 1 1 1 1 1 0.7859641
According to adjusted R squared, 1 0 0 0 1 1 1 or 1 0 0 1 1 1 1 or 1 0 1 0 1 1 1 or 1 1 0 1 1 1 1 or 1 0 1 1 1 1 1 should be the selection of variables in our model as it corresponds to the highest value of 0.787. Since most of the variables are common across these 5 options, we need to further evaluate the significance of ‘TOEFL Score’, ‘University rating’ and ‘SOP Rating’ to arrive at our final model.It is also important to note that the selection 1 1 1 1 1 1 is not preferred using this method.
# Best Subset Selection
model_3 = regsubsets(Chance.of.Admit ~., data = train_set, nbest = 1, nvmax = 7)
summary(model_3)
## Subset selection object
## Call: regsubsets.formula(Chance.of.Admit ~ ., data = train_set, nbest = 1,
## nvmax = 7)
## 7 Variables (and intercept)
## Forced in Forced out
## GRE.Score FALSE FALSE
## TOEFL.Score FALSE FALSE
## University.Rating FALSE FALSE
## SOP.Rating FALSE FALSE
## LOR.Rating FALSE FALSE
## CGPA FALSE FALSE
## Research.Experience FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
## GRE.Score TOEFL.Score University.Rating SOP.Rating LOR.Rating CGPA
## 1 ( 1 ) " " " " " " " " " " "*"
## 2 ( 1 ) "*" " " " " " " " " "*"
## 3 ( 1 ) "*" " " " " " " "*" "*"
## 4 ( 1 ) "*" " " " " " " "*" "*"
## 5 ( 1 ) "*" "*" " " " " "*" "*"
## 6 ( 1 ) "*" "*" "*" " " "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" "*" "*"
## Research.Experience
## 1 ( 1 ) " "
## 2 ( 1 ) " "
## 3 ( 1 ) " "
## 4 ( 1 ) "*"
## 5 ( 1 ) "*"
## 6 ( 1 ) "*"
## 7 ( 1 ) "*"
plot(model_3, scale = "bic")
According to best subset selection, the influence of ‘TOEFL Score’ > ‘University Rating’ > ‘SOP Rating’.
Upon comparing both these results we can arrive at the conclusion that 1 1 0 1 1 1 1 is the best linear regression model for this dataset or in other words, all variables except ‘University Rating’ are statiscally significant in predicting the chance of admission.
model_4 = lm(Chance.of.Admit ~ GRE.Score + TOEFL.Score + SOP.Rating + LOR.Rating + CGPA + Research.Experience, data = train_set)
VIF(model_4)
## GRE.Score TOEFL.Score SOP.Rating LOR.Rating
## 4.634416 4.138157 2.401266 2.000310
## CGPA Research.Experience
## 4.461789 1.460461
summary(model_4)
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + SOP.Rating +
## LOR.Rating + CGPA + Research.Experience, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.23825 -0.02440 0.00748 0.03595 0.15247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.3264822 0.1355257 -9.788 < 2e-16 ***
## GRE.Score 0.0020715 0.0006747 3.070 0.00234 **
## TOEFL.Score 0.0026579 0.0012096 2.197 0.02878 *
## SOP.Rating 0.0098109 0.0056036 1.751 0.08102 .
## LOR.Rating 0.0139703 0.0052788 2.646 0.00857 **
## CGPA 0.1177838 0.0123623 9.528 < 2e-16 ***
## Research.Experience 0.0260941 0.0085079 3.067 0.00236 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06066 on 293 degrees of freedom
## Multiple R-squared: 0.819, Adjusted R-squared: 0.8153
## F-statistic: 220.9 on 6 and 293 DF, p-value: < 2.2e-16
MSE <- mean(summary(model_4)$residuals^2)
RMSE <- sqrt(MSE)
RMSE
## [1] 0.05995116
As we know, largest VIF value (typically >5 & <10) is used as an indicator of multicollinearlity. Since the VIF values for none of our variables fall in that range, there is no evidence of multicollinearlity in our model.
Further, Adjusted R- squared values is 0.815, and F-statistic corresponds to a value lesser than 0.05. This implies that the model can succesfully predict the chance of admission and is able to explain 81.5% of the variation in the data set.
par(mfrow = c(1,2))
# generate QQ plot
qqnorm(model_4$residuals,main = "Model for predicting admission")
qqline(model_4$residuals)
# generate Scatter Plot
plot(model_4$fitted.values,model_4$residuals,pch = 20)
abline(h=0,col="grey")
From the graphs, we observe that the qq plot is not ideal and the data in the scatterplot is not evenly distributed.Therefore, this dataset doesn’t completely satisfy the normality, linearity and equal variance assumptions. We tried to apply a transformation to see if it changes anything.
cox = boxcox(model_4) #Estimate the lambda value by using box-cox function
(lambda <- cox$x[which.max(cox$y)]) #Extract the Lambda value
## [1] 2
powerTransform <- function(y, lambda1, lambda2 = NULL, method = "boxcox") {
boxcoxTrans <- function(x, lam1, lam2 = NULL) {
# if we set lambda2 to zero, it becomes the one parameter transformation
lam2 <- ifelse(is.null(lam2), 0, lam2)
if (lam1 == 0L) {
log(y + lam2)
} else {
(((y + lam2)^lam1) - 1) / lam1
}
}
switch(method
, boxcox = boxcoxTrans(y, lambda1, lambda2)
, tukey = y^lambda1
)
}
# re-run with Box-Cox Transformation
model_5 <- lm(powerTransform(Chance.of.Admit, lambda) ~ GRE.Score + TOEFL.Score + SOP.Rating + LOR.Rating + CGPA + Research.Experience, data= admitdata_2 )
summary(model_5)
##
## Call:
## lm(formula = powerTransform(Chance.of.Admit, lambda) ~ GRE.Score +
## TOEFL.Score + SOP.Rating + LOR.Rating + CGPA + Research.Experience,
## data = admitdata_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.166633 -0.020230 0.006802 0.025374 0.102388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.6897792 0.0663567 -25.465 < 2e-16 ***
## GRE.Score 0.0013511 0.0003280 4.120 4.45e-05 ***
## TOEFL.Score 0.0023187 0.0005681 4.082 5.22e-05 ***
## SOP.Rating 0.0046980 0.0027731 1.694 0.0909 .
## LOR.Rating 0.0111603 0.0026776 4.168 3.63e-05 ***
## CGPA 0.0838282 0.0062791 13.350 < 2e-16 ***
## Research.Experience 0.0185187 0.0043114 4.295 2.10e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03921 on 493 degrees of freedom
## Multiple R-squared: 0.8472, Adjusted R-squared: 0.8454
## F-statistic: 455.7 on 6 and 493 DF, p-value: < 2.2e-16
par(mfrow = c(1,2))
# generate QQ plot
qqnorm(model_5$residuals,main = "Model for predicting admission")
qqline(model_5$residuals)
# generate Scatter Plot
plot(model_5$fitted.values,model_5$residuals,pch = 20)
abline(h=0,col="grey")
From the output, we can observe that although the R-squared value increased with model_5, the transformation did not have any effect on the qq plot and the residual plot. This may be because of the presence of outliers or due to the small size of the data set. Hence, we decided to proceed with the previous linear regression model (model_4) for the validation.
# K Fold Cross Validation
set.seed(234) # Random Seed
train.control <- trainControl(method = "cv", number = 5)
# Train the model
cv_result <- train(Chance.of.Admit ~ ., data = train_set, method = "lm", trControl = train.control)
# Summarize the results
print(cv_result)
## Linear Regression
##
## 300 samples
## 7 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 239, 242, 240, 240, 239
## Resampling results:
##
## RMSE Rsquared MAE
## 0.06128595 0.8138116 0.04432828
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
For 5-fold cross-validation, the training dataset is divided into 5 parts of equal sample size. Each part serves as the testing sample and the rest serves as training sample. This training/testing procedure is iteratively performed 5 times. The CV score is usually the average of the metric of out-of-sample performance across 5 iterations.
The observed RMSE and R-Squared values match the summary results of model_4 which indicates that the model is consistent across different subsets of data within the small training dataset.
# Test Set Validation
testset_output <- predict(model_4, newdata = test_set)
testset_rmse <- RMSE(testset_output, test_set$Chance.of.Admit)
testset_rmse
## [1] 0.05957447
testset_rsq <- cor(testset_output, test_set$Chance.of.Admit)^2
testset_rsq
## [1] 0.8226028
The RMSE value and R-squared value for the test data set is almost the same as that for the training data set which implies that our model is equally accurate across the two data sets. Therefore, we can conclude that our linear regression model is statiscally significant in using the list of independent variables to predict the chance of the admits with an average in-sample RMSE of 0.0623 and out-of-sample RMSE of 0.0596. This particular model accounts for an average variation of 81.4% in in-sample data and 82.3% of the variation in out-of-sample data.
#Comparing the predicted values against observed values
newadmit <- data.frame(GRE.Score = 337,TOEFL.Score = 118,SOP.Rating = 4, LOR.Rating = 4.5,Research.Experience = 1, CGPA = 9.65)
print(paste0("Observed chance of admission: ",0.92))
## [1] "Observed chance of admission: 0.92"
predicted <- predict(model_4, newdata = newadmit)
print(paste0("Predicted chance of admission: ",(predicted)))
## [1] "Predicted chance of admission: 0.95005635855602"
newadmit <- data.frame(GRE.Score = 308,TOEFL.Score = 101,SOP.Rating = 3, LOR.Rating = 4, Research.Experience = 0, CGPA = 7.9)
print(paste0("Observed chance of admission: ",0.64))
## [1] "Observed chance of admission: 0.64"
predicted <- predict(model_4, newdata = newadmit)
print(paste0("Predicted chance of admission: ",(predicted)))
## [1] "Predicted chance of admission: 0.595787276517959"
We observe the values we get for chance of admission using our created model is + or - 0.05 of the observed values. This concludes the Regression analysis of the data.
# Creating the tree
admission.rpart <- rpart(formula = Chance.of.Admit ~. , data = train_set, method = "anova",cp=-1)
rpart.plot(admission.rpart,type = 3, fallen.leaves =TRUE)
printcp(admission.rpart)
##
## Regression tree:
## rpart(formula = Chance.of.Admit ~ ., data = train_set, method = "anova",
## cp = -1)
##
## Variables actually used in tree construction:
## [1] CGPA GRE.Score LOR.Rating
## [4] Research.Experience SOP.Rating TOEFL.Score
## [7] University.Rating
##
## Root node error: 5.9559/300 = 0.019853
##
## n= 300
##
## CP nsplit rel error xerror xstd
## 1 0.55952944 0 1.00000 1.00838 0.072154
## 2 0.12589985 1 0.44047 0.44666 0.036601
## 3 0.06773845 2 0.31457 0.35162 0.033699
## 4 0.02630450 3 0.24683 0.28920 0.033570
## 5 0.01113962 4 0.22053 0.27848 0.032913
## 6 0.01093104 5 0.20939 0.28125 0.032149
## 7 0.01005214 6 0.19846 0.26979 0.030584
## 8 0.00547058 7 0.18840 0.26180 0.029381
## 9 0.00424329 8 0.18293 0.27379 0.030594
## 10 0.00421874 9 0.17869 0.27098 0.029913
## 11 0.00420918 10 0.17447 0.27098 0.029913
## 12 0.00364285 11 0.17026 0.27309 0.030757
## 13 0.00343006 12 0.16662 0.27004 0.030887
## 14 0.00313253 13 0.16319 0.26881 0.031364
## 15 0.00257098 14 0.16006 0.26849 0.031296
## 16 0.00242682 15 0.15749 0.26955 0.031310
## 17 0.00229248 16 0.15506 0.26665 0.031299
## 18 0.00151305 17 0.15277 0.26323 0.031044
## 19 0.00145472 18 0.15125 0.26986 0.031034
## 20 0.00140451 19 0.14980 0.26986 0.031034
## 21 0.00105812 20 0.14840 0.27134 0.031177
## 22 0.00036676 21 0.14734 0.27110 0.031161
## 23 0.00029766 22 0.14697 0.27031 0.031145
## 24 0.00025146 23 0.14667 0.27036 0.031144
## 25 -1.00000000 24 0.14642 0.27036 0.031144
# Validating against the taining data set
predict.rpart.train <- predict(admission.rpart, train_set)
mse.tree1 <- mean((predict.rpart.train- train_set$Chance.of.Admit)^2)
rmse1 <- sqrt(mse.tree1)
rmse1
## [1] 0.05391567
#Validating against the testing data set
predict.rpart.test <- predict(admission.rpart, test_set)
mse.tree2 <- mean((predict.rpart.test- test_set$Chance.of.Admit)^2)
rmse2 <- sqrt(mse.tree2)
rmse2
## [1] 0.06906934
#Comparing the predicted values against observed values
newadmit <- data.frame(GRE.Score = 337,TOEFL.Score = 118, University.Rating = 4, SOP.Rating = 4, LOR.Rating = 4.5,Research.Experience = 1, CGPA = 9.65)
print(paste0("Observed chance of admission: ",0.92))
## [1] "Observed chance of admission: 0.92"
predicted <- predict(admission.rpart, newdata = newadmit)
print(paste0("Predicted chance of admission: ",(predicted)))
## [1] "Predicted chance of admission: 0.934285714285714"
newadmit <- data.frame(GRE.Score = 308,TOEFL.Score = 101, University.Rating = 2, SOP.Rating = 3, LOR.Rating = 4, Research.Experience = 0, CGPA = 7.9)
print(paste0("Observed chance of admission: ",0.64))
## [1] "Observed chance of admission: 0.64"
predicted <- predict(admission.rpart, newdata = newadmit)
print(paste0("Predicted chance of admission: ",(predicted)))
## [1] "Predicted chance of admission: 0.528333333333333"
As most of our data is quantitative, we created regression trees instead of classification trees. The tree is trained against all variables and is fully grwon until it reaches its maximum depth of 8. The rmse values for training and testing datasets are 0.0539 and 0.0691 respectively.
# Pruning the tree
min_cp <- admission.rpart$cptable[which.min(admission.rpart$cptable[,"xerror"]),"CP"]
admission.pruned <- prune (admission.rpart,cp=min_cp)
rpart.plot(admission.pruned,type = 3, fallen.leaves =TRUE)
plotcp(admission.pruned)
# Validating against the taining data set
predict.pruned.train <- predict(admission.pruned, train_set)
mse.tree3 <- mean((predict.pruned.train- train_set$Chance.of.Admit)^2)
rmse3 <- sqrt(mse.tree3)
rmse3
## [1] 0.06115884
#Validating against the testing data set
predict.pruned.test <- predict(admission.pruned, test_set)
mse.tree4 <- mean((predict.pruned.test- test_set$Chance.of.Admit)^2)
rmse4 <- sqrt(mse.tree4)
rmse4
## [1] 0.07311019
#Comparing the predicted values against observed values
newadmit <- data.frame(GRE.Score = 337,TOEFL.Score = 118, University.Rating = 4, SOP.Rating = 4, LOR.Rating = 4.5,Research.Experience = 1, CGPA = 9.65)
print(paste0("Observed chance of admission: ",0.92))
## [1] "Observed chance of admission: 0.92"
predicted <- predict(admission.pruned, newdata = newadmit)
print(paste0("Predicted chance of admission: ",(predicted)))
## [1] "Predicted chance of admission: 0.927674418604651"
newadmit <- data.frame(GRE.Score = 308,TOEFL.Score = 101, University.Rating = 2, SOP.Rating = 3, LOR.Rating = 4, Research.Experience = 0, CGPA = 7.9)
print(paste0("Observed chance of admission: ",0.64))
## [1] "Observed chance of admission: 0.64"
predicted <- predict(admission.pruned, newdata = newadmit)
print(paste0("Predicted chance of admission: ",(predicted)))
## [1] "Predicted chance of admission: 0.501944444444444"
This is to evaluate if pruning of the above tree will provide better results. The complexity parameter cp used here is associated with the least cross validated error. This is indicated by the dotted line in the graph.
The rpart tree is an input to this pruning function and the ouput tree has a reduced depth. It can also be observed that the rmse values for training and testing datasets increase slightly when compared to the the original tree. This may be because the original fully grown tree, overfits the data when compared to the pruned tree.
# Creating the forest
set.seed(234)
j <- 500 # Hyper Parameter 1 - Number of replicas of dataset
k <- round(sqrt(7)) # Hyper Parameter 2 - Number of random variables considered for split at every step
admission.forest <- randomForest(Chance.of.Admit ~., data=train_set, mtry=k, ntree=j)
print(admission.forest) # view results
##
## Call:
## randomForest(formula = Chance.of.Admit ~ ., data = train_set, mtry = k, ntree = j)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.004060214
## % Var explained: 79.55
plot(admission.forest)
importance(admission.forest) # importance of each predictor
## IncNodePurity
## GRE.Score 1.39063240
## TOEFL.Score 0.86814593
## University.Rating 0.28933488
## SOP.Rating 0.32537368
## LOR.Rating 0.19127949
## CGPA 2.62237251
## Research.Experience 0.08358798
varImpPlot(admission.forest)
# Validating against the taining data set
predict.forest.train <- predict(admission.forest, train_set)
mse.tree5 <- mean((predict.forest.train- train_set$Chance.of.Admit)^2)
rmse5 <- sqrt(mse.tree5)
rmse5
## [1] 0.03058209
#Validating against the testing data set
predict.forest.test <- predict(admission.forest, test_set)
mse.tree6 <- mean((predict.forest.test- test_set$Chance.of.Admit)^2)
rmse6 <- sqrt(mse.tree6)
rmse6
## [1] 0.06303027
#Comparing the predicted values against observed values
newadmit <- data.frame(GRE.Score = 337,TOEFL.Score = 118, University.Rating = 4, SOP.Rating = 4, LOR.Rating = 4.5,Research.Experience = 1, CGPA = 9.65)
print(paste0("Observed chance of admission: ",0.92))
## [1] "Observed chance of admission: 0.92"
predicted <- predict(admission.forest, newadmit)
print(paste0("Predicted chance of admission: ",(predicted)))
## [1] "Predicted chance of admission: 0.926307000000001"
newadmit <- data.frame(GRE.Score = 308,TOEFL.Score = 101, University.Rating = 2, SOP.Rating = 3, LOR.Rating = 4, Research.Experience = 0, CGPA = 7.9)
print(paste0("Observed chance of admission: ",0.64))
## [1] "Observed chance of admission: 0.64"
predicted <- predict(admission.forest, newadmit)
print(paste0("Predicted chance of admission: ",(predicted)))
## [1] "Predicted chance of admission: 0.556253333333334"
As seen in the plot, the error decreases with the increase in number of trees. The Hyper Parameter 2 used to generate the bootestrapped trees ensures that 3 random dimensions / variables is selected at every split reducing the bias and variance in the model. Due to this, it has a lower rmse value when compared to the normal decision tree as well as the pruned decision. This approach also helps in determining the relative importance of each variable in decision making.
Comparison:
Since the data is continuous and not categorical, the RMSE value would be our primary metric to compare the prediction accuracy of these models.
From the outputs for trees, it is clear that the RMSE values across both the in-sample and out-sample data sets for Random Forest < CART < Pruned Tree.
However, upon further comparison, linear regression seems to fit the data better than Random Forest which is rather unusual. This anamoly can be attributed to the small size of dataset and the presence of many independent variables.
While both the models equally explain the importance of variables, the Linear regression model is able to account for a greater percentage (81.4%) of variation in training data than Random Forest (79.5%). Also, as shown in the table below, it has a lower RMSE value for predicting new (out-sample) data.
We also compared this result with the IEEE paper published by the owner of this dataset and our observations matched. Following is an image of his conclusion for reference purpose.
Conclusion:
Therefore, we can conclude that Linear Regression emerges as the champion model for supervised learning while Random Forest emerges the challenger model for this given dataset.
Chance.of.Admit = -1.3264822 + 0.0020715 * (GRE.Score) + 0.0026579 * (TOEFL.Score) + 0.0098109 * (SOP.Rating) + 0.0139703 * (LOR.Rating) + 0.1177838 * (CGPA) + 0.0260941 * (Research.Experience)