Chapter 1: Introduction to the Dataset

Team Members

  • Ruthvik Chidambaram
  • Harsh Dave
  • Neha Ganshani
  • Tanmay Jog
  • Vinishruth Mocherla


Context: This dataset is created for prediction of Graduate Admissions from an Indian perspective.

Objective: What are the chances of a student getting an admission into grad school?

Content: The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are :

Sr. No. Field Name Range
1 GRE Scores out of 340
2 TOEFL Scores out of 120
3 University Rating out of 5
4 SOP and LOR Strength out of 5
5 Undergraduate GPA out of 10
6 Research Experience either 0 or 1
7 Chance of Admit ranging from 0 to 1


Acknowledgements: This dataset is inspired by the UCLA Graduate Dataset. The test scores and GPA are in the older format. The dataset is owned by Mohan S Acharya.

Inspiration: This dataset was built with the purpose of helping students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university.

Citation: Please cite the following if you are interested in using the dataset : Mohan S Acharya, Asfia Armaan, Aneeta S Antony
A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019

Chapter 2: Exploratory Data Analysis

Importing the Data

# Make sure the CSV is copied to working directory
admitdata <- read.csv("Admission_Predict_Ver1.1.csv")

# Summarizing the dimensions and variable characterisitics of the initial dataset 
dim(admitdata)
## [1] 500   9
summary(admitdata)
##    Serial.No.      GRE.Score      TOEFL.Score    University.Rating
##  Min.   :  1.0   Min.   :290.0   Min.   : 92.0   Min.   :1.000    
##  1st Qu.:125.8   1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000    
##  Median :250.5   Median :317.0   Median :107.0   Median :3.000    
##  Mean   :250.5   Mean   :316.5   Mean   :107.2   Mean   :3.114    
##  3rd Qu.:375.2   3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000    
##  Max.   :500.0   Max.   :340.0   Max.   :120.0   Max.   :5.000    
##       SOP             LOR             CGPA          Research   
##  Min.   :1.000   Min.   :1.000   Min.   :6.800   Min.   :0.00  
##  1st Qu.:2.500   1st Qu.:3.000   1st Qu.:8.127   1st Qu.:0.00  
##  Median :3.500   Median :3.500   Median :8.560   Median :1.00  
##  Mean   :3.374   Mean   :3.484   Mean   :8.576   Mean   :0.56  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:9.040   3rd Qu.:1.00  
##  Max.   :5.000   Max.   :5.000   Max.   :9.920   Max.   :1.00  
##  Chance.of.Admit 
##  Min.   :0.3400  
##  1st Qu.:0.6300  
##  Median :0.7200  
##  Mean   :0.7217  
##  3rd Qu.:0.8200  
##  Max.   :0.9700

It can be oserved that all the columns are of numerical data type. However, within the Research column, we can see that there are only 2 possible values (0 and 1).

Data Cleaning

#Eliminating Duplicate Records
admitdata_1 <- unique(admitdata)

#Checking for the null values
sum(is.na(admitdata_1))
## [1] 0
#Renaming Columns
admitdata_2 = admitdata_1[c(-1)]
names(admitdata_2)[4] <- "SOP.Rating"
names(admitdata_2)[5] <- "LOR.Rating"
names(admitdata_2)[7] <- "Research.Experience"
head(admitdata_2)
##   GRE.Score TOEFL.Score University.Rating SOP.Rating LOR.Rating CGPA
## 1       337         118                 4        4.5        4.5 9.65
## 2       324         107                 4        4.0        4.5 8.87
## 3       316         104                 3        3.0        3.5 8.00
## 4       322         110                 3        3.5        2.5 8.67
## 5       314         103                 2        2.0        3.0 8.21
## 6       330         115                 5        4.5        3.0 9.34
##   Research.Experience Chance.of.Admit
## 1                   1            0.92
## 2                   1            0.76
## 3                   1            0.72
## 4                   1            0.80
## 5                   0            0.65
## 6                   1            0.90

There are no duplicate records or null values.

We have dropped the ‘Serial No.’ column from the dataset as we wont be needing it for our analysis. For uniformity, we have also renamed the column headers of ‘SOP’,‘LOR’ and ‘Research’ to ‘SOP.Rating’, ‘LOR.Rating’ and ‘Research.Experience’ respectively.

Now, the dataset is clean and is ready for analysis.

Data Visualization

# Summarizing the dimensions and variable characterisitics of the cleaned dataset 
str(admitdata_2)
## 'data.frame':    500 obs. of  8 variables:
##  $ GRE.Score          : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score        : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating  : int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP.Rating         : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR.Rating         : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA               : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research.Experience: int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit    : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
dim(admitdata_2)
## [1] 500   8
summary(admitdata_2)
##    GRE.Score      TOEFL.Score    University.Rating   SOP.Rating   
##  Min.   :290.0   Min.   : 92.0   Min.   :1.000     Min.   :1.000  
##  1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000     1st Qu.:2.500  
##  Median :317.0   Median :107.0   Median :3.000     Median :3.500  
##  Mean   :316.5   Mean   :107.2   Mean   :3.114     Mean   :3.374  
##  3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000     3rd Qu.:4.000  
##  Max.   :340.0   Max.   :120.0   Max.   :5.000     Max.   :5.000  
##    LOR.Rating         CGPA       Research.Experience Chance.of.Admit 
##  Min.   :1.000   Min.   :6.800   Min.   :0.00        Min.   :0.3400  
##  1st Qu.:3.000   1st Qu.:8.127   1st Qu.:0.00        1st Qu.:0.6300  
##  Median :3.500   Median :8.560   Median :1.00        Median :0.7200  
##  Mean   :3.484   Mean   :8.576   Mean   :0.56        Mean   :0.7217  
##  3rd Qu.:4.000   3rd Qu.:9.040   3rd Qu.:1.00        3rd Qu.:0.8200  
##  Max.   :5.000   Max.   :9.920   Max.   :1.00        Max.   :0.9700
# Correlation Plots, Histograms and Boxplots
pairs.panels(admitdata_2,gap = 0)

multi.hist(admitdata_2[,sapply(admitdata_2, is.numeric)])

par(mfrow=c(2,2)) # combine the box and whiskers plots
boxplot(admitdata_2$GRE.Score, main="GRE Score")
boxplot(admitdata_2$TOEFL.Score, main ="TOEFL Score")
boxplot(admitdata_2$University.Rating, main="University Rating")
boxplot(admitdata_2$SOP.Rating, main ="SOP Rating")

boxplot(admitdata_2$LOR.Rating, main ="LOR Rating")
boxplot(admitdata_2$CGPA, main="CGPA")
boxplot(admitdata_2$Chance.of.Admit, main="Chance of Admit")

We see a high correlation between:

  • ‘Chance of Admit’ & ‘CGPA’
  • ‘Chance of Admit’ & ‘GRE score’
  • ‘Chance of Admit’ & ‘TOEFL score’
  • ‘GRE Score’ & ‘TOEFL Score’
  • ‘GRE score’ & ‘CGPA’
  • ‘TOEFL score’ & ‘CGPA’

We can see from the histograms that except for research experience(as this is a binary response), all the values are more or less normally distributed around the mean. So we do not need to apply any kind transformation on any of the columns at this stage.

From the boxplot, there are a few outliers in LOR and Chance of Admit. However, since our dataset is very small removing these outliers might significantly change its composition. Hence, we decided to let it be as is.

Chapter 3: Linear Regression Analysis

Data Preparation

# split data into train and test set
set.seed(123)  # set random seed
split = sample.split(admitdata_2$Chance.of.Admit, SplitRatio = 0.607)
train_set = subset(admitdata_2, split == TRUE) # training set
test_set = subset(admitdata_2, split == FALSE) # test set

Upon splitting of this dataset using the random seed, we observe that the training set has 300 observations and the testing dataset has 200 observations.

Full Model

# Multiple Linear regression model with all the variables
model_1 = lm(Chance.of.Admit ~ ., data = train_set)
summary(model_1)
## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = train_set)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.234236 -0.024014  0.007303  0.034962  0.154220 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.2858625  0.1390077  -9.250  < 2e-16 ***
## GRE.Score            0.0020697  0.0006739   3.071  0.00233 ** 
## TOEFL.Score          0.0024059  0.0012241   1.966  0.05030 .  
## University.Rating    0.0063646  0.0049472   1.287  0.19928    
## SOP.Rating           0.0073533  0.0059144   1.243  0.21476    
## LOR.Rating           0.0134878  0.0052862   2.552  0.01124 *  
## CGPA                 0.1151881  0.0125123   9.206  < 2e-16 ***
## Research.Experience  0.0249928  0.0085414   2.926  0.00370 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0606 on 292 degrees of freedom
## Multiple R-squared:   0.82,  Adjusted R-squared:  0.8157 
## F-statistic:   190 on 7 and 292 DF,  p-value: < 2.2e-16
mean(summary(model_1)$residuals^2)
## [1] 0.003573885

We observe that ‘GRE Score’, ‘LOR Rating’, ‘CGPA’ and ‘Research Experience’ are significant, while ‘TOEFL Score’, ‘University Ranking’ and ‘SOP rating’ are not significant.

Besides, the Adjusted R- squared values is 0.816 which is pretty high and the F-statistic corresponds to a value lesser than 0.05 suggesting that the model is able to predict the dependent variable.

However, let’s perform variable selection process to identify the significant covariates.

Variable selection

# Adjusted R- Squared method
model_2 <- leaps( x = train_set[,-1], y = train_set[,1], method = "adjr2")
model_2
## $which
##       1     2     3     4     5     6     7
## 1  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## 1 FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## 1 FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## 1 FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## 2  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
## 2  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE
## 2  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
## 2  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
## 2  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE
## 2  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
## 2 FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
## 2 FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
## 3  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
## 3  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
## 3  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE
## 3  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
## 3  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE
## 3  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE
## 3  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE
## 3  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE
## 3  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
## 3  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE
## 4  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE
## 4  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE
## 4  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE
## 4  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
## 4  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE
## 4  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE
## 4  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE
## 4  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE
## 4  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
## 4  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE
## 5  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## 5  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
## 5  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## 5  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## 5  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
## 5  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
## 5  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
## 5  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
## 5  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
## 5  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE
## 6  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## 6  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## 6  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## 6  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## 6  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## 6  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## 6 FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## 7  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## 
## $label
## [1] "(Intercept)" "1"           "2"           "3"           "4"          
## [6] "5"           "6"           "7"          
## 
## $size
##  [1] 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6
## [39] 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 8
## 
## $adjr2
##  [1] 0.7066050 0.6748175 0.6617018 0.4161640 0.4008062 0.3066930 0.2992672
##  [8] 0.7641087 0.7634878 0.7378060 0.7157976 0.7157957 0.7138432 0.7109018
## [15] 0.7069969 0.6827763 0.6802347 0.7817418 0.7752558 0.7749289 0.7635554
## [22] 0.7634160 0.7633530 0.7630372 0.7628187 0.7627430 0.7428243 0.7870539
## [29] 0.7812901 0.7810118 0.7810037 0.7751674 0.7745738 0.7745611 0.7745504
## [36] 0.7743073 0.7741681 0.7873610 0.7865704 0.7864696 0.7805534 0.7805494
## [43] 0.7802670 0.7744166 0.7744068 0.7739163 0.7738963 0.7866803 0.7866677
## [50] 0.7858894 0.7798054 0.7736581 0.7731969 0.7286635 0.7859641
cbind(model_2$which, adjRsq = model_2$adjr2)
##   1 2 3 4 5 6 7    adjRsq
## 1 1 0 0 0 0 0 0 0.7066050
## 1 0 0 0 0 1 0 0 0.6748175
## 1 0 0 0 0 0 0 1 0.6617018
## 1 0 1 0 0 0 0 0 0.4161640
## 1 0 0 1 0 0 0 0 0.4008062
## 1 0 0 0 0 0 1 0 0.3066930
## 1 0 0 0 1 0 0 0 0.2992672
## 2 1 0 0 0 1 0 0 0.7641087
## 2 1 0 0 0 0 0 1 0.7634878
## 2 1 0 0 0 0 1 0 0.7378060
## 2 1 1 0 0 0 0 0 0.7157976
## 2 1 0 1 0 0 0 0 0.7157957
## 2 1 0 0 1 0 0 0 0.7138432
## 2 0 0 0 0 1 0 1 0.7109018
## 2 0 0 0 0 1 1 0 0.7069969
## 2 0 1 0 0 1 0 0 0.6827763
## 2 0 0 0 0 0 1 1 0.6802347
## 3 1 0 0 0 1 1 0 0.7817418
## 3 1 0 0 0 0 1 1 0.7752558
## 3 1 0 0 0 1 0 1 0.7749289
## 3 1 1 0 0 1 0 0 0.7635554
## 3 1 0 0 1 1 0 0 0.7634160
## 3 1 0 1 0 1 0 0 0.7633530
## 3 1 1 0 0 0 0 1 0.7630372
## 3 1 0 1 0 0 0 1 0.7628187
## 3 1 0 0 1 0 0 1 0.7627430
## 3 1 0 1 0 0 1 0 0.7428243
## 4 1 0 0 0 1 1 1 0.7870539
## 4 1 0 0 1 1 1 0 0.7812901
## 4 1 0 1 0 1 1 0 0.7810118
## 4 1 1 0 0 1 1 0 0.7810037
## 4 1 0 0 1 1 0 1 0.7751674
## 4 1 1 0 0 0 1 1 0.7745738
## 4 1 0 1 0 0 1 1 0.7745611
## 4 1 0 0 1 0 1 1 0.7745504
## 4 1 0 1 0 1 0 1 0.7743073
## 4 1 1 0 0 1 0 1 0.7741681
## 5 1 0 0 1 1 1 1 0.7873610
## 5 1 0 1 0 1 1 1 0.7865704
## 5 1 1 0 0 1 1 1 0.7864696
## 5 1 0 1 1 1 1 0 0.7805534
## 5 1 1 0 1 1 1 0 0.7805494
## 5 1 1 1 0 1 1 0 0.7802670
## 5 1 1 0 1 1 0 1 0.7744166
## 5 1 0 1 1 1 0 1 0.7744068
## 5 1 0 1 1 0 1 1 0.7739163
## 5 1 1 0 1 0 1 1 0.7738963
## 6 1 1 0 1 1 1 1 0.7866803
## 6 1 0 1 1 1 1 1 0.7866677
## 6 1 1 1 0 1 1 1 0.7858894
## 6 1 1 1 1 1 1 0 0.7798054
## 6 1 1 1 1 1 0 1 0.7736581
## 6 1 1 1 1 0 1 1 0.7731969
## 6 0 1 1 1 1 1 1 0.7286635
## 7 1 1 1 1 1 1 1 0.7859641

According to adjusted R squared, 1 0 0 0 1 1 1 or 1 0 0 1 1 1 1 or 1 0 1 0 1 1 1 or 1 1 0 1 1 1 1 or 1 0 1 1 1 1 1 should be the selection of variables in our model as it corresponds to the highest value of 0.787. Since most of the variables are common across these 5 options, we need to further evaluate the significance of ‘TOEFL Score’, ‘University rating’ and ‘SOP Rating’ to arrive at our final model.It is also important to note that the selection 1 1 1 1 1 1 is not preferred using this method.

# Best Subset Selection
model_3 = regsubsets(Chance.of.Admit ~., data = train_set, nbest = 1, nvmax = 7)
summary(model_3)
## Subset selection object
## Call: regsubsets.formula(Chance.of.Admit ~ ., data = train_set, nbest = 1, 
##     nvmax = 7)
## 7 Variables  (and intercept)
##                     Forced in Forced out
## GRE.Score               FALSE      FALSE
## TOEFL.Score             FALSE      FALSE
## University.Rating       FALSE      FALSE
## SOP.Rating              FALSE      FALSE
## LOR.Rating              FALSE      FALSE
## CGPA                    FALSE      FALSE
## Research.Experience     FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
##          GRE.Score TOEFL.Score University.Rating SOP.Rating LOR.Rating CGPA
## 1  ( 1 ) " "       " "         " "               " "        " "        "*" 
## 2  ( 1 ) "*"       " "         " "               " "        " "        "*" 
## 3  ( 1 ) "*"       " "         " "               " "        "*"        "*" 
## 4  ( 1 ) "*"       " "         " "               " "        "*"        "*" 
## 5  ( 1 ) "*"       "*"         " "               " "        "*"        "*" 
## 6  ( 1 ) "*"       "*"         "*"               " "        "*"        "*" 
## 7  ( 1 ) "*"       "*"         "*"               "*"        "*"        "*" 
##          Research.Experience
## 1  ( 1 ) " "                
## 2  ( 1 ) " "                
## 3  ( 1 ) " "                
## 4  ( 1 ) "*"                
## 5  ( 1 ) "*"                
## 6  ( 1 ) "*"                
## 7  ( 1 ) "*"
plot(model_3, scale = "bic")

According to best subset selection, the influence of ‘TOEFL Score’ > ‘University Rating’ > ‘SOP Rating’.

Upon comparing both these results we can arrive at the conclusion that 1 1 0 1 1 1 1 is the best linear regression model for this dataset or in other words, all variables except ‘University Rating’ are statiscally significant in predicting the chance of admission.

Final Model

model_4 = lm(Chance.of.Admit ~ GRE.Score + TOEFL.Score + SOP.Rating + LOR.Rating + CGPA + Research.Experience, data = train_set) 

VIF(model_4)
##           GRE.Score         TOEFL.Score          SOP.Rating          LOR.Rating 
##            4.634416            4.138157            2.401266            2.000310 
##                CGPA Research.Experience 
##            4.461789            1.460461
summary(model_4) 
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + SOP.Rating + 
##     LOR.Rating + CGPA + Research.Experience, data = train_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23825 -0.02440  0.00748  0.03595  0.15247 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.3264822  0.1355257  -9.788  < 2e-16 ***
## GRE.Score            0.0020715  0.0006747   3.070  0.00234 ** 
## TOEFL.Score          0.0026579  0.0012096   2.197  0.02878 *  
## SOP.Rating           0.0098109  0.0056036   1.751  0.08102 .  
## LOR.Rating           0.0139703  0.0052788   2.646  0.00857 ** 
## CGPA                 0.1177838  0.0123623   9.528  < 2e-16 ***
## Research.Experience  0.0260941  0.0085079   3.067  0.00236 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06066 on 293 degrees of freedom
## Multiple R-squared:  0.819,  Adjusted R-squared:  0.8153 
## F-statistic: 220.9 on 6 and 293 DF,  p-value: < 2.2e-16
MSE <- mean(summary(model_4)$residuals^2)
RMSE <- sqrt(MSE)
RMSE
## [1] 0.05995116

As we know, largest VIF value (typically >5 & <10) is used as an indicator of multicollinearlity. Since the VIF values for none of our variables fall in that range, there is no evidence of multicollinearlity in our model.

Further, Adjusted R- squared values is 0.815, and F-statistic corresponds to a value lesser than 0.05. This implies that the model can succesfully predict the chance of admission and is able to explain 81.5% of the variation in the data set.

Model Adequacy Check

par(mfrow = c(1,2))
# generate QQ plot
qqnorm(model_4$residuals,main = "Model for predicting admission")
qqline(model_4$residuals)

# generate Scatter Plot
plot(model_4$fitted.values,model_4$residuals,pch = 20)
abline(h=0,col="grey") 

From the graphs, we observe that the qq plot is not ideal and the data in the scatterplot is not evenly distributed.Therefore, this dataset doesn’t completely satisfy the normality, linearity and equal variance assumptions. We tried to apply a transformation to see if it changes anything.

Box-Cox Transformation

cox = boxcox(model_4) #Estimate the lambda value by using box-cox function

(lambda <- cox$x[which.max(cox$y)]) #Extract the Lambda value
## [1] 2
powerTransform <- function(y, lambda1, lambda2 = NULL, method = "boxcox") {
  boxcoxTrans <- function(x, lam1, lam2 = NULL) {
    
    # if we set lambda2 to zero, it becomes the one parameter transformation
    lam2 <- ifelse(is.null(lam2), 0, lam2)

    if (lam1 == 0L) {
      log(y + lam2)
    } else {
      (((y + lam2)^lam1) - 1) / lam1
    }
  }

  switch(method
         , boxcox = boxcoxTrans(y, lambda1, lambda2)
         , tukey = y^lambda1
  )
}

# re-run with Box-Cox Transformation
model_5 <- lm(powerTransform(Chance.of.Admit, lambda) ~ GRE.Score + TOEFL.Score + SOP.Rating + LOR.Rating + CGPA + Research.Experience, data= admitdata_2 )

summary(model_5)
## 
## Call:
## lm(formula = powerTransform(Chance.of.Admit, lambda) ~ GRE.Score + 
##     TOEFL.Score + SOP.Rating + LOR.Rating + CGPA + Research.Experience, 
##     data = admitdata_2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.166633 -0.020230  0.006802  0.025374  0.102388 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.6897792  0.0663567 -25.465  < 2e-16 ***
## GRE.Score            0.0013511  0.0003280   4.120 4.45e-05 ***
## TOEFL.Score          0.0023187  0.0005681   4.082 5.22e-05 ***
## SOP.Rating           0.0046980  0.0027731   1.694   0.0909 .  
## LOR.Rating           0.0111603  0.0026776   4.168 3.63e-05 ***
## CGPA                 0.0838282  0.0062791  13.350  < 2e-16 ***
## Research.Experience  0.0185187  0.0043114   4.295 2.10e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03921 on 493 degrees of freedom
## Multiple R-squared:  0.8472, Adjusted R-squared:  0.8454 
## F-statistic: 455.7 on 6 and 493 DF,  p-value: < 2.2e-16
par(mfrow = c(1,2))
# generate QQ plot
qqnorm(model_5$residuals,main = "Model for predicting admission")
qqline(model_5$residuals)

# generate Scatter Plot
plot(model_5$fitted.values,model_5$residuals,pch = 20)
abline(h=0,col="grey") 

From the output, we can observe that although the R-squared value increased with model_5, the transformation did not have any effect on the qq plot and the residual plot. This may be because of the presence of outliers or due to the small size of the data set. Hence, we decided to proceed with the previous linear regression model (model_4) for the validation.

Model Validation

# K Fold Cross Validation
set.seed(234) # Random Seed
train.control <- trainControl(method = "cv", number = 5)
# Train the model
cv_result <- train(Chance.of.Admit ~ ., data = train_set, method = "lm", trControl = train.control)
# Summarize the results
print(cv_result)
## Linear Regression 
## 
## 300 samples
##   7 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 239, 242, 240, 240, 239 
## Resampling results:
## 
##   RMSE        Rsquared   MAE       
##   0.06128595  0.8138116  0.04432828
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

For 5-fold cross-validation, the training dataset is divided into 5 parts of equal sample size. Each part serves as the testing sample and the rest serves as training sample. This training/testing procedure is iteratively performed 5 times. The CV score is usually the average of the metric of out-of-sample performance across 5 iterations.

The observed RMSE and R-Squared values match the summary results of model_4 which indicates that the model is consistent across different subsets of data within the small training dataset.

# Test Set Validation
testset_output <- predict(model_4, newdata = test_set)
testset_rmse <- RMSE(testset_output,  test_set$Chance.of.Admit)
testset_rmse
## [1] 0.05957447
testset_rsq <- cor(testset_output, test_set$Chance.of.Admit)^2
testset_rsq
## [1] 0.8226028

The RMSE value and R-squared value for the test data set is almost the same as that for the training data set which implies that our model is equally accurate across the two data sets. Therefore, we can conclude that our linear regression model is statiscally significant in using the list of independent variables to predict the chance of the admits with an average in-sample RMSE of 0.0623 and out-of-sample RMSE of 0.0596. This particular model accounts for an average variation of 81.4% in in-sample data and 82.3% of the variation in out-of-sample data.

Model Accuracy

#Comparing the predicted values against observed values
newadmit <- data.frame(GRE.Score = 337,TOEFL.Score = 118,SOP.Rating = 4, LOR.Rating = 4.5,Research.Experience = 1, CGPA = 9.65)
print(paste0("Observed chance of admission: ",0.92))
## [1] "Observed chance of admission: 0.92"
predicted <- predict(model_4, newdata = newadmit)
print(paste0("Predicted chance of admission: ",(predicted)))
## [1] "Predicted chance of admission: 0.95005635855602"
newadmit <- data.frame(GRE.Score = 308,TOEFL.Score = 101,SOP.Rating = 3, LOR.Rating = 4, Research.Experience = 0, CGPA = 7.9)
print(paste0("Observed chance of admission: ",0.64))
## [1] "Observed chance of admission: 0.64"
predicted <- predict(model_4, newdata = newadmit)
print(paste0("Predicted chance of admission: ",(predicted)))
## [1] "Predicted chance of admission: 0.595787276517959"

We observe the values we get for chance of admission using our created model is + or - 0.05 of the observed values. This concludes the Regression analysis of the data.

Chapter 4: Decision Tree Analysis

Regression Tree

# Creating the tree
admission.rpart <- rpart(formula = Chance.of.Admit ~. , data = train_set, method = "anova",cp=-1)
rpart.plot(admission.rpart,type = 3, fallen.leaves =TRUE)

printcp(admission.rpart)
## 
## Regression tree:
## rpart(formula = Chance.of.Admit ~ ., data = train_set, method = "anova", 
##     cp = -1)
## 
## Variables actually used in tree construction:
## [1] CGPA                GRE.Score           LOR.Rating         
## [4] Research.Experience SOP.Rating          TOEFL.Score        
## [7] University.Rating  
## 
## Root node error: 5.9559/300 = 0.019853
## 
## n= 300 
## 
##             CP nsplit rel error  xerror     xstd
## 1   0.55952944      0   1.00000 1.00838 0.072154
## 2   0.12589985      1   0.44047 0.44666 0.036601
## 3   0.06773845      2   0.31457 0.35162 0.033699
## 4   0.02630450      3   0.24683 0.28920 0.033570
## 5   0.01113962      4   0.22053 0.27848 0.032913
## 6   0.01093104      5   0.20939 0.28125 0.032149
## 7   0.01005214      6   0.19846 0.26979 0.030584
## 8   0.00547058      7   0.18840 0.26180 0.029381
## 9   0.00424329      8   0.18293 0.27379 0.030594
## 10  0.00421874      9   0.17869 0.27098 0.029913
## 11  0.00420918     10   0.17447 0.27098 0.029913
## 12  0.00364285     11   0.17026 0.27309 0.030757
## 13  0.00343006     12   0.16662 0.27004 0.030887
## 14  0.00313253     13   0.16319 0.26881 0.031364
## 15  0.00257098     14   0.16006 0.26849 0.031296
## 16  0.00242682     15   0.15749 0.26955 0.031310
## 17  0.00229248     16   0.15506 0.26665 0.031299
## 18  0.00151305     17   0.15277 0.26323 0.031044
## 19  0.00145472     18   0.15125 0.26986 0.031034
## 20  0.00140451     19   0.14980 0.26986 0.031034
## 21  0.00105812     20   0.14840 0.27134 0.031177
## 22  0.00036676     21   0.14734 0.27110 0.031161
## 23  0.00029766     22   0.14697 0.27031 0.031145
## 24  0.00025146     23   0.14667 0.27036 0.031144
## 25 -1.00000000     24   0.14642 0.27036 0.031144
# Validating against the taining data set
predict.rpart.train <- predict(admission.rpart, train_set)
mse.tree1 <- mean((predict.rpart.train- train_set$Chance.of.Admit)^2)
rmse1 <- sqrt(mse.tree1)
rmse1
## [1] 0.05391567
#Validating against the testing data set
predict.rpart.test <- predict(admission.rpart, test_set)
mse.tree2 <- mean((predict.rpart.test- test_set$Chance.of.Admit)^2)
rmse2 <- sqrt(mse.tree2)
rmse2
## [1] 0.06906934
#Comparing the predicted values against observed values
newadmit <- data.frame(GRE.Score = 337,TOEFL.Score = 118, University.Rating = 4, SOP.Rating = 4, LOR.Rating = 4.5,Research.Experience = 1, CGPA = 9.65)
print(paste0("Observed chance of admission: ",0.92))
## [1] "Observed chance of admission: 0.92"
predicted <- predict(admission.rpart, newdata = newadmit)
print(paste0("Predicted chance of admission: ",(predicted)))
## [1] "Predicted chance of admission: 0.934285714285714"
newadmit <- data.frame(GRE.Score = 308,TOEFL.Score = 101, University.Rating = 2, SOP.Rating = 3, LOR.Rating = 4, Research.Experience = 0, CGPA = 7.9)
print(paste0("Observed chance of admission: ",0.64))
## [1] "Observed chance of admission: 0.64"
predicted <- predict(admission.rpart, newdata = newadmit)
print(paste0("Predicted chance of admission: ",(predicted)))
## [1] "Predicted chance of admission: 0.528333333333333"

As most of our data is quantitative, we created regression trees instead of classification trees. The tree is trained against all variables and is fully grwon until it reaches its maximum depth of 8. The rmse values for training and testing datasets are 0.0539 and 0.0691 respectively.

Pruning

# Pruning the tree
min_cp <- admission.rpart$cptable[which.min(admission.rpart$cptable[,"xerror"]),"CP"]
admission.pruned <- prune (admission.rpart,cp=min_cp)
rpart.plot(admission.pruned,type = 3, fallen.leaves =TRUE)

plotcp(admission.pruned)

# Validating against the taining data set
predict.pruned.train <- predict(admission.pruned, train_set)
mse.tree3 <- mean((predict.pruned.train- train_set$Chance.of.Admit)^2)
rmse3 <- sqrt(mse.tree3)
rmse3
## [1] 0.06115884
#Validating against the testing data set
predict.pruned.test <- predict(admission.pruned, test_set)
mse.tree4 <- mean((predict.pruned.test- test_set$Chance.of.Admit)^2)
rmse4 <- sqrt(mse.tree4)
rmse4
## [1] 0.07311019
#Comparing the predicted values against observed values
newadmit <- data.frame(GRE.Score = 337,TOEFL.Score = 118, University.Rating = 4, SOP.Rating = 4, LOR.Rating = 4.5,Research.Experience = 1, CGPA = 9.65)
print(paste0("Observed chance of admission: ",0.92))
## [1] "Observed chance of admission: 0.92"
predicted <- predict(admission.pruned, newdata = newadmit)
print(paste0("Predicted chance of admission: ",(predicted)))
## [1] "Predicted chance of admission: 0.927674418604651"
newadmit <- data.frame(GRE.Score = 308,TOEFL.Score = 101, University.Rating = 2, SOP.Rating = 3, LOR.Rating = 4, Research.Experience = 0, CGPA = 7.9)
print(paste0("Observed chance of admission: ",0.64))
## [1] "Observed chance of admission: 0.64"
predicted <- predict(admission.pruned, newdata = newadmit)
print(paste0("Predicted chance of admission: ",(predicted)))
## [1] "Predicted chance of admission: 0.501944444444444"

This is to evaluate if pruning of the above tree will provide better results. The complexity parameter cp used here is associated with the least cross validated error. This is indicated by the dotted line in the graph.

The rpart tree is an input to this pruning function and the ouput tree has a reduced depth. It can also be observed that the rmse values for training and testing datasets increase slightly when compared to the the original tree. This may be because the original fully grown tree, overfits the data when compared to the pruned tree.

Random Forest

# Creating the forest
set.seed(234)
j <- 500 # Hyper Parameter 1 - Number of replicas of dataset
k <- round(sqrt(7)) # Hyper Parameter 2 - Number of random variables considered for split at every step
admission.forest <- randomForest(Chance.of.Admit ~.,   data=train_set, mtry=k, ntree=j)
print(admission.forest) # view results
## 
## Call:
##  randomForest(formula = Chance.of.Admit ~ ., data = train_set,      mtry = k, ntree = j) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.004060214
##                     % Var explained: 79.55
plot(admission.forest)

importance(admission.forest) # importance of each predictor
##                     IncNodePurity
## GRE.Score              1.39063240
## TOEFL.Score            0.86814593
## University.Rating      0.28933488
## SOP.Rating             0.32537368
## LOR.Rating             0.19127949
## CGPA                   2.62237251
## Research.Experience    0.08358798
varImpPlot(admission.forest)

# Validating against the taining data set
predict.forest.train <- predict(admission.forest, train_set)
mse.tree5 <- mean((predict.forest.train- train_set$Chance.of.Admit)^2)
rmse5 <- sqrt(mse.tree5)
rmse5
## [1] 0.03058209
#Validating against the testing data set
predict.forest.test <- predict(admission.forest, test_set)
mse.tree6 <- mean((predict.forest.test- test_set$Chance.of.Admit)^2)
rmse6 <- sqrt(mse.tree6)
rmse6
## [1] 0.06303027
#Comparing the predicted values against observed values
newadmit <- data.frame(GRE.Score = 337,TOEFL.Score = 118, University.Rating = 4, SOP.Rating = 4, LOR.Rating = 4.5,Research.Experience = 1, CGPA = 9.65)
print(paste0("Observed chance of admission: ",0.92))
## [1] "Observed chance of admission: 0.92"
predicted <- predict(admission.forest, newadmit)
print(paste0("Predicted chance of admission: ",(predicted)))
## [1] "Predicted chance of admission: 0.926307000000001"
newadmit <- data.frame(GRE.Score = 308,TOEFL.Score = 101, University.Rating = 2, SOP.Rating = 3, LOR.Rating = 4, Research.Experience = 0, CGPA = 7.9)
print(paste0("Observed chance of admission: ",0.64))
## [1] "Observed chance of admission: 0.64"
predicted <- predict(admission.forest, newadmit)
print(paste0("Predicted chance of admission: ",(predicted)))
## [1] "Predicted chance of admission: 0.556253333333334"

As seen in the plot, the error decreases with the increase in number of trees. The Hyper Parameter 2 used to generate the bootestrapped trees ensures that 3 random dimensions / variables is selected at every split reducing the bias and variance in the model. Due to this, it has a lower rmse value when compared to the normal decision tree as well as the pruned decision. This approach also helps in determining the relative importance of each variable in decision making.

Chapter 5: Comparision and Conclusion

Comparison:

Since the data is continuous and not categorical, the RMSE value would be our primary metric to compare the prediction accuracy of these models.

From the outputs for trees, it is clear that the RMSE values across both the in-sample and out-sample data sets for Random Forest < CART < Pruned Tree.

However, upon further comparison, linear regression seems to fit the data better than Random Forest which is rather unusual. This anamoly can be attributed to the small size of dataset and the presence of many independent variables.

While both the models equally explain the importance of variables, the Linear regression model is able to account for a greater percentage (81.4%) of variation in training data than Random Forest (79.5%). Also, as shown in the table below, it has a lower RMSE value for predicting new (out-sample) data.

We also compared this result with the IEEE paper published by the owner of this dataset and our observations matched. Following is an image of his conclusion for reference purpose.

Conclusion:

Therefore, we can conclude that Linear Regression emerges as the champion model for supervised learning while Random Forest emerges the challenger model for this given dataset.

Chance.of.Admit = -1.3264822 + 0.0020715 * (GRE.Score) + 0.0026579 * (TOEFL.Score) + 0.0098109 * (SOP.Rating) + 0.0139703 * (LOR.Rating) + 0.1177838 * (CGPA) + 0.0260941 * (Research.Experience)