Brief introduction

Goals

In this project, we will be using the Admission_predict dataset in csv format to predict the chances of students getting admission by a university based on several academic performance measurement. To yield the most accurate result, we will be going through several steps such as data preprocessing, t-test, feature selection, cross validation, model selection, etc. to train a linear regression model, make prediction and measure its performance.

Dataset information

This dataset is created for prediction of Graduate Admissions from an Indian perspective. We can find the dataset “Admission_Predict.csv” by clicking the link here. Dataset cited here is owned by Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019

Content:

The dataset contains several parameters:

  1. GRE Scores ( out of 340 )
  2. TOEFL Scores ( out of 120 )
  3. University Rating ( out of 5 )
  4. Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
  5. Undergraduate GPA ( out of 10 )
  6. Research Experience ( either 0 or 1 )
  7. Chance of Admit ( ranging from 0 to 1 ).

The last column is the response variable, which is a continuous value between 0 and 1, indicating the chances of getting admission

Get Data

First, we need to load our admission dataset into R.

Explore Data

# produces a numerical summary of each variable in our dataset
summary(admission_data)
##    Serial.No.      GRE.Score      TOEFL.Score    University.Rating
##  Min.   :  1.0   Min.   :290.0   Min.   : 92.0   Min.   :1.000    
##  1st Qu.:125.8   1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000    
##  Median :250.5   Median :317.0   Median :107.0   Median :3.000    
##  Mean   :250.5   Mean   :316.5   Mean   :107.2   Mean   :3.114    
##  3rd Qu.:375.2   3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000    
##  Max.   :500.0   Max.   :340.0   Max.   :120.0   Max.   :5.000    
##       SOP             LOR             CGPA          Research   
##  Min.   :1.000   Min.   :1.000   Min.   :6.800   Min.   :0.00  
##  1st Qu.:2.500   1st Qu.:3.000   1st Qu.:8.127   1st Qu.:0.00  
##  Median :3.500   Median :3.500   Median :8.560   Median :1.00  
##  Mean   :3.374   Mean   :3.484   Mean   :8.576   Mean   :0.56  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:9.040   3rd Qu.:1.00  
##  Max.   :5.000   Max.   :5.000   Max.   :9.920   Max.   :1.00  
##  Chance.of.Admit 
##  Min.   :0.3400  
##  1st Qu.:0.6300  
##  Median :0.7200  
##  Mean   :0.7217  
##  3rd Qu.:0.8200  
##  Max.   :0.9700

From the summary of data, we can find that all of our columns are of numerical data type. But within the Research column, we can see that there are only 2 possible value (0 and 1). In this case, it will be more convenient for us to convert it to factor type.

# convert the Research column from numerical to factor data type and assign labels to each possible value
admission_data$Research <- factor(admission_data$Research, levels = c(0,1), labels = c("No","Yes"))
str(admission_data)  # check the structure of dataset
## 'data.frame':    500 obs. of  9 variables:
##  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 1 1 1 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

Check for missing values

Then, before we fit model using our dataset, we need to check data entry for missing values or empty values and handle them, because we can not make prediction using values that are not available, for it could severely influence the accuracy of results.

# checking and counting numbers of missing values
colSums(is.na(admission_data))
##        Serial.No.         GRE.Score       TOEFL.Score University.Rating 
##                 0                 0                 0                 0 
##               SOP               LOR              CGPA          Research 
##                 0                 0                 0                 0 
##   Chance.of.Admit 
##                 0
# Checking for empty values
colSums(admission_data=='')
##        Serial.No.         GRE.Score       TOEFL.Score University.Rating 
##                 0                 0                 0                 0 
##               SOP               LOR              CGPA          Research 
##                 0                 0                 0                 0 
##   Chance.of.Admit 
##                 0

we can see from the checking result, there are no missing values in each columns of our dataset. So now our dataset is complete and ready to be used to do further analysis.

Distributions:

We should then analyse the distribution of our data. First we plot the Histogram of the response variable Y to check visually if it seems normally distributed.

# we can also do qqplot to check the normality of our data
qqnorm(admission_data$Chance.of.Admit)
qqline(admission_data$Chance.of.Admit, col = 2)

We can see from the histogram that the distribution of our response variable looks normally distributed. And the qqline is roughly straight, so it is approximately normal distributed.

Next, we can plot data of our predictor columns to check their distribution.

According to our histogram, we can see the data distribution of these numerical columns appear to be approximately normal. So we do not need to transform them to be more normal now.

Relationships:

In the next step, we need to look at correlations between numerical columns.

pairs(admission_data[-7])

# we could use test to test correlation value against the null hypothesis that the population  correlation is actually zero
cor.test(~ CGPA + Chance.of.Admit, data=admission_data, )
## 
##  Pearson's product-moment correlation
## 
## data:  CGPA and Chance.of.Admit
## t = 41.855, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8613745 0.9004286
## sample estimates:
##       cor 
## 0.8824126
cor.test(~ GRE.Score + Chance.of.Admit, data=admission_data, )
## 
##  Pearson's product-moment correlation
## 
## data:  GRE.Score and Chance.of.Admit
## t = 30.862, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7779406 0.8384601
## sample estimates:
##       cor 
## 0.8103506
cor.test(~ TOEFL.Score + Chance.of.Admit, data=admission_data, )
## 
##  Pearson's product-moment correlation
## 
## data:  TOEFL.Score and Chance.of.Admit
## t = 28.972, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7571359 0.8227603
## sample estimates:
##       cor 
## 0.7922276

we can see from the correlation plot, all the other 6 variables seem to have linear relationship with the response variable. The relationship between the response and the CGPA, GRE score and TOEFL score are the most obvious.

# test correlations between preditors
cor.test(~ CGPA + GRE.Score, data=admission_data, )
## 
##  Pearson's product-moment correlation
## 
## data:  CGPA and GRE.Score
## t = 32.686, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7958222 0.8518743
## sample estimates:
##      cor 
## 0.825878
cor.test(~ CGPA + TOEFL.Score, data=admission_data, )
## 
##  Pearson's product-moment correlation
## 
## data:  CGPA and TOEFL.Score
## t = 30.887, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7781969 0.8386529
## sample estimates:
##       cor 
## 0.8105735
cor.test(~ TOEFL.Score + GRE.Score, data=admission_data, )
## 
##  Pearson's product-moment correlation
## 
## data:  TOEFL.Score and GRE.Score
## t = 32.852, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7973476 0.8530152
## sample estimates:
##       cor 
## 0.8272004

We can also find that there are pretty strong correlation between CGPA & GRE and with TOEFL score. GRE score and TOEFL score is also appears to be strongly correlated.

Now we should plot and look at relationships between response and predictor columns

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : pseudoinverse used at 3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : There are other near singularities as well. 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : pseudoinverse used at 3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : There are other near singularities as well. 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : pseudoinverse used at 3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : There are other near singularities as well. 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : pseudoinverse used at 3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : There are other near singularities as well. 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : pseudoinverse used at 3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : There are other near singularities as well. 1

we can see from the result, The GRE score and SOP, CGPA, TOEFL score and Research might be good predictors for the chance of admission

issues:

  1. According to the plot results, we can find out that LOR, SOP and university rating are non-continuous numerical data.

To solve this, we can convert them into categorical data type

admission_data$SOP <- cut(admission_data$SOP, 3, labels = c("low","med","high"))
admission_data$LOR <- cut(admission_data$LOR, 3, labels = c("low","med","high"))
admission_data$University.Rating <- cut(admission_data$University.Rating, 3, labels = c("low","med","high"))
str(admission_data)
## 'data.frame':    500 obs. of  9 variables:
##  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: Factor w/ 3 levels "low","med","high": 3 3 2 2 1 3 2 1 1 2 ...
##  $ SOP              : Factor w/ 3 levels "low","med","high": 3 3 2 2 1 3 2 2 1 2 ...
##  $ LOR              : Factor w/ 3 levels "low","med","high": 3 3 2 2 2 2 3 3 1 2 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 1 1 1 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

we have transformed these 3 variables into categorical type. Now re-plot these variables using boxplot

now, the relationship between these 3 variables and the reponse variable has became more clear

  1. There are numerous outliers in GRE score plot and TOEFL score plot To solve this, we can try to do transformation on GRE and TOEFL score columns, derive the log value of the original value
admission_data$GRE.Score <- log(admission_data$GRE.Score)
admission_data$TOEFL.Score <- log(admission_data$TOEFL.Score)

and plot the transformed results

T-test

We can now do a hypothesis test. First we split our dataset into 2 groups based on the value of the Research column i.e. students with research experience and students without research experience

# split dataset and store seperately based on research column
student_with_research <- admission_data[which(admission_data$Research=="Yes"),]
student_without_research <- admission_data[which(admission_data$Research=="No"),]
# calculate mean of two groups
mean_with_research <- mean(student_with_research$Chance.of.Admit)
mean_without_research <- mean(student_without_research$Chance.of.Admit)

We can see that the mean between two groups seems different. Then we use t.test to examine whether these two groups of students data have different mean value of admission chance

# test whether the hypothesis that chance of admission is dependant on research.
t.test(Chance.of.Admit ~ Research, mu = 0, alt = "two.sided", conf  = 0.95, data=admission_data)
## 
##  Welch Two Sample t-test
## 
## data:  Chance.of.Admit by Research
## t = -14.707, df = 487.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1757700 -0.1343404
## sample estimates:
##  mean in group No mean in group Yes 
##         0.6349091         0.7899643

we can conclude from the t-test result that: the probability of seeing our data that there is no difference in chance of admission between students with or without research experience, is 2.2e-16, which is much smaller than 0.02. So there is a difference in chance of admission between students with or without research experience. Students with research experience has higher mean value of chance of admission.

we can view the relationship through box plot

Prepare Data for analysis

For the column Serial.No. It is just the sequential number of single data entry, which is irrelevant to our model fitting and is not helpful in making prediction, so we should remove this column from out dataset

## Warning: package 'dplyr' was built under R version 3.6.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# remove redundant column
admission_data <- select(admission_data, - Serial.No.)  # drop the Serial.No. column

Then we can split data into 50:50 test and training rows, so that we can test the results our model and get performance measurement.

# split data into train and test set
set.seed(1)  # set random seed
train <- sample(1:nrow(admission_data), floor(nrow(admission_data)/2) )  # choose rows to be placed in the training set
test_set <- admission_data[-train,]  # test set
train_set <- admission_data[train,] # taining set

Now we have training set of 250 rows and test dataset of same rows.

Perform Linear Regression

After the preprocessing of dataset, now we can fit a linear model. To cycle through all possible combinations of predictors and generate set of best performing models, we will use the regsubset.

## Warning: package 'leaps' was built under R version 3.6.1
# using regsubsets to select predictors subset, set method = exhaustive
regsubsets.out <- regsubsets(Chance.of.Admit ~ .,
                              data = admission_data,
                              nbest = 1,  # number of subsets of each size to record
                              nvmax = NULL,  # maximum size of subsets to examine
                              force.in = NULL, force.out = NULL,
                              method = 'exhaustive')
# converts this to dataframe and show the choosing result
as.data.frame(summary(regsubsets.out)$outmat)
##           GRE.Score TOEFL.Score University.Ratingmed University.Ratinghigh
## 1  ( 1 )                                                                  
## 2  ( 1 )          *                                                       
## 3  ( 1 )                      *                                           
## 4  ( 1 )                      *                                           
## 5  ( 1 )          *           *                                           
## 6  ( 1 )          *           *                                           
## 7  ( 1 )          *           *                                          *
## 8  ( 1 )          *           *                                          *
## 9  ( 1 )          *           *                                          *
## 10  ( 1 )         *           *                    *                     *
##           SOPmed SOPhigh LORmed LORhigh CGPA ResearchYes
## 1  ( 1 )                                   *            
## 2  ( 1 )                                   *            
## 3  ( 1 )                                   *           *
## 4  ( 1 )                              *    *           *
## 5  ( 1 )                              *    *           *
## 6  ( 1 )                      *       *    *           *
## 7  ( 1 )                      *       *    *           *
## 8  ( 1 )               *      *       *    *           *
## 9  ( 1 )       *       *      *       *    *           *
## 10  ( 1 )      *       *      *       *    *           *

Then we use Adjusted Rsq value to estimate performance of each predictors.

# create plot of results, it includes fit performance information
plot(regsubsets.out, scale='adjr2', main='Adjusted Rsq')

we can see from the plot that CGPA has the smallest adjr2 value as a predictor.

Model Selection

Next, we should select the optimum model base on the slection results of predictors. 1. We we build a linear model with CGPA as the single predictor.

lm.single <- lm(Chance.of.Admit ~ CGPA, data = train_set)

  1. For the second linear model, we choose CGPA, GRE.Score, and TOEFL.Score as our predictors to train model.
lm.multi <- lm(Chance.of.Admit ~ CGPA + TOEFL.Score + GRE.Score, data = train_set)
  1. Finally, we fit a model that use all predictors.
lm.full <- lm(Chance.of.Admit ~ ., data = train_set)

Now that we have generated three models and get some discriptive information of them. To choose the optimal one out of them, we need to evaluate performance of chosen model on our test dataset.

## Warning: package 'texreg' was built under R version 3.6.1
## Version:  1.36.23
## Date:     2017-03-03
## Author:   Philip Leifeld (University of Glasgow)
## 
## Please cite the JSS article in your publications -- see citation("texreg").
## Warning: package 'caret' was built under R version 3.6.1
## Loading required package: lattice
## Loading required package: ggplot2
# calculate the rmse and r2 of each model
model_single_rmse <- RMSE(lm.single$fitted.values, obs = train_set$Chance.of.Admit)
model_multi_rmse <- RMSE(lm.multi$fitted.values, obs = train_set$Chance.of.Admit)
model_full_rmse <- RMSE(lm.full$fitted.values, obs = train_set$Chance.of.Admit)
model_rmse_table <- data.frame(model_single_rmse, model_multi_rmse, model_full_rmse)
model_rmse_table
##   model_single_rmse model_multi_rmse model_full_rmse
## 1        0.06762249       0.06186354      0.05885514
# calculate the rsq value of trained model
model_single_rsq <- cor(lm.single$fitted.values, train_set$Chance.of.Admit)^2
model_multi_rsq <- cor(lm.multi$fitted.values, train_set$Chance.of.Admit)^2
model_full_rsq <- cor(lm.full$fitted.values, train_set$Chance.of.Admit)^2
model_rsq_table <- data.frame(model_single_rsq, model_multi_rsq, model_full_rsq)
model_rsq_table
##   model_single_rsq model_multi_rsq model_full_rsq
## 1        0.7811558       0.8168435      0.8342241
# we can also use screenreg to show imformation about our model and choose the optimal one
screenreg(list(lm.single, lm.multi, lm.full))
## 
## =========================================================
##                        Model 1     Model 2     Model 3   
## ---------------------------------------------------------
## (Intercept)             -1.06 ***   -7.27 ***   -5.92 ***
##                         (0.06)      (1.01)      (1.05)   
## CGPA                     0.21 ***    0.14 ***    0.12 ***
##                         (0.01)      (0.01)      (0.01)   
## TOEFL.Score                          0.35 **     0.30 *  
##                                     (0.13)      (0.13)   
## GRE.Score                            0.90 ***    0.72 ** 
##                                     (0.22)      (0.22)   
## University.Ratingmed                             0.00    
##                                                 (0.01)   
## University.Ratinghigh                            0.03    
##                                                 (0.01)   
## SOPmed                                           0.00    
##                                                 (0.01)   
## SOPhigh                                         -0.01    
##                                                 (0.02)   
## LORmed                                           0.00    
##                                                 (0.01)   
## LORhigh                                          0.03    
##                                                 (0.01)   
## ResearchYes                                      0.02 ** 
##                                                 (0.01)   
## ---------------------------------------------------------
## R^2                      0.78        0.82        0.83    
## Adj. R^2                 0.78        0.81        0.83    
## Num. obs.              250         250         250       
## RMSE                     0.07        0.06        0.06    
## =========================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

We can see from the result :that model, lm.full using all predictors has the largest R^2 value as well as the smallest RMSE value, thus, we choose this model as our optimal one. And the formula for our linear model is \[ y = -9.01 +0.12*CGPA + 0.30*TOEFL.Score+4.13*GRE.Score+University.Rating*0.03-0.01*SOP+0.03*LOR+0.02*Research\]

Cross Validation

We use 10-fold cross validation method.k-fold cross-validation method evaluates the model performance on different subset of the training data and then calculate the average prediction error rate

# First we Define training control, 
set.seed(2) 
train.control <- trainControl(method = "cv", number = 10)
# Train the model
cv.lm.full <- train(Chance.of.Admit ~ ., data = train_set, method = "lm",
               trControl = train.control)
# Summarize the results
print(cv.lm.full)
## Linear Regression 
## 
## 250 samples
##   7 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 225, 225, 226, 224, 224, 224, ... 
## Resampling results:
## 
##   RMSE        Rsquared  MAE       
##   0.06180049  0.823202  0.04460971
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Evaluation

After choosing the model with the best performance measure, we can make predictions using our trained model and store in the test dataset

prediction evaluation result using all predictor model lm.full:

# making prediction
test_set$full_model_output <- predict(lm.full, newdata = test_set)

now we should calculate the RMSE and R2 value of our testset

# calculate the RMSE on the test dataset
testset_rmse <- RMSE(test_set$full_model_output, obs = test_set$Chance.of.Admit)
testset_rmse
## [1] 0.06219085
# calculate the R2 value on the test dataset
testset_rsq <- cor(test_set$full_model_output, test_set$Chance.of.Admit)^2
testset_rsq
## [1] 0.7983604

Evaluation result: It can be seen from above that our RMSE value on test dataset is 0.062 a little higher compared with 0.059, the model RMSE, and the rsq value is 0.798, which are little lower than that of model rsq, which is 0.834.

Perform Decision Tree fitting

In this section, we will build regression tree models. load the tree package, its provides functionality for classiffication and regression trees

Pruning method

First, we can build a decision tree using pruning method, using all predictors

set.seed(2)  # set random seed
# fit the tree model with training data
tree.admission <- tree(Chance.of.Admit ~ ., admission_data, subset = train)
summary(tree.admission)
## 
## Regression tree:
## tree(formula = Chance.of.Admit ~ ., data = admission_data, subset = train)
## Variables actually used in tree construction:
## [1] "CGPA"        "TOEFL.Score" "GRE.Score"  
## Number of terminal nodes:  8 
## Residual mean deviance:  0.003881 = 0.9393 / 242 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.23030 -0.03227  0.00234  0.00000  0.03757  0.20630
# plot the trees
plot(tree.admission)
text(tree.admission, pretty = 0)

Then we should consider whether pruning the tree might lead to improved results. We will use cross-validation with cv.tree() to see the best level of pruning to minimize the deviance

set.seed(2)
cv.admission <- cv.tree(tree.admission)
names(cv.admission)
## [1] "size"   "dev"    "k"      "method"
cv.admission  # this reports the number of terminal nodes of each tree considered as well as the corresponding error rate and the value of the cost-complexity parameter used k
## $size
## [1] 8 7 6 5 4 3 2 1
## 
## $dev
## [1] 1.294149 1.298796 1.298796 1.435518 1.549878 1.857434 2.411289 5.236328
## 
## $k
## [1]       -Inf 0.05250431 0.05312685 0.08429825 0.11200025 0.32733751
## [7] 0.58160702 3.07363644
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
# we plot the deviance as a function of both size of tree (number of nodes), and tree penalty k.
par(mfrow = c(1, 2))
plot(cv.admission$size, cv.admission$dev, type = "b")
plot(cv.admission$k, cv.admission$dev, type = "b")

As we can see from the plot, the deviance seems to have lowest value with tree size=8 and k=0.05 Examine the best performing tree.

cv.admission$size <- cv.admission$size[which(cv.admission$size!=1)]  # we should exclude the tee size=1
# used to store the predicted value
n_vals <- cv.admission$size
n_vals
## [1] 8 7 6 5 4 3 2
results <- 0
results <- data.frame(size = rep(0,length(n_vals)),
                      rmse = rep(0,length(n_vals))
                      )  

we will use for loop to run over each tree size and use prune.tree to create the corresponding decision tree. And also predict to measure the performance on the test data set in terms of deviance

par(mfrow=c(1,2))
for (i in n_vals) {
  results$size[i-1] <- i
  prune.admission <- prune.tree(tree.admission, best = i, method = "deviance")
  plot(prune.admission)
  text(prune.admission, pretty = 0)
  tree.pred <- predict(prune.admission, test_set)  # make prediciton
  prune_rmse <- RMSE(tree.pred, obs = train_set$Chance.of.Admit)
  results$rmse[i-1]  <- prune_rmse
}

##   size      rmse
## 1    2 0.1868216
## 2    3 0.1989816
## 3    4 0.2004313
## 4    5 0.2006124
## 5    6 0.2008099
## 6    7 0.2009120
## 7    8 0.2009812

we can see from the rmse results that model with ntree = 2 has the lowest rmse.

Bagging method

In this section, we will build an bagged decision tree model.

## Warning: package 'randomForest' was built under R version 3.6.1
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
# there are 7 predictors we can use to predict the chance of admit.
set.seed(2)
p <- 7
bag.admission <- randomForest(Chance.of.Admit ~ ., data = admission_data, subset = train, mtry = p, importance = TRUE, ntree=20)
bag.admission  # look at the output
## 
## Call:
##  randomForest(formula = Chance.of.Admit ~ ., data = admission_data,      mtry = p, importance = TRUE, ntree = 20, subset = train) 
##                Type of random forest: regression
##                      Number of trees: 20
## No. of variables tried at each split: 7
## 
##           Mean of squared residuals: 0.005243639
##                     % Var explained: 74.91

Next, we would explore how cross validated error rate improves with number of trees used. We can change the number of trees using the ntree argument Now we do the plot of OOB performance measured with mse against number of trees

rf_tree = randomForest(Chance.of.Admit ~ ., data = admission_data , subset = train,  mtry= p, ntree = 500, do.trace=100)
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  100 | 0.004451    21.30 |
##  200 | 0.004438    21.24 |
##  300 | 0.004454    21.32 |
##  400 | 0.004439    21.24 |
##  500 | 0.004399    21.05 |
# rf_tree$mse stores the MSE calculated on the Out-Of-Bag  
plot(rf_tree$mse, xlab="N Trees", ylab="MSE")

We set ntee = 300 at first. Then draw the plot. We can see from the plot that, with number of trees increasing from 0 to 150, MSE decreases gradually. But MSE remains approximately stable as N trees growing from 150 to 500. Therefore, we choose number of trees to be 500, and then re-build the optimal bagged decision tree.

set.seed(2)
p <- 7
bag.admission.optimal <- randomForest(Chance.of.Admit ~ ., data = admission_data, subset = train, mtry = p, importance = TRUE, ntree = 500)
bag.admission.optimal  # look at the output
## 
## Call:
##  randomForest(formula = Chance.of.Admit ~ ., data = admission_data,      mtry = p, importance = TRUE, ntree = 500, subset = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##           Mean of squared residuals: 0.004459445
##                     % Var explained: 78.66

Finally, we should evaluate performance on the test dataset, see how well it fit the data.

# make prediction
set.seed(2)
bag.optimal.pred <- predict(bag.admission.optimal, newdata = test_set)

we can see that out datapoint is evenly distributed around the straight line, indicating that our model well fit the data.

# calculate the RMSE on the test dataset
bag_test_rmse <- RMSE(bag.optimal.pred, obs = test_set$Chance.of.Admit)
bag_test_rmse
## [1] 0.06484412

Random Forest method

In this section, we will use random forest method. Bagging is a special case of a random forest with m = p. So we can also change the number of variables using the mtry argument

# We used all 7 variables above, but we can first limit the number of predictors used for each tree using mtry = 5 to build a random forest model.
set.seed(2)
rf.admission <- randomForest(Chance.of.Admit ~ ., data = admission_data, subset = train, mtry = 4, importance = TRUE)  # set mtry=4
rf.admission
## 
## Call:
##  randomForest(formula = Chance.of.Admit ~ ., data = admission_data,      mtry = 4, importance = TRUE, subset = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 0.004163098
##                     % Var explained: 80.08
# rf_tree$mse stores the MSE calculated on the Out-Of-Bag  
plot(rf.admission$mse, xlab="N Trees", ylab="MSE")

change the mtry to oter values and redraw the plot.

  • m=2
set.seed(2)
rf.admission <- randomForest(Chance.of.Admit ~ ., data = admission_data, subset = train, mtry = 2, importance = TRUE)  # set mtry=2
rf.admission
## 
## Call:
##  randomForest(formula = Chance.of.Admit ~ ., data = admission_data,      mtry = 2, importance = TRUE, subset = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.003985625
##                     % Var explained: 80.93
  • m=3
set.seed(2)
rf.admission <- randomForest(Chance.of.Admit ~ ., data = admission_data, subset = train, mtry = 3, importance = TRUE)  # set mtry=3
rf.admission
## 
## Call:
##  randomForest(formula = Chance.of.Admit ~ ., data = admission_data,      mtry = 3, importance = TRUE, subset = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.004043383
##                     % Var explained: 80.65

we can look at the variable importance measure using the importance() function.

importance(rf.admission)
##                     %IncMSE IncNodePurity
## GRE.Score         25.225787    1.58196204
## TOEFL.Score       13.178529    0.74814184
## University.Rating  7.027358    0.20980123
## SOP               14.596085    0.22722982
## LOR                7.129761    0.09959360
## CGPA              32.973516    2.12004148
## Research          10.335855    0.07400329

Here we can see that two measures of variable importance are reported 1. based upon the mean decrease of accuracy in predictions on the out of bag samples when a given variable is excluded from the model. 2. measure of the total decrease in node impurity that results from splits over that variable, averaged over all trees.

We can draw a plot of importance of variables

# Plots of these importance measures
varImpPlot(rf.admission)

we can see that the CGPA, GRE.Score and TOEFL.Score are three variables of most importance.

Built the optimal random forest decision tree. using mtry=2 and ntree=500

set.seed(2)
rf.admission.optimal <- randomForest(Chance.of.Admit ~ ., data = admission_data, subset = train, mtry = 3, ntree=500, importance = TRUE)  # set mtry=3
rf.admission.optimal
## 
## Call:
##  randomForest(formula = Chance.of.Admit ~ ., data = admission_data,      mtry = 3, ntree = 500, importance = TRUE, subset = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.004043383
##                     % Var explained: 80.65

Evaluate performance on the test dataset.

# make prediction
set.seed(2)
rf.optimal.pred <- predict(rf.admission.optimal, newdata = test_set)

we can see that out datapoint is evenly distributed around the straight line, indicating that our model well fit the data.

# calculate the RMSE on the test dataset
rf_test_rmse <- RMSE(rf.optimal.pred, obs = test_set$Chance.of.Admit)
rf_test_rmse
## [1] 0.06265062

Summary

In this report, we use the admission dataset to predict the chances of a student getting admission by a university according to several different academic performance measurement data. We first went through the data processing step, in which we handle the missing value and empty value as well as converting data type to. Then we build optimal linear regression model and decision Tree model through model selection and cross validation approach. At last, we make predictions based on our generated model and did corresponding evaluation using RMSE as the indicator.

We can see the table of model performance across the methods(using RMSE) below:

##   linear_testset_rmse rf_test_rmse bag_test_rmse prune_rmse
## 1          0.06219085   0.06265062    0.06484412  0.1868216

Conclusion: As it can be seen from the RMSE result table, linear model has reletively higher accuracy and lower error rate compared with tree models, but as for interpretability and clarity, the decision tree model has more advantages over linear model.