In this project, we will be using the Admission_predict dataset in csv format to predict the chances of students getting admission by a university based on several academic performance measurement. To yield the most accurate result, we will be going through several steps such as data preprocessing, t-test, feature selection, cross validation, model selection, etc. to train a linear regression model, make prediction and measure its performance.
This dataset is created for prediction of Graduate Admissions from an Indian perspective. We can find the dataset “Admission_Predict.csv” by clicking the link here. Dataset cited here is owned by Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019
Content:
The dataset contains several parameters:
The last column is the response variable, which is a continuous value between 0 and 1, indicating the chances of getting admission
First, we need to load our admission dataset into R.
# produces a numerical summary of each variable in our dataset
summary(admission_data)
## Serial.No. GRE.Score TOEFL.Score University.Rating
## Min. : 1.0 Min. :290.0 Min. : 92.0 Min. :1.000
## 1st Qu.:125.8 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000
## Median :250.5 Median :317.0 Median :107.0 Median :3.000
## Mean :250.5 Mean :316.5 Mean :107.2 Mean :3.114
## 3rd Qu.:375.2 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000
## Max. :500.0 Max. :340.0 Max. :120.0 Max. :5.000
## SOP LOR CGPA Research
## Min. :1.000 Min. :1.000 Min. :6.800 Min. :0.00
## 1st Qu.:2.500 1st Qu.:3.000 1st Qu.:8.127 1st Qu.:0.00
## Median :3.500 Median :3.500 Median :8.560 Median :1.00
## Mean :3.374 Mean :3.484 Mean :8.576 Mean :0.56
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:9.040 3rd Qu.:1.00
## Max. :5.000 Max. :5.000 Max. :9.920 Max. :1.00
## Chance.of.Admit
## Min. :0.3400
## 1st Qu.:0.6300
## Median :0.7200
## Mean :0.7217
## 3rd Qu.:0.8200
## Max. :0.9700
From the summary of data, we can find that all of our columns are of numerical data type. But within the Research column, we can see that there are only 2 possible value (0 and 1). In this case, it will be more convenient for us to convert it to factor type.
# convert the Research column from numerical to factor data type and assign labels to each possible value
admission_data$Research <- factor(admission_data$Research, levels = c(0,1), labels = c("No","Yes"))
str(admission_data) # check the structure of dataset
## 'data.frame': 500 obs. of 9 variables:
## $ Serial.No. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 1 1 1 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
Then, before we fit model using our dataset, we need to check data entry for missing values or empty values and handle them, because we can not make prediction using values that are not available, for it could severely influence the accuracy of results.
# checking and counting numbers of missing values
colSums(is.na(admission_data))
## Serial.No. GRE.Score TOEFL.Score University.Rating
## 0 0 0 0
## SOP LOR CGPA Research
## 0 0 0 0
## Chance.of.Admit
## 0
# Checking for empty values
colSums(admission_data=='')
## Serial.No. GRE.Score TOEFL.Score University.Rating
## 0 0 0 0
## SOP LOR CGPA Research
## 0 0 0 0
## Chance.of.Admit
## 0
we can see from the checking result, there are no missing values in each columns of our dataset. So now our dataset is complete and ready to be used to do further analysis.
We should then analyse the distribution of our data. First we plot the Histogram of the response variable Y to check visually if it seems normally distributed.
# we can also do qqplot to check the normality of our data
qqnorm(admission_data$Chance.of.Admit)
qqline(admission_data$Chance.of.Admit, col = 2)
We can see from the histogram that the distribution of our response variable looks normally distributed. And the qqline is roughly straight, so it is approximately normal distributed.
Next, we can plot data of our predictor columns to check their distribution.
According to our histogram, we can see the data distribution of these numerical columns appear to be approximately normal. So we do not need to transform them to be more normal now.
In the next step, we need to look at correlations between numerical columns.
pairs(admission_data[-7])
# we could use test to test correlation value against the null hypothesis that the population correlation is actually zero
cor.test(~ CGPA + Chance.of.Admit, data=admission_data, )
##
## Pearson's product-moment correlation
##
## data: CGPA and Chance.of.Admit
## t = 41.855, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8613745 0.9004286
## sample estimates:
## cor
## 0.8824126
cor.test(~ GRE.Score + Chance.of.Admit, data=admission_data, )
##
## Pearson's product-moment correlation
##
## data: GRE.Score and Chance.of.Admit
## t = 30.862, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7779406 0.8384601
## sample estimates:
## cor
## 0.8103506
cor.test(~ TOEFL.Score + Chance.of.Admit, data=admission_data, )
##
## Pearson's product-moment correlation
##
## data: TOEFL.Score and Chance.of.Admit
## t = 28.972, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7571359 0.8227603
## sample estimates:
## cor
## 0.7922276
we can see from the correlation plot, all the other 6 variables seem to have linear relationship with the response variable. The relationship between the response and the CGPA, GRE score and TOEFL score are the most obvious.
# test correlations between preditors
cor.test(~ CGPA + GRE.Score, data=admission_data, )
##
## Pearson's product-moment correlation
##
## data: CGPA and GRE.Score
## t = 32.686, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7958222 0.8518743
## sample estimates:
## cor
## 0.825878
cor.test(~ CGPA + TOEFL.Score, data=admission_data, )
##
## Pearson's product-moment correlation
##
## data: CGPA and TOEFL.Score
## t = 30.887, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7781969 0.8386529
## sample estimates:
## cor
## 0.8105735
cor.test(~ TOEFL.Score + GRE.Score, data=admission_data, )
##
## Pearson's product-moment correlation
##
## data: TOEFL.Score and GRE.Score
## t = 32.852, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7973476 0.8530152
## sample estimates:
## cor
## 0.8272004
We can also find that there are pretty strong correlation between CGPA & GRE and with TOEFL score. GRE score and TOEFL score is also appears to be strongly correlated.
Now we should plot and look at relationships between response and predictor columns
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : pseudoinverse used at 3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : There are other near singularities as well. 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : pseudoinverse used at 3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : There are other near singularities as well. 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : pseudoinverse used at 3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : There are other near singularities as well. 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : pseudoinverse used at 3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : There are other near singularities as well. 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : pseudoinverse used at 3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## FALSE, : There are other near singularities as well. 1
we can see from the result, The GRE score and SOP, CGPA, TOEFL score and Research might be good predictors for the chance of admission
issues:
To solve this, we can convert them into categorical data type
admission_data$SOP <- cut(admission_data$SOP, 3, labels = c("low","med","high"))
admission_data$LOR <- cut(admission_data$LOR, 3, labels = c("low","med","high"))
admission_data$University.Rating <- cut(admission_data$University.Rating, 3, labels = c("low","med","high"))
str(admission_data)
## 'data.frame': 500 obs. of 9 variables:
## $ Serial.No. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: Factor w/ 3 levels "low","med","high": 3 3 2 2 1 3 2 1 1 2 ...
## $ SOP : Factor w/ 3 levels "low","med","high": 3 3 2 2 1 3 2 2 1 2 ...
## $ LOR : Factor w/ 3 levels "low","med","high": 3 3 2 2 2 2 3 3 1 2 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 1 1 1 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
we have transformed these 3 variables into categorical type. Now re-plot these variables using boxplot
now, the relationship between these 3 variables and the reponse variable has became more clear
admission_data$GRE.Score <- log(admission_data$GRE.Score)
admission_data$TOEFL.Score <- log(admission_data$TOEFL.Score)
and plot the transformed results
We can now do a hypothesis test. First we split our dataset into 2 groups based on the value of the Research column i.e. students with research experience and students without research experience
# split dataset and store seperately based on research column
student_with_research <- admission_data[which(admission_data$Research=="Yes"),]
student_without_research <- admission_data[which(admission_data$Research=="No"),]
# calculate mean of two groups
mean_with_research <- mean(student_with_research$Chance.of.Admit)
mean_without_research <- mean(student_without_research$Chance.of.Admit)
We can see that the mean between two groups seems different. Then we use t.test to examine whether these two groups of students data have different mean value of admission chance
# test whether the hypothesis that chance of admission is dependant on research.
t.test(Chance.of.Admit ~ Research, mu = 0, alt = "two.sided", conf = 0.95, data=admission_data)
##
## Welch Two Sample t-test
##
## data: Chance.of.Admit by Research
## t = -14.707, df = 487.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1757700 -0.1343404
## sample estimates:
## mean in group No mean in group Yes
## 0.6349091 0.7899643
we can conclude from the t-test result that: the probability of seeing our data that there is no difference in chance of admission between students with or without research experience, is 2.2e-16, which is much smaller than 0.02. So there is a difference in chance of admission between students with or without research experience. Students with research experience has higher mean value of chance of admission.
we can view the relationship through box plot
For the column Serial.No. It is just the sequential number of single data entry, which is irrelevant to our model fitting and is not helpful in making prediction, so we should remove this column from out dataset
## Warning: package 'dplyr' was built under R version 3.6.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# remove redundant column
admission_data <- select(admission_data, - Serial.No.) # drop the Serial.No. column
Then we can split data into 50:50 test and training rows, so that we can test the results our model and get performance measurement.
# split data into train and test set
set.seed(1) # set random seed
train <- sample(1:nrow(admission_data), floor(nrow(admission_data)/2) ) # choose rows to be placed in the training set
test_set <- admission_data[-train,] # test set
train_set <- admission_data[train,] # taining set
Now we have training set of 250 rows and test dataset of same rows.
After the preprocessing of dataset, now we can fit a linear model. To cycle through all possible combinations of predictors and generate set of best performing models, we will use the regsubset.
## Warning: package 'leaps' was built under R version 3.6.1
# using regsubsets to select predictors subset, set method = exhaustive
regsubsets.out <- regsubsets(Chance.of.Admit ~ .,
data = admission_data,
nbest = 1, # number of subsets of each size to record
nvmax = NULL, # maximum size of subsets to examine
force.in = NULL, force.out = NULL,
method = 'exhaustive')
# converts this to dataframe and show the choosing result
as.data.frame(summary(regsubsets.out)$outmat)
## GRE.Score TOEFL.Score University.Ratingmed University.Ratinghigh
## 1 ( 1 )
## 2 ( 1 ) *
## 3 ( 1 ) *
## 4 ( 1 ) *
## 5 ( 1 ) * *
## 6 ( 1 ) * *
## 7 ( 1 ) * * *
## 8 ( 1 ) * * *
## 9 ( 1 ) * * *
## 10 ( 1 ) * * * *
## SOPmed SOPhigh LORmed LORhigh CGPA ResearchYes
## 1 ( 1 ) *
## 2 ( 1 ) *
## 3 ( 1 ) * *
## 4 ( 1 ) * * *
## 5 ( 1 ) * * *
## 6 ( 1 ) * * * *
## 7 ( 1 ) * * * *
## 8 ( 1 ) * * * * *
## 9 ( 1 ) * * * * * *
## 10 ( 1 ) * * * * * *
Then we use Adjusted Rsq value to estimate performance of each predictors.
# create plot of results, it includes fit performance information
plot(regsubsets.out, scale='adjr2', main='Adjusted Rsq')
we can see from the plot that CGPA has the smallest adjr2 value as a predictor.
Next, we should select the optimum model base on the slection results of predictors. 1. We we build a linear model with CGPA as the single predictor.
lm.single <- lm(Chance.of.Admit ~ CGPA, data = train_set)
lm.multi <- lm(Chance.of.Admit ~ CGPA + TOEFL.Score + GRE.Score, data = train_set)
lm.full <- lm(Chance.of.Admit ~ ., data = train_set)
Now that we have generated three models and get some discriptive information of them. To choose the optimal one out of them, we need to evaluate performance of chosen model on our test dataset.
## Warning: package 'texreg' was built under R version 3.6.1
## Version: 1.36.23
## Date: 2017-03-03
## Author: Philip Leifeld (University of Glasgow)
##
## Please cite the JSS article in your publications -- see citation("texreg").
## Warning: package 'caret' was built under R version 3.6.1
## Loading required package: lattice
## Loading required package: ggplot2
# calculate the rmse and r2 of each model
model_single_rmse <- RMSE(lm.single$fitted.values, obs = train_set$Chance.of.Admit)
model_multi_rmse <- RMSE(lm.multi$fitted.values, obs = train_set$Chance.of.Admit)
model_full_rmse <- RMSE(lm.full$fitted.values, obs = train_set$Chance.of.Admit)
model_rmse_table <- data.frame(model_single_rmse, model_multi_rmse, model_full_rmse)
model_rmse_table
## model_single_rmse model_multi_rmse model_full_rmse
## 1 0.06762249 0.06186354 0.05885514
# calculate the rsq value of trained model
model_single_rsq <- cor(lm.single$fitted.values, train_set$Chance.of.Admit)^2
model_multi_rsq <- cor(lm.multi$fitted.values, train_set$Chance.of.Admit)^2
model_full_rsq <- cor(lm.full$fitted.values, train_set$Chance.of.Admit)^2
model_rsq_table <- data.frame(model_single_rsq, model_multi_rsq, model_full_rsq)
model_rsq_table
## model_single_rsq model_multi_rsq model_full_rsq
## 1 0.7811558 0.8168435 0.8342241
# we can also use screenreg to show imformation about our model and choose the optimal one
screenreg(list(lm.single, lm.multi, lm.full))
##
## =========================================================
## Model 1 Model 2 Model 3
## ---------------------------------------------------------
## (Intercept) -1.06 *** -7.27 *** -5.92 ***
## (0.06) (1.01) (1.05)
## CGPA 0.21 *** 0.14 *** 0.12 ***
## (0.01) (0.01) (0.01)
## TOEFL.Score 0.35 ** 0.30 *
## (0.13) (0.13)
## GRE.Score 0.90 *** 0.72 **
## (0.22) (0.22)
## University.Ratingmed 0.00
## (0.01)
## University.Ratinghigh 0.03
## (0.01)
## SOPmed 0.00
## (0.01)
## SOPhigh -0.01
## (0.02)
## LORmed 0.00
## (0.01)
## LORhigh 0.03
## (0.01)
## ResearchYes 0.02 **
## (0.01)
## ---------------------------------------------------------
## R^2 0.78 0.82 0.83
## Adj. R^2 0.78 0.81 0.83
## Num. obs. 250 250 250
## RMSE 0.07 0.06 0.06
## =========================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
We can see from the result :that model, lm.full using all predictors has the largest R^2 value as well as the smallest RMSE value, thus, we choose this model as our optimal one. And the formula for our linear model is \[ y = -9.01 +0.12*CGPA + 0.30*TOEFL.Score+4.13*GRE.Score+University.Rating*0.03-0.01*SOP+0.03*LOR+0.02*Research\]
We use 10-fold cross validation method.k-fold cross-validation method evaluates the model performance on different subset of the training data and then calculate the average prediction error rate
# First we Define training control,
set.seed(2)
train.control <- trainControl(method = "cv", number = 10)
# Train the model
cv.lm.full <- train(Chance.of.Admit ~ ., data = train_set, method = "lm",
trControl = train.control)
# Summarize the results
print(cv.lm.full)
## Linear Regression
##
## 250 samples
## 7 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 225, 225, 226, 224, 224, 224, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.06180049 0.823202 0.04460971
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
After choosing the model with the best performance measure, we can make predictions using our trained model and store in the test dataset
prediction evaluation result using all predictor model lm.full:
# making prediction
test_set$full_model_output <- predict(lm.full, newdata = test_set)
now we should calculate the RMSE and R2 value of our testset
# calculate the RMSE on the test dataset
testset_rmse <- RMSE(test_set$full_model_output, obs = test_set$Chance.of.Admit)
testset_rmse
## [1] 0.06219085
# calculate the R2 value on the test dataset
testset_rsq <- cor(test_set$full_model_output, test_set$Chance.of.Admit)^2
testset_rsq
## [1] 0.7983604
Evaluation result: It can be seen from above that our RMSE value on test dataset is 0.062 a little higher compared with 0.059, the model RMSE, and the rsq value is 0.798, which are little lower than that of model rsq, which is 0.834.
In this section, we will build regression tree models. load the tree package, its provides functionality for classiffication and regression trees
First, we can build a decision tree using pruning method, using all predictors
set.seed(2) # set random seed
# fit the tree model with training data
tree.admission <- tree(Chance.of.Admit ~ ., admission_data, subset = train)
summary(tree.admission)
##
## Regression tree:
## tree(formula = Chance.of.Admit ~ ., data = admission_data, subset = train)
## Variables actually used in tree construction:
## [1] "CGPA" "TOEFL.Score" "GRE.Score"
## Number of terminal nodes: 8
## Residual mean deviance: 0.003881 = 0.9393 / 242
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.23030 -0.03227 0.00234 0.00000 0.03757 0.20630
# plot the trees
plot(tree.admission)
text(tree.admission, pretty = 0)
Then we should consider whether pruning the tree might lead to improved results. We will use cross-validation with cv.tree() to see the best level of pruning to minimize the deviance
set.seed(2)
cv.admission <- cv.tree(tree.admission)
names(cv.admission)
## [1] "size" "dev" "k" "method"
cv.admission # this reports the number of terminal nodes of each tree considered as well as the corresponding error rate and the value of the cost-complexity parameter used k
## $size
## [1] 8 7 6 5 4 3 2 1
##
## $dev
## [1] 1.294149 1.298796 1.298796 1.435518 1.549878 1.857434 2.411289 5.236328
##
## $k
## [1] -Inf 0.05250431 0.05312685 0.08429825 0.11200025 0.32733751
## [7] 0.58160702 3.07363644
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
# we plot the deviance as a function of both size of tree (number of nodes), and tree penalty k.
par(mfrow = c(1, 2))
plot(cv.admission$size, cv.admission$dev, type = "b")
plot(cv.admission$k, cv.admission$dev, type = "b")
As we can see from the plot, the deviance seems to have lowest value with tree size=8 and k=0.05 Examine the best performing tree.
cv.admission$size <- cv.admission$size[which(cv.admission$size!=1)] # we should exclude the tee size=1
# used to store the predicted value
n_vals <- cv.admission$size
n_vals
## [1] 8 7 6 5 4 3 2
results <- 0
results <- data.frame(size = rep(0,length(n_vals)),
rmse = rep(0,length(n_vals))
)
we will use for loop to run over each tree size and use prune.tree to create the corresponding decision tree. And also predict to measure the performance on the test data set in terms of deviance
par(mfrow=c(1,2))
for (i in n_vals) {
results$size[i-1] <- i
prune.admission <- prune.tree(tree.admission, best = i, method = "deviance")
plot(prune.admission)
text(prune.admission, pretty = 0)
tree.pred <- predict(prune.admission, test_set) # make prediciton
prune_rmse <- RMSE(tree.pred, obs = train_set$Chance.of.Admit)
results$rmse[i-1] <- prune_rmse
}
## size rmse
## 1 2 0.1868216
## 2 3 0.1989816
## 3 4 0.2004313
## 4 5 0.2006124
## 5 6 0.2008099
## 6 7 0.2009120
## 7 8 0.2009812
we can see from the rmse results that model with ntree = 2 has the lowest rmse.
In this section, we will build an bagged decision tree model.
## Warning: package 'randomForest' was built under R version 3.6.1
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
# there are 7 predictors we can use to predict the chance of admit.
set.seed(2)
p <- 7
bag.admission <- randomForest(Chance.of.Admit ~ ., data = admission_data, subset = train, mtry = p, importance = TRUE, ntree=20)
bag.admission # look at the output
##
## Call:
## randomForest(formula = Chance.of.Admit ~ ., data = admission_data, mtry = p, importance = TRUE, ntree = 20, subset = train)
## Type of random forest: regression
## Number of trees: 20
## No. of variables tried at each split: 7
##
## Mean of squared residuals: 0.005243639
## % Var explained: 74.91
Next, we would explore how cross validated error rate improves with number of trees used. We can change the number of trees using the ntree argument Now we do the plot of OOB performance measured with mse against number of trees
rf_tree = randomForest(Chance.of.Admit ~ ., data = admission_data , subset = train, mtry= p, ntree = 500, do.trace=100)
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 100 | 0.004451 21.30 |
## 200 | 0.004438 21.24 |
## 300 | 0.004454 21.32 |
## 400 | 0.004439 21.24 |
## 500 | 0.004399 21.05 |
# rf_tree$mse stores the MSE calculated on the Out-Of-Bag
plot(rf_tree$mse, xlab="N Trees", ylab="MSE")
We set ntee = 300 at first. Then draw the plot. We can see from the plot that, with number of trees increasing from 0 to 150, MSE decreases gradually. But MSE remains approximately stable as N trees growing from 150 to 500. Therefore, we choose number of trees to be 500, and then re-build the optimal bagged decision tree.
set.seed(2)
p <- 7
bag.admission.optimal <- randomForest(Chance.of.Admit ~ ., data = admission_data, subset = train, mtry = p, importance = TRUE, ntree = 500)
bag.admission.optimal # look at the output
##
## Call:
## randomForest(formula = Chance.of.Admit ~ ., data = admission_data, mtry = p, importance = TRUE, ntree = 500, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 7
##
## Mean of squared residuals: 0.004459445
## % Var explained: 78.66
Finally, we should evaluate performance on the test dataset, see how well it fit the data.
# make prediction
set.seed(2)
bag.optimal.pred <- predict(bag.admission.optimal, newdata = test_set)
we can see that out datapoint is evenly distributed around the straight line, indicating that our model well fit the data.
# calculate the RMSE on the test dataset
bag_test_rmse <- RMSE(bag.optimal.pred, obs = test_set$Chance.of.Admit)
bag_test_rmse
## [1] 0.06484412
In this section, we will use random forest method. Bagging is a special case of a random forest with m = p. So we can also change the number of variables using the mtry argument
# We used all 7 variables above, but we can first limit the number of predictors used for each tree using mtry = 5 to build a random forest model.
set.seed(2)
rf.admission <- randomForest(Chance.of.Admit ~ ., data = admission_data, subset = train, mtry = 4, importance = TRUE) # set mtry=4
rf.admission
##
## Call:
## randomForest(formula = Chance.of.Admit ~ ., data = admission_data, mtry = 4, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 0.004163098
## % Var explained: 80.08
# rf_tree$mse stores the MSE calculated on the Out-Of-Bag
plot(rf.admission$mse, xlab="N Trees", ylab="MSE")
change the mtry to oter values and redraw the plot.
set.seed(2)
rf.admission <- randomForest(Chance.of.Admit ~ ., data = admission_data, subset = train, mtry = 2, importance = TRUE) # set mtry=2
rf.admission
##
## Call:
## randomForest(formula = Chance.of.Admit ~ ., data = admission_data, mtry = 2, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.003985625
## % Var explained: 80.93
set.seed(2)
rf.admission <- randomForest(Chance.of.Admit ~ ., data = admission_data, subset = train, mtry = 3, importance = TRUE) # set mtry=3
rf.admission
##
## Call:
## randomForest(formula = Chance.of.Admit ~ ., data = admission_data, mtry = 3, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.004043383
## % Var explained: 80.65
we can look at the variable importance measure using the importance() function.
importance(rf.admission)
## %IncMSE IncNodePurity
## GRE.Score 25.225787 1.58196204
## TOEFL.Score 13.178529 0.74814184
## University.Rating 7.027358 0.20980123
## SOP 14.596085 0.22722982
## LOR 7.129761 0.09959360
## CGPA 32.973516 2.12004148
## Research 10.335855 0.07400329
Here we can see that two measures of variable importance are reported 1. based upon the mean decrease of accuracy in predictions on the out of bag samples when a given variable is excluded from the model. 2. measure of the total decrease in node impurity that results from splits over that variable, averaged over all trees.
We can draw a plot of importance of variables
# Plots of these importance measures
varImpPlot(rf.admission)
we can see that the CGPA, GRE.Score and TOEFL.Score are three variables of most importance.
Built the optimal random forest decision tree. using mtry=2 and ntree=500
set.seed(2)
rf.admission.optimal <- randomForest(Chance.of.Admit ~ ., data = admission_data, subset = train, mtry = 3, ntree=500, importance = TRUE) # set mtry=3
rf.admission.optimal
##
## Call:
## randomForest(formula = Chance.of.Admit ~ ., data = admission_data, mtry = 3, ntree = 500, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.004043383
## % Var explained: 80.65
Evaluate performance on the test dataset.
# make prediction
set.seed(2)
rf.optimal.pred <- predict(rf.admission.optimal, newdata = test_set)
we can see that out datapoint is evenly distributed around the straight line, indicating that our model well fit the data.
# calculate the RMSE on the test dataset
rf_test_rmse <- RMSE(rf.optimal.pred, obs = test_set$Chance.of.Admit)
rf_test_rmse
## [1] 0.06265062
In this report, we use the admission dataset to predict the chances of a student getting admission by a university according to several different academic performance measurement data. We first went through the data processing step, in which we handle the missing value and empty value as well as converting data type to. Then we build optimal linear regression model and decision Tree model through model selection and cross validation approach. At last, we make predictions based on our generated model and did corresponding evaluation using RMSE as the indicator.
We can see the table of model performance across the methods(using RMSE) below:
## linear_testset_rmse rf_test_rmse bag_test_rmse prune_rmse
## 1 0.06219085 0.06265062 0.06484412 0.1868216
Conclusion: As it can be seen from the RMSE result table, linear model has reletively higher accuracy and lower error rate compared with tree models, but as for interpretability and clarity, the decision tree model has more advantages over linear model.