#loading necessary packages
library(tidyverse)
## ── Attaching packages ─────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1 ✔ purrr 0.3.1
## ✔ tibble 2.0.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.5.2
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'tidyr' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.2
## Warning: package 'dplyr' was built under R version 3.5.2
## Warning: package 'stringr' was built under R version 3.5.2
## Warning: package 'forcats' was built under R version 3.5.2
## ── Conflicts ────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(psych)
## Warning: package 'psych' was built under R version 3.5.2
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
# install.packages("GGally")
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(ggplot2)
# install.packages("ggExtra")
library(ggExtra)
# install.packages("class")
library(class)
## Warning: package 'class' was built under R version 3.5.2
library(gmodels)
# install.packages("caret")
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# install.packages("fastDummies")
library(fastDummies)
# install.packages("C50")
library(C50)
////PHASE ONE : BUSINESS UNDERSTANDING This phase focuses on understanding the project from a business perspectives and develop a plan to achieve the goals
//Determining the business objectives This phase uncovers the primary business objective. I am a graduate student at northeastern university. When I was applying to the universities from India, often times I wanted to know the acceptance rate of the universities based on SoP(Statement of Purpose), LoR(Letter of Recommendation), GRE score, TOEFL Score. To help students decide their chances of admit at different universities in the US based on these criteria I have decided to build Machine Learning and Data mining concepts that I have learned in DA5030 class. 1. To use Machine Learning and Data mining techniques to predict the admission chances of students with different test scores and criteria
//Situation assessment This phase covers what data is available to meet the primary business goal,list potential risk and solutions to those risks The dataset contains information about the test scores, CGPA and university rating, the question arises who rates the universities and what are the criteria for rating a university.The veracity of SoP and LoR rating will be biased as it is based on the applicants knowledge and self- evaluation which is why they are listed as ‘not important’ variables. The chance of admit was for different universities around the country USA.
//Determining the data mining goals The data mining goals of this project is to predict the admission of new students and also improve the accuracy of prediction of different machine learning models.
//Producing a project plan This project will have the following steps 1. Explore the dataset using different visualization techniques 2. Transform the data if needed 3. Splitting the dataset into training and validation 4. Use Machine Learning techniques like KNN, Decision tree, multiple regression to predict the chance of admit 5. Compare the accuracy of different machine learning models 6. Build stacked ensemble method to improve the models performance 7. Visualize the prediction for better understanding
////PHASE TWO : DATA UNDERSTANDING This phase consists of data collection, exploring the data, identifying the data quality problems, discover insights into the data and detect interesting subsets too form hypotheses.
//Collecting the data The data was collected from Kaggle (https://www.kaggle.com/mohansacharya/graduate-admissions). Initially this data was inspired by UCLA graduate dataset. There was no need for different data sources as the dataset obtained was sufficient to predict the chance of admit using different machine learning algorithms
//Describe the data This describes the quantity of the data, number of records, features, format of the data The dataset obtained was in CSV format. It had 500 observations with 9 variables which are a) Serial number (numerical) b) GRE Score (numerical, out of 340) c) TOEFL Score (numerical, out of 120) d) University Rating (numerical, out of 5) e) SoP Rating(numerical, out of 5) f) LoR Rating(numerical, out of 5) g) CGPA (numerical, out of 10) h) Research(categorical, 0 or 1) i) Chance of admit(probability, 0 to 1)
#loading the dataset into R environment
grad_admit <- read.csv("/Users/jess/Downloads/graduate-admissions/Admission_Predict_Ver1.1.csv")
str(grad_admit)
## 'data.frame': 500 obs. of 9 variables:
## $ Serial.No. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
OG_data <- read.csv("/Users/jess/Downloads/graduate-admissions/Admission_Predict_Ver1.1.csv")
//Exploring the data Visualizing and reporting the data is part of this phase. I have created many graphs to explore the dataset, to find outliers and have produced graphs to check the normal distribution of the variable
#histograms to check the normal distribution
hist(grad_admit$GRE.Score, xlab = "GRE score")
hist(grad_admit$TOEFL.Score, xlab = "TOEFL score")
hist(grad_admit$University.Rating, xlab = "University rating")
hist(grad_admit$SOP, xlab = "SOP rating")
hist(grad_admit$LOR, xlab = "LOR rating")
hist(grad_admit$CGPA, xlab = "CGPA")
hist(grad_admit$Chance.of.Admit , xlab = "chance of admit")
#boxplot to detect outliers if there are any
boxplot(grad_admit$GRE.Score, ylab = "GRE score")
boxplot(grad_admit$TOEFL.Score, ylab = "TOEFL score")
boxplot(grad_admit$University.Rating, ylab ="University rating" )
boxplot(grad_admit$SOP, ylab = "SOP rating")
boxplot(grad_admit$LOR, ylab = "LOR rating")
boxplot(grad_admit$CGPA, ylab = "CGPA")
boxplot(grad_admit$Chance.of.Admit, ylab = "Chance of admit")
summary(grad_admit$Chance.of.Admit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3400 0.6300 0.7200 0.7217 0.8200 0.9700
summary(grad_admit$LOR)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.500 3.484 4.000 5.000
#scatterplot for all scores and chance of admit
pairs.panels(grad_admit)
important_param <- grad_admit %>%
select(GRE.Score, TOEFL.Score, CGPA, Chance.of.Admit)
pairs.panels(important_param)
#gre vs chance of admit
p <- ggplot(grad_admit, aes(x=GRE.Score, y=Chance.of.Admit)) + geom_point(shape=1) + geom_smooth(method=lm , color="red", se=TRUE) + xlab("GRE score") + ylab(" Chance of admit") # Add linear regression line
ggMarginal(p, type="histogram")
#research facet plot
ggplot(data = grad_admit) +
geom_point(mapping = aes(x = GRE.Score, y = TOEFL.Score, color = Chance.of.Admit)) +
facet_wrap(~ Research)+
xlab("GRE Score")+
ylab("TOEFL Score")
//Verifying the data quality The quality of the dataset it good because it doesnt have any missing values. Of course like any other dataset it has biased attributes like SoP and LoR rating but there are no missing values. The variables does not have any conflict with common sense. It has 500 observation of 9 different variables out of which one is what we are going to predict using machine learning techniques.
////PHASE THREE: DATA PREPARATION This phase covers constructing of dataset for using in the analyses, transforming the dataset if needed, integration and cleansing of data if needed
//Selection of data In this phase I have to select the variables that aligns with my data mining goals and eliminate variables which are not useful. In graduate admission dataset, I dont need “serial number” variable because it does not provide useful information or any relevance to the dataset For building models I used only few important variables they are “GRE Score”, “TOEFL Score”, “CGPA”. The other variables are least important because they do not have high correlation.
//Clean data Inorder to clean the data, I deleted “serial number” variable from the dataset. Some categorical variables like “research” “university rating” were in different datatype. I converted them to factor variable which will be more useful.
//Construct data In this stage, I created dummy variables for categorical variables I have in my dataset. Produced a new column called “chance_low_high” based on “Chance of admit” that is available in the actual dataset.This variable will help us in the data mining goals we proposed earlier
//Integrate data This dataset doesnt require integration of data from different dataset since the data obtained itself is clean and complete.
//Format data Some variable types were different and inorder to use it I changed them to factor variables to use. Imputed the outliers of different variables by replacing it with either mode or median of that variable to get rid of the outliers. Outliers are usually defined as the 3 standard deviation away. I found the outliers by creating a boxplot from the data exploration phase.
########################################Data Imputation################################
#there is one outlier in LOR and chance of admit.the lor outlier is 1 and the chance of admit is 0.34.
#to impute the outlier. I will replace LOR with mode
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
modeLOR <- getmode(grad_admit$LOR)
print(modeLOR)
## [1] 3
#The mode of LOR is 3 hence replacing the outlier with 3
grad_admit$LOR[grad_admit$LOR ==1] <- 3
#replacing chance of admit with median which is 0.72
grad_admit$Chance.of.Admit[grad_admit$Chance.of.Admit==0.34] <- 0.72
#coverting "research" to categorical variable
grad_admit$Research[grad_admit$Research ==1] <- "Yes"
grad_admit$Research[grad_admit$Research ==0] <- "No"
grad_admit$Research <- as.factor(grad_admit$Research) #2 is yes, 1 is no
#creating a new categorical variable using the existing variable
grad_admit$chance_low_high[grad_admit$Chance.of.Admit <= 0.50 ] <- "low"
grad_admit$chance_low_high[grad_admit$Chance.of.Admit > 0.50 ] <- "high"
grad_admit$chance_low_high <- as.factor(grad_admit$chance_low_high)
#removing unwanted columns
grad_admit$Serial.No. <- NULL
grad_admit$Chance.of.Admit <- NULL
str(grad_admit)
## 'data.frame': 500 obs. of 8 variables:
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 1 1 1 ...
## $ chance_low_high : Factor w/ 2 levels "high","low": 1 1 1 1 1 1 1 1 2 2 ...
////PHASE FOUR: MODELLING //Selecting the modeling technique Inorder to meet the datamining goals, I have decided to use three modeling techniques, they are 1. K-Nearest Neighbors using Caret and Class package 2. Multiple Regression 3. Naive Bayes Classification using C5.0 package
//Generate test design In order to test the model that is created by various machine learning techniques, I split the dataset into two for training the model and validating the model . It generally has 80% data for training and 20% data for validating the model created.
//Build the model Built three different machine learning models on the training dataset which was created in the previous step.
//Assess the model The model constructed are assessed by using the model on validating dataset which was created. This will give the accuracy of the model and If there are any changes or if I want to improve the model I created a new model on the training set and use it on validation dataset.
##KNN
#creating a new dataset
grad_num_df <- grad_admit %>% select(GRE.Score, TOEFL.Score, CGPA, chance_low_high)
#normalizing the data
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
#passing the function
grad_num_df_n <- as.data.frame(lapply(grad_num_df[,1:3], normalize))
#splitting the dataset randomly
## 75% of the sample size
smp_size <- floor(0.75 * nrow(grad_num_df_n))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(grad_num_df_n)), size = smp_size)
grad_num_df_train <- grad_num_df_n[train_ind, ]
grad_num_df_test <- grad_num_df_n[-train_ind, ]
#creating labels
grad_num_df_train_labels <- grad_num_df[train_ind, 4, drop = TRUE]
grad_num_df_test_labels <- grad_num_df[-train_ind,4, drop = TRUE]
#implementing KNN
grad_admit_test_pred <- knn(train = grad_num_df_train, test = grad_num_df_test,
cl = grad_num_df_train_labels , k = 3)
#evaluating model performance
CrossTable(x = grad_num_df_test_labels, y = grad_admit_test_pred,
prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 125
##
##
## | grad_admit_test_pred
## grad_num_df_test_labels | high | low | Row Total |
## ------------------------|-----------|-----------|-----------|
## high | 112 | 2 | 114 |
## | 0.982 | 0.018 | 0.912 |
## | 0.949 | 0.286 | |
## | 0.896 | 0.016 | |
## ------------------------|-----------|-----------|-----------|
## low | 6 | 5 | 11 |
## | 0.545 | 0.455 | 0.088 |
## | 0.051 | 0.714 | |
## | 0.048 | 0.040 | |
## ------------------------|-----------|-----------|-----------|
## Column Total | 118 | 7 | 125 |
## | 0.944 | 0.056 | |
## ------------------------|-----------|-----------|-----------|
##
##
# overall, there were 6 which was falsely classified, the model correctly classified 112 students have high chnace of admittance - true negatives and 5 students have low chance of getting adn the algorithm correctly predicted them. But the algorithm incorrectly identified that 2 students have high chance of getting and admit and 6 students were incorrectly classified as low chance of getting an admit- false negative
#KNN using caret package with different K values
set.seed(30)
new <- createDataPartition(y = grad_admit$chance_low_high,p = 0.65,list = FALSE) #creating train and test dataset
grad_train <-grad_admit[new,]
grad_test <- grad_admit[-new,]
con <- trainControl(method = "repeatedcv", number = 2, repeats = 5)
knn <- train(chance_low_high ~ ., data = grad_train,
method ="knn", trControl = con, preProcess = c("center","scale"))
predict <- predict(knn,newdata = grad_test)
head(predict)
## [1] high high high high high high
## Levels: high low
confusionMatrix(predict, grad_test$chance_low_high)
## Confusion Matrix and Statistics
##
## Reference
## Prediction high low
## high 160 9
## low 2 3
##
## Accuracy : 0.9368
## 95% CI : (0.8897, 0.968)
## No Information Rate : 0.931
## P-Value [Acc > NIR] : 0.45749
##
## Kappa : 0.3256
## Mcnemar's Test P-Value : 0.07044
##
## Sensitivity : 0.9877
## Specificity : 0.2500
## Pos Pred Value : 0.9467
## Neg Pred Value : 0.6000
## Prevalence : 0.9310
## Detection Rate : 0.9195
## Detection Prevalence : 0.9713
## Balanced Accuracy : 0.6188
##
## 'Positive' Class : high
##
#using caret package we have accurcy of 94% with no false negatives.
#############different k-values##################################
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 2)
set.seed(3333)
knn_fit <- train(chance_low_high ~., data = grad_train, method = "knn",trControl=trctrl,preProcess = c("center", "scale"),tuneLength = 10)
knn_fit
## k-Nearest Neighbors
##
## 326 samples
## 7 predictor
## 2 classes: 'high', 'low'
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold, repeated 2 times)
## Summary of sample sizes: 293, 294, 294, 293, 294, 293, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9237494 0.189349840
## 7 0.9343165 0.234594025
## 9 0.9327986 0.226484248
## 11 0.9265987 0.108147794
## 13 0.9235183 0.021465201
## 15 0.9220477 -0.002380952
## 17 0.9235628 0.000000000
## 19 0.9235628 0.000000000
## 21 0.9235628 0.000000000
## 23 0.9235628 0.000000000
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.
df <- knn_fit$results
ggplot(data = df)+
geom_line(mapping = aes(x = df$k, y = df$Accuracy))+
xlab("K values")+
ylab("Accuracy")
#When we plot different values for k, we get to know that the highest accuracy is when k is 7.
grad_admit_test_pred5 <- knn(train = grad_num_df_train, test = grad_num_df_test,
cl = grad_num_df_train_labels , k = 7)
#evaluating model performance
CrossTable(x = grad_num_df_test_labels, y = grad_admit_test_pred5,
prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 125
##
##
## | grad_admit_test_pred5
## grad_num_df_test_labels | high | low | Row Total |
## ------------------------|-----------|-----------|-----------|
## high | 114 | 0 | 114 |
## | 1.000 | 0.000 | 0.912 |
## | 0.927 | 0.000 | |
## | 0.912 | 0.000 | |
## ------------------------|-----------|-----------|-----------|
## low | 9 | 2 | 11 |
## | 0.818 | 0.182 | 0.088 |
## | 0.073 | 1.000 | |
## | 0.072 | 0.016 | |
## ------------------------|-----------|-----------|-----------|
## Column Total | 123 | 2 | 125 |
## | 0.984 | 0.016 | |
## ------------------------|-----------|-----------|-----------|
##
##
#Multiple regression
#creating dummy variables for 3 categorical variable in this dataset
str(OG_data)
## 'data.frame': 500 obs. of 9 variables:
## $ Serial.No. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
OG_data <- dummy_cols(OG_data, select_columns = "University.Rating")
OG_data <- dummy_cols(OG_data, select_columns = "Research")
#####splitting the dataset into training and testing#########
set.seed(111)
split_ind <- sample(seq_len(nrow(OG_data)), size = smp_size)
grad_MR_train <- OG_data[split_ind, ]
grad_MR_test <- OG_data[-split_ind, ]
#created a model using training dataset
mul_model <- glm(Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating_1 + University.Rating_2 + University.Rating_3 + University.Rating_4 + University.Rating_5 +CGPA + Research_1 , data = grad_MR_train)
summary(mul_model)
##
## Call:
## glm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating_1 +
## University.Rating_2 + University.Rating_3 + University.Rating_4 +
## University.Rating_5 + CGPA + Research_1, data = grad_MR_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.23191 -0.02386 0.00558 0.03555 0.14719
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.270944 0.125403 -10.135 < 2e-16 ***
## GRE.Score 0.001862 0.000566 3.291 0.00110 **
## TOEFL.Score 0.002496 0.001030 2.423 0.01589 *
## University.Rating_1 -0.028555 0.017457 -1.636 0.10275
## University.Rating_2 -0.038639 0.012981 -2.977 0.00311 **
## University.Rating_3 -0.022683 0.011347 -1.999 0.04635 *
## University.Rating_4 -0.021533 0.010943 -1.968 0.04984 *
## University.Rating_5 NA NA NA NA
## CGPA 0.133703 0.010271 13.018 < 2e-16 ***
## Research_1 0.024011 0.007538 3.185 0.00157 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.003581965)
##
## Null deviance: 7.2812 on 374 degrees of freedom
## Residual deviance: 1.3110 on 366 degrees of freedom
## AIC: -1036.8
##
## Number of Fisher Scoring iterations: 2
#as we can see, there are many insignificant variables which can be removed by backfitting method.
#################model 2 created by backfitting#####################################
mul_model2 <- lm(Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating_2 + University.Rating_3 +CGPA , data = grad_MR_train)
summary(mul_model2)
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating_2 +
## University.Rating_3 + CGPA, data = grad_MR_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.251814 -0.024133 0.004559 0.037784 0.149652
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4988889 0.1082410 -13.848 < 2e-16 ***
## GRE.Score 0.0023334 0.0005537 4.214 3.16e-05 ***
## TOEFL.Score 0.0025505 0.0010328 2.469 0.014 *
## University.Rating_2 -0.0226375 0.0088578 -2.556 0.011 *
## University.Rating_3 -0.0059376 0.0077214 -0.769 0.442
## CGPA 0.1419755 0.0098325 14.439 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0608 on 369 degrees of freedom
## Multiple R-squared: 0.8127, Adjusted R-squared: 0.8101
## F-statistic: 320.2 on 5 and 369 DF, p-value: < 2.2e-16
#################model 3 created by backfitting####################################
mul_model3<- lm(Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating_2 +CGPA , data = grad_MR_train)
summary(mul_model3)
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating_2 +
## CGPA, data = grad_MR_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.255469 -0.024757 0.005077 0.037076 0.150162
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5169906 0.1055919 -14.367 < 2e-16 ***
## GRE.Score 0.0023434 0.0005533 4.235 2.88e-05 ***
## TOEFL.Score 0.0026104 0.0010293 2.536 0.0116 *
## University.Rating_2 -0.0193948 0.0077854 -2.491 0.0132 *
## CGPA 0.1426537 0.0097875 14.575 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06076 on 370 degrees of freedom
## Multiple R-squared: 0.8124, Adjusted R-squared: 0.8103
## F-statistic: 400.5 on 4 and 370 DF, p-value: < 2.2e-16
#this model has the most significant variable. This is obtained by backfitting proceess
#as we can see the coefficients, we have statistically significant variables with p-values less than 0.05. multiple r-squared provides the measure of how well our model explains the values of the dependant variables, the closer it is to 1, the better it is. We have 0.80 which is a pretty good model. Our model has 0.80 as R-squared values the model explains nearly 80 percent of the variation in the dependent variable
######################predicting the model############################
predMR <- predict(mul_model3, grad_MR_test[c("GRE.Score", "TOEFL.Score", "University.Rating_2", "CGPA")])
head(predMR)
## 2 10 11 12 18 19
## 0.7869347 0.7486852 0.7196205 0.8229516 0.6484984 0.7707196
head(grad_MR_test$Chance.of.Admit)
## [1] 0.76 0.45 0.52 0.84 0.65 0.63
###############calculating the root mean squared error for the prediction####
Sq_err <- (grad_MR_test$Chance.of.Admit - predMR)^ 2
#calculating the mean
avg_sq_err <- mean(Sq_err)
#taking square root of it
rmse <- sqrt(avg_sq_err)
rmse
## [1] 0.06712385
##the root mean squared error is very minimum. Hence the model I created is good
#Naive bayes
set.seed(123)
train_sample <- sample(500,100)
str(train_sample)
## int [1:100] 144 394 204 439 467 23 261 440 272 225 ...
#splitting the dataset randomly
grad_NB_test<- grad_admit[train_sample,]
grad_NB_train <- grad_admit[-train_sample,]
#fairly even split
prop.table(table(grad_NB_test$chance_low_high))
##
## high low
## 0.95 0.05
prop.table(table(grad_NB_train$chance_low_high))
##
## high low
## 0.92 0.08
head(grad_NB_train)
# grad_NB_test <- grad_NB_test[,-8]
# grad_NB_train <- grad_NB_train[,-8]
#creating a naive bayes model
nb_model <- C5.0(grad_NB_train[-8], grad_NB_train$chance_low_high)
nb_model
##
## Call:
## C5.0.default(x = grad_NB_train[-8], y = grad_NB_train$chance_low_high)
##
## Classification Tree
## Number of samples: 400
## Number of predictors: 7
##
## Tree size: 5
##
## Non-standard options: attempt to group attributes
summary(nb_model)
##
## Call:
## C5.0.default(x = grad_NB_train[-8], y = grad_NB_train$chance_low_high)
##
##
## C5.0 [Release 2.07 GPL Edition] Thu Apr 25 11:06:41 2019
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 400 cases (8 attributes) from undefined.data
##
## Decision tree:
##
## CGPA > 8.02: high (322/4)
## CGPA <= 8.02:
## :...TOEFL.Score <= 97: low (12/2)
## TOEFL.Score > 97:
## :...CGPA > 7.6: high (53/10)
## CGPA <= 7.6:
## :...GRE.Score <= 301: low (7)
## GRE.Score > 301: high (6/1)
##
##
## Evaluation on training data (400 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 5 17( 4.2%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 366 2 (a): class high
## 15 17 (b): class low
##
##
## Attribute usage:
##
## 100.00% CGPA
## 19.50% TOEFL.Score
## 3.25% GRE.Score
##
##
## Time: 0.0 secs
#The few lines of the models tree can be read as if cgpa is greater than 8.02, then there is high chance of getting an admit otherwise the cgpa is less than or equal to 8.02, and if the TOEFL score is less than 97 then the chance will be low . the (322/4) means that there were 4 cases which was incorrectly predicted.The error rate for this model is 4.2%, 2 values of "high chance of getting admit" was incorrectly classified(false positives) while 15 actual low values were incorrectly clasified as high(false negatives).
#evaluating the model performance using testing dataset
nb_pred <- predict(nb_model, grad_NB_test)
CrossTable(grad_NB_test$chance_low_high, nb_pred, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c("actual chance", "predicted chance"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | predicted chance
## actual chance | high | low | Row Total |
## --------------|-----------|-----------|-----------|
## high | 91 | 4 | 95 |
## | 0.910 | 0.040 | |
## --------------|-----------|-----------|-----------|
## low | 2 | 3 | 5 |
## | 0.020 | 0.030 | |
## --------------|-----------|-----------|-----------|
## Column Total | 93 | 7 | 100 |
## --------------|-----------|-----------|-----------|
##
##
#the model has correctly predicted 91 students have higher chance of getting and admit and 3 students have low chance of getting an admit, with 94% accuracy and an error rate of 6% which is not actually bad model but we can boost the accuracy of decision trees
########################improving model performance##################################
#Let us see if we can boost our algorithm with trials parameter, indicating the number of separate decision trees to use in the boosted team, it sets an upper limit, the algorithm will stop adding trees if it recognizes that additional trials do not seem to be improving the accuracy. we are starting with 10.
nb_boost <- C5.0(grad_NB_train[-8], grad_NB_train$chance_low_high, trials = 10)
nb_boost
##
## Call:
## C5.0.default(x = grad_NB_train[-8], y =
## grad_NB_train$chance_low_high, trials = 10)
##
## Classification Tree
## Number of samples: 400
## Number of predictors: 7
##
## Number of boosting iterations: 10
## Average tree size: 4.9
##
## Non-standard options: attempt to group attributes
summary(nb_boost)
##
## Call:
## C5.0.default(x = grad_NB_train[-8], y =
## grad_NB_train$chance_low_high, trials = 10)
##
##
## C5.0 [Release 2.07 GPL Edition] Thu Apr 25 11:06:41 2019
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 400 cases (8 attributes) from undefined.data
##
## ----- Trial 0: -----
##
## Decision tree:
##
## CGPA > 8.02: high (322/4)
## CGPA <= 8.02:
## :...TOEFL.Score <= 97: low (12/2)
## TOEFL.Score > 97:
## :...CGPA > 7.6: high (53/10)
## CGPA <= 7.6:
## :...GRE.Score <= 301: low (7)
## GRE.Score > 301: high (6/1)
##
## ----- Trial 1: -----
##
## Decision tree:
##
## CGPA > 8.3: high (203.5/6.4)
## CGPA <= 8.3:
## :...SOP <= 2: low (80/19.8)
## SOP > 2: high (116.5/42.1)
##
## ----- Trial 2: -----
##
## Decision tree:
##
## CGPA > 8.6: high (119.6)
## CGPA <= 8.6:
## :...LOR > 3.5: high (29)
## LOR <= 3.5:
## :...CGPA <= 8: low (137/54.1)
## CGPA > 8:
## :...CGPA <= 8.57: high (105.7/26.5)
## CGPA > 8.57: low (8.7/0.6)
##
## ----- Trial 3: -----
##
## Decision tree:
##
## LOR > 3.5: high (88.1)
## LOR <= 3.5:
## :...CGPA > 8.6: high (32.8)
## CGPA <= 8.6:
## :...SOP <= 2.5: high (164.3/45.9)
## SOP > 2.5:
## :...University.Rating <= 3: low (103.9/43.7)
## University.Rating > 3: high (10.9)
##
## ----- Trial 4: -----
##
## Decision tree:
##
## CGPA > 8.3: high (128.1/5.4)
## CGPA <= 8.3:
## :...LOR > 3: high (46.9/7.1)
## LOR <= 3:
## :...SOP > 2: high (115.4/39.6)
## SOP <= 2:
## :...TOEFL.Score <= 103: high (87.5/37.4)
## TOEFL.Score > 103: low (22.1/1.7)
##
## ----- Trial 5: -----
##
## Decision tree:
##
## CGPA > 8.2: high (137.7/13.2)
## CGPA <= 8.2:
## :...GRE.Score > 309: high (75.4/24)
## GRE.Score <= 309:
## :...TOEFL.Score <= 95: low (15.8)
## TOEFL.Score > 95:
## :...University.Rating <= 1: high (34.4/13.3)
## University.Rating > 1: low (136.6/44.1)
##
## ----- Trial 6: -----
##
## Decision tree:
##
## LOR > 3: high (109/9)
## LOR <= 3:
## :...SOP <= 2: low (125.9/43.8)
## SOP > 2:
## :...TOEFL.Score <= 101: low (79.8/34.9)
## TOEFL.Score > 101: high (85.3/14.5)
##
## ----- Trial 7: -----
##
## Decision tree:
##
## CGPA > 8.3: high (85.2)
## CGPA <= 8.3:
## :...GRE.Score <= 296: low (29.2/7.9)
## GRE.Score > 296:
## :...LOR > 3: high (43.9/7.1)
## LOR <= 3:
## :...GRE.Score <= 301: high (68.8/18.2)
## GRE.Score > 301:
## :...TOEFL.Score <= 101: low (85.7/37)
## TOEFL.Score > 101: high (86.2/29.9)
##
## ----- Trial 8: -----
##
## Decision tree:
##
## CGPA > 8.3: high (71.6)
## CGPA <= 8.3:
## :...LOR > 3: high (34.3/3.3)
## LOR <= 3:
## :...TOEFL.Score <= 97: low (38.2/12.5)
## TOEFL.Score > 97:
## :...Research = Yes: low (61.4/29.1)
## Research = No:
## :...LOR <= 1.5: low (11.5/3.1)
## LOR > 1.5:
## :...SOP <= 3.5: high (167.8/48.8)
## SOP > 3.5: low (13.1/3.1)
##
## ----- Trial 9: -----
##
## Decision tree:
##
## CGPA > 8.02: high (155.3/4.9)
## CGPA <= 8.02:
## :...CGPA > 7.66: high (120.1/36.9)
## CGPA <= 7.66:
## :...GRE.Score <= 300: low (51.6/7.3)
## GRE.Score > 300: high (67.9/28.3)
##
##
## Evaluation on training data (400 cases):
##
## Trial Decision Tree
## ----- ----------------
## Size Errors
##
## 0 5 17( 4.2%)
## 1 3 38( 9.5%)
## 2 5 43(10.8%)
## 3 5 79(19.8%)
## 4 5 29( 7.2%)
## 5 5 47(11.8%)
## 6 4 58(14.5%)
## 7 6 32( 8.0%)
## 8 7 39( 9.8%)
## 9 4 19( 4.8%)
## boost 17( 4.2%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 367 1 (a): class high
## 16 16 (b): class low
##
##
## Attribute usage:
##
## 100.00% LOR
## 100.00% CGPA
## 54.25% SOP
## 46.25% TOEFL.Score
## 35.00% GRE.Score
## 31.50% University.Rating
## 22.50% Research
##
##
## Time: 0.0 secs
nb_pred_boost <- predict(nb_boost, grad_NB_test)
CrossTable(grad_NB_test$chance_low_high, nb_pred_boost, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c("actual chance", "predicted chance"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | predicted chance
## actual chance | high | low | Row Total |
## --------------|-----------|-----------|-----------|
## high | 92 | 3 | 95 |
## | 0.920 | 0.030 | |
## --------------|-----------|-----------|-----------|
## low | 3 | 2 | 5 |
## | 0.030 | 0.020 | |
## --------------|-----------|-----------|-----------|
## Column Total | 95 | 5 | 100 |
## --------------|-----------|-----------|-----------|
##
##
#The model obtained 94% accuracy which is the same as before and it is a good model
#Building Ensemble Methods
set.seed(1)
str(grad_admit)
## 'data.frame': 500 obs. of 8 variables:
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 1 1 1 ...
## $ chance_low_high : Factor w/ 2 levels "high","low": 1 1 1 1 1 1 1 1 2 2 ...
#Spliting training set into two parts based on outcome: 75% and 25%
index <- createDataPartition(grad_admit$chance_low_high, p=0.75, list=FALSE)
trainSet_ensemble <- grad_admit[ index,]
testSet_ensemble <- grad_admit[-index,]
#Defining the training controls for multiple models
fitControl <- trainControl(
method = "cv",
number = 5,
savePredictions = 'final',
classProbs = T)
#Defining the predictors and outcome
predictors<-c("GRE.Score", "TOEFL.Score", "CGPA")
outcomeName<-'chance_low_high'
###################################Random Forest#####################################
#Training the random forest model
model_rf<-train(trainSet_ensemble[,predictors],trainSet_ensemble[,outcomeName],method='rf',trControl=fitControl,tuneLength=3)
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
#Predicting using random forest model
testSet_ensemble$pred_rf<-predict(object = model_rf,testSet_ensemble[,predictors])
#Checking the accuracy of the random forest model
confusionMatrix(testSet_ensemble$chance_low_high,testSet_ensemble$pred_rf) #the accuracy is 91%
## Confusion Matrix and Statistics
##
## Reference
## Prediction high low
## high 113 2
## low 8 1
##
## Accuracy : 0.9194
## 95% CI : (0.8567, 0.9606)
## No Information Rate : 0.9758
## P-Value [Acc > NIR] : 0.9998
##
## Kappa : 0.1353
## Mcnemar's Test P-Value : 0.1138
##
## Sensitivity : 0.9339
## Specificity : 0.3333
## Pos Pred Value : 0.9826
## Neg Pred Value : 0.1111
## Prevalence : 0.9758
## Detection Rate : 0.9113
## Detection Prevalence : 0.9274
## Balanced Accuracy : 0.6336
##
## 'Positive' Class : high
##
#################################KNN##################################################
#Training the knn model
model_knn<-train(trainSet_ensemble[,predictors],trainSet_ensemble[,outcomeName],method='knn',trControl=fitControl,tuneLength=3)
#Predicting using knn model
testSet_ensemble$pred_knn<-predict(object = model_knn,testSet_ensemble[,predictors])
#Checking the accuracy of the knn model
confusionMatrix(testSet_ensemble$chance_low_high,testSet_ensemble$pred_knn) #the accuracy is 91%
## Confusion Matrix and Statistics
##
## Reference
## Prediction high low
## high 113 2
## low 8 1
##
## Accuracy : 0.9194
## 95% CI : (0.8567, 0.9606)
## No Information Rate : 0.9758
## P-Value [Acc > NIR] : 0.9998
##
## Kappa : 0.1353
## Mcnemar's Test P-Value : 0.1138
##
## Sensitivity : 0.9339
## Specificity : 0.3333
## Pos Pred Value : 0.9826
## Neg Pred Value : 0.1111
## Prevalence : 0.9758
## Detection Rate : 0.9113
## Detection Prevalence : 0.9274
## Balanced Accuracy : 0.6336
##
## 'Positive' Class : high
##
#################################Logistic regression###################################
#Training the Logistic regression model
model_lr<-train(trainSet_ensemble[,predictors],trainSet_ensemble[,outcomeName],method='glm',trControl=fitControl,tuneLength=3)
#Predicting using knn model
testSet_ensemble$pred_lr<-predict(object = model_lr,testSet_ensemble[,predictors])
#Checking the accuracy of the Logistic regression model
confusionMatrix(testSet_ensemble$chance_low_high,testSet_ensemble$pred_lr) #the accuracy is 93%, slightly higher than the previous two models
## Confusion Matrix and Statistics
##
## Reference
## Prediction high low
## high 115 0
## low 8 1
##
## Accuracy : 0.9355
## 95% CI : (0.8768, 0.9717)
## No Information Rate : 0.9919
## P-Value [Acc > NIR] : 1.00000
##
## Kappa : 0.1882
## Mcnemar's Test P-Value : 0.01333
##
## Sensitivity : 0.9350
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.1111
## Prevalence : 0.9919
## Detection Rate : 0.9274
## Detection Prevalence : 0.9274
## Balanced Accuracy : 0.9675
##
## 'Positive' Class : high
##
#averaging the predictions from each model. Since we are predicting whether the chance of admit is high or low, we are averaging the probabilities
#Predicting the probabilities
testSet_ensemble$pred_rf_prob<-predict(object=model_rf,testSet_ensemble[,predictors],type='prob')
testSet_ensemble$pred_knn_prob<-predict(object = model_knn,testSet_ensemble[,predictors],type='prob')
testSet_ensemble$pred_lr_prob<-predict(object = model_lr,testSet_ensemble[,predictors],type='prob')
#Taking average of predictions
testSet_ensemble$pred_avg<-(testSet_ensemble$pred_rf_prob$high+testSet_ensemble$pred_knn_prob$high+testSet_ensemble$pred_lr_prob$high)/3
#Splitting into binary classes at 0.5
testSet_ensemble$pred_avg<-as.factor(ifelse(testSet_ensemble$pred_avg>0.5,'HIGH','LOW'))
#Implimenting majority voting by assigning the prediction for the values as predicted by the models
testSet_ensemble$pred_majority<-as.factor(ifelse(testSet_ensemble$pred_rf=='high' & testSet_ensemble$pred_knn=='high','high',ifelse(testSet_ensemble$pred_rf=='high' & testSet_ensemble$pred_lr=='high','high',ifelse(testSet_ensemble$pred_knn=='high' & testSet_ensemble$pred_lr=='high','high','low'))))
#Taking weighted average of predictions, generally weights of predictions are higher for more accurate models
testSet_ensemble$pred_weighted_avg<-(testSet_ensemble$pred_rf_prob$high*0.5)+(testSet_ensemble$pred_knn_prob$high*0.5)+(testSet_ensemble$pred_lr_prob$high*0.25)
#Splitting into binary classes at 0.5
testSet_ensemble$pred_weighted_avg<-as.factor(ifelse(testSet_ensemble$pred_weighted_avg>0.5,'HIGH','LOW'))
#########################Boosting using GBM############################################
#Gradient Boosting aka GBM is a powerful algorithm. It reduces bias and variance and constructs one tree at a time and it is used in real world examples. the weekness of the model is that it tends to overfit and it is harder to tune.
#Predicting using each base layer model for training data
trainSet_ensemble$OOF_pred_rf<-model_rf$pred$high[order(model_rf$pred$rowIndex)]
trainSet_ensemble$OOF_pred_knn<-model_knn$pred$high[order(model_knn$pred$rowIndex)]
trainSet_ensemble$OOF_pred_lr<-model_lr$pred$high[order(model_lr$pred$rowIndex)]
#Predicting probabilities for the test data
testSet_ensemble$OOF_pred_rf<-predict(model_rf,testSet_ensemble[predictors],type='prob')$high
testSet_ensemble$OOF_pred_knn<-predict(model_knn,testSet_ensemble[predictors],type='prob')$high
testSet_ensemble$OOF_pred_lr<-predict(model_lr,testSet_ensemble[predictors],type='prob')$high
#Predictors for top layer models
predictors_top<-c('OOF_pred_rf','OOF_pred_knn','OOF_pred_lr')
#constructing GBM as the top model after prediction
#GBM as top layer model
model_gbm<-
train(trainSet_ensemble[,predictors_top],trainSet_ensemble[,outcomeName],method='gbm',trControl=fitControl,tuneLength=3)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.4463 nan 0.1000 0.0205
## 2 0.4218 nan 0.1000 0.0116
## 3 0.4017 nan 0.1000 0.0103
## 4 0.3818 nan 0.1000 0.0097
## 5 0.3706 nan 0.1000 0.0061
## 6 0.3592 nan 0.1000 0.0042
## 7 0.3505 nan 0.1000 0.0028
## 8 0.3420 nan 0.1000 0.0018
## 9 0.3356 nan 0.1000 0.0027
## 10 0.3295 nan 0.1000 0.0015
## 20 0.2838 nan 0.1000 0.0007
## 40 0.2527 nan 0.1000 -0.0013
## 60 0.2352 nan 0.1000 0.0001
## 80 0.2273 nan 0.1000 -0.0003
## 100 0.2239 nan 0.1000 -0.0018
## 120 0.2154 nan 0.1000 -0.0011
## 140 0.2079 nan 0.1000 -0.0005
## 150 0.2060 nan 0.1000 -0.0003
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.4272 nan 0.1000 0.0355
## 2 0.4054 nan 0.1000 0.0119
## 3 0.3856 nan 0.1000 0.0064
## 4 0.3702 nan 0.1000 0.0021
## 5 0.3548 nan 0.1000 0.0060
## 6 0.3405 nan 0.1000 0.0053
## 7 0.3303 nan 0.1000 0.0057
## 8 0.3195 nan 0.1000 0.0021
## 9 0.3128 nan 0.1000 0.0011
## 10 0.3069 nan 0.1000 0.0001
## 20 0.2562 nan 0.1000 -0.0022
## 40 0.2211 nan 0.1000 0.0000
## 60 0.1946 nan 0.1000 -0.0025
## 80 0.1839 nan 0.1000 -0.0027
## 100 0.1677 nan 0.1000 -0.0019
## 120 0.1576 nan 0.1000 -0.0010
## 140 0.1483 nan 0.1000 -0.0011
## 150 0.1434 nan 0.1000 -0.0014
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.4600 nan 0.1000 0.0319
## 2 0.4171 nan 0.1000 0.0180
## 3 0.3810 nan 0.1000 0.0154
## 4 0.3638 nan 0.1000 0.0017
## 5 0.3467 nan 0.1000 0.0046
## 6 0.3264 nan 0.1000 0.0025
## 7 0.3123 nan 0.1000 0.0053
## 8 0.3035 nan 0.1000 0.0037
## 9 0.2930 nan 0.1000 -0.0008
## 10 0.2855 nan 0.1000 -0.0005
## 20 0.2407 nan 0.1000 0.0002
## 40 0.1997 nan 0.1000 -0.0016
## 60 0.1745 nan 0.1000 -0.0006
## 80 0.1527 nan 0.1000 -0.0014
## 100 0.1366 nan 0.1000 -0.0021
## 120 0.1255 nan 0.1000 -0.0012
## 140 0.1149 nan 0.1000 -0.0006
## 150 0.1087 nan 0.1000 -0.0004
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.4338 nan 0.1000 0.0215
## 2 0.3981 nan 0.1000 0.0085
## 3 0.3803 nan 0.1000 0.0080
## 4 0.3600 nan 0.1000 0.0076
## 5 0.3463 nan 0.1000 0.0058
## 6 0.3291 nan 0.1000 0.0045
## 7 0.3178 nan 0.1000 0.0033
## 8 0.3074 nan 0.1000 0.0051
## 9 0.3029 nan 0.1000 0.0008
## 10 0.2971 nan 0.1000 0.0000
## 20 0.2648 nan 0.1000 0.0015
## 40 0.2470 nan 0.1000 -0.0014
## 60 0.2383 nan 0.1000 -0.0012
## 80 0.2238 nan 0.1000 -0.0015
## 100 0.2151 nan 0.1000 -0.0020
## 120 0.2087 nan 0.1000 -0.0001
## 140 0.2056 nan 0.1000 -0.0001
## 150 0.2005 nan 0.1000 -0.0009
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.4426 nan 0.1000 0.0477
## 2 0.4049 nan 0.1000 0.0171
## 3 0.3752 nan 0.1000 0.0153
## 4 0.3592 nan 0.1000 0.0046
## 5 0.3408 nan 0.1000 0.0066
## 6 0.3298 nan 0.1000 0.0053
## 7 0.3185 nan 0.1000 0.0057
## 8 0.3069 nan 0.1000 0.0059
## 9 0.3065 nan 0.1000 -0.0018
## 10 0.2986 nan 0.1000 0.0035
## 20 0.2605 nan 0.1000 -0.0021
## 40 0.2276 nan 0.1000 -0.0023
## 60 0.2081 nan 0.1000 0.0007
## 80 0.1932 nan 0.1000 -0.0015
## 100 0.1826 nan 0.1000 -0.0010
## 120 0.1738 nan 0.1000 -0.0020
## 140 0.1658 nan 0.1000 -0.0014
## 150 0.1625 nan 0.1000 -0.0005
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.4245 nan 0.1000 0.0266
## 2 0.3826 nan 0.1000 0.0146
## 3 0.3586 nan 0.1000 0.0071
## 4 0.3444 nan 0.1000 0.0043
## 5 0.3277 nan 0.1000 0.0051
## 6 0.3119 nan 0.1000 0.0057
## 7 0.3025 nan 0.1000 0.0014
## 8 0.2933 nan 0.1000 0.0020
## 9 0.2836 nan 0.1000 0.0016
## 10 0.2751 nan 0.1000 0.0018
## 20 0.2349 nan 0.1000 -0.0007
## 40 0.2017 nan 0.1000 -0.0017
## 60 0.1834 nan 0.1000 -0.0019
## 80 0.1739 nan 0.1000 -0.0019
## 100 0.1612 nan 0.1000 -0.0012
## 120 0.1465 nan 0.1000 -0.0014
## 140 0.1333 nan 0.1000 -0.0006
## 150 0.1278 nan 0.1000 -0.0007
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.4468 nan 0.1000 0.0454
## 2 0.4038 nan 0.1000 0.0115
## 3 0.3715 nan 0.1000 0.0091
## 4 0.3575 nan 0.1000 0.0088
## 5 0.3375 nan 0.1000 0.0054
## 6 0.3272 nan 0.1000 0.0057
## 7 0.3150 nan 0.1000 0.0059
## 8 0.3050 nan 0.1000 0.0013
## 9 0.2961 nan 0.1000 0.0012
## 10 0.2888 nan 0.1000 0.0027
## 20 0.2512 nan 0.1000 -0.0020
## 40 0.2282 nan 0.1000 -0.0005
## 60 0.2214 nan 0.1000 -0.0010
## 80 0.2118 nan 0.1000 -0.0014
## 100 0.2047 nan 0.1000 0.0001
## 120 0.1982 nan 0.1000 -0.0005
## 140 0.1934 nan 0.1000 -0.0008
## 150 0.1920 nan 0.1000 -0.0009
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.4277 nan 0.1000 0.0358
## 2 0.4006 nan 0.1000 0.0101
## 3 0.3678 nan 0.1000 0.0149
## 4 0.3446 nan 0.1000 0.0097
## 5 0.3284 nan 0.1000 0.0085
## 6 0.3147 nan 0.1000 0.0026
## 7 0.3011 nan 0.1000 0.0051
## 8 0.2922 nan 0.1000 0.0034
## 9 0.2874 nan 0.1000 0.0015
## 10 0.2809 nan 0.1000 0.0033
## 20 0.2406 nan 0.1000 0.0010
## 40 0.2082 nan 0.1000 -0.0006
## 60 0.1952 nan 0.1000 -0.0025
## 80 0.1833 nan 0.1000 -0.0018
## 100 0.1727 nan 0.1000 -0.0018
## 120 0.1632 nan 0.1000 -0.0014
## 140 0.1523 nan 0.1000 -0.0008
## 150 0.1495 nan 0.1000 -0.0005
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.4257 nan 0.1000 0.0397
## 2 0.3840 nan 0.1000 0.0111
## 3 0.3613 nan 0.1000 0.0121
## 4 0.3456 nan 0.1000 0.0051
## 5 0.3228 nan 0.1000 0.0070
## 6 0.3086 nan 0.1000 0.0041
## 7 0.3001 nan 0.1000 0.0029
## 8 0.2886 nan 0.1000 0.0017
## 9 0.2811 nan 0.1000 0.0026
## 10 0.2703 nan 0.1000 0.0009
## 20 0.2141 nan 0.1000 -0.0012
## 40 0.1860 nan 0.1000 -0.0020
## 60 0.1633 nan 0.1000 -0.0027
## 80 0.1517 nan 0.1000 -0.0011
## 100 0.1370 nan 0.1000 -0.0011
## 120 0.1175 nan 0.1000 -0.0016
## 140 0.1094 nan 0.1000 -0.0013
## 150 0.1026 nan 0.1000 -0.0011
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.4703 nan 0.1000 0.0323
## 2 0.4404 nan 0.1000 0.0071
## 3 0.4246 nan 0.1000 0.0054
## 4 0.4080 nan 0.1000 0.0089
## 5 0.3957 nan 0.1000 0.0041
## 6 0.3829 nan 0.1000 0.0039
## 7 0.3708 nan 0.1000 0.0048
## 8 0.3610 nan 0.1000 0.0035
## 9 0.3564 nan 0.1000 0.0010
## 10 0.3519 nan 0.1000 -0.0003
## 20 0.3192 nan 0.1000 0.0002
## 40 0.3019 nan 0.1000 -0.0002
## 60 0.2874 nan 0.1000 0.0002
## 80 0.2740 nan 0.1000 -0.0015
## 100 0.2636 nan 0.1000 -0.0004
## 120 0.2551 nan 0.1000 -0.0009
## 140 0.2503 nan 0.1000 -0.0012
## 150 0.2479 nan 0.1000 -0.0005
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.4827 nan 0.1000 0.0270
## 2 0.4411 nan 0.1000 0.0155
## 3 0.4181 nan 0.1000 0.0104
## 4 0.3982 nan 0.1000 0.0070
## 5 0.3757 nan 0.1000 0.0043
## 6 0.3605 nan 0.1000 0.0034
## 7 0.3504 nan 0.1000 0.0013
## 8 0.3381 nan 0.1000 0.0047
## 9 0.3362 nan 0.1000 -0.0021
## 10 0.3289 nan 0.1000 0.0014
## 20 0.2931 nan 0.1000 -0.0017
## 40 0.2702 nan 0.1000 -0.0004
## 60 0.2486 nan 0.1000 -0.0013
## 80 0.2349 nan 0.1000 -0.0023
## 100 0.2256 nan 0.1000 -0.0017
## 120 0.2104 nan 0.1000 -0.0010
## 140 0.2018 nan 0.1000 -0.0015
## 150 0.2002 nan 0.1000 -0.0005
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.4893 nan 0.1000 0.0273
## 2 0.4364 nan 0.1000 0.0190
## 3 0.4153 nan 0.1000 0.0097
## 4 0.3919 nan 0.1000 0.0106
## 5 0.3782 nan 0.1000 0.0038
## 6 0.3637 nan 0.1000 0.0038
## 7 0.3548 nan 0.1000 0.0017
## 8 0.3470 nan 0.1000 -0.0002
## 9 0.3403 nan 0.1000 0.0005
## 10 0.3323 nan 0.1000 0.0010
## 20 0.2810 nan 0.1000 -0.0022
## 40 0.2427 nan 0.1000 -0.0046
## 60 0.2191 nan 0.1000 -0.0032
## 80 0.2037 nan 0.1000 -0.0015
## 100 0.1921 nan 0.1000 -0.0027
## 120 0.1802 nan 0.1000 -0.0008
## 140 0.1662 nan 0.1000 -0.0007
## 150 0.1609 nan 0.1000 -0.0016
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.4802 nan 0.1000 0.0321
## 2 0.4459 nan 0.1000 0.0092
## 3 0.4238 nan 0.1000 0.0062
## 4 0.4052 nan 0.1000 0.0089
## 5 0.3950 nan 0.1000 0.0035
## 6 0.3799 nan 0.1000 0.0058
## 7 0.3696 nan 0.1000 0.0038
## 8 0.3640 nan 0.1000 0.0010
## 9 0.3653 nan 0.1000 -0.0029
## 10 0.3590 nan 0.1000 0.0007
## 20 0.3287 nan 0.1000 -0.0011
## 40 0.3126 nan 0.1000 -0.0008
## 60 0.2950 nan 0.1000 -0.0002
## 80 0.2851 nan 0.1000 -0.0018
## 100 0.2773 nan 0.1000 -0.0005
## 120 0.2704 nan 0.1000 0.0000
## 140 0.2634 nan 0.1000 -0.0007
## 150 0.2612 nan 0.1000 -0.0019
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.4859 nan 0.1000 0.0291
## 2 0.4451 nan 0.1000 0.0172
## 3 0.4258 nan 0.1000 0.0049
## 4 0.4051 nan 0.1000 0.0094
## 5 0.3903 nan 0.1000 0.0067
## 6 0.3778 nan 0.1000 0.0067
## 7 0.3728 nan 0.1000 -0.0008
## 8 0.3556 nan 0.1000 0.0079
## 9 0.3434 nan 0.1000 0.0035
## 10 0.3375 nan 0.1000 0.0010
## 20 0.3046 nan 0.1000 -0.0013
## 40 0.2674 nan 0.1000 -0.0012
## 60 0.2520 nan 0.1000 -0.0009
## 80 0.2386 nan 0.1000 -0.0014
## 100 0.2288 nan 0.1000 -0.0008
## 120 0.2213 nan 0.1000 -0.0019
## 140 0.2105 nan 0.1000 -0.0006
## 150 0.2095 nan 0.1000 -0.0016
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.4718 nan 0.1000 0.0230
## 2 0.4350 nan 0.1000 0.0126
## 3 0.4118 nan 0.1000 0.0102
## 4 0.3953 nan 0.1000 0.0066
## 5 0.3733 nan 0.1000 0.0054
## 6 0.3622 nan 0.1000 0.0053
## 7 0.3510 nan 0.1000 0.0038
## 8 0.3443 nan 0.1000 0.0020
## 9 0.3327 nan 0.1000 0.0007
## 10 0.3230 nan 0.1000 0.0000
## 20 0.2850 nan 0.1000 -0.0010
## 40 0.2520 nan 0.1000 -0.0017
## 60 0.2334 nan 0.1000 -0.0012
## 80 0.2200 nan 0.1000 -0.0006
## 100 0.2057 nan 0.1000 -0.0022
## 120 0.1962 nan 0.1000 -0.0021
## 140 0.1821 nan 0.1000 -0.0011
## 150 0.1754 nan 0.1000 -0.0014
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.4694 nan 0.1000 0.0273
## 2 0.4467 nan 0.1000 0.0075
## 3 0.4198 nan 0.1000 0.0120
## 4 0.3955 nan 0.1000 0.0061
## 5 0.3731 nan 0.1000 0.0084
## 6 0.3656 nan 0.1000 0.0033
## 7 0.3562 nan 0.1000 0.0042
## 8 0.3511 nan 0.1000 0.0010
## 9 0.3424 nan 0.1000 0.0039
## 10 0.3329 nan 0.1000 0.0042
## 20 0.3100 nan 0.1000 -0.0023
## 40 0.2758 nan 0.1000 -0.0011
## 50 0.2690 nan 0.1000 -0.0007
#constructing Logistic regression as the top layer model
#Logistic regression as top layer model
model_glm<-
train(trainSet_ensemble[,predictors_top],trainSet_ensemble[,outcomeName],method='glm',trControl=fitControl,tuneLength=3)
#############################Stacked ensemble method################################
#predict using GBM top layer model
testSet_ensemble$gbm_stacked<-predict(model_gbm,testSet_ensemble[,predictors_top])
#predict using logictic regression top layer model
testSet_ensemble$glm_stacked<-predict(model_glm,testSet_ensemble[,predictors_top])
#Since the dataset is fairly clean and neat, all the predictions were higher than usual. The above code displays how stacked ensemble works. Ensemble methods are very useful when the dataset is big and messy. It combines two or more algorithms to produce the best accurate results. There are different types, but the above code explains stacked ensemble method meaning stacking multiple layer of machine learning algorithm over one another where each of the models passes thier predictions to the model in the layer above and the top layer model takes decision based on the inputs given to it. Since this is fairly a good dataset, i used stacked ensemble method
////PHASE FIVE: EVALUATION //Evaluate results To evaluate the model built I keep few factors into account such as accuracy, RMSE, Pvalue, R-squared. I tuned the model until it reaches the highest accuracy.
#comparing the models with thier R-squared values to have a better understanding
errors_rate=data.frame(model_name=c("Naive_Bayes", "Multiple_linear_regression", "KNN") , Error_rate=c(0.34, 0.060, 0.065))
ggplot(errors_rate, aes(x=model_name, y=Error_rate)) + geom_bar(stat = "identity")
#when compared to three machine learning techniques, multiple linear regression has the lowest error rate of all. Hence the best machine learning model for this dataset is multiple regression.
//Review process This phase generally consists of cross- verifying the algorithm and checking if it runs perfectly as expected. I re-run the code from scratch to check if there are any errors and gave my insights on every modela nd every code I have written above.
//Determining the next step The next step for this project 1. cross verify the code 2. Deploy the project
////PHASE SIX : DEPLOYMENT ///Plan deployment I have planned to deploy this project in my Github and Rpubs. This will help the students who aspire to do a masters degree in th US. They can use this project to get to know thier chance of admission in the University
///Plan monitoring and maintenance The project will get updated every three years with new datasets. Some variables like GRE scores have changed in the past. 10 years ago, the GRE scores were out of 800, but recently it has changed to 340. The project will get updated if there are any such changes in the factors affecting the chance of admission
///Produce final report The final report for the project will be 1. PDF file describes the steps and Data mining concepts along with coding and appropriate commenting 2. A brief powerpoint presentation describing the steps and action I have taken in the project 3. Access link to the people who are willing to take a look at it.
Reference 1.Grolemund, Garrett, and Hadley Wickham. R for Data Science. Accessed April 24, 2019. https://r4ds.had.co.nz/. 2.“How to Build Ensemble Models in Machine Learning? (With Code in R).” Analytics Vidhya (blog), February 15, 2017. https://www.analyticsvidhya.com/blog/2017/02/introduction-to-ensembling-along-with-implementation-in-r/. 3.Brett Lanz. Machine Learning with R. Second edition. Packt Publishing, n.d.