R Notebook

#loading necessary packages
library(tidyverse)

## ── Attaching packages ─────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.1       ✔ purrr   0.3.1  
## ✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0

## Warning: package 'ggplot2' was built under R version 3.5.2

## Warning: package 'tibble' was built under R version 3.5.2

## Warning: package 'tidyr' was built under R version 3.5.2

## Warning: package 'purrr' was built under R version 3.5.2

## Warning: package 'dplyr' was built under R version 3.5.2

## Warning: package 'stringr' was built under R version 3.5.2

## Warning: package 'forcats' was built under R version 3.5.2

## ── Conflicts ────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(psych)

## Warning: package 'psych' was built under R version 3.5.2

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

# install.packages("GGally")
library(GGally)

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

library(ggplot2)
 # install.packages("ggExtra")
library(ggExtra)
# install.packages("class")
library(class)

## Warning: package 'class' was built under R version 3.5.2

library(gmodels)
# install.packages("caret")
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

# install.packages("fastDummies")
library(fastDummies)
# install.packages("C50")
library(C50)

////PHASE ONE : BUSINESS UNDERSTANDING This phase focuses on understanding the project from a business perspectives and develop a plan to achieve the goals

//Determining the business objectives This phase uncovers the primary business objective. I am a graduate student at northeastern university. When I was applying to the universities from India, often times I wanted to know the acceptance rate of the universities based on SoP(Statement of Purpose), LoR(Letter of Recommendation), GRE score, TOEFL Score. To help students decide their chances of admit at different universities in the US based on these criteria I have decided to build Machine Learning and Data mining concepts that I have learned in DA5030 class. 1. To use Machine Learning and Data mining techniques to predict the admission chances of students with different test scores and criteria

//Situation assessment This phase covers what data is available to meet the primary business goal,list potential risk and solutions to those risks The dataset contains information about the test scores, CGPA and university rating, the question arises who rates the universities and what are the criteria for rating a university.The veracity of SoP and LoR rating will be biased as it is based on the applicants knowledge and self- evaluation which is why they are listed as ‘not important’ variables. The chance of admit was for different universities around the country USA.

//Determining the data mining goals The data mining goals of this project is to predict the admission of new students and also improve the accuracy of prediction of different machine learning models.

//Producing a project plan This project will have the following steps 1. Explore the dataset using different visualization techniques 2. Transform the data if needed 3. Splitting the dataset into training and validation 4. Use Machine Learning techniques like KNN, Decision tree, multiple regression to predict the chance of admit 5. Compare the accuracy of different machine learning models 6. Build stacked ensemble method to improve the models performance 7. Visualize the prediction for better understanding

////PHASE TWO : DATA UNDERSTANDING This phase consists of data collection, exploring the data, identifying the data quality problems, discover insights into the data and detect interesting subsets too form hypotheses.

//Collecting the data The data was collected from Kaggle (https://www.kaggle.com/mohansacharya/graduate-admissions). Initially this data was inspired by UCLA graduate dataset. There was no need for different data sources as the dataset obtained was sufficient to predict the chance of admit using different machine learning algorithms

//Describe the data This describes the quantity of the data, number of records, features, format of the data The dataset obtained was in CSV format. It had 500 observations with 9 variables which are a) Serial number (numerical) b) GRE Score (numerical, out of 340) c) TOEFL Score (numerical, out of 120) d) University Rating (numerical, out of 5) e) SoP Rating(numerical, out of 5) f) LoR Rating(numerical, out of 5) g) CGPA (numerical, out of 10) h) Research(categorical, 0 or 1) i) Chance of admit(probability, 0 to 1)

#loading the dataset into R environment
grad_admit <- read.csv("/Users/jess/Downloads/graduate-admissions/Admission_Predict_Ver1.1.csv")
str(grad_admit)

## 'data.frame':    500 obs. of  9 variables:
##  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

OG_data <- read.csv("/Users/jess/Downloads/graduate-admissions/Admission_Predict_Ver1.1.csv")

//Exploring the data Visualizing and reporting the data is part of this phase. I have created many graphs to explore the dataset, to find outliers and have produced graphs to check the normal distribution of the variable

#histograms to check the normal distribution
hist(grad_admit$GRE.Score, xlab = "GRE score")

hist(grad_admit$TOEFL.Score, xlab = "TOEFL score")

hist(grad_admit$University.Rating, xlab = "University rating")

hist(grad_admit$SOP, xlab = "SOP rating")

hist(grad_admit$LOR, xlab = "LOR rating")

hist(grad_admit$CGPA,  xlab = "CGPA")

hist(grad_admit$Chance.of.Admit , xlab = "chance of admit")

#boxplot to detect outliers if there are any
boxplot(grad_admit$GRE.Score, ylab = "GRE score")

boxplot(grad_admit$TOEFL.Score, ylab = "TOEFL score")

boxplot(grad_admit$University.Rating, ylab ="University rating" )

boxplot(grad_admit$SOP, ylab = "SOP rating")

boxplot(grad_admit$LOR, ylab = "LOR rating")

boxplot(grad_admit$CGPA, ylab = "CGPA")

boxplot(grad_admit$Chance.of.Admit, ylab = "Chance of admit")

summary(grad_admit$Chance.of.Admit)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3400  0.6300  0.7200  0.7217  0.8200  0.9700

summary(grad_admit$LOR)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   3.500   3.484   4.000   5.000

#scatterplot for all scores and chance of admit
pairs.panels(grad_admit)

important_param <- grad_admit %>% 
  select(GRE.Score, TOEFL.Score, CGPA, Chance.of.Admit)
pairs.panels(important_param)

#gre vs chance of admit

p <- ggplot(grad_admit, aes(x=GRE.Score, y=Chance.of.Admit)) +    geom_point(shape=1) +  geom_smooth(method=lm , color="red", se=TRUE) + xlab("GRE score") + ylab(" Chance of admit")   # Add linear regression line 
ggMarginal(p, type="histogram")

#research facet plot
ggplot(data = grad_admit) + 
  geom_point(mapping = aes(x = GRE.Score, y = TOEFL.Score, color = Chance.of.Admit)) + 
  facet_wrap(~ Research)+
  xlab("GRE Score")+
  ylab("TOEFL Score")

//Verifying the data quality The quality of the dataset it good because it doesnt have any missing values. Of course like any other dataset it has biased attributes like SoP and LoR rating but there are no missing values. The variables does not have any conflict with common sense. It has 500 observation of 9 different variables out of which one is what we are going to predict using machine learning techniques.

////PHASE THREE: DATA PREPARATION This phase covers constructing of dataset for using in the analyses, transforming the dataset if needed, integration and cleansing of data if needed

//Selection of data In this phase I have to select the variables that aligns with my data mining goals and eliminate variables which are not useful. In graduate admission dataset, I dont need “serial number” variable because it does not provide useful information or any relevance to the dataset For building models I used only few important variables they are “GRE Score”, “TOEFL Score”, “CGPA”. The other variables are least important because they do not have high correlation.

//Clean data Inorder to clean the data, I deleted “serial number” variable from the dataset. Some categorical variables like “research” “university rating” were in different datatype. I converted them to factor variable which will be more useful.

//Construct data In this stage, I created dummy variables for categorical variables I have in my dataset. Produced a new column called “chance_low_high” based on “Chance of admit” that is available in the actual dataset.This variable will help us in the data mining goals we proposed earlier

//Integrate data This dataset doesnt require integration of data from different dataset since the data obtained itself is clean and complete.

//Format data Some variable types were different and inorder to use it I changed them to factor variables to use. Imputed the outliers of different variables by replacing it with either mode or median of that variable to get rid of the outliers. Outliers are usually defined as the 3 standard deviation away. I found the outliers by creating a boxplot from the data exploration phase.

########################################Data Imputation################################
#there is one outlier in LOR and chance of admit.the lor outlier is 1 and the chance of admit is 0.34.

#to impute the outlier. I will replace LOR with mode
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
modeLOR <- getmode(grad_admit$LOR)
print(modeLOR)

## [1] 3

#The mode of LOR is 3 hence replacing the outlier with 3
grad_admit$LOR[grad_admit$LOR ==1] <- 3

#replacing chance of admit with median which is 0.72
grad_admit$Chance.of.Admit[grad_admit$Chance.of.Admit==0.34] <- 0.72

#coverting "research" to categorical variable
grad_admit$Research[grad_admit$Research ==1] <- "Yes"
grad_admit$Research[grad_admit$Research ==0] <- "No"

grad_admit$Research <- as.factor(grad_admit$Research) #2 is yes, 1 is no

#creating a new categorical variable using the existing variable

grad_admit$chance_low_high[grad_admit$Chance.of.Admit <= 0.50 ] <- "low"
grad_admit$chance_low_high[grad_admit$Chance.of.Admit > 0.50 ] <- "high"

grad_admit$chance_low_high <- as.factor(grad_admit$chance_low_high)


#removing unwanted columns
grad_admit$Serial.No. <- NULL
grad_admit$Chance.of.Admit <- NULL

str(grad_admit)

## 'data.frame':    500 obs. of  8 variables:
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 1 1 1 ...
##  $ chance_low_high  : Factor w/ 2 levels "high","low": 1 1 1 1 1 1 1 1 2 2 ...

////PHASE FOUR: MODELLING //Selecting the modeling technique Inorder to meet the datamining goals, I have decided to use three modeling techniques, they are 1. K-Nearest Neighbors using Caret and Class package 2. Multiple Regression 3. Naive Bayes Classification using C5.0 package

//Generate test design In order to test the model that is created by various machine learning techniques, I split the dataset into two for training the model and validating the model . It generally has 80% data for training and 20% data for validating the model created.

//Build the model Built three different machine learning models on the training dataset which was created in the previous step.

//Assess the model The model constructed are assessed by using the model on validating dataset which was created. This will give the accuracy of the model and If there are any changes or if I want to improve the model I created a new model on the training set and use it on validation dataset.

##KNN
#creating a new dataset
grad_num_df <- grad_admit %>% select(GRE.Score, TOEFL.Score, CGPA, chance_low_high)

#normalizing the data
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}

#passing the function
grad_num_df_n <- as.data.frame(lapply(grad_num_df[,1:3], normalize))

#splitting the dataset randomly
## 75% of the sample size
smp_size <- floor(0.75 * nrow(grad_num_df_n))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(grad_num_df_n)), size = smp_size)

grad_num_df_train <- grad_num_df_n[train_ind, ]
grad_num_df_test <- grad_num_df_n[-train_ind, ]

#creating labels
grad_num_df_train_labels <- grad_num_df[train_ind, 4, drop = TRUE]
grad_num_df_test_labels <- grad_num_df[-train_ind,4, drop = TRUE]


#implementing KNN 
grad_admit_test_pred <- knn(train = grad_num_df_train, test = grad_num_df_test,
                        cl = grad_num_df_train_labels  , k = 3)

#evaluating model performance
CrossTable(x = grad_num_df_test_labels, y = grad_admit_test_pred,
                       prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  125 
## 
##  
##                         | grad_admit_test_pred 
## grad_num_df_test_labels |      high |       low | Row Total | 
## ------------------------|-----------|-----------|-----------|
##                    high |       112 |         2 |       114 | 
##                         |     0.982 |     0.018 |     0.912 | 
##                         |     0.949 |     0.286 |           | 
##                         |     0.896 |     0.016 |           | 
## ------------------------|-----------|-----------|-----------|
##                     low |         6 |         5 |        11 | 
##                         |     0.545 |     0.455 |     0.088 | 
##                         |     0.051 |     0.714 |           | 
##                         |     0.048 |     0.040 |           | 
## ------------------------|-----------|-----------|-----------|
##            Column Total |       118 |         7 |       125 | 
##                         |     0.944 |     0.056 |           | 
## ------------------------|-----------|-----------|-----------|
## 
##

# overall, there were 6 which was falsely classified, the model correctly classified 112 students have high chnace of admittance - true negatives and 5 students have low chance of getting adn the algorithm correctly predicted them. But the algorithm incorrectly identified that 2 students have high chance of getting and admit and 6 students were incorrectly classified as low chance of getting an admit- false negative


#KNN using caret package with different K values
set.seed(30)
new <- createDataPartition(y = grad_admit$chance_low_high,p = 0.65,list = FALSE) #creating train and test dataset
grad_train <-grad_admit[new,]
grad_test <- grad_admit[-new,]
con <- trainControl(method = "repeatedcv", number = 2, repeats = 5)
knn <- train(chance_low_high ~ ., data = grad_train,
method ="knn", trControl = con, preProcess = c("center","scale"))
predict <- predict(knn,newdata = grad_test) 
head(predict)

## [1] high high high high high high
## Levels: high low

confusionMatrix(predict, grad_test$chance_low_high)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high low
##       high  160   9
##       low     2   3
##                                          
##                Accuracy : 0.9368         
##                  95% CI : (0.8897, 0.968)
##     No Information Rate : 0.931          
##     P-Value [Acc > NIR] : 0.45749        
##                                          
##                   Kappa : 0.3256         
##  Mcnemar's Test P-Value : 0.07044        
##                                          
##             Sensitivity : 0.9877         
##             Specificity : 0.2500         
##          Pos Pred Value : 0.9467         
##          Neg Pred Value : 0.6000         
##              Prevalence : 0.9310         
##          Detection Rate : 0.9195         
##    Detection Prevalence : 0.9713         
##       Balanced Accuracy : 0.6188         
##                                          
##        'Positive' Class : high           
##

#using caret package we have accurcy of 94% with no false negatives. 

#############different k-values##################################
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 2)
set.seed(3333)
knn_fit <- train(chance_low_high ~., data = grad_train, method = "knn",trControl=trctrl,preProcess = c("center", "scale"),tuneLength = 10)
knn_fit

## k-Nearest Neighbors 
## 
## 326 samples
##   7 predictor
##   2 classes: 'high', 'low' 
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold, repeated 2 times) 
## Summary of sample sizes: 293, 294, 294, 293, 294, 293, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa       
##    5  0.9237494   0.189349840
##    7  0.9343165   0.234594025
##    9  0.9327986   0.226484248
##   11  0.9265987   0.108147794
##   13  0.9235183   0.021465201
##   15  0.9220477  -0.002380952
##   17  0.9235628   0.000000000
##   19  0.9235628   0.000000000
##   21  0.9235628   0.000000000
##   23  0.9235628   0.000000000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.

df <- knn_fit$results
ggplot(data = df)+
  geom_line(mapping = aes(x = df$k, y = df$Accuracy))+
  xlab("K values")+
  ylab("Accuracy")

#When we plot different values for k, we get to know that the highest accuracy is when k is 7. 
grad_admit_test_pred5 <- knn(train = grad_num_df_train, test = grad_num_df_test,
                        cl = grad_num_df_train_labels  , k = 7)

#evaluating model performance
CrossTable(x = grad_num_df_test_labels, y = grad_admit_test_pred5,
                       prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  125 
## 
##  
##                         | grad_admit_test_pred5 
## grad_num_df_test_labels |      high |       low | Row Total | 
## ------------------------|-----------|-----------|-----------|
##                    high |       114 |         0 |       114 | 
##                         |     1.000 |     0.000 |     0.912 | 
##                         |     0.927 |     0.000 |           | 
##                         |     0.912 |     0.000 |           | 
## ------------------------|-----------|-----------|-----------|
##                     low |         9 |         2 |        11 | 
##                         |     0.818 |     0.182 |     0.088 | 
##                         |     0.073 |     1.000 |           | 
##                         |     0.072 |     0.016 |           | 
## ------------------------|-----------|-----------|-----------|
##            Column Total |       123 |         2 |       125 | 
##                         |     0.984 |     0.016 |           | 
## ------------------------|-----------|-----------|-----------|
## 
##

#Multiple regression

#creating dummy variables for 3 categorical variable in this dataset
str(OG_data)

## 'data.frame':    500 obs. of  9 variables:
##  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

OG_data <- dummy_cols(OG_data, select_columns = "University.Rating")
OG_data <- dummy_cols(OG_data, select_columns = "Research")

#####splitting the dataset into training and testing#########
set.seed(111)
split_ind <- sample(seq_len(nrow(OG_data)), size = smp_size)

grad_MR_train <- OG_data[split_ind, ]
grad_MR_test <- OG_data[-split_ind, ]

#created a model using training dataset
mul_model <- glm(Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating_1 + University.Rating_2 + University.Rating_3 + University.Rating_4 + University.Rating_5 +CGPA + Research_1 , data = grad_MR_train)
summary(mul_model)

## 
## Call:
## glm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating_1 + 
##     University.Rating_2 + University.Rating_3 + University.Rating_4 + 
##     University.Rating_5 + CGPA + Research_1, data = grad_MR_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.23191  -0.02386   0.00558   0.03555   0.14719  
## 
## Coefficients: (1 not defined because of singularities)
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.270944   0.125403 -10.135  < 2e-16 ***
## GRE.Score            0.001862   0.000566   3.291  0.00110 ** 
## TOEFL.Score          0.002496   0.001030   2.423  0.01589 *  
## University.Rating_1 -0.028555   0.017457  -1.636  0.10275    
## University.Rating_2 -0.038639   0.012981  -2.977  0.00311 ** 
## University.Rating_3 -0.022683   0.011347  -1.999  0.04635 *  
## University.Rating_4 -0.021533   0.010943  -1.968  0.04984 *  
## University.Rating_5        NA         NA      NA       NA    
## CGPA                 0.133703   0.010271  13.018  < 2e-16 ***
## Research_1           0.024011   0.007538   3.185  0.00157 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.003581965)
## 
##     Null deviance: 7.2812  on 374  degrees of freedom
## Residual deviance: 1.3110  on 366  degrees of freedom
## AIC: -1036.8
## 
## Number of Fisher Scoring iterations: 2

#as we can see, there are many insignificant variables which can be removed by backfitting method.

#################model 2 created by backfitting#####################################
mul_model2 <- lm(Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating_2 + University.Rating_3   +CGPA  , data = grad_MR_train)
summary(mul_model2)

## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating_2 + 
##     University.Rating_3 + CGPA, data = grad_MR_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.251814 -0.024133  0.004559  0.037784  0.149652 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.4988889  0.1082410 -13.848  < 2e-16 ***
## GRE.Score            0.0023334  0.0005537   4.214 3.16e-05 ***
## TOEFL.Score          0.0025505  0.0010328   2.469    0.014 *  
## University.Rating_2 -0.0226375  0.0088578  -2.556    0.011 *  
## University.Rating_3 -0.0059376  0.0077214  -0.769    0.442    
## CGPA                 0.1419755  0.0098325  14.439  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0608 on 369 degrees of freedom
## Multiple R-squared:  0.8127, Adjusted R-squared:  0.8101 
## F-statistic: 320.2 on 5 and 369 DF,  p-value: < 2.2e-16

#################model 3 created by backfitting####################################
mul_model3<- lm(Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating_2 +CGPA  , data = grad_MR_train)
summary(mul_model3)

## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating_2 + 
##     CGPA, data = grad_MR_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.255469 -0.024757  0.005077  0.037076  0.150162 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.5169906  0.1055919 -14.367  < 2e-16 ***
## GRE.Score            0.0023434  0.0005533   4.235 2.88e-05 ***
## TOEFL.Score          0.0026104  0.0010293   2.536   0.0116 *  
## University.Rating_2 -0.0193948  0.0077854  -2.491   0.0132 *  
## CGPA                 0.1426537  0.0097875  14.575  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06076 on 370 degrees of freedom
## Multiple R-squared:  0.8124, Adjusted R-squared:  0.8103 
## F-statistic: 400.5 on 4 and 370 DF,  p-value: < 2.2e-16

#this model has the most significant variable. This is obtained by backfitting proceess
#as we can see the coefficients, we have statistically significant variables with p-values less than 0.05. multiple r-squared provides the measure of how well our model explains the values of the dependant variables, the closer it is to 1, the better it is. We have 0.80 which is a pretty good model. Our model has 0.80 as R-squared values the model explains nearly 80 percent of the variation in the dependent variable
 
######################predicting the model############################
predMR <- predict(mul_model3, grad_MR_test[c("GRE.Score", "TOEFL.Score", "University.Rating_2", "CGPA")])
head(predMR)

##         2        10        11        12        18        19 
## 0.7869347 0.7486852 0.7196205 0.8229516 0.6484984 0.7707196

head(grad_MR_test$Chance.of.Admit)

## [1] 0.76 0.45 0.52 0.84 0.65 0.63

###############calculating the root mean squared error for the prediction####
Sq_err <- (grad_MR_test$Chance.of.Admit - predMR)^ 2
 #calculating the mean
 avg_sq_err <- mean(Sq_err)
 #taking square root of it
 rmse <- sqrt(avg_sq_err)
 rmse

## [1] 0.06712385

 ##the root mean squared error is very minimum. Hence the model I created is good

#Naive bayes
set.seed(123)
train_sample <- sample(500,100)
str(train_sample)

##  int [1:100] 144 394 204 439 467 23 261 440 272 225 ...

#splitting the dataset randomly
grad_NB_test<- grad_admit[train_sample,]
grad_NB_train <- grad_admit[-train_sample,]

#fairly even split
prop.table(table(grad_NB_test$chance_low_high))

## 
## high  low 
## 0.95 0.05

prop.table(table(grad_NB_train$chance_low_high))

## 
## high  low 
## 0.92 0.08

head(grad_NB_train)

# grad_NB_test <- grad_NB_test[,-8]
# grad_NB_train <- grad_NB_train[,-8]

#creating a naive bayes model
nb_model <- C5.0(grad_NB_train[-8], grad_NB_train$chance_low_high)
nb_model

## 
## Call:
## C5.0.default(x = grad_NB_train[-8], y = grad_NB_train$chance_low_high)
## 
## Classification Tree
## Number of samples: 400 
## Number of predictors: 7 
## 
## Tree size: 5 
## 
## Non-standard options: attempt to group attributes

summary(nb_model)

## 
## Call:
## C5.0.default(x = grad_NB_train[-8], y = grad_NB_train$chance_low_high)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu Apr 25 11:06:41 2019
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 400 cases (8 attributes) from undefined.data
## 
## Decision tree:
## 
## CGPA > 8.02: high (322/4)
## CGPA <= 8.02:
## :...TOEFL.Score <= 97: low (12/2)
##     TOEFL.Score > 97:
##     :...CGPA > 7.6: high (53/10)
##         CGPA <= 7.6:
##         :...GRE.Score <= 301: low (7)
##             GRE.Score > 301: high (6/1)
## 
## 
## Evaluation on training data (400 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       5   17( 4.2%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     366     2    (a): class high
##      15    17    (b): class low
## 
## 
##  Attribute usage:
## 
##  100.00% CGPA
##   19.50% TOEFL.Score
##    3.25% GRE.Score
## 
## 
## Time: 0.0 secs

#The few lines of the models tree can be read as if cgpa is greater than 8.02, then there is high chance of getting an admit otherwise the cgpa is less than or equal to 8.02, and if the TOEFL score is less than 97 then the chance will be low . the (322/4) means that there were 4 cases which was incorrectly predicted.The error rate for this model is 4.2%, 2 values of "high chance of getting admit" was incorrectly classified(false positives) while 15 actual low values were incorrectly clasified as high(false negatives).

#evaluating the model performance using testing dataset
nb_pred <- predict(nb_model, grad_NB_test)
CrossTable(grad_NB_test$chance_low_high, nb_pred, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c("actual chance", "predicted chance"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##               | predicted chance 
## actual chance |      high |       low | Row Total | 
## --------------|-----------|-----------|-----------|
##          high |        91 |         4 |        95 | 
##               |     0.910 |     0.040 |           | 
## --------------|-----------|-----------|-----------|
##           low |         2 |         3 |         5 | 
##               |     0.020 |     0.030 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |        93 |         7 |       100 | 
## --------------|-----------|-----------|-----------|
## 
##

#the model has correctly predicted 91 students have higher chance of getting and admit and 3 students have low chance of getting an admit, with 94% accuracy and an error rate of 6% which is not actually bad model but we can boost the accuracy of decision trees

########################improving model performance##################################
#Let us see if we can boost our algorithm with trials parameter, indicating the number of separate decision trees to use in the boosted team, it sets an upper limit, the algorithm will stop adding trees if it recognizes that additional trials do not seem to be improving the accuracy. we are starting with 10.
nb_boost <- C5.0(grad_NB_train[-8], grad_NB_train$chance_low_high, trials = 10)
nb_boost

## 
## Call:
## C5.0.default(x = grad_NB_train[-8], y =
##  grad_NB_train$chance_low_high, trials = 10)
## 
## Classification Tree
## Number of samples: 400 
## Number of predictors: 7 
## 
## Number of boosting iterations: 10 
## Average tree size: 4.9 
## 
## Non-standard options: attempt to group attributes

summary(nb_boost)

## 
## Call:
## C5.0.default(x = grad_NB_train[-8], y =
##  grad_NB_train$chance_low_high, trials = 10)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu Apr 25 11:06:41 2019
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 400 cases (8 attributes) from undefined.data
## 
## -----  Trial 0:  -----
## 
## Decision tree:
## 
## CGPA > 8.02: high (322/4)
## CGPA <= 8.02:
## :...TOEFL.Score <= 97: low (12/2)
##     TOEFL.Score > 97:
##     :...CGPA > 7.6: high (53/10)
##         CGPA <= 7.6:
##         :...GRE.Score <= 301: low (7)
##             GRE.Score > 301: high (6/1)
## 
## -----  Trial 1:  -----
## 
## Decision tree:
## 
## CGPA > 8.3: high (203.5/6.4)
## CGPA <= 8.3:
## :...SOP <= 2: low (80/19.8)
##     SOP > 2: high (116.5/42.1)
## 
## -----  Trial 2:  -----
## 
## Decision tree:
## 
## CGPA > 8.6: high (119.6)
## CGPA <= 8.6:
## :...LOR > 3.5: high (29)
##     LOR <= 3.5:
##     :...CGPA <= 8: low (137/54.1)
##         CGPA > 8:
##         :...CGPA <= 8.57: high (105.7/26.5)
##             CGPA > 8.57: low (8.7/0.6)
## 
## -----  Trial 3:  -----
## 
## Decision tree:
## 
## LOR > 3.5: high (88.1)
## LOR <= 3.5:
## :...CGPA > 8.6: high (32.8)
##     CGPA <= 8.6:
##     :...SOP <= 2.5: high (164.3/45.9)
##         SOP > 2.5:
##         :...University.Rating <= 3: low (103.9/43.7)
##             University.Rating > 3: high (10.9)
## 
## -----  Trial 4:  -----
## 
## Decision tree:
## 
## CGPA > 8.3: high (128.1/5.4)
## CGPA <= 8.3:
## :...LOR > 3: high (46.9/7.1)
##     LOR <= 3:
##     :...SOP > 2: high (115.4/39.6)
##         SOP <= 2:
##         :...TOEFL.Score <= 103: high (87.5/37.4)
##             TOEFL.Score > 103: low (22.1/1.7)
## 
## -----  Trial 5:  -----
## 
## Decision tree:
## 
## CGPA > 8.2: high (137.7/13.2)
## CGPA <= 8.2:
## :...GRE.Score > 309: high (75.4/24)
##     GRE.Score <= 309:
##     :...TOEFL.Score <= 95: low (15.8)
##         TOEFL.Score > 95:
##         :...University.Rating <= 1: high (34.4/13.3)
##             University.Rating > 1: low (136.6/44.1)
## 
## -----  Trial 6:  -----
## 
## Decision tree:
## 
## LOR > 3: high (109/9)
## LOR <= 3:
## :...SOP <= 2: low (125.9/43.8)
##     SOP > 2:
##     :...TOEFL.Score <= 101: low (79.8/34.9)
##         TOEFL.Score > 101: high (85.3/14.5)
## 
## -----  Trial 7:  -----
## 
## Decision tree:
## 
## CGPA > 8.3: high (85.2)
## CGPA <= 8.3:
## :...GRE.Score <= 296: low (29.2/7.9)
##     GRE.Score > 296:
##     :...LOR > 3: high (43.9/7.1)
##         LOR <= 3:
##         :...GRE.Score <= 301: high (68.8/18.2)
##             GRE.Score > 301:
##             :...TOEFL.Score <= 101: low (85.7/37)
##                 TOEFL.Score > 101: high (86.2/29.9)
## 
## -----  Trial 8:  -----
## 
## Decision tree:
## 
## CGPA > 8.3: high (71.6)
## CGPA <= 8.3:
## :...LOR > 3: high (34.3/3.3)
##     LOR <= 3:
##     :...TOEFL.Score <= 97: low (38.2/12.5)
##         TOEFL.Score > 97:
##         :...Research = Yes: low (61.4/29.1)
##             Research = No:
##             :...LOR <= 1.5: low (11.5/3.1)
##                 LOR > 1.5:
##                 :...SOP <= 3.5: high (167.8/48.8)
##                     SOP > 3.5: low (13.1/3.1)
## 
## -----  Trial 9:  -----
## 
## Decision tree:
## 
## CGPA > 8.02: high (155.3/4.9)
## CGPA <= 8.02:
## :...CGPA > 7.66: high (120.1/36.9)
##     CGPA <= 7.66:
##     :...GRE.Score <= 300: low (51.6/7.3)
##         GRE.Score > 300: high (67.9/28.3)
## 
## 
## Evaluation on training data (400 cases):
## 
## Trial        Decision Tree   
## -----      ----------------  
##    Size      Errors  
## 
##    0      5   17( 4.2%)
##    1      3   38( 9.5%)
##    2      5   43(10.8%)
##    3      5   79(19.8%)
##    4      5   29( 7.2%)
##    5      5   47(11.8%)
##    6      4   58(14.5%)
##    7      6   32( 8.0%)
##    8      7   39( 9.8%)
##    9      4   19( 4.8%)
## boost             17( 4.2%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     367     1    (a): class high
##      16    16    (b): class low
## 
## 
##  Attribute usage:
## 
##  100.00% LOR
##  100.00% CGPA
##   54.25% SOP
##   46.25% TOEFL.Score
##   35.00% GRE.Score
##   31.50% University.Rating
##   22.50% Research
## 
## 
## Time: 0.0 secs

nb_pred_boost <- predict(nb_boost, grad_NB_test)
CrossTable(grad_NB_test$chance_low_high, nb_pred_boost, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c("actual chance", "predicted chance"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##               | predicted chance 
## actual chance |      high |       low | Row Total | 
## --------------|-----------|-----------|-----------|
##          high |        92 |         3 |        95 | 
##               |     0.920 |     0.030 |           | 
## --------------|-----------|-----------|-----------|
##           low |         3 |         2 |         5 | 
##               |     0.030 |     0.020 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |        95 |         5 |       100 | 
## --------------|-----------|-----------|-----------|
## 
##

#The model obtained 94% accuracy which is the same as before and it is a good model

#Building Ensemble Methods
set.seed(1)

str(grad_admit)

## 'data.frame':    500 obs. of  8 variables:
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 1 1 1 ...
##  $ chance_low_high  : Factor w/ 2 levels "high","low": 1 1 1 1 1 1 1 1 2 2 ...

#Spliting training set into two parts based on outcome: 75% and 25%
index <- createDataPartition(grad_admit$chance_low_high, p=0.75, list=FALSE)
trainSet_ensemble <- grad_admit[ index,]
testSet_ensemble <- grad_admit[-index,]

#Defining the training controls for multiple models
fitControl <- trainControl(
  method = "cv",
  number = 5,
savePredictions = 'final',
classProbs = T)

#Defining the predictors and outcome
predictors<-c("GRE.Score", "TOEFL.Score", "CGPA")
outcomeName<-'chance_low_high'

###################################Random Forest#####################################

#Training the random forest model
model_rf<-train(trainSet_ensemble[,predictors],trainSet_ensemble[,outcomeName],method='rf',trControl=fitControl,tuneLength=3)

## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .

#Predicting using random forest model
testSet_ensemble$pred_rf<-predict(object = model_rf,testSet_ensemble[,predictors])

#Checking the accuracy of the random forest model
confusionMatrix(testSet_ensemble$chance_low_high,testSet_ensemble$pred_rf) #the accuracy is 91%

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high low
##       high  113   2
##       low     8   1
##                                           
##                Accuracy : 0.9194          
##                  95% CI : (0.8567, 0.9606)
##     No Information Rate : 0.9758          
##     P-Value [Acc > NIR] : 0.9998          
##                                           
##                   Kappa : 0.1353          
##  Mcnemar's Test P-Value : 0.1138          
##                                           
##             Sensitivity : 0.9339          
##             Specificity : 0.3333          
##          Pos Pred Value : 0.9826          
##          Neg Pred Value : 0.1111          
##              Prevalence : 0.9758          
##          Detection Rate : 0.9113          
##    Detection Prevalence : 0.9274          
##       Balanced Accuracy : 0.6336          
##                                           
##        'Positive' Class : high            
##

#################################KNN##################################################

#Training the knn model
model_knn<-train(trainSet_ensemble[,predictors],trainSet_ensemble[,outcomeName],method='knn',trControl=fitControl,tuneLength=3)

#Predicting using knn model
testSet_ensemble$pred_knn<-predict(object = model_knn,testSet_ensemble[,predictors])

#Checking the accuracy of the knn model
confusionMatrix(testSet_ensemble$chance_low_high,testSet_ensemble$pred_knn) #the accuracy is 91%

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high low
##       high  113   2
##       low     8   1
##                                           
##                Accuracy : 0.9194          
##                  95% CI : (0.8567, 0.9606)
##     No Information Rate : 0.9758          
##     P-Value [Acc > NIR] : 0.9998          
##                                           
##                   Kappa : 0.1353          
##  Mcnemar's Test P-Value : 0.1138          
##                                           
##             Sensitivity : 0.9339          
##             Specificity : 0.3333          
##          Pos Pred Value : 0.9826          
##          Neg Pred Value : 0.1111          
##              Prevalence : 0.9758          
##          Detection Rate : 0.9113          
##    Detection Prevalence : 0.9274          
##       Balanced Accuracy : 0.6336          
##                                           
##        'Positive' Class : high            
##

#################################Logistic regression###################################
#Training the Logistic regression model
model_lr<-train(trainSet_ensemble[,predictors],trainSet_ensemble[,outcomeName],method='glm',trControl=fitControl,tuneLength=3)

#Predicting using knn model
testSet_ensemble$pred_lr<-predict(object = model_lr,testSet_ensemble[,predictors])

#Checking the accuracy of the Logistic regression model
confusionMatrix(testSet_ensemble$chance_low_high,testSet_ensemble$pred_lr) #the accuracy is 93%, slightly higher than the previous two models

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high low
##       high  115   0
##       low     8   1
##                                           
##                Accuracy : 0.9355          
##                  95% CI : (0.8768, 0.9717)
##     No Information Rate : 0.9919          
##     P-Value [Acc > NIR] : 1.00000         
##                                           
##                   Kappa : 0.1882          
##  Mcnemar's Test P-Value : 0.01333         
##                                           
##             Sensitivity : 0.9350          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.1111          
##              Prevalence : 0.9919          
##          Detection Rate : 0.9274          
##    Detection Prevalence : 0.9274          
##       Balanced Accuracy : 0.9675          
##                                           
##        'Positive' Class : high            
##

#averaging the predictions from each model. Since we are predicting whether the chance of admit is high or low, we are averaging the probabilities 
#Predicting the probabilities
testSet_ensemble$pred_rf_prob<-predict(object=model_rf,testSet_ensemble[,predictors],type='prob')
testSet_ensemble$pred_knn_prob<-predict(object = model_knn,testSet_ensemble[,predictors],type='prob')
testSet_ensemble$pred_lr_prob<-predict(object = model_lr,testSet_ensemble[,predictors],type='prob')

#Taking average of predictions
testSet_ensemble$pred_avg<-(testSet_ensemble$pred_rf_prob$high+testSet_ensemble$pred_knn_prob$high+testSet_ensemble$pred_lr_prob$high)/3

#Splitting into binary classes at 0.5
testSet_ensemble$pred_avg<-as.factor(ifelse(testSet_ensemble$pred_avg>0.5,'HIGH','LOW'))

#Implimenting majority voting by assigning the prediction for the values as predicted by the models
testSet_ensemble$pred_majority<-as.factor(ifelse(testSet_ensemble$pred_rf=='high' & testSet_ensemble$pred_knn=='high','high',ifelse(testSet_ensemble$pred_rf=='high' & testSet_ensemble$pred_lr=='high','high',ifelse(testSet_ensemble$pred_knn=='high' & testSet_ensemble$pred_lr=='high','high','low'))))

#Taking weighted average of predictions, generally weights of predictions are higher for more accurate models
testSet_ensemble$pred_weighted_avg<-(testSet_ensemble$pred_rf_prob$high*0.5)+(testSet_ensemble$pred_knn_prob$high*0.5)+(testSet_ensemble$pred_lr_prob$high*0.25)

#Splitting into binary classes at 0.5
testSet_ensemble$pred_weighted_avg<-as.factor(ifelse(testSet_ensemble$pred_weighted_avg>0.5,'HIGH','LOW'))


#########################Boosting using GBM############################################
#Gradient Boosting aka GBM is a powerful algorithm. It reduces bias and variance and constructs one tree at a time and it is used in real world examples. the weekness of the model is that it tends to overfit and it is harder to tune.

#Predicting using each base layer model for training data
trainSet_ensemble$OOF_pred_rf<-model_rf$pred$high[order(model_rf$pred$rowIndex)]
trainSet_ensemble$OOF_pred_knn<-model_knn$pred$high[order(model_knn$pred$rowIndex)]
trainSet_ensemble$OOF_pred_lr<-model_lr$pred$high[order(model_lr$pred$rowIndex)]

#Predicting probabilities for the test data
testSet_ensemble$OOF_pred_rf<-predict(model_rf,testSet_ensemble[predictors],type='prob')$high
testSet_ensemble$OOF_pred_knn<-predict(model_knn,testSet_ensemble[predictors],type='prob')$high
testSet_ensemble$OOF_pred_lr<-predict(model_lr,testSet_ensemble[predictors],type='prob')$high

#Predictors for top layer models 
predictors_top<-c('OOF_pred_rf','OOF_pred_knn','OOF_pred_lr') 

#constructing GBM as the top model after prediction
#GBM as top layer model 
model_gbm<- 
train(trainSet_ensemble[,predictors_top],trainSet_ensemble[,outcomeName],method='gbm',trControl=fitControl,tuneLength=3)

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.4463             nan     0.1000    0.0205
##      2        0.4218             nan     0.1000    0.0116
##      3        0.4017             nan     0.1000    0.0103
##      4        0.3818             nan     0.1000    0.0097
##      5        0.3706             nan     0.1000    0.0061
##      6        0.3592             nan     0.1000    0.0042
##      7        0.3505             nan     0.1000    0.0028
##      8        0.3420             nan     0.1000    0.0018
##      9        0.3356             nan     0.1000    0.0027
##     10        0.3295             nan     0.1000    0.0015
##     20        0.2838             nan     0.1000    0.0007
##     40        0.2527             nan     0.1000   -0.0013
##     60        0.2352             nan     0.1000    0.0001
##     80        0.2273             nan     0.1000   -0.0003
##    100        0.2239             nan     0.1000   -0.0018
##    120        0.2154             nan     0.1000   -0.0011
##    140        0.2079             nan     0.1000   -0.0005
##    150        0.2060             nan     0.1000   -0.0003
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.4272             nan     0.1000    0.0355
##      2        0.4054             nan     0.1000    0.0119
##      3        0.3856             nan     0.1000    0.0064
##      4        0.3702             nan     0.1000    0.0021
##      5        0.3548             nan     0.1000    0.0060
##      6        0.3405             nan     0.1000    0.0053
##      7        0.3303             nan     0.1000    0.0057
##      8        0.3195             nan     0.1000    0.0021
##      9        0.3128             nan     0.1000    0.0011
##     10        0.3069             nan     0.1000    0.0001
##     20        0.2562             nan     0.1000   -0.0022
##     40        0.2211             nan     0.1000    0.0000
##     60        0.1946             nan     0.1000   -0.0025
##     80        0.1839             nan     0.1000   -0.0027
##    100        0.1677             nan     0.1000   -0.0019
##    120        0.1576             nan     0.1000   -0.0010
##    140        0.1483             nan     0.1000   -0.0011
##    150        0.1434             nan     0.1000   -0.0014
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.4600             nan     0.1000    0.0319
##      2        0.4171             nan     0.1000    0.0180
##      3        0.3810             nan     0.1000    0.0154
##      4        0.3638             nan     0.1000    0.0017
##      5        0.3467             nan     0.1000    0.0046
##      6        0.3264             nan     0.1000    0.0025
##      7        0.3123             nan     0.1000    0.0053
##      8        0.3035             nan     0.1000    0.0037
##      9        0.2930             nan     0.1000   -0.0008
##     10        0.2855             nan     0.1000   -0.0005
##     20        0.2407             nan     0.1000    0.0002
##     40        0.1997             nan     0.1000   -0.0016
##     60        0.1745             nan     0.1000   -0.0006
##     80        0.1527             nan     0.1000   -0.0014
##    100        0.1366             nan     0.1000   -0.0021
##    120        0.1255             nan     0.1000   -0.0012
##    140        0.1149             nan     0.1000   -0.0006
##    150        0.1087             nan     0.1000   -0.0004
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.4338             nan     0.1000    0.0215
##      2        0.3981             nan     0.1000    0.0085
##      3        0.3803             nan     0.1000    0.0080
##      4        0.3600             nan     0.1000    0.0076
##      5        0.3463             nan     0.1000    0.0058
##      6        0.3291             nan     0.1000    0.0045
##      7        0.3178             nan     0.1000    0.0033
##      8        0.3074             nan     0.1000    0.0051
##      9        0.3029             nan     0.1000    0.0008
##     10        0.2971             nan     0.1000    0.0000
##     20        0.2648             nan     0.1000    0.0015
##     40        0.2470             nan     0.1000   -0.0014
##     60        0.2383             nan     0.1000   -0.0012
##     80        0.2238             nan     0.1000   -0.0015
##    100        0.2151             nan     0.1000   -0.0020
##    120        0.2087             nan     0.1000   -0.0001
##    140        0.2056             nan     0.1000   -0.0001
##    150        0.2005             nan     0.1000   -0.0009
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.4426             nan     0.1000    0.0477
##      2        0.4049             nan     0.1000    0.0171
##      3        0.3752             nan     0.1000    0.0153
##      4        0.3592             nan     0.1000    0.0046
##      5        0.3408             nan     0.1000    0.0066
##      6        0.3298             nan     0.1000    0.0053
##      7        0.3185             nan     0.1000    0.0057
##      8        0.3069             nan     0.1000    0.0059
##      9        0.3065             nan     0.1000   -0.0018
##     10        0.2986             nan     0.1000    0.0035
##     20        0.2605             nan     0.1000   -0.0021
##     40        0.2276             nan     0.1000   -0.0023
##     60        0.2081             nan     0.1000    0.0007
##     80        0.1932             nan     0.1000   -0.0015
##    100        0.1826             nan     0.1000   -0.0010
##    120        0.1738             nan     0.1000   -0.0020
##    140        0.1658             nan     0.1000   -0.0014
##    150        0.1625             nan     0.1000   -0.0005
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.4245             nan     0.1000    0.0266
##      2        0.3826             nan     0.1000    0.0146
##      3        0.3586             nan     0.1000    0.0071
##      4        0.3444             nan     0.1000    0.0043
##      5        0.3277             nan     0.1000    0.0051
##      6        0.3119             nan     0.1000    0.0057
##      7        0.3025             nan     0.1000    0.0014
##      8        0.2933             nan     0.1000    0.0020
##      9        0.2836             nan     0.1000    0.0016
##     10        0.2751             nan     0.1000    0.0018
##     20        0.2349             nan     0.1000   -0.0007
##     40        0.2017             nan     0.1000   -0.0017
##     60        0.1834             nan     0.1000   -0.0019
##     80        0.1739             nan     0.1000   -0.0019
##    100        0.1612             nan     0.1000   -0.0012
##    120        0.1465             nan     0.1000   -0.0014
##    140        0.1333             nan     0.1000   -0.0006
##    150        0.1278             nan     0.1000   -0.0007
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.4468             nan     0.1000    0.0454
##      2        0.4038             nan     0.1000    0.0115
##      3        0.3715             nan     0.1000    0.0091
##      4        0.3575             nan     0.1000    0.0088
##      5        0.3375             nan     0.1000    0.0054
##      6        0.3272             nan     0.1000    0.0057
##      7        0.3150             nan     0.1000    0.0059
##      8        0.3050             nan     0.1000    0.0013
##      9        0.2961             nan     0.1000    0.0012
##     10        0.2888             nan     0.1000    0.0027
##     20        0.2512             nan     0.1000   -0.0020
##     40        0.2282             nan     0.1000   -0.0005
##     60        0.2214             nan     0.1000   -0.0010
##     80        0.2118             nan     0.1000   -0.0014
##    100        0.2047             nan     0.1000    0.0001
##    120        0.1982             nan     0.1000   -0.0005
##    140        0.1934             nan     0.1000   -0.0008
##    150        0.1920             nan     0.1000   -0.0009
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.4277             nan     0.1000    0.0358
##      2        0.4006             nan     0.1000    0.0101
##      3        0.3678             nan     0.1000    0.0149
##      4        0.3446             nan     0.1000    0.0097
##      5        0.3284             nan     0.1000    0.0085
##      6        0.3147             nan     0.1000    0.0026
##      7        0.3011             nan     0.1000    0.0051
##      8        0.2922             nan     0.1000    0.0034
##      9        0.2874             nan     0.1000    0.0015
##     10        0.2809             nan     0.1000    0.0033
##     20        0.2406             nan     0.1000    0.0010
##     40        0.2082             nan     0.1000   -0.0006
##     60        0.1952             nan     0.1000   -0.0025
##     80        0.1833             nan     0.1000   -0.0018
##    100        0.1727             nan     0.1000   -0.0018
##    120        0.1632             nan     0.1000   -0.0014
##    140        0.1523             nan     0.1000   -0.0008
##    150        0.1495             nan     0.1000   -0.0005
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.4257             nan     0.1000    0.0397
##      2        0.3840             nan     0.1000    0.0111
##      3        0.3613             nan     0.1000    0.0121
##      4        0.3456             nan     0.1000    0.0051
##      5        0.3228             nan     0.1000    0.0070
##      6        0.3086             nan     0.1000    0.0041
##      7        0.3001             nan     0.1000    0.0029
##      8        0.2886             nan     0.1000    0.0017
##      9        0.2811             nan     0.1000    0.0026
##     10        0.2703             nan     0.1000    0.0009
##     20        0.2141             nan     0.1000   -0.0012
##     40        0.1860             nan     0.1000   -0.0020
##     60        0.1633             nan     0.1000   -0.0027
##     80        0.1517             nan     0.1000   -0.0011
##    100        0.1370             nan     0.1000   -0.0011
##    120        0.1175             nan     0.1000   -0.0016
##    140        0.1094             nan     0.1000   -0.0013
##    150        0.1026             nan     0.1000   -0.0011
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.4703             nan     0.1000    0.0323
##      2        0.4404             nan     0.1000    0.0071
##      3        0.4246             nan     0.1000    0.0054
##      4        0.4080             nan     0.1000    0.0089
##      5        0.3957             nan     0.1000    0.0041
##      6        0.3829             nan     0.1000    0.0039
##      7        0.3708             nan     0.1000    0.0048
##      8        0.3610             nan     0.1000    0.0035
##      9        0.3564             nan     0.1000    0.0010
##     10        0.3519             nan     0.1000   -0.0003
##     20        0.3192             nan     0.1000    0.0002
##     40        0.3019             nan     0.1000   -0.0002
##     60        0.2874             nan     0.1000    0.0002
##     80        0.2740             nan     0.1000   -0.0015
##    100        0.2636             nan     0.1000   -0.0004
##    120        0.2551             nan     0.1000   -0.0009
##    140        0.2503             nan     0.1000   -0.0012
##    150        0.2479             nan     0.1000   -0.0005
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.4827             nan     0.1000    0.0270
##      2        0.4411             nan     0.1000    0.0155
##      3        0.4181             nan     0.1000    0.0104
##      4        0.3982             nan     0.1000    0.0070
##      5        0.3757             nan     0.1000    0.0043
##      6        0.3605             nan     0.1000    0.0034
##      7        0.3504             nan     0.1000    0.0013
##      8        0.3381             nan     0.1000    0.0047
##      9        0.3362             nan     0.1000   -0.0021
##     10        0.3289             nan     0.1000    0.0014
##     20        0.2931             nan     0.1000   -0.0017
##     40        0.2702             nan     0.1000   -0.0004
##     60        0.2486             nan     0.1000   -0.0013
##     80        0.2349             nan     0.1000   -0.0023
##    100        0.2256             nan     0.1000   -0.0017
##    120        0.2104             nan     0.1000   -0.0010
##    140        0.2018             nan     0.1000   -0.0015
##    150        0.2002             nan     0.1000   -0.0005
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.4893             nan     0.1000    0.0273
##      2        0.4364             nan     0.1000    0.0190
##      3        0.4153             nan     0.1000    0.0097
##      4        0.3919             nan     0.1000    0.0106
##      5        0.3782             nan     0.1000    0.0038
##      6        0.3637             nan     0.1000    0.0038
##      7        0.3548             nan     0.1000    0.0017
##      8        0.3470             nan     0.1000   -0.0002
##      9        0.3403             nan     0.1000    0.0005
##     10        0.3323             nan     0.1000    0.0010
##     20        0.2810             nan     0.1000   -0.0022
##     40        0.2427             nan     0.1000   -0.0046
##     60        0.2191             nan     0.1000   -0.0032
##     80        0.2037             nan     0.1000   -0.0015
##    100        0.1921             nan     0.1000   -0.0027
##    120        0.1802             nan     0.1000   -0.0008
##    140        0.1662             nan     0.1000   -0.0007
##    150        0.1609             nan     0.1000   -0.0016
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.4802             nan     0.1000    0.0321
##      2        0.4459             nan     0.1000    0.0092
##      3        0.4238             nan     0.1000    0.0062
##      4        0.4052             nan     0.1000    0.0089
##      5        0.3950             nan     0.1000    0.0035
##      6        0.3799             nan     0.1000    0.0058
##      7        0.3696             nan     0.1000    0.0038
##      8        0.3640             nan     0.1000    0.0010
##      9        0.3653             nan     0.1000   -0.0029
##     10        0.3590             nan     0.1000    0.0007
##     20        0.3287             nan     0.1000   -0.0011
##     40        0.3126             nan     0.1000   -0.0008
##     60        0.2950             nan     0.1000   -0.0002
##     80        0.2851             nan     0.1000   -0.0018
##    100        0.2773             nan     0.1000   -0.0005
##    120        0.2704             nan     0.1000    0.0000
##    140        0.2634             nan     0.1000   -0.0007
##    150        0.2612             nan     0.1000   -0.0019
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.4859             nan     0.1000    0.0291
##      2        0.4451             nan     0.1000    0.0172
##      3        0.4258             nan     0.1000    0.0049
##      4        0.4051             nan     0.1000    0.0094
##      5        0.3903             nan     0.1000    0.0067
##      6        0.3778             nan     0.1000    0.0067
##      7        0.3728             nan     0.1000   -0.0008
##      8        0.3556             nan     0.1000    0.0079
##      9        0.3434             nan     0.1000    0.0035
##     10        0.3375             nan     0.1000    0.0010
##     20        0.3046             nan     0.1000   -0.0013
##     40        0.2674             nan     0.1000   -0.0012
##     60        0.2520             nan     0.1000   -0.0009
##     80        0.2386             nan     0.1000   -0.0014
##    100        0.2288             nan     0.1000   -0.0008
##    120        0.2213             nan     0.1000   -0.0019
##    140        0.2105             nan     0.1000   -0.0006
##    150        0.2095             nan     0.1000   -0.0016
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.4718             nan     0.1000    0.0230
##      2        0.4350             nan     0.1000    0.0126
##      3        0.4118             nan     0.1000    0.0102
##      4        0.3953             nan     0.1000    0.0066
##      5        0.3733             nan     0.1000    0.0054
##      6        0.3622             nan     0.1000    0.0053
##      7        0.3510             nan     0.1000    0.0038
##      8        0.3443             nan     0.1000    0.0020
##      9        0.3327             nan     0.1000    0.0007
##     10        0.3230             nan     0.1000    0.0000
##     20        0.2850             nan     0.1000   -0.0010
##     40        0.2520             nan     0.1000   -0.0017
##     60        0.2334             nan     0.1000   -0.0012
##     80        0.2200             nan     0.1000   -0.0006
##    100        0.2057             nan     0.1000   -0.0022
##    120        0.1962             nan     0.1000   -0.0021
##    140        0.1821             nan     0.1000   -0.0011
##    150        0.1754             nan     0.1000   -0.0014
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.4694             nan     0.1000    0.0273
##      2        0.4467             nan     0.1000    0.0075
##      3        0.4198             nan     0.1000    0.0120
##      4        0.3955             nan     0.1000    0.0061
##      5        0.3731             nan     0.1000    0.0084
##      6        0.3656             nan     0.1000    0.0033
##      7        0.3562             nan     0.1000    0.0042
##      8        0.3511             nan     0.1000    0.0010
##      9        0.3424             nan     0.1000    0.0039
##     10        0.3329             nan     0.1000    0.0042
##     20        0.3100             nan     0.1000   -0.0023
##     40        0.2758             nan     0.1000   -0.0011
##     50        0.2690             nan     0.1000   -0.0007

#constructing Logistic regression as the top layer model
#Logistic regression as top layer model
model_glm<-
train(trainSet_ensemble[,predictors_top],trainSet_ensemble[,outcomeName],method='glm',trControl=fitControl,tuneLength=3)
#############################Stacked ensemble method################################
#predict using GBM top layer model
testSet_ensemble$gbm_stacked<-predict(model_gbm,testSet_ensemble[,predictors_top])

#predict using logictic regression top layer model
testSet_ensemble$glm_stacked<-predict(model_glm,testSet_ensemble[,predictors_top])

#Since the dataset is fairly  clean and neat, all the predictions were higher than usual. The above code displays how stacked ensemble works. Ensemble methods are very useful when the dataset is big and messy. It combines two or more algorithms to produce the best accurate results. There are different types, but the above code explains stacked ensemble method meaning stacking multiple layer of machine learning algorithm over one another where each of the models passes thier predictions to the model in the layer above and the top layer model takes decision based on the inputs given to it. Since this is fairly a good dataset, i used stacked ensemble method

////PHASE FIVE: EVALUATION //Evaluate results To evaluate the model built I keep few factors into account such as accuracy, RMSE, Pvalue, R-squared. I tuned the model until it reaches the highest accuracy.

#comparing the models with thier R-squared values to have a better understanding
errors_rate=data.frame(model_name=c("Naive_Bayes", "Multiple_linear_regression", "KNN") ,  Error_rate=c(0.34, 0.060, 0.065))

ggplot(errors_rate, aes(x=model_name, y=Error_rate)) + geom_bar(stat = "identity")

#when compared to three machine learning techniques, multiple linear regression has the lowest error rate of all. Hence the best machine learning model for this dataset is multiple regression.

//Review process This phase generally consists of cross- verifying the algorithm and checking if it runs perfectly as expected. I re-run the code from scratch to check if there are any errors and gave my insights on every modela nd every code I have written above.

//Determining the next step The next step for this project 1. cross verify the code 2. Deploy the project

////PHASE SIX : DEPLOYMENT ///Plan deployment I have planned to deploy this project in my Github and Rpubs. This will help the students who aspire to do a masters degree in th US. They can use this project to get to know thier chance of admission in the University

///Plan monitoring and maintenance The project will get updated every three years with new datasets. Some variables like GRE scores have changed in the past. 10 years ago, the GRE scores were out of 800, but recently it has changed to 340. The project will get updated if there are any such changes in the factors affecting the chance of admission

///Produce final report The final report for the project will be 1. PDF file describes the steps and Data mining concepts along with coding and appropriate commenting 2. A brief powerpoint presentation describing the steps and action I have taken in the project 3. Access link to the people who are willing to take a look at it.

Reference 1.Grolemund, Garrett, and Hadley Wickham. R for Data Science. Accessed April 24, 2019. https://r4ds.had.co.nz/. 2.“How to Build Ensemble Models in Machine Learning? (With Code in R).” Analytics Vidhya (blog), February 15, 2017. https://www.analyticsvidhya.com/blog/2017/02/introduction-to-ensembling-along-with-implementation-in-r/. 3.Brett Lanz. Machine Learning with R. Second edition. Packt Publishing, n.d.