library(readxl)
## Warning: package 'readxl' was built under R version 3.5.3
data <- read.csv('F:/Machine Learning/Data Science/Machine Learning/Naive_Bayes/binary.csv')
str(data)
## 'data.frame': 400 obs. of 4 variables:
## $ admit: int 0 1 1 1 0 1 1 0 1 0 ...
## $ gre : int 380 660 800 640 520 760 560 400 540 700 ...
## $ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
## $ rank : int 3 3 1 4 4 2 1 2 3 2 ...
#convert integer into factor
data$rank <- as.factor(data$rank)
data$admit <- as.factor(data$admit)
str(data)
## 'data.frame': 400 obs. of 4 variables:
## $ admit: Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 1 2 1 ...
## $ gre : int 380 660 800 640 520 760 560 400 540 700 ...
## $ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
## $ rank : Factor w/ 4 levels "1","2","3","4": 3 3 1 4 4 2 1 2 3 2 ...
#cross-tabulation to check freuencies
xtabs(~admit+rank, data)
## rank
## admit 1 2 3 4
## 0 28 97 93 55
## 1 33 54 28 12
#Visualization of correlation
library(psych)
## Warning: package 'psych' was built under R version 3.5.3
#excluded target variable being the first variable in the dataset
pairs.panels(data[-1])
The above graph shows the relationship between the GRE and GPA is not strong as the correlation is 0.38. One thing to remember is that to develop Naive Bayes model, the independent variables should not be highly correlated.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#boxplot
data %>% ggplot(aes(admit, gre, fill=admit)) +
geom_boxplot() +
ggtitle("Boxplot")
data %>% ggplot(aes(admit, gpa, fill=admit)) +
geom_boxplot() +
ggtitle("Boxplot")
The box plots show there is a significant amount of overlap. Further, it is evident that there is a high chance of a student being selected whose GRE and GPA scores are high.
#Density plot to check overlap
data %>% ggplot(aes(gre,fill=admit))+
geom_density(alpha=.6)+
ggtitle("Density Plot GRE vs admit")
data %>% ggplot(aes(gpa, fill=admit))+
geom_density(alpha=.5)+
ggtitle("Density Plot GPA vs admit")
We can see the overall plot where admit=1 is on the right side, these students have high GRE scores compared to the red ones on the left where students are not admitted.
This plot shows more clearly the evidence of overlap. There is a lot of scope to develop a model, but the model is unlikely to be 100% accurate because of overlap between the two classes.
#data partition
set.seed(2498)
ind <- sample(2, nrow(data), replace = TRUE, prob=c(.8,.2))
train <- data[ind==1,]
test <- data[ind==2,]
dim(train)
## [1] 322 4
dim(test)
## [1] 78 4
#Naives Bayes Model
library(e1071)
## Warning: package 'e1071' was built under R version 3.5.3
nb.model <- naiveBayes(admit ~ ., data=train)
nb.model
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.6832298 0.3167702
##
## Conditional probabilities:
## gre
## Y [,1] [,2]
## 0 567.4545 116.9587
## 1 619.8039 102.4258
##
## gpa
## Y [,1] [,2]
## 0 3.346318 0.3899954
## 1 3.489118 0.3702317
##
## rank
## Y 1 2 3 4
## 0 0.10000000 0.34545455 0.35000000 0.20454545
## 1 0.29411765 0.42156863 0.19607843 0.08823529
In the train data, about 68.3% of students belonging to the category where admit=0, which means students are not admitted and 31.7% of data points where admit=1. So in the train data we have 31.7% of students were admitted to the program.
#Predict train data
p <- predict(nb.model, train, type = "raw")
p.class <- predict(nb.model, train, type = "class")
p.class <- ifelse(p.class==1, 1,0 )
#comparing predicted probabilities and predicted class vs actual class
head(cbind(p, p.class, train))
## 0 1 p.class admit gre gpa rank
## 2 0.6686904 0.3313096 0 1 660 3.67 3
## 3 0.2015633 0.7984367 1 1 800 4.00 1
## 4 0.8172556 0.1827444 0 1 640 3.19 4
## 6 0.6094223 0.3905777 0 1 760 3.00 2
## 7 0.5443692 0.4556308 0 1 560 2.98 1
## 9 0.8128942 0.1871058 0 1 540 3.39 3
The 2nd applicant has a probability of 67% not getting admitted and 33% chance of getting admitted as per prediction of model, but in reality the student was admitted which is a misclassification error. This student had low GRE score of 660, and GPA of 3.67 and came from a low ranked university.
If we look at 3rd applicant who has probability of 20% of not getting admitted and 80% chance of getting admitted. So the model predicted this student to be admitted and in reality also the student was admitted, this is correct classification. This student has high GRE and high GPA scores and came from a rank 1 university. In this case, the prediction of model is correct.
#Confusion matrix - train data
(tab<-table(train$admit, p.class))
## p.class
## 0 1
## 0 190 30
## 1 70 32
#accuracy
sum(diag(tab))/sum(tab)
## [1] 0.689441
#misclassification error on train data
1-sum(diag(tab))/sum(tab)
## [1] 0.310559
So misclassification is about 31% on the train data.
#Predict test data
p1.class <- predict(nb.model, test, type = "class")
p1 <- predict(nb.model, test, type="raw")
p1.class <- ifelse(p1.class==1, 1,0 )
#comparing predicted probabilities and predicted class vs actual class
head(cbind(p1, p1.class, test))
## 0 1 p1.class admit gre gpa rank
## 1 0.9201543 0.07984572 0 0 380 3.61 3
## 5 0.9158862 0.08411378 0 0 520 2.93 4
## 8 0.8849241 0.11507586 0 0 400 3.08 2
## 14 0.6050627 0.39493734 0 0 700 3.08 2
## 17 0.6511811 0.34881886 0 0 780 3.87 4
## 21 0.8754905 0.12450953 0 0 500 3.17 3
#Confusion matrix - test data
(tab1<-table(test$admit, p1.class))
## p1.class
## 0 1
## 0 49 4
## 1 17 8
#accuracy
sum(diag(tab1))/sum(tab1)
## [1] 0.7307692
#misclassification error on test data
1-sum(diag(tab1))/sum(tab1)
## [1] 0.2692308
So the misclassification is 27% in the test data.