In machine learning, naïve Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naïve) independence assumptions between the features. Naive Bayes uses a similar method to predict the probability of different class based on various attributes. It predicts membership probabilities for each class such as the probability that given record or data point belongs to a particular class. The class with the highest probability is considered as the most likely class.
The data set contains the admission data of 400 students along with their score, the ranking of their school as well as if they got admitted or not.
We will use the following libraries
library(naivebayes)
## naivebayes 0.9.7 loaded
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caret)
## Loading required package: lattice
library(corrplot)
## corrplot 0.84 loaded
library(RColorBrewer)
Lets load and examine the data as ‘data’
Lets examine the data
summary(data)
## admit gre gpa rank
## Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
## Median :0.0000 Median :580.0 Median :3.395 Median :2.000
## Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485
## 3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
## Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
str(data)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 400 obs. of 4 variables:
## $ admit: num 0 1 1 1 0 1 1 0 1 0 ...
## $ gre : num 380 660 800 640 520 760 560 400 540 700 ...
## $ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
## $ rank : num 3 3 1 4 4 2 1 2 3 2 ...
## - attr(*, "spec")=
## .. cols(
## .. admit = col_double(),
## .. gre = col_double(),
## .. gpa = col_double(),
## .. rank = col_double()
## .. )
Since we are using a Naive Bayes classification model, we must make sure that the independent variables are not highly correlated.
M <-cor(data)
corrplot(M, type="upper", order="hclust", method = "number",col=brewer.pal(n=8, name="RdYlBu"))
Here, the 2 numeric data are gre and gpa and the correlation between them is not very strong.
We observe that the variable ‘admit’ and ‘rank’ are integers while they must be factors. So lets convert them.
data$admit <- as.factor(data$admit)
data$rank <- as.factor(data$rank)
We can visually understand the distribution of the 2 classes of ’admit variable by building a boxplot
ggplot(data, aes(x =admit, y =gre, fill = admit))+
geom_boxplot() +
ggtitle("comparison of distribution of admit")
A significant amount of overlap in observed between the distributions of the students admitted and not admitted. But the overall average gre is higher for those students who have been admitted vs those who have not. This can be clearly illustrated for both gre and gpa with the help of density plot.
ggplot(data, aes(x = gre, fill = admit))+
geom_density(alpha = 0.8, colour = "black")+
ggtitle("Density Plot GRE")
ggplot(data, aes(x = gpa, fill = admit))+
geom_density(alpha = 0.8, colour = "black")+
ggtitle("Density Plot GPA")
The students who were admitted, shown in blue, are skewed to the right, hence comparatively they had higher gre/gpa scores compared to students who were not.
Hence there is some potential to develop a classification problem but the accuracy may be compromised since there is overlap between the two.
In order to move ahead with building a model, we will partition the data into train and test samples
set.seed(1234)
ind <- sample(2, nrow(data), replace = T, prob = c(0.8, 0.2))
train <- data[ind ==1,]
test <- data[ind == 2,]
The data has been divided into 325 observations of training data and 75 observations of test data.
Naive Bayes Model
The Naive Bayes classification model is based on Naive Bayes theorem which states that P(A/B) = P(A) * P(B/A) / P(B)
(Assumption that A and B are independent)
Hence, probability that the student is admitted given that the student is coming from rank 1 school-
p(admit = 1/rank = 1)= p(admit = 1) * p(admit = 1/rank = 1) / p(rank = 1)
#Naive Bayes Model
model <- naive_bayes(admit ~., data = train)
model
##
## ================================== Naive Bayes ==================================
##
## Call:
## naive_bayes.formula(formula = admit ~ ., data = train)
##
## ---------------------------------------------------------------------------------
##
## Laplace smoothing: 0
##
## ---------------------------------------------------------------------------------
##
## A priori probabilities:
##
## 0 1
## 0.6861538 0.3138462
##
## ---------------------------------------------------------------------------------
##
## Tables:
##
## ---------------------------------------------------------------------------------
## ::: gre (Gaussian)
## ---------------------------------------------------------------------------------
##
## gre 0 1
## mean 578.6547 622.9412
## sd 116.3250 110.9240
##
## ---------------------------------------------------------------------------------
## ::: gpa (Gaussian)
## ---------------------------------------------------------------------------------
##
## gpa 0 1
## mean 3.3552466 3.5336275
## sd 0.3714542 0.3457057
##
## ---------------------------------------------------------------------------------
## ::: rank (Categorical)
## ---------------------------------------------------------------------------------
##
## rank 0 1
## 1 0.10313901 0.24509804
## 2 0.36771300 0.42156863
## 3 0.33183857 0.24509804
## 4 0.19730942 0.08823529
##
## ---------------------------------------------------------------------------------
Analysis: Here, in the training data, we have 0.6861538 or 68.6 % of the data that belongs to admit category ‘0’ or not admitted and 0.3138462 or 31.3% of the students belonging to category 1 or admitted.
For quantitative Independent data, the model calculates the sd and mean, whereas for categorical independent data, the model calculates the probability for each level of category.
Here, rank 0 1 1 0.10313901 0.24509804
can be interpreted as p(rank = 1/admit = 0) = 0.103 p(rank = 1/ admit = 1) = 0.245 and so on.
p <- predict(model, train, type = "prob")
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
head(cbind(p, train))
## 0 1 admit gre gpa rank
## 1 0.8449088 0.1550912 0 380 3.61 3
## 2 0.6214983 0.3785017 1 660 3.67 3
## 3 0.2082304 0.7917696 1 800 4.00 1
## 4 0.8501030 0.1498970 1 640 3.19 4
## 5 0.6917580 0.3082420 1 760 3.00 2
## 6 0.6720365 0.3279635 1 560 2.98 1
Here, the probability that the first student is unsuccessful in getting admitted is 0.8449088 or about 85%. We can also observe the gre and gpa scores were low and that the student came from a low ranked school as well. The second student
A confusion matrix will give us a better look.
#Confusion matrix for train data
p1 <- predict(model, train)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
(tab1 <- table(p1, train$admit))
##
## p1 0 1
## 0 196 69
## 1 27 33
By observing the above result, we can observe that there are some instances of misclassification. Lets calculate that-
1 - sum(diag(tab1))/sum(tab1)
## [1] 0.2953846
The miss classification rate is about 30%.
If we look at the test data
p2 <- predict(model, test)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
(tab1 <- table(p2, test$admit))
##
## p2 0 1
## 0 47 21
## 1 3 4
1 - sum(diag(tab1))/sum(tab1)
## [1] 0.32
Here, the miss classification is about 32%
If we look at the distribution of the numeric independent variables we see that they are not normal. Here we will use the Shapiro-wilk normality test for the same.
shapiro.test(data$gre)
##
## Shapiro-Wilk normality test
##
## data: data$gre
## W = 0.9859, p-value = 0.0006268
shapiro.test(data$gpa)
##
## Shapiro-Wilk normality test
##
## data: data$gpa
## W = 0.97736, p-value = 6.68e-06
We observe that both values are less than the threshold of 0.05 and hence the ‘Null Hypothesis’ is rejected and we can conclude that the data are not normally distributed.
hence, to improve our model performance we can perform Kernal based density estimation.
model2 <- naive_bayes(admit ~., data = train, usekernel = T)
model2
##
## ================================== Naive Bayes ==================================
##
## Call:
## naive_bayes.formula(formula = admit ~ ., data = train, usekernel = T)
##
## ---------------------------------------------------------------------------------
##
## Laplace smoothing: 0
##
## ---------------------------------------------------------------------------------
##
## A priori probabilities:
##
## 0 1
## 0.6861538 0.3138462
##
## ---------------------------------------------------------------------------------
##
## Tables:
##
## ---------------------------------------------------------------------------------
## ::: gre::0 (KDE)
## ---------------------------------------------------------------------------------
##
## Call:
## density.default(x = x, na.rm = TRUE)
##
## Data: x (223 obs.); Bandwidth 'bw' = 35.5
##
## x y
## Min. :193.5 Min. :6.010e-07
## 1st Qu.:371.7 1st Qu.:2.924e-04
## Median :550.0 Median :1.291e-03
## Mean :550.0 Mean :1.401e-03
## 3rd Qu.:728.3 3rd Qu.:2.405e-03
## Max. :906.5 Max. :3.199e-03
##
## ---------------------------------------------------------------------------------
## ::: gre::1 (KDE)
## ---------------------------------------------------------------------------------
##
## Call:
## density.default(x = x, na.rm = TRUE)
##
## Data: x (102 obs.); Bandwidth 'bw' = 39.59
##
## x y
## Min. :181.2 Min. :1.145e-06
## 1st Qu.:365.6 1st Qu.:2.007e-04
## Median :550.0 Median :1.129e-03
## Mean :550.0 Mean :1.354e-03
## 3rd Qu.:734.4 3rd Qu.:2.375e-03
## Max. :918.8 Max. :3.465e-03
##
## ---------------------------------------------------------------------------------
## ::: gpa::0 (KDE)
## ---------------------------------------------------------------------------------
##
## Call:
## density.default(x = x, na.rm = TRUE)
##
## Data: x (223 obs.); Bandwidth 'bw' = 0.1134
##
## x y
## Min. :2.080 Min. :0.0002229
## 1st Qu.:2.645 1st Qu.:0.0924939
## Median :3.210 Median :0.4521795
## Mean :3.210 Mean :0.4419689
## 3rd Qu.:3.775 3rd Qu.:0.6603271
## Max. :4.340 Max. :1.1433285
##
## ---------------------------------------------------------------------------------
## ::: gpa::1 (KDE)
## ---------------------------------------------------------------------------------
##
## Call:
## density.default(x = x, na.rm = TRUE)
##
## Data: x (102 obs.); Bandwidth 'bw' = 0.1234
##
## x y
## Min. :2.25 Min. :0.0005231
## 1st Qu.:2.78 1st Qu.:0.0800747
## Median :3.31 Median :0.4801891
## Mean :3.31 Mean :0.4710851
## 3rd Qu.:3.84 3rd Qu.:0.8626207
## Max. :4.37 Max. :1.0595464
##
## ---------------------------------------------------------------------------------
## ::: rank (Categorical)
## ---------------------------------------------------------------------------------
##
## rank 0 1
## 1 0.10313901 0.24509804
## 2 0.36771300 0.42156863
## 3 0.33183857 0.24509804
## 4 0.19730942 0.08823529
##
## ---------------------------------------------------------------------------------
p3 <- predict(model2, train)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
(tab3 <- table(p3, train$admit))
##
## p3 0 1
## 0 203 69
## 1 20 33
1- sum(diag(tab3))/sum(tab3)
## [1] 0.2738462
Hence, the miss classification has come down from the earlier value of 29.5 to 27.3 which is a significant improvement.
Handling Class Imbalance Problem: Improving Predictive Model Performance
Lets look at a proportion table of the data
(prop <- prop.table(table(data$admit)))
##
## 0 1
## 0.6825 0.3175
barplot(prop, col = rainbow(2), ylim = c(0, 0.7), main = "Class Distribution")
About 2/3rds of our data belong to the not admitted category and 1/3rd to the admitted. Hence, there is a huge difference between the data available for both the classes and hence we have a class imbalance.
Whenever we build a predictive model from an imbalanced data, our results will be dominated from the data in this class. Accuracy of the model will be better when predicting for class 0 v/s class 1. But we may be interested in accurately predicting the students that were successful in getting admitted rather than the unsuccessful ones. We can follow the following steps to lessen the implication of this class imbalance-
set.seed(111)
#Data Partition
ind_bal <- sample(2, nrow(data), replace = T, prob = c(0.7, 0.3))
train1 <- data[ind_bal==1,]
test1 <- data[ind_bal ==2,]
str(train1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 285 obs. of 4 variables:
## $ admit: Factor w/ 2 levels "0","1": 1 2 2 1 2 2 1 2 1 1 ...
## $ gre : num 380 800 640 520 760 560 400 540 700 800 ...
## $ gpa : num 3.61 4 3.19 2.93 3 2.98 3.08 3.39 3.92 4 ...
## $ rank : Factor w/ 4 levels "1","2","3","4": 3 1 4 4 2 1 2 3 2 4 ...
Here, the training data set has 285 observations and test data set has 115.
#Training dataset for predicitive models
table(train1$admit)
##
## 0 1
## 193 92
We see the same imbalance that was present in the original data set.
Here, we will use Random Forest for predicting the model.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
rf_train <- randomForest(admit~., data= train1)
#Model evaluation with test data
caret::confusionMatrix(predict(rf_train, test1), test1$admit, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 77 27
## 1 3 8
##
## Accuracy : 0.7391
## 95% CI : (0.649, 0.8166)
## No Information Rate : 0.6957
## P-Value [Acc > NIR] : 0.1815
##
## Kappa : 0.2367
##
## Mcnemar's Test P-Value : 2.679e-05
##
## Sensitivity : 0.22857
## Specificity : 0.96250
## Pos Pred Value : 0.72727
## Neg Pred Value : 0.74038
## Prevalence : 0.30435
## Detection Rate : 0.06957
## Detection Prevalence : 0.09565
## Balanced Accuracy : 0.59554
##
## 'Positive' Class : 1
##
Analysis: Reference is the actual value and the prediction is based on the prediction model. 77 of the applicants have been correctly classified as not admitted (category ‘0’) whereas 8 applicants have been correctly identified as admitted (category ‘1’). The off diagonal numbers give us the miss classifications.
Accuracy is the TP + TN/TOTAL. Here it is 73.9%. This accuracy value range is given by 95%CI.
The No Information Rate is the ‘Largest proportion of the observed class’(here it is for category ‘0’). If we do not develop any model and simplify classify every applicant as being rejected i.e belonging to category 0, we will be right 69.5% of the time.
IF we are mostly interested in calculating the positive instances(category ‘1’) then this model does not perform very well since sensitivity is 0.228 or 22.8%.
Sensitivity is how often our model is correctly able to predict category 1. The formula is TP / (TP+FN)
Specificity is how often our model is correctly able to predict category 0. The formula is TN / (TN + FP). It is very high about 96.25%.
This huge difference is because of class imbalance in data. Hence, the accuracy is also misleading.
One of the ways we can rectify is by oversampling
Oversampling for better sensitivity
library(ROSE) #Random Over Sampling Examples
## Loaded ROSE 0.0-3
over <- ovun.sample(admit ~.,data = train1, method = "over", N= 386)$data
table(over$admit)
##
## 0 1
## 193 193
summary(over)
## admit gre gpa rank
## 0:193 Min. :220.0 Min. :2.260 1: 90
## 1:193 1st Qu.:520.0 1st Qu.:3.143 2:144
## Median :580.0 Median :3.450 3: 92
## Mean :592.3 Mean :3.435 4: 60
## 3rd Qu.:680.0 3rd Qu.:3.740
## Max. :800.0 Max. :4.000
Here we have over sampled the number of samples in category 0 so that both categories will have equal number of samples.
It makes use of re sampling. Some of the same data points from class 1 are repeated randomly so that we have equal number of samples.
#Building a random forest from the over data
rf_over <- randomForest(admit ~., data = over)
confusionMatrix(predict(rf_over, test1), test1$admit, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 66 19
## 1 14 16
##
## Accuracy : 0.713
## 95% CI : (0.6212, 0.7935)
## No Information Rate : 0.6957
## P-Value [Acc > NIR] : 0.3853
##
## Kappa : 0.294
##
## Mcnemar's Test P-Value : 0.4862
##
## Sensitivity : 0.4571
## Specificity : 0.8250
## Pos Pred Value : 0.5333
## Neg Pred Value : 0.7765
## Prevalence : 0.3043
## Detection Rate : 0.1391
## Detection Prevalence : 0.2609
## Balanced Accuracy : 0.6411
##
## 'Positive' Class : 1
##
Analysis: Overall, the accuracy has reduced to 67.83% as compared to 73.91% in the first model.
Sensitivity has come down to 57.14% and Specificity to 72.5 %. If our aim is to predict the number of successful applicants, then the over sampled model is better.
Lets try under sampling
table(train1$admit)
##
## 0 1
## 193 92
under <- ovun.sample(admit ~., data = train1, method = "under", N= 184)$data
table(under$admit)
##
## 0 1
## 92 92
Here we see that there are higher number of category ‘0’ instances. Hence we will under sample the category ‘0’ and only take 184 instances(92 * 2)
The class is now balanced but there is a huge information loss since we have ignored quite a few instances of category ‘0’
Lets build a model for under sampling
under_rf <- randomForest(admit ~., data = under)
confusionMatrix(predict(under_rf, test1), test1$admit, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 50 13
## 1 30 22
##
## Accuracy : 0.6261
## 95% CI : (0.531, 0.7145)
## No Information Rate : 0.6957
## P-Value [Acc > NIR] : 0.95549
##
## Kappa : 0.2231
##
## Mcnemar's Test P-Value : 0.01469
##
## Sensitivity : 0.6286
## Specificity : 0.6250
## Pos Pred Value : 0.4231
## Neg Pred Value : 0.7937
## Prevalence : 0.3043
## Detection Rate : 0.1913
## Detection Prevalence : 0.4522
## Balanced Accuracy : 0.6268
##
## 'Positive' Class : 1
##
Analysis: Sensitivity has stayed the same 57.14% whereas specificity that is prediction of category’0’ has bettered as compared to over sampling.
Finally, we must try both under and over sampling together
both <- ovun.sample(admit ~., data = train1, method = "both", p =0.5, seed = 222, N = 285)$data
table(both$admit)
##
## 0 1
## 134 151
both_rf <- randomForest(admit~., data = both)
confusionMatrix(predict(both_rf, test1), test1$admit, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 51 13
## 1 29 22
##
## Accuracy : 0.6348
## 95% CI : (0.5399, 0.7226)
## No Information Rate : 0.6957
## P-Value [Acc > NIR] : 0.93385
##
## Kappa : 0.2358
##
## Mcnemar's Test P-Value : 0.02064
##
## Sensitivity : 0.6286
## Specificity : 0.6375
## Pos Pred Value : 0.4314
## Neg Pred Value : 0.7969
## Prevalence : 0.3043
## Detection Rate : 0.1913
## Detection Prevalence : 0.4435
## Balanced Accuracy : 0.6330
##
## 'Positive' Class : 1
##
Analysis: We have a better Sensitivity at 62.8% while there is not much of a difference in accuracy or specificity. Hence this can be a good model for predicting category ‘1’.
#Synthetic data with ROSE
rose <- ROSE(admit~., data = train1, N= 500)$data
table(rose$admit)
##
## 0 1
## 247 253
summary(rose)
## admit gre gpa rank
## 0:247 Min. :113.1 Min. :2.241 1:114
## 1:253 1st Qu.:514.0 1st Qu.:3.143 2:184
## Median :602.1 Median :3.435 3:123
## Mean :599.6 Mean :3.423 4: 79
## 3rd Qu.:684.0 3rd Qu.:3.723
## Max. :920.3 Max. :4.526
The max gre in our original data was 800 and max of gpa was 4 whereas since we have formulated synthetic data, both these parameters have increased. here, gpa cannot be more than 4.Hence, we can add some attributes to limit the maximum so that these parameters do not go beyond a certain range.
Lets create Random Forest on rose.
rose_rf <- randomForest(admit~., data = rose)
confusionMatrix(predict(rose_rf, test1), test1$admit, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 48 14
## 1 32 21
##
## Accuracy : 0.6
## 95% CI : (0.5045, 0.6902)
## No Information Rate : 0.6957
## P-Value [Acc > NIR] : 0.98878
##
## Kappa : 0.1747
##
## Mcnemar's Test P-Value : 0.01219
##
## Sensitivity : 0.6000
## Specificity : 0.6000
## Pos Pred Value : 0.3962
## Neg Pred Value : 0.7742
## Prevalence : 0.3043
## Detection Rate : 0.1826
## Detection Prevalence : 0.4609
## Balanced Accuracy : 0.6000
##
## 'Positive' Class : 1
##
Analysis: Sensitivity is 68% which is much better than the previous models.