Lab 3: Naive Bayes

Introduction

The data used for Lab Session 3: Introductory Naive Bayesian models is the Breast Cancer Wisconsin (Original) dataset found in the UCI repository here.

The data has 10 independent continuous variables and 1 dependent variable, also initially continuous. All variable values are integers. The attributes briefly are as follows:

Variable Name	Value Distribution
1. Sample code number	id number
2. Clump Thickness	1 - 10
3. Uniformity of Cell Size	1 - 10
4. Uniformity of Cell Shape	1 - 10
5. Marginal Adhesion	1 - 10
6. Single Epithelial Cell Size	1 - 10
7. Bare Nuclei	1 - 10
8. Bland Chromatin	1 - 10
9. Normal Nucleoli	1 - 10
10. Mitoses	1 - 10
11. Class	(2 for benign, 4 for malignant)

File I/O

The dataset was fetched via -curl, parsed and saved to the dataframe, df. All columns were then named after the UCI Data Dictionary and the ID column was trimmed.

fileURL <- "http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
download.file(fileURL, destfile="breast-cancer-wisconsin.data", method="curl")
df <- read.table("breast-cancer-wisconsin.data", na.strings = "?", sep=",")
# Columns are named appropriately and the ID column is trimmed.
df <- df[ , -1]
ds <- df
names(ds) <- c("ClumpThickness",
               "UniformityCellSize",
               "UniformityCellShape",
               "MarginalAdhesion",
               "SingleEpithelialCellSize",
               "BareNuclei",
               "BlandChromatin",
               "NormalNucleoli",
               "Mitoses",
               "Class")

Data Exploration & Preparation

The target variable is then relabelled into a more binary format (0 = benign, 1 = malignant) and is converted into a factor.

ds$Class <- factor(ds$Class, levels = c(2,4), labels = c("0", "1"))

The class distribution for the target variable shows a fortunate majority of benign outcomes:

prop.table(table(ds$Class))

## 
##         0         1 
## 0.6552217 0.3447783

ggplot(ds, aes(Class))+
  geom_bar(aes(fill = Class)) +
  scale_x_discrete("Benign or Malignant", labels = waiver()) +
  scale_fill_viridis(discrete = TRUE)

Distribution & Normality

Taking a look at the distribution of the independent variables with a density plot visualization.

plot_density(ds)

At first glance, most of the variables are right-skewed and thus not Gaussian in nature. The Shapiro-Wilk test of normality was then used as a formal test of normality (permissible as sample size >35, as per this resource)

shapiro.test(ds$ClumpThickness)

## 
##  Shapiro-Wilk normality test
## 
## data:  ds$ClumpThickness
## W = 0.9022, p-value < 2.2e-16

Including the example above, none of the independent variables exhibited a p-value >0.05, and thus normality cannot be assumed for any of them.

Correlation

corrTable <- cor(ds[, 1:9])
ggcorrplot(corrTable,
           outline.col = "white",
           ggtheme = ggplot2::theme_gray,
           lab = TRUE
          )

Correlation analysis among the independent variables showed very high correlation between the variables UniformityCellShape and UniformityCellSize. Genereally, both variables exhibited correlation values > 0.6 with the other independent variables, save for Mitoses.

Missing Data Imputation

plot_missing(ds)

table(is.na(ds$BareNuclei))

## 
## FALSE  TRUE 
##   683    16

Missing data checks show that 2.29% or 16 observations of the var. “BareNuclei” are missing. While it is tempting and expedient to ignore these values entirely and proceed, we shall conduct imputation for academic purposes.

Imputation Methods

The most immediate option was to use simple mean imputation to solve the missing values, but this article elaborated on several severe drawbacks of using said method, mostly concerning the introduction of bias in multivariate estimates (correlation or regression coefficients), standard errors, variance and sample means of the imputed variables.

Thus an alternative approach, i.e. Predictive Mean Matching, was implemented via the “mice” package.

library(mice)

mice_imputes <- mice(ds, m=5, method = "pmm")

The code executes m=5 imputations and creates 5 possible imputed values for BareNuclei:

mice_imputes$imp["BareNuclei"]

## $BareNuclei
##     1 2 3  4 5
## 24  5 5 2 10 8
## 41  5 5 4  4 1
## 140 5 1 2  1 1
## 146 1 5 1  1 1
## 159 1 1 1  1 2
## 165 3 2 1  1 1
## 236 2 1 1  1 2
## 250 1 1 1  1 1
## 276 1 4 1  1 5
## 293 3 2 1  6 5
## 295 1 1 1  1 1
## 298 1 1 1  3 1
## 316 1 1 2  1 1
## 322 1 3 1  1 1
## 412 1 3 1  1 1
## 618 1 1 1  1 1

For the purpose of this exercise, we impute the missing values from the fifth dataset created above:

imputed_ds <- complete(mice_imputes,5)

Feature Scaling Check

For Naive Bayes, feature scaling is implemented within the algorithm by design according to this answer here, hence why it does not need to be done manually in this exercise.

Furthermore, all the values for each variable is on a 1-10 scale, which is a equally scaled range.
Note that other methods that utilize Euclidean distances, e.g. k-NN, k-Means will place a larger emphasis on normalized scales.

Data Sampling

To ensure the distribution of the training dataset matches the distribution of the dependent variable in the sample, stratified sampling was enacted. To do this, we use the ‘caret’ package, specifically the createDataPartition function, to conduct random sampling while preserving the class distribution of the target variable.

Specifically, the function creates an index that selects observations that will reflect the initial distribution of the chosen variable, in this case: Class.

set.seed(1234)
trainIndex <- createDataPartition(imputed_ds$Class, p = .7, list = FALSE, times = 1)

# Create Train and Test Sets:
ds_Train <- imputed_ds[trainIndex,]
ds_Test <- imputed_ds[-trainIndex,]

The class distributions were checked after the sampling was executed and compared to the initial distribution:

 prop.table(table(ds_Test$Class))

## 
##         0         1 
## 0.6555024 0.3444976

 prop.table(table(ds_Train$Class))

## 
##        0        1 
## 0.655102 0.344898

 prop.table(table(imputed_ds$Class))

## 
##         0         1 
## 0.6552217 0.3447783

The distribution of the target variable in the training and test datasets are similar to that of the initial dataset. Stratified sampling has thus resulted in more representative training and test datasets.

Building the model

The NaiveBayes package has multiple implementations of the Naive Bayes classifier. Here we explored two variations: the standard naive_bayes and the specialized gaussian_naive_bayes.

Gaussian Naive Bayes

The Gaussian Naive Bayes model assumes that all class conditional distrubutions are Gaussian and independent as well as requires the predictor values to be inputted in a numeric matrix. The docs state that this implementation is more efficient at the cost of flexibility.

ds_Train_matrix <- as.matrix(ds_Train[ , -10])

classifier <- gaussian_naive_bayes(x = ds_Train_matrix, 
                                   y = ds_Train$Class)

The classifier was then used on the training data to test the predictions. The accuracy of the classifier was then tested:

y_pred_train <- predict(classifier, newdata = ds_Train_matrix)


cm = table(ds_Train$Class, y_pred_train)
(
accuracy = sum(diag(cm))/sum(cm)
)

## [1] 0.955102

Note that the accuracy metric is higher than Prof. Mandava’s example which used completely random sampling and the e1071 package’s Naive Bayes implementation:

0.9525773

I believe the comparison to be valid as the set.seed value used was the same, but I could be mistaken.

An explanation for the increase in accuracy could be due to the more representative sampling obtained by using the stratified sampling method.

Note the model’s assumption that all the predictor variables are Gaussian and independent. As seen from the density and correlation plots above, neither assumption is particularly true.

Standard Naive Bayes

For shits and giggles, the package’s standard Naive Bayes classifier (which is less efficient but more flexible) was used.

classifier2 <- naive_bayes(x = ds_Train[ ,-10], y = ds_Train$Class)

y_pred_train2 <- predict(classifier2, newdata = ds_Train[ , -10])

cm2 = table(ds_Train$Class, y_pred_train2)
(
accuracy2 = sum(diag(cm2))/sum(cm2)
)

## [1] 0.955102

No difference in accuracy between the Gaussian and non-Gaussian implementations.

Validating the Classifier

The classifier is validated using the test data:

ds_Test_matrix <- as.matrix(ds_Test[ , -10])
y_pred_test <- predict(classifier, newdata = ds_Test_matrix)

cm3 = table(ds_Test$Class, y_pred_test)

(
accuracy3 = sum(diag(cm3))/sum(cm3)
)

## [1] 0.9617225

Variations

Laplacian Smoothing

According to this blog post, Laplacian smoothing is applied to prevent issues with unseen values for features and to combat overfitting as the entire Naive Bayesian equation would equal zero should any a priori probability = 0. Laplacian smoothing adds 1 to every count so any probability will never be zero.

The previous code was modified slightly to include the argument for laplace.

classifier_lap <- naive_bayes(x = ds_Train[ ,-10], y = ds_Train$Class, laplace = 1)

y_pred_train_lap <- predict(classifier_lap, newdata = ds_Train[ ,-10] )
cm_lap = table(ds_Train$Class, y_pred_train_lap)
(
accuracy_lap = sum(diag(cm_lap))/sum(cm_lap)
)

## [1] 0.955102

y_pred_test_lap <- predict(classifier_lap, newdata = ds_Test[ , -10])
cm_test_lap = table(ds_Test$Class, y_pred_test_lap)
(
accuracy_test_lap = sum(diag(cm_test_lap))/sum(cm_test_lap)
)

## [1] 0.9617225

There is no difference in accuracy after applying Laplacian smoothing. This is due to the nature of the technique which is meant to smooth categorical data. As all the data for the predictors are continuous in nature and there are no zero values in any predictor variables, applying Laplacian smoothing has no effect.

Kernel Selection

According to the docs, a kernel density estimation can be used to obtain a non-parametric representation of the conditional probability density function as opposed to using a Gaussian density estimate, which assumes that the underlying data is Gaussian.

classifier_kde <- naive_bayes(x = ds_Train[ ,-10], y = ds_Train$Class, usekernel = TRUE)

y_pred_train_kde <- predict(classifier_kde, newdata = ds_Train[ ,-10] )
cm_kde = table(ds_Train$Class, y_pred_train_kde)
(
accuracy_kde = sum(diag(cm_kde))/sum(cm_kde)
)

## [1] 0.9571429

y_pred_test_kde <- predict(classifier_kde, newdata = ds_Test[ , -10])
cm_test_kde = table(ds_Test$Class, y_pred_test_kde)
(
accuracy_test_kde = sum(diag(cm_test_kde))/sum(cm_test_kde)
)

## [1] 0.9808612

The obtained accuracy from using the KDE is higher for both training and test datasets. The non-parametric model is better suited to the non-parametric, i.e. non-Gaussian nature of the dataset, hence the better accuracy.

Comparison of Results

Simple Naive Bayes		Naive Bayes with Laplace Smoothing		Naive Bayes with Kernel Selection
Training	Test	Training	Test	Training	Test
0.955102	0.9617225	0.955102	0.9617225	0.9571429	0.9808612

The implementation using Kernel Selection exhibits the highest accuracy scores among the three methods tested. This is because the non-parametric Naive Bayes model used fits the also non-parametric nature of the breast cancer dataset. Laplacian Smoothing has no effect because it primarily affects categorical variables, whereas all of the predictor variables are continuous.

Therefore, it is important to choose a variation of the predictive model that matches the nature of the given dataset.

Questions

The correlation plot reveals that the var. Bare Nuclei is unable to formulate correlations with other var’s. Why?
For imputation using Predictive Mean Matching, which iteration should be used to impute the missing values?
With regard to naming conventions, when is it acceptable to name variables with a “.”, e.g. lm.fit as opposed to lm_fit, etc?
Is it necessary to know the raw/probabilities of the predictions?

A work by Arnold Cheong

tp057228@mail.apu.edu.my