Classification Analysis: Naive Bayes

Preliminary

We will use the DescTools, caret and e1071 packages. If you do not already have the e1071 package installed, you will first install it using the install.packages() function.

install.packages("e1071")

Next, we load all of the necessary libraries for use in the session.

library(caret)
library(DescTools)
library(e1071)

In this lesson, we use the DirectMarketing.csv file, which contains information for 1000 customers of a business that sells products by mail-order catalog. The marketing manager of the company wants to use historical data about his customers (demographic and past marketing information) to predict if the customer will spend (Amount) at or below average (Below_Avg) or above average (Above_Avg). The company prioritizes correctly predicting if a customer will purchase more than the average customer, because they have limited advertising budget and need to make strategic marketing decisions.

Variables include:

Age: customer age (Young, Middle, Old)
Gender: customer gender (Male, Female)
OwnHome: if the customer owns a home (Own) or rents (Rent)
Married: if the customer is married (Married) or single (Single)
Location: if the customer lives close to the company (Close) or not (Far)
Salary: the customers yearly salary
Children: the number of children that the customer have (0, 1, 2, 3)
Catalogs: the total number of catalogs that the customer has received (6, 12, 18, 24)
Amount: the level of money that the customer has spent, either at or below average (Below_Avg) or above (Above_Avg)

We use the read.csv() function to import the CSV file into R as a dataframe named DM. We set stringsAsFactors = FALSE to keep any character columns as-is.

DM <- read.csv(file = "DirectMarketing.csv",
                   stringsAsFactors = FALSE)

Data Exploration & Variable Preparation

First, we can obtain high-level information about the DM dataframe to look at the variable types and to check for missing (NA) values.

Abstract(DM)

## ------------------------------------------------------------------------------ 
## DM
## 
## data frame:  1000 obs. of  9 variables
##      1000 complete cases (100.0%)
## 
##   Nr  ColName   Class      NAs  Levels
##   1   Age       character  .          
##   2   Gender    character  .          
##   3   OwnHome   character  .          
##   4   Married   character  .          
##   5   Location  character  .          
##   6   Salary    integer    .          
##   7   Children  integer    .          
##   8   Catalogs  integer    .          
##   9   Amount    character  .

Preparing Target (Y) Variable

Next, we can convert our target class variable that we want to predict, Amount to a nominal factor variable. Note: although it is ordinal, we will use it as a nominal variable for compatibility. We will identify level 1 as Below_Avg and level 2 as Above_Avg to reflect the known ordering.

DM$Amount <- factor(x = DM$Amount,
                    levels = c("Below_Avg", "Above_Avg"))

We can plot the distribution using a bar plot, which is the default plot for the plot() function when applied to factor variables.

plot(x = DM$Amount,
     main = "Amount Spent")

Preparing Predictor (X) Variables

Nominal Variables

First, we can convert our nominal predictor variables to factor variables. We will set up a convenience vector of the names of the nominal variables (noms) so that we can more easily refer to them.

noms <- c("Gender", "OwnHome", "Married", "Location")
DM[ ,noms] <- lapply(X = DM[ ,noms],
                     FUN = factor)

Ordinal Variables

Next, we can convert our ordinal variables. The Age variable takes on three levels, which we will need to specify with the correct ordering. Since the Children and Catalogs variables take on numbers, we can use the default ordering and convert them at the same time.We will set up a convenience vector of the names of the ordinal variables (ords) so that we can more easily refer to them.

ords <- c("Age", "Children", "Catalogs")

DM$Age <- factor(x = DM$Age,
                 levels = c("Young", "Middle", "Old"),
                 ordered = TRUE)

DM[,c("Children", "Catalogs")] <- lapply(X = DM[,c("Children", "Catalogs")],
                                         FUN = factor,
                                         ordered = TRUE)

Numerical Variables

We only have one numerical variable, Salary. We can set up a named value, nums (although we do not need to).

nums <- c("Salary")

Finally, we can create our convenience vector of predictor variables (vars) containing all of our predictors. We can view the predictors by name by running a code line of the vars object.

vars <- c(noms, ords, nums)
vars

## [1] "Gender"   "OwnHome"  "Married"  "Location" "Age"      "Children" "Catalogs"
## [8] "Salary"

Data Preprocessing & Transformation

First, we can view summary() information for our prepared data.

summary(DM)

##      Age         Gender    OwnHome       Married     Location  
##  Young :287   Female:506   Own :516   Married:502   Close:710  
##  Middle:508   Male  :494   Rent:484   Single :498   Far  :290  
##  Old   :205                                                    
##                                                                
##                                                                
##                                                                
##      Salary       Children Catalogs       Amount   
##  Min.   : 10100   0:462    6 :252   Below_Avg:601  
##  1st Qu.: 29975   1:267    12:282   Above_Avg:399  
##  Median : 53700   2:146    18:233                  
##  Mean   : 56104   3:125    24:233                  
##  3rd Qu.: 77025                                    
##  Max.   :168800

For Naive Bayes classification, we know that we are okay keeping irrelevant features, missing values can either be removed, imputed or ignored during analysis, categorical variables can be incorporated as factor variables and we need to remove redundant numerical features. We also would like our numeric variables to be approximately normally distributed.

Missing Values First, we check for missing values. If missing values are present, we can either remove them row-wise (na.omit()) or perform imputation (or we can handle them during analysis).

PlotMiss(DM)

Redundant Variables We need to identify highly correlated numeric predictor (X) variables and exclude them from our predictive model. Since we only have one numeric variable, we do not need to worry about redundancy.
Normally Distributed Numeric Variables By assumption, Naive Bayes expects numeric variables to be normally distributed. We can visually inspect the distribution of our numeric variable, Salary, using a histogram.

hist(DM$Salary, main = "Salary")

We can transform the Salary variable to approximately normal. First, we can look at the distribution of the Salary variable to determine which transformation to use. If we have strictly positive values, we can use a Box-Cox transformation. If not, we can use a Yeo-Johnson transformation.

summary(DM$Salary)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10100   29975   53700   56104   77025  168800

Since we have strictly positive values, we will use a Box-Cox transformation and then standardize the Salary variable.

cen_bcs <- preProcess(x = DM[ ,vars],
                      method = c("BoxCox", "center", "scale"))
DM_bcs <- predict(object = cen_bcs,
                      newdata = DM)

Training & Testing Sets

We use the createDataPartition() function from the caret package to identify the row numbers that we will include in our training set. Then, all other rows will be put in our testing set. We split the data using an 80/20 split (80% in training and 20% in testing). By using createDataPartition() we preserve the distribution of our outcome (Y) variable (Amount). Since the function takes a random sample, we initialize a random seed first for reproducibility.

set.seed(831) # initialize the random seed

# Generate the list of observations for the
# train dataframe
sub <- createDataPartition(y = DM_bcs$Amount, 
                           p = 0.80, 
                           list = FALSE)

Subset the rows of the DM_bcs dataframe to include the row numbers in the sub object to create the train dataframe. We use all observations not in the sub object to create the test dataframe.

train <- DM_bcs[sub, ] 
test <- DM_bcs[-sub, ]

Analysis

We use the naiveBayes() function from the e1071 package to perform Naive Bayes classification.

If we have NA values that we did not handle during preprocessing, we can use the na.action argument, which defaults to na.pass, which means NAs will not be included when calculating probabilities. Alternatively, na.action = na.omit can be specified and NAs will not be included in modeling.

The default laplace value is 0, meaning Laplace smoothing is not applied. To determine if we need to use Laplace smoothing, we need to look for zero probability categories.

aggregate(train[ ,c(noms,ords)],
          by = list(train$Amount),
          FUN = table)

##     Group.1 Gender.Female Gender.Male OwnHome.Own OwnHome.Rent Married.Married
## 1 Below_Avg           281         200         186          295             152
## 2 Above_Avg           124         196         229           91             248
##   Married.Single Location.Close Location.Far Age.Young Age.Middle Age.Old
## 1            329            367          114       206        190      85
## 2             72            195          125        22        218      80
##   Children.0 Children.1 Children.2 Children.3 Catalogs.6 Catalogs.12
## 1        187        121         95         78        176         154
## 2        182         88         25         25         27          67
##   Catalogs.18 Catalogs.24
## 1          87          64
## 2         104         122

Since we have categories with 0 frequency we will set laplace = 1.

nb_mod <- naiveBayes(x = train[ ,vars],
                     y = train$Amount,
                     laplace = 1)

Model Performance & Fit

Training Performance

To assess the goodness of fit of the model, we compare the training and testing performance. For this reason, we need to assess how well the model does on the training sample.

First, we use the predict() function to obtain the class predictions (type = "class") for the training data (train) based on our NB model. This creates a vector of class predictions.

nb.train <- predict(object = nb_mod, # NB model
                    newdata = train[ ,vars], # predictors
                    type = "class")

We can use the confusionMatrix() function from the caret package to obtain aconfusion matrix and obtain performance measures for our model applied to the training dataset (train). We can set mode = "everything" to obtain all available performance measures. We identify the Above_Avg class as positive, since that is the class we are more interested in being able to predict. We will save it so that we can make comparisons.

train_conf <- confusionMatrix(data = nb.train, # predictions
                              reference = train$Amount, # actual
                              positive = "Above_Avg",
                              mode = "everything")
train_conf

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Below_Avg Above_Avg
##   Below_Avg       391        55
##   Above_Avg        90       265
##                                           
##                Accuracy : 0.819           
##                  95% CI : (0.7905, 0.8451)
##     No Information Rate : 0.6005          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6295          
##                                           
##  Mcnemar's Test P-Value : 0.00475         
##                                           
##             Sensitivity : 0.8281          
##             Specificity : 0.8129          
##          Pos Pred Value : 0.7465          
##          Neg Pred Value : 0.8767          
##               Precision : 0.7465          
##                  Recall : 0.8281          
##                      F1 : 0.7852          
##              Prevalence : 0.3995          
##          Detection Rate : 0.3308          
##    Detection Prevalence : 0.4432          
##       Balanced Accuracy : 0.8205          
##                                           
##        'Positive' Class : Above_Avg       
##

Testing Performance

To assess model performance, we focus on the performance of the model applied to the testing set. Next, we use the predict() function to obtain the class predictions (type = "class") for the testing data based on our NB model.

nb.test <- predict(object = nb_mod, # NB model
                   newdata = test[ ,vars], # predictors
                   type = "class")

Again, we use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model, this time applied to the testing dataset (test).

test_conf <- confusionMatrix(data = nb.test, 
                             reference = test$Amount, 
                             positive = "Above_Avg",
                             mode = "everything")
test_conf

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Below_Avg Above_Avg
##   Below_Avg        99        11
##   Above_Avg        21        68
##                                           
##                Accuracy : 0.8392          
##                  95% CI : (0.7806, 0.8873)
##     No Information Rate : 0.603           
##     P-Value [Acc > NIR] : 4.168e-13       
##                                           
##                   Kappa : 0.6712          
##                                           
##  Mcnemar's Test P-Value : 0.1116          
##                                           
##             Sensitivity : 0.8608          
##             Specificity : 0.8250          
##          Pos Pred Value : 0.7640          
##          Neg Pred Value : 0.9000          
##               Precision : 0.7640          
##                  Recall : 0.8608          
##                      F1 : 0.8095          
##              Prevalence : 0.3970          
##          Detection Rate : 0.3417          
##    Detection Prevalence : 0.4472          
##       Balanced Accuracy : 0.8429          
##                                           
##        'Positive' Class : Above_Avg       
##

We can describe the overall performance based on our accuracy and kappa values.

test_conf$overall[c("Accuracy", "Kappa")]

##  Accuracy     Kappa 
## 0.8391960 0.6712442

We can describe class-level performance for the different class levels. Note, above, we set positive = "Above_Avg", since we are more interested in predicting customers who will spend a higher than average amount

test_conf$byClass

##          Sensitivity          Specificity       Pos Pred Value 
##            0.8607595            0.8250000            0.7640449 
##       Neg Pred Value            Precision               Recall 
##            0.9000000            0.7640449            0.8607595 
##                   F1           Prevalence       Detection Rate 
##            0.8095238            0.3969849            0.3417085 
## Detection Prevalence    Balanced Accuracy 
##            0.4472362            0.8428797

Model Goodness of Fit

To assess the model goodness of fit, we want to know if the model is balanced, underfitting or overfitting. We compare the performance on the training and testing sets. We can use the cbind() function to compare our confusionMatrix() output side-by-side.

First, we can compare the overall performance on the training and testing sets.

cbind(Training = train_conf$overall,
      Testing = test_conf$overall)

##                    Training      Testing
## Accuracy       8.189763e-01 8.391960e-01
## Kappa          6.294920e-01 6.712442e-01
## AccuracyLower  7.905294e-01 7.806430e-01
## AccuracyUpper  8.450516e-01 8.873350e-01
## AccuracyNull   6.004994e-01 6.030151e-01
## AccuracyPValue 1.528047e-40 4.167886e-13
## McnemarPValue  4.749556e-03 1.116118e-01

Next, we can compare the class level performance on the training and testing sets.

cbind(Training = train_conf$byClass,
      Testing = test_conf$byClass)

##                       Training   Testing
## Sensitivity          0.8281250 0.8607595
## Specificity          0.8128898 0.8250000
## Pos Pred Value       0.7464789 0.7640449
## Neg Pred Value       0.8766816 0.9000000
## Precision            0.7464789 0.7640449
## Recall               0.8281250 0.8607595
## F1                   0.7851852 0.8095238
## Prevalence           0.3995006 0.3969849
## Detection Rate       0.3308365 0.3417085
## Detection Prevalence 0.4431960 0.4472362
## Balanced Accuracy    0.8205074 0.8428797

As shown, we have similar performance on our training and testing sets. For this reason, we would conclude that the model is balanced. Also, it is a high performing model, with pretty high overall and class-level performance when applied to the testing data set.