We will use the DescTools, caret and e1071 packages. If you do not already have the e1071 package installed, you will first install it using the install.packages() function.
install.packages("e1071")
Next, we load all of the necessary libraries for use in the session.
library(caret)
library(DescTools)
library(e1071)
In this lesson, we use the DirectMarketing.csv file, which contains information for 1000 customers of a business that sells products by mail-order catalog. The marketing manager of the company wants to use historical data about his customers (demographic and past marketing information) to predict if the customer will spend (Amount) at or below average (Below_Avg) or above average (Above_Avg). The company prioritizes correctly predicting if a customer will purchase more than the average customer, because they have limited advertising budget and need to make strategic marketing decisions.
Variables include:
Young, Middle, Old)Male, Female)Married) or single (Single)Close) or not (Far)Below_Avg) or above (Above_Avg)We use the read.csv() function to import the CSV file into R as a dataframe named DM. We set stringsAsFactors = FALSE to keep any character columns as-is.
DM <- read.csv(file = "DirectMarketing.csv",
stringsAsFactors = FALSE)
First, we can obtain high-level information about the DM dataframe to look at the variable types and to check for missing (NA) values.
Abstract(DM)
## ------------------------------------------------------------------------------
## DM
##
## data frame: 1000 obs. of 9 variables
## 1000 complete cases (100.0%)
##
## Nr ColName Class NAs Levels
## 1 Age character .
## 2 Gender character .
## 3 OwnHome character .
## 4 Married character .
## 5 Location character .
## 6 Salary integer .
## 7 Children integer .
## 8 Catalogs integer .
## 9 Amount character .
Next, we can convert our target class variable that we want to predict, Amount to a nominal factor variable. Note: although it is ordinal, we will use it as a nominal variable for compatibility. We will identify level 1 as Below_Avg and level 2 as Above_Avg to reflect the known ordering.
DM$Amount <- factor(x = DM$Amount,
levels = c("Below_Avg", "Above_Avg"))
We can plot the distribution using a bar plot, which is the default plot for the plot() function when applied to factor variables.
plot(x = DM$Amount,
main = "Amount Spent")
First, we can convert our nominal predictor variables to factor variables. We will set up a convenience vector of the names of the nominal variables (noms) so that we can more easily refer to them.
noms <- c("Gender", "OwnHome", "Married", "Location")
DM[ ,noms] <- lapply(X = DM[ ,noms],
FUN = factor)
Next, we can convert our ordinal variables. The Age variable takes on three levels, which we will need to specify with the correct ordering. Since the Children and Catalogs variables take on numbers, we can use the default ordering and convert them at the same time.We will set up a convenience vector of the names of the ordinal variables (ords) so that we can more easily refer to them.
ords <- c("Age", "Children", "Catalogs")
DM$Age <- factor(x = DM$Age,
levels = c("Young", "Middle", "Old"),
ordered = TRUE)
DM[,c("Children", "Catalogs")] <- lapply(X = DM[,c("Children", "Catalogs")],
FUN = factor,
ordered = TRUE)
We only have one numerical variable, Salary. We can set up a named value, nums (although we do not need to).
nums <- c("Salary")
Finally, we can create our convenience vector of predictor variables (vars) containing all of our predictors. We can view the predictors by name by running a code line of the vars object.
vars <- c(noms, ords, nums)
vars
## [1] "Gender" "OwnHome" "Married" "Location" "Age" "Children" "Catalogs"
## [8] "Salary"
First, we can view summary() information for our prepared data.
summary(DM)
## Age Gender OwnHome Married Location
## Young :287 Female:506 Own :516 Married:502 Close:710
## Middle:508 Male :494 Rent:484 Single :498 Far :290
## Old :205
##
##
##
## Salary Children Catalogs Amount
## Min. : 10100 0:462 6 :252 Below_Avg:601
## 1st Qu.: 29975 1:267 12:282 Above_Avg:399
## Median : 53700 2:146 18:233
## Mean : 56104 3:125 24:233
## 3rd Qu.: 77025
## Max. :168800
For Naive Bayes classification, we know that we are okay keeping irrelevant features, missing values can either be removed, imputed or ignored during analysis, categorical variables can be incorporated as factor variables and we need to remove redundant numerical features. We also would like our numeric variables to be approximately normally distributed.
na.omit()) or perform imputation (or we can handle them during analysis).PlotMiss(DM)
Redundant Variables We need to identify highly correlated numeric predictor (X) variables and exclude them from our predictive model. Since we only have one numeric variable, we do not need to worry about redundancy.
Normally Distributed Numeric Variables By assumption, Naive Bayes expects numeric variables to be normally distributed. We can visually inspect the distribution of our numeric variable, Salary, using a histogram.
hist(DM$Salary, main = "Salary")
We can transform the Salary variable to approximately normal. First, we can look at the distribution of the Salary variable to determine which transformation to use. If we have strictly positive values, we can use a Box-Cox transformation. If not, we can use a Yeo-Johnson transformation.
summary(DM$Salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10100 29975 53700 56104 77025 168800
Since we have strictly positive values, we will use a Box-Cox transformation and then standardize the Salary variable.
cen_bcs <- preProcess(x = DM[ ,vars],
method = c("BoxCox", "center", "scale"))
DM_bcs <- predict(object = cen_bcs,
newdata = DM)
We use the createDataPartition() function from the caret package to identify the row numbers that we will include in our training set. Then, all other rows will be put in our testing set. We split the data using an 80/20 split (80% in training and 20% in testing). By using createDataPartition() we preserve the distribution of our outcome (Y) variable (Amount). Since the function takes a random sample, we initialize a random seed first for reproducibility.
set.seed(831) # initialize the random seed
# Generate the list of observations for the
# train dataframe
sub <- createDataPartition(y = DM_bcs$Amount,
p = 0.80,
list = FALSE)
Subset the rows of the DM_bcs dataframe to include the row numbers in the sub object to create the train dataframe. We use all observations not in the sub object to create the test dataframe.
train <- DM_bcs[sub, ]
test <- DM_bcs[-sub, ]
We use the naiveBayes() function from the e1071 package to perform Naive Bayes classification.
If we have NA values that we did not handle during preprocessing, we can use the na.action argument, which defaults to na.pass, which means NAs will not be included when calculating probabilities. Alternatively, na.action = na.omit can be specified and NAs will not be included in modeling.
The default laplace value is 0, meaning Laplace smoothing is not applied. To determine if we need to use Laplace smoothing, we need to look for zero probability categories.
aggregate(train[ ,c(noms,ords)],
by = list(train$Amount),
FUN = table)
## Group.1 Gender.Female Gender.Male OwnHome.Own OwnHome.Rent Married.Married
## 1 Below_Avg 281 200 186 295 152
## 2 Above_Avg 124 196 229 91 248
## Married.Single Location.Close Location.Far Age.Young Age.Middle Age.Old
## 1 329 367 114 206 190 85
## 2 72 195 125 22 218 80
## Children.0 Children.1 Children.2 Children.3 Catalogs.6 Catalogs.12
## 1 187 121 95 78 176 154
## 2 182 88 25 25 27 67
## Catalogs.18 Catalogs.24
## 1 87 64
## 2 104 122
Since we have categories with 0 frequency we will set laplace = 1.
nb_mod <- naiveBayes(x = train[ ,vars],
y = train$Amount,
laplace = 1)
To assess the goodness of fit of the model, we compare the training and testing performance. For this reason, we need to assess how well the model does on the training sample.
First, we use the predict() function to obtain the class predictions (type = "class") for the training data (train) based on our NB model. This creates a vector of class predictions.
nb.train <- predict(object = nb_mod, # NB model
newdata = train[ ,vars], # predictors
type = "class")
We can use the confusionMatrix() function from the caret package to obtain aconfusion matrix and obtain performance measures for our model applied to the training dataset (train). We can set mode = "everything" to obtain all available performance measures. We identify the Above_Avg class as positive, since that is the class we are more interested in being able to predict. We will save it so that we can make comparisons.
train_conf <- confusionMatrix(data = nb.train, # predictions
reference = train$Amount, # actual
positive = "Above_Avg",
mode = "everything")
train_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Below_Avg Above_Avg
## Below_Avg 391 55
## Above_Avg 90 265
##
## Accuracy : 0.819
## 95% CI : (0.7905, 0.8451)
## No Information Rate : 0.6005
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6295
##
## Mcnemar's Test P-Value : 0.00475
##
## Sensitivity : 0.8281
## Specificity : 0.8129
## Pos Pred Value : 0.7465
## Neg Pred Value : 0.8767
## Precision : 0.7465
## Recall : 0.8281
## F1 : 0.7852
## Prevalence : 0.3995
## Detection Rate : 0.3308
## Detection Prevalence : 0.4432
## Balanced Accuracy : 0.8205
##
## 'Positive' Class : Above_Avg
##
To assess model performance, we focus on the performance of the model applied to the testing set. Next, we use the predict() function to obtain the class predictions (type = "class") for the testing data based on our NB model.
nb.test <- predict(object = nb_mod, # NB model
newdata = test[ ,vars], # predictors
type = "class")
Again, we use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model, this time applied to the testing dataset (test).
test_conf <- confusionMatrix(data = nb.test,
reference = test$Amount,
positive = "Above_Avg",
mode = "everything")
test_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Below_Avg Above_Avg
## Below_Avg 99 11
## Above_Avg 21 68
##
## Accuracy : 0.8392
## 95% CI : (0.7806, 0.8873)
## No Information Rate : 0.603
## P-Value [Acc > NIR] : 4.168e-13
##
## Kappa : 0.6712
##
## Mcnemar's Test P-Value : 0.1116
##
## Sensitivity : 0.8608
## Specificity : 0.8250
## Pos Pred Value : 0.7640
## Neg Pred Value : 0.9000
## Precision : 0.7640
## Recall : 0.8608
## F1 : 0.8095
## Prevalence : 0.3970
## Detection Rate : 0.3417
## Detection Prevalence : 0.4472
## Balanced Accuracy : 0.8429
##
## 'Positive' Class : Above_Avg
##
We can describe the overall performance based on our accuracy and kappa values.
test_conf$overall[c("Accuracy", "Kappa")]
## Accuracy Kappa
## 0.8391960 0.6712442
We can describe class-level performance for the different class levels. Note, above, we set positive = "Above_Avg", since we are more interested in predicting customers who will spend a higher than average amount
test_conf$byClass
## Sensitivity Specificity Pos Pred Value
## 0.8607595 0.8250000 0.7640449
## Neg Pred Value Precision Recall
## 0.9000000 0.7640449 0.8607595
## F1 Prevalence Detection Rate
## 0.8095238 0.3969849 0.3417085
## Detection Prevalence Balanced Accuracy
## 0.4472362 0.8428797
To assess the model goodness of fit, we want to know if the model is balanced, underfitting or overfitting. We compare the performance on the training and testing sets. We can use the cbind() function to compare our confusionMatrix() output side-by-side.
First, we can compare the overall performance on the training and testing sets.
cbind(Training = train_conf$overall,
Testing = test_conf$overall)
## Training Testing
## Accuracy 8.189763e-01 8.391960e-01
## Kappa 6.294920e-01 6.712442e-01
## AccuracyLower 7.905294e-01 7.806430e-01
## AccuracyUpper 8.450516e-01 8.873350e-01
## AccuracyNull 6.004994e-01 6.030151e-01
## AccuracyPValue 1.528047e-40 4.167886e-13
## McnemarPValue 4.749556e-03 1.116118e-01
Next, we can compare the class level performance on the training and testing sets.
cbind(Training = train_conf$byClass,
Testing = test_conf$byClass)
## Training Testing
## Sensitivity 0.8281250 0.8607595
## Specificity 0.8128898 0.8250000
## Pos Pred Value 0.7464789 0.7640449
## Neg Pred Value 0.8766816 0.9000000
## Precision 0.7464789 0.7640449
## Recall 0.8281250 0.8607595
## F1 0.7851852 0.8095238
## Prevalence 0.3995006 0.3969849
## Detection Rate 0.3308365 0.3417085
## Detection Prevalence 0.4431960 0.4472362
## Balanced Accuracy 0.8205074 0.8428797
As shown, we have similar performance on our training and testing sets. For this reason, we would conclude that the model is balanced. Also, it is a high performing model, with pretty high overall and class-level performance when applied to the testing data set.