library(tm)
## Loading required package: NLP
library(e1071)
## Warning: package 'e1071' was built under R version 3.4.4
library(minqa)

#library(nloptr)
library(MatrixModels)

#install.packages("caret", dependencies = c("Depends", "Suggests"))
library(caret)
## Warning: package 'caret' was built under R version 3.4.4
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
creditData <- read.csv("creditData.csv")
str(creditData)
## 'data.frame':    1000 obs. of  21 variables:
##  $ Creditability                    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Account.Balance                  : int  1 1 2 1 1 1 1 1 4 2 ...
##  $ Duration.of.Credit..month.       : int  18 9 12 12 12 10 8 6 18 24 ...
##  $ Payment.Status.of.Previous.Credit: int  4 4 2 4 4 4 4 4 4 2 ...
##  $ Purpose                          : int  2 0 9 0 0 0 0 0 3 3 ...
##  $ Credit.Amount                    : int  1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
##  $ Value.Savings.Stocks             : int  1 1 2 1 1 1 1 1 1 3 ...
##  $ Length.of.current.employment     : int  2 3 4 3 3 2 4 2 1 1 ...
##  $ Instalment.per.cent              : int  4 2 2 3 4 1 1 2 4 1 ...
##  $ Sex...Marital.Status             : int  2 3 2 3 3 3 3 3 2 2 ...
##  $ Guarantors                       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Duration.in.Current.address      : int  4 2 4 2 4 3 4 4 4 4 ...
##  $ Most.valuable.available.asset    : int  2 1 1 1 2 1 1 1 3 4 ...
##  $ Age..years.                      : int  21 36 23 39 38 48 39 40 65 23 ...
##  $ Concurrent.Credits               : int  3 3 3 3 1 3 3 3 3 3 ...
##  $ Type.of.apartment                : int  1 1 1 1 2 1 2 2 2 1 ...
##  $ No.of.Credits.at.this.Bank       : int  1 2 1 2 2 2 2 1 2 1 ...
##  $ Occupation                       : int  3 3 2 2 2 2 2 2 1 1 ...
##  $ No.of.dependents                 : int  1 2 1 2 1 2 1 2 1 1 ...
##  $ Telephone                        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Foreign.Worker                   : int  1 1 1 2 2 2 2 2 1 1 ...

Naïve Bayes Classifiers, Part 1

Step 1: Exploring and Preparing the Data

sum(is.na(creditData))
## [1] 0

Training and Testing Data sets

set.seed(12345) 
credit_rand <- creditData[order(runif(1000)), ]
summary(creditData$Credit.Amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250    1366    2320    3271    3972   18424
summary(credit_rand$Credit.Amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250    1366    2320    3271    3972   18424

There are many ways to do this in R. For example:

# for(i in 1:1000) {
#   if(credit_rand$Credit.Amount[i] >= 2320) { #2320 is the median for this column
#     credit_rand$CreditAmountClass[i] = "yes"} 
#   else {credit_rand$CreditAmountClass[i] = "no"}
# }
credit_rand$Creditability <- as.factor(credit_rand$Creditability)

Subsetting the observations (records) to establish the 75% of the records for training set and 25% for the test set:

credit_train <- credit_rand[1:750, ] 
credit_test <- credit_rand[751:1000, ]

If the randomization went well then the percentages between splits should be close. So the number of Credit Amounts using the table() command.

Checking the percentages rather than actual numbers for both the training set and the test set using:

prop.table(table(credit_train$Creditability))
## 
##         0         1 
## 0.3146667 0.6853333
prop.table(table(credit_test$Creditability))
## 
##     0     1 
## 0.256 0.744

The split looks good and close to eachother ratios. So going ahead with this, moving to Step 2.

Step 2: Training a Model on the Data

Creating the Naive Bayes Model

library(naivebayes)
naive_model <- naive_bayes(credit_train$Creditability ~ ., data = credit_train)
naive_model
## ===================== Naive Bayes ===================== 
## Call: 
## naive_bayes.formula(formula = credit_train$Creditability ~ ., 
##     data = credit_train)
## 
## A priori probabilities: 
## 
##         0         1 
## 0.3146667 0.6853333 
## 
## Tables: 
##                
## Account.Balance        0        1
##            mean 1.923729 2.793774
##            sd   1.036826 1.252008
## 
##                           
## Duration.of.Credit..month.        0        1
##                       mean 24.46610 19.20039
##                       sd   13.82208 11.13433
## 
##                                  
## Payment.Status.of.Previous.Credit        0        1
##                              mean 2.161017 2.665370
##                              sd   1.071649 1.045219
## 
##        
## Purpose        0        1
##    mean 2.927966 2.803502
##    sd   2.944722 2.633253
## 
##              
## Credit.Amount        0        1
##          mean 3964.195 2984.177
##          sd   3597.093 2379.685
## 
## # ... and 15 more tables

Step 3: Evaluating Model Performance

To evaluate the performance of our model, using the predict() function and look at the output in a confusion table:

credit_test_pred <- predict(naive_model, newdata = credit_test) 
table(credit_test_pred, credit_test$Creditability)
##                 
## credit_test_pred   0   1
##                0  42  35
##                1  22 151

Thus, as we can see there are 22 False Negatives and 35 False Positives in the confusion matix. Further, analyzing the confusion matrix we find that:

The percentage of Not-Credit-Worthy is (42/250)100 = 16.8% The percentage of Credit-Worthy is (151/250)100 = 60.4% The total percentage of Accuracy = ((Not-Credit-Worthy-Amount + Credit-Worthy-Amount)/(sum of all the confusion matrix))*100 = 77.2%

The confusion table shows that the naïve Bayes filter gets all the Prediction right with the Expected values in the test dataset.

Naïve Bayes Classifiers, Part 2

There is an “order of operations” required to get all this done correctly.

Step 1: Exploring and Preparing the Data

  1. Randomize the data
credit_rand <- creditData[order(runif(1000)), ]
  1. Scale the data
creditDataScaled <- scale(credit_rand[,2:ncol(credit_rand)], center=TRUE, scale = TRUE)
  1. Compute the correlation matrix (note that this does not include the class variable)
m <- cor(creditDataScaled)
  1. Determine the threshold to use for feature (variable) selection (I’ll continue to use 0.30 as an example.) and perform the feature selection.
highlycor <- findCorrelation(m, 0.30)
  1. Recombine the class variable with the highly correlated credit data and split into training and test data sets - BUT before you do this you will need to go back to the unscaled data to get a feature data set because the naiveBayes() function seems to do strange things with scaled data when the parameter centered equals TRUE!
filteredData <- credit_rand[, -highlycor] 
filteredTraining <- filteredData[1:750, ] 
filteredTest <- filteredData[751:1000, ]

Plus, do all the other sorts of things like initializing the overall process for reproducibility etc.

Step 2: Training a Model on the Data

At this point we follow the same process that we have done before.

  1. Build and evaluate the Naïve Bayes Classifier as usual
nb_model <- naiveBayes(as.factor(filteredTraining$Creditability) ~ ., data=filteredTraining)

Step 3: Evaluating Model Performance

  1. Evaluate the Naïve Bayes Classifier as usual
filteredTestPred <- predict(nb_model, newdata = filteredTest) 
table(filteredTestPred, filteredTest$Creditability)
##                 
## filteredTestPred   0   1
##                0  52  47
##                1  18 133

Thus there are 18 False Negatives and 47 False Positives in the confusion matix. Further, analyzing the confusion matrix we find that:

The percentage of Not-Credit-Worthy is (52/250)100 = 20.8% The percentage of Credit-Worthy is (133/250)100 = 53.2% The total percentage of Accuracy = ((Not-Credit-Worthy-Amount + Credit-Worthy-Amount)/(sum of all the confusion matrix))*100 = 74%

The performance of the Naïve Bayes Classifier was not improved because the total % of accuracy of 74% is less than 77%.