library(tm)
## Loading required package: NLP
library(e1071)
## Warning: package 'e1071' was built under R version 3.4.4
library(minqa)
#library(nloptr)
library(MatrixModels)
#install.packages("caret", dependencies = c("Depends", "Suggests"))
library(caret)
## Warning: package 'caret' was built under R version 3.4.4
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
creditData <- read.csv("creditData.csv")
str(creditData)
## 'data.frame': 1000 obs. of 21 variables:
## $ Creditability : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Account.Balance : int 1 1 2 1 1 1 1 1 4 2 ...
## $ Duration.of.Credit..month. : int 18 9 12 12 12 10 8 6 18 24 ...
## $ Payment.Status.of.Previous.Credit: int 4 4 2 4 4 4 4 4 4 2 ...
## $ Purpose : int 2 0 9 0 0 0 0 0 3 3 ...
## $ Credit.Amount : int 1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
## $ Value.Savings.Stocks : int 1 1 2 1 1 1 1 1 1 3 ...
## $ Length.of.current.employment : int 2 3 4 3 3 2 4 2 1 1 ...
## $ Instalment.per.cent : int 4 2 2 3 4 1 1 2 4 1 ...
## $ Sex...Marital.Status : int 2 3 2 3 3 3 3 3 2 2 ...
## $ Guarantors : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Duration.in.Current.address : int 4 2 4 2 4 3 4 4 4 4 ...
## $ Most.valuable.available.asset : int 2 1 1 1 2 1 1 1 3 4 ...
## $ Age..years. : int 21 36 23 39 38 48 39 40 65 23 ...
## $ Concurrent.Credits : int 3 3 3 3 1 3 3 3 3 3 ...
## $ Type.of.apartment : int 1 1 1 1 2 1 2 2 2 1 ...
## $ No.of.Credits.at.this.Bank : int 1 2 1 2 2 2 2 1 2 1 ...
## $ Occupation : int 3 3 2 2 2 2 2 2 1 1 ...
## $ No.of.dependents : int 1 2 1 2 1 2 1 2 1 1 ...
## $ Telephone : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Foreign.Worker : int 1 1 1 2 2 2 2 2 1 1 ...
sum(is.na(creditData))
## [1] 0
set.seed(12345)
credit_rand <- creditData[order(runif(1000)), ]
summary(creditData$Credit.Amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250 1366 2320 3271 3972 18424
summary(credit_rand$Credit.Amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250 1366 2320 3271 3972 18424
There are many ways to do this in R. For example:
# for(i in 1:1000) {
# if(credit_rand$Credit.Amount[i] >= 2320) { #2320 is the median for this column
# credit_rand$CreditAmountClass[i] = "yes"}
# else {credit_rand$CreditAmountClass[i] = "no"}
# }
credit_rand$Creditability <- as.factor(credit_rand$Creditability)
Subsetting the observations (records) to establish the 75% of the records for training set and 25% for the test set:
credit_train <- credit_rand[1:750, ]
credit_test <- credit_rand[751:1000, ]
If the randomization went well then the percentages between splits should be close. So the number of Credit Amounts using the table() command.
Checking the percentages rather than actual numbers for both the training set and the test set using:
prop.table(table(credit_train$Creditability))
##
## 0 1
## 0.3146667 0.6853333
prop.table(table(credit_test$Creditability))
##
## 0 1
## 0.256 0.744
The split looks good and close to eachother ratios. So going ahead with this, moving to Step 2.
library(naivebayes)
naive_model <- naive_bayes(credit_train$Creditability ~ ., data = credit_train)
naive_model
## ===================== Naive Bayes =====================
## Call:
## naive_bayes.formula(formula = credit_train$Creditability ~ .,
## data = credit_train)
##
## A priori probabilities:
##
## 0 1
## 0.3146667 0.6853333
##
## Tables:
##
## Account.Balance 0 1
## mean 1.923729 2.793774
## sd 1.036826 1.252008
##
##
## Duration.of.Credit..month. 0 1
## mean 24.46610 19.20039
## sd 13.82208 11.13433
##
##
## Payment.Status.of.Previous.Credit 0 1
## mean 2.161017 2.665370
## sd 1.071649 1.045219
##
##
## Purpose 0 1
## mean 2.927966 2.803502
## sd 2.944722 2.633253
##
##
## Credit.Amount 0 1
## mean 3964.195 2984.177
## sd 3597.093 2379.685
##
## # ... and 15 more tables
To evaluate the performance of our model, using the predict() function and look at the output in a confusion table:
credit_test_pred <- predict(naive_model, newdata = credit_test)
table(credit_test_pred, credit_test$Creditability)
##
## credit_test_pred 0 1
## 0 42 35
## 1 22 151
Thus, as we can see there are 22 False Negatives and 35 False Positives in the confusion matix. Further, analyzing the confusion matrix we find that:
The percentage of Not-Credit-Worthy is (42/250)100 = 16.8% The percentage of Credit-Worthy is (151/250)100 = 60.4% The total percentage of Accuracy = ((Not-Credit-Worthy-Amount + Credit-Worthy-Amount)/(sum of all the confusion matrix))*100 = 77.2%
The confusion table shows that the naïve Bayes filter gets all the Prediction right with the Expected values in the test dataset.
There is an “order of operations” required to get all this done correctly.
credit_rand <- creditData[order(runif(1000)), ]
creditDataScaled <- scale(credit_rand[,2:ncol(credit_rand)], center=TRUE, scale = TRUE)
m <- cor(creditDataScaled)
highlycor <- findCorrelation(m, 0.30)
filteredData <- credit_rand[, -highlycor]
filteredTraining <- filteredData[1:750, ]
filteredTest <- filteredData[751:1000, ]
Plus, do all the other sorts of things like initializing the overall process for reproducibility etc.
At this point we follow the same process that we have done before.
nb_model <- naiveBayes(as.factor(filteredTraining$Creditability) ~ ., data=filteredTraining)
filteredTestPred <- predict(nb_model, newdata = filteredTest)
table(filteredTestPred, filteredTest$Creditability)
##
## filteredTestPred 0 1
## 0 52 47
## 1 18 133
Thus there are 18 False Negatives and 47 False Positives in the confusion matix. Further, analyzing the confusion matrix we find that:
The percentage of Not-Credit-Worthy is (52/250)100 = 20.8% The percentage of Credit-Worthy is (133/250)100 = 53.2% The total percentage of Accuracy = ((Not-Credit-Worthy-Amount + Credit-Worthy-Amount)/(sum of all the confusion matrix))*100 = 74%
The performance of the Naïve Bayes Classifier was not improved because the total % of accuracy of 74% is less than 77%.