Prescriptive analytics is what happens when you take predictions made and use them to make strategic changes, e.g. to a business model in order to refocus or enhance the model. This laboratory will specifically look at credit risk whereby credit risk is the risk of default on a debt due to a borrower failing to make the required payments in a timely manner.The bank , as the financial institution, will analyze customer data to predict which customers might be credit risks. These predictions will then feed into risk management.
This particular laboratory focuses on the use of Bayesian methods, specifically Naïve Bayesian Classifiers. We will use the brute force part which is we’ll apply techniques to improve performance individually and manually to the original algorithms.
Thi lab uses Credit Data. Data was imported and analyzed.Missing values were checked and the data does not have any missing values.
library(readxl)
creditData <- read_excel("/Users/Rodda Ouma/Documents/Harrisburg/Machine Learning/creditData.xlsx")
##converting excel spreadsheet to dataframe
creditData<-as.data.frame(creditData)
str(creditData)
## 'data.frame': 1000 obs. of 21 variables:
## $ Creditability : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Account Balance : num 1 1 2 1 1 1 1 1 4 2 ...
## $ Duration of Credit (month) : num 18 9 12 12 12 10 8 6 18 24 ...
## $ Payment Status of Previous Credit: num 4 4 2 4 4 4 4 4 4 2 ...
## $ Purpose : num 2 0 9 0 0 0 0 0 3 3 ...
## $ Credit Amount : num 1049 2799 841 2122 2171 ...
## $ Value Savings/Stocks : num 1 1 2 1 1 1 1 1 1 3 ...
## $ Length of current employment : num 2 3 4 3 3 2 4 2 1 1 ...
## $ Instalment per cent : num 4 2 2 3 4 1 1 2 4 1 ...
## $ Sex & Marital Status : num 2 3 2 3 3 3 3 3 2 2 ...
## $ Guarantors : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Duration in Current address : num 4 2 4 2 4 3 4 4 4 4 ...
## $ Most valuable available asset : num 2 1 1 1 2 1 1 1 3 4 ...
## $ Age (years) : num 21 36 23 39 38 48 39 40 65 23 ...
## $ Concurrent Credits : num 3 3 3 3 1 3 3 3 3 3 ...
## $ Type of apartment : num 1 1 1 1 2 1 2 2 2 1 ...
## $ No of Credits at this Bank : num 1 2 1 2 2 2 2 1 2 1 ...
## $ Occupation : num 3 3 2 2 2 2 2 2 1 1 ...
## $ No of dependents : num 1 2 1 2 1 2 1 2 1 1 ...
## $ Telephone : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Foreign Worker : num 1 1 1 2 2 2 2 2 1 1 ...
##Exploring the data by checking for missing data
sum(is.na(creditData))
## [1] 0
summary(creditData)
## Creditability Account Balance Duration of Credit (month)
## Min. :0.0 Min. :1.000 Min. : 4.0
## 1st Qu.:0.0 1st Qu.:1.000 1st Qu.:12.0
## Median :1.0 Median :2.000 Median :18.0
## Mean :0.7 Mean :2.577 Mean :20.9
## 3rd Qu.:1.0 3rd Qu.:4.000 3rd Qu.:24.0
## Max. :1.0 Max. :4.000 Max. :72.0
## Payment Status of Previous Credit Purpose Credit Amount
## Min. :0.000 Min. : 0.000 Min. : 250
## 1st Qu.:2.000 1st Qu.: 1.000 1st Qu.: 1366
## Median :2.000 Median : 2.000 Median : 2320
## Mean :2.545 Mean : 2.828 Mean : 3271
## 3rd Qu.:4.000 3rd Qu.: 3.000 3rd Qu.: 3972
## Max. :4.000 Max. :10.000 Max. :18424
## Value Savings/Stocks Length of current employment Instalment per cent
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:2.000
## Median :1.000 Median :3.000 Median :3.000
## Mean :2.105 Mean :3.384 Mean :2.973
## 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :4.000
## Sex & Marital Status Guarantors Duration in Current address
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000
## Median :3.000 Median :1.000 Median :3.000
## Mean :2.682 Mean :1.145 Mean :2.845
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:4.000
## Max. :4.000 Max. :3.000 Max. :4.000
## Most valuable available asset Age (years) Concurrent Credits
## Min. :1.000 Min. :19.00 Min. :1.000
## 1st Qu.:1.000 1st Qu.:27.00 1st Qu.:3.000
## Median :2.000 Median :33.00 Median :3.000
## Mean :2.358 Mean :35.54 Mean :2.675
## 3rd Qu.:3.000 3rd Qu.:42.00 3rd Qu.:3.000
## Max. :4.000 Max. :75.00 Max. :3.000
## Type of apartment No of Credits at this Bank Occupation
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:3.000
## Median :2.000 Median :1.000 Median :3.000
## Mean :1.928 Mean :1.407 Mean :2.904
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :3.000 Max. :4.000 Max. :4.000
## No of dependents Telephone Foreign Worker
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000
## Mean :1.155 Mean :1.404 Mean :1.037
## 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000 Max. :2.000
##converted to a factor because the Naives Bayes classification needs a categorical variable inorder to run.
creditData$Creditability <-as.factor(creditData$Creditability)
summary(creditData$Creditability)
## 0 1
## 300 700
75%/25% split for training and test data, i.e. use 75% of the records for the training set and 25% of the records for the test set.
set.seed(12345)
credit_rand <-creditData[order(runif(1000)),]
credit_train <- credit_rand [1:750,]
credit_test <- credit_rand [751:1000,]
prop.table(table(credit_train$Creditability))
##
## 0 1
## 0.3146667 0.6853333
prop.table(table(credit_test$Creditability))
##
## 0 1
## 0.256 0.744
We use the Naive bayes classification to build the Naive Bayes classification model.
naive_model<-naive_bayes(Creditability ~ ., data = credit_train)
naive_model
## ===================== Naive Bayes =====================
## Call:
## naive_bayes.formula(formula = Creditability ~ ., data = credit_train)
##
## A priori probabilities:
##
## 0 1
## 0.3146667 0.6853333
##
## Tables:
##
## Account Balance 0 1
## mean 1.923729 2.793774
## sd 1.036826 1.252008
##
##
## Duration of Credit (month) 0 1
## mean 24.46610 19.20039
## sd 13.82208 11.13433
##
##
## Payment Status of Previous Credit 0 1
## mean 2.161017 2.665370
## sd 1.071649 1.045219
##
##
## Purpose 0 1
## mean 2.927966 2.803502
## sd 2.944722 2.633253
##
##
## Credit Amount 0 1
## mean 3964.195 2984.177
## sd 3597.093 2379.685
##
## # ... and 15 more tables
From the results above,68.5% of the creditors are worthy. we can then evaluate our model by lookign at the accuracy.
##Model Evaluation
conf_nat <-table(predict(naive_model,credit_test),credit_test$Creditability)
conf_nat
##
## 0 1
## 0 42 35
## 1 22 151
(Accuracy<-sum(diag(conf_nat))/sum(conf_nat)*100)
## [1] 77.2
From the results above, the model is 77.2 % accurate. We will then use other methods to see if we can improve the performance of the model using feature selection of the variables.
#we'll manually work to improve the performance of the Naïve Bayes classifier
##We first randomize the data
credit_rand2<- creditData[order(runif(1000)), ]
str(credit_rand2)
## 'data.frame': 1000 obs. of 21 variables:
## $ Creditability : Factor w/ 2 levels "0","1": 2 2 1 2 2 1 1 2 2 2 ...
## $ Account Balance : num 2 2 1 4 2 4 2 3 1 4 ...
## $ Duration of Credit (month) : num 42 24 24 18 12 24 18 18 18 12 ...
## $ Payment Status of Previous Credit: num 4 4 2 2 2 2 2 2 2 4 ...
## $ Purpose : num 9 1 0 9 9 3 0 2 0 6 ...
## $ Credit Amount : num 5954 7758 1371 1950 841 ...
## $ Value Savings/Stocks : num 1 4 5 1 2 3 5 1 2 1 ...
## $ Length of current employment : num 4 5 3 4 4 5 3 2 4 4 ...
## $ Instalment per cent : num 2 2 4 4 2 3 4 1 4 2 ...
## $ Sex & Marital Status : num 2 2 2 3 2 3 2 2 3 3 ...
## $ Guarantors : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Duration in Current address : num 1 4 4 1 4 2 2 1 3 3 ...
## $ Most valuable available asset : num 1 4 1 3 1 3 2 2 3 1 ...
## $ Age (years) : num 41 29 25 34 23 35 33 45 30 49 ...
## $ Concurrent Credits : num 1 3 3 2 3 1 3 2 3 3 ...
## $ Type of apartment : num 2 1 1 2 1 2 2 2 2 2 ...
## $ No of Credits at this Bank : num 2 1 1 2 1 2 1 1 1 1 ...
## $ Occupation : num 2 3 3 3 2 3 3 2 4 2 ...
## $ No of dependents : num 1 1 1 1 1 1 1 1 2 2 ...
## $ Telephone : num 1 1 1 2 1 2 1 1 2 1 ...
## $ Foreign Worker : num 1 1 1 1 1 1 1 1 1 1 ...
The next step is to scale the data but we first remove the categorical variable in the first column.
## we scale the data
creditDataScaled <- scale(credit_rand2[,2:ncol(credit_rand2)], center=TRUE, scale = TRUE)
View(creditDataScaled)
We then calculate the correlation among the variables using the Find correlation function. A cut off of 0.3 is used to determine the highly correlated variables which are then removed from the dataframe.
##WE use correlation matrix to perform feature, i.e. variable selection.
##compute the correlation matrix
#note that this does not include the class variable
m <- cor(creditDataScaled)
m
## Account Balance
## Account Balance 1.000000000
## Duration of Credit (month) -0.072013088
## Payment Status of Previous Credit 0.192190688
## Purpose 0.028782569
## Credit Amount -0.042695127
## Value Savings/Stocks 0.222866860
## Length of current employment 0.106338752
## Instalment per cent -0.005279856
## Sex & Marital Status 0.043261280
## Guarantors -0.127736563
## Duration in Current address -0.042233689
## Most valuable available asset -0.032260126
## Age (years) 0.058630740
## Concurrent Credits 0.068273870
## Type of apartment 0.023335309
## No of Credits at this Bank 0.076005137
## Occupation 0.040663061
## No of dependents -0.014145427
## Telephone 0.066295834
## Foreign Worker -0.035186993
## Duration of Credit (month)
## Account Balance -0.07201309
## Duration of Credit (month) 1.00000000
## Payment Status of Previous Credit -0.07718647
## Purpose 0.14749187
## Credit Amount 0.62498846
## Value Savings/Stocks 0.04766092
## Length of current employment 0.05738103
## Instalment per cent 0.07474882
## Sex & Marital Status 0.01478933
## Guarantors -0.02448995
## Duration in Current address 0.03406720
## Most valuable available asset 0.30397125
## Age (years) -0.03754986
## Concurrent Credits -0.06288379
## Type of apartment 0.15312556
## No of Credits at this Bank -0.01128360
## Occupation 0.21090973
## No of dependents -0.02383448
## Telephone 0.16471821
## Foreign Worker -0.13467996
## Payment Status of Previous Credit
## Account Balance 0.19219069
## Duration of Credit (month) -0.07718647
## Payment Status of Previous Credit 1.00000000
## Purpose -0.09033589
## Credit Amount -0.05991485
## Value Savings/Stocks 0.03905788
## Length of current employment 0.13822522
## Instalment per cent 0.04437459
## Sex & Marital Status 0.04217088
## Guarantors -0.04067553
## Duration in Current address 0.06319797
## Most valuable available asset -0.05377676
## Age (years) 0.14633747
## Concurrent Credits 0.15995707
## Type of apartment 0.06142792
## No of Credits at this Bank 0.43706577
## Occupation 0.01035018
## No of dependents 0.01154955
## Telephone 0.05237019
## Foreign Worker 0.02855405
## Purpose Credit Amount
## Account Balance 0.0287825694 -0.042695127
## Duration of Credit (month) 0.1474918712 0.624988461
## Payment Status of Previous Credit -0.0903358941 -0.059914852
## Purpose 1.0000000000 0.068480054
## Credit Amount 0.0684800535 1.000000000
## Value Savings/Stocks -0.0186844687 0.064632168
## Length of current employment 0.0160130053 -0.008376109
## Instalment per cent 0.0483689475 -0.271322281
## Sex & Marital Status 0.0001565929 -0.016094338
## Guarantors -0.0176067538 -0.027830917
## Duration in Current address -0.0382213445 0.028916676
## Most valuable available asset 0.0109663534 0.311602093
## Age (years) -0.0008923856 0.032272677
## Concurrent Credits -0.1002303932 -0.069392010
## Type of apartment 0.0134946967 0.133023634
## No of Credits at this Bank 0.0549353555 0.020785277
## Occupation 0.0080847757 0.285393073
## No of dependents -0.0325768744 0.017143582
## Telephone 0.0783705414 0.277000181
## Foreign Worker -0.1132436689 -0.030661601
## Value Savings/Stocks
## Account Balance 0.222866860
## Duration of Credit (month) 0.047660924
## Payment Status of Previous Credit 0.039057881
## Purpose -0.018684469
## Credit Amount 0.064632168
## Value Savings/Stocks 1.000000000
## Length of current employment 0.120949514
## Instalment per cent 0.021992529
## Sex & Marital Status 0.017348689
## Guarantors -0.105068513
## Duration in Current address 0.091424109
## Most valuable available asset 0.018948001
## Age (years) 0.083433512
## Concurrent Credits 0.001907967
## Type of apartment 0.006643819
## No of Credits at this Bank -0.021644133
## Occupation 0.011708920
## No of dependents 0.027513789
## Telephone 0.087208402
## Foreign Worker 0.010449560
## Length of current employment
## Account Balance 0.106338752
## Duration of Credit (month) 0.057381027
## Payment Status of Previous Credit 0.138225216
## Purpose 0.016013005
## Credit Amount -0.008376109
## Value Savings/Stocks 0.120949514
## Length of current employment 1.000000000
## Instalment per cent 0.126161307
## Sex & Marital Status 0.111278288
## Guarantors -0.008116008
## Duration in Current address 0.245080745
## Most valuable available asset 0.087187468
## Age (years) 0.259116153
## Concurrent Credits -0.007279305
## Type of apartment 0.115077459
## No of Credits at this Bank 0.125790651
## Occupation 0.101224870
## No of dependents 0.097192004
## Telephone 0.060518081
## Foreign Worker -0.022845318
## Instalment per cent Sex & Marital Status
## Account Balance -0.005279856 0.0432612798
## Duration of Credit (month) 0.074748816 0.0147893320
## Payment Status of Previous Credit 0.044374587 0.0421708809
## Purpose 0.048368947 0.0001565929
## Credit Amount -0.271322281 -0.0160943379
## Value Savings/Stocks 0.021992529 0.0173486885
## Length of current employment 0.126161307 0.1112782879
## Instalment per cent 1.000000000 0.1193079016
## Sex & Marital Status 0.119307902 1.0000000000
## Guarantors -0.011397639 0.0506338891
## Duration in Current address 0.049302371 -0.0272690320
## Most valuable available asset 0.053391413 -0.0069404770
## Age (years) 0.057270750 0.0051498271
## Concurrent Credits 0.007893967 -0.0267469446
## Type of apartment 0.091228577 0.0989338012
## No of Credits at this Bank 0.021668743 0.0646718729
## Occupation 0.097755393 -0.0119563566
## No of dependents -0.071206943 0.1221648450
## Telephone 0.014412880 0.0272748748
## Foreign Worker -0.094762307 0.0731034045
## Guarantors Duration in Current address
## Account Balance -0.127736563 -0.042233689
## Duration of Credit (month) -0.024489950 0.034067202
## Payment Status of Previous Credit -0.040675530 0.063197969
## Purpose -0.017606754 -0.038221345
## Credit Amount -0.027830917 0.028916676
## Value Savings/Stocks -0.105068513 0.091424109
## Length of current employment -0.008116008 0.245080745
## Instalment per cent -0.011397639 0.049302371
## Sex & Marital Status 0.050633889 -0.027269032
## Guarantors 1.000000000 -0.025677506
## Duration in Current address -0.025677506 1.000000000
## Most valuable available asset -0.155450138 0.147231116
## Age (years) -0.029825663 0.265626478
## Concurrent Credits -0.038235049 0.022654074
## Type of apartment -0.065449419 0.009989899
## No of Credits at this Bank -0.025446800 0.089625233
## Occupation -0.057962986 0.012654644
## No of dependents 0.020399584 0.042643426
## Telephone -0.075034578 0.095359367
## Foreign Worker 0.140190191 -0.039690633
## Most valuable available asset
## Account Balance -0.032260126
## Duration of Credit (month) 0.303971245
## Payment Status of Previous Credit -0.053776760
## Purpose 0.010966353
## Credit Amount 0.311602093
## Value Savings/Stocks 0.018948001
## Length of current employment 0.087187468
## Instalment per cent 0.053391413
## Sex & Marital Status -0.006940477
## Guarantors -0.155450138
## Duration in Current address 0.147231116
## Most valuable available asset 1.000000000
## Age (years) 0.074551454
## Concurrent Credits -0.107593324
## Type of apartment 0.342968580
## No of Credits at this Bank -0.007765020
## Occupation 0.276149365
## No of dependents 0.011871999
## Telephone 0.196801583
## Foreign Worker -0.132461796
## Age (years) Concurrent Credits
## Account Balance 0.0586307400 0.068273870
## Duration of Credit (month) -0.0375498629 -0.062883787
## Payment Status of Previous Credit 0.1463374687 0.159957065
## Purpose -0.0008923856 -0.100230393
## Credit Amount 0.0322726775 -0.069392010
## Value Savings/Stocks 0.0834335122 0.001907967
## Length of current employment 0.2591161527 -0.007279305
## Instalment per cent 0.0572707503 0.007893967
## Sex & Marital Status 0.0051498271 -0.026746945
## Guarantors -0.0298256629 -0.038235049
## Duration in Current address 0.2656264783 0.022654074
## Most valuable available asset 0.0745514538 -0.107593324
## Age (years) 1.0000000000 -0.030471934
## Concurrent Credits -0.0304719341 1.000000000
## Type of apartment 0.3033464109 -0.097397651
## No of Credits at this Bank 0.1507176513 -0.055809873
## Occupation 0.0153830309 0.006077318
## No of dependents 0.1185891829 -0.076890642
## Telephone 0.1435058472 -0.025139895
## Foreign Worker 0.0139811872 0.007699595
## Type of apartment
## Account Balance 0.023335309
## Duration of Credit (month) 0.153125556
## Payment Status of Previous Credit 0.061427919
## Purpose 0.013494697
## Credit Amount 0.133023634
## Value Savings/Stocks 0.006643819
## Length of current employment 0.115077459
## Instalment per cent 0.091228577
## Sex & Marital Status 0.098933801
## Guarantors -0.065449419
## Duration in Current address 0.009989899
## Most valuable available asset 0.342968580
## Age (years) 0.303346411
## Concurrent Credits -0.097397651
## Type of apartment 1.000000000
## No of Credits at this Bank 0.050019938
## Occupation 0.104243222
## No of dependents 0.115548584
## Telephone 0.100326589
## Foreign Worker -0.083336024
## No of Credits at this Bank Occupation
## Account Balance 0.07600514 0.040663061
## Duration of Credit (month) -0.01128360 0.210909735
## Payment Status of Previous Credit 0.43706577 0.010350179
## Purpose 0.05493536 0.008084776
## Credit Amount 0.02078528 0.285393073
## Value Savings/Stocks -0.02164413 0.011708920
## Length of current employment 0.12579065 0.101224870
## Instalment per cent 0.02166874 0.097755393
## Sex & Marital Status 0.06467187 -0.011956357
## Guarantors -0.02544680 -0.057962986
## Duration in Current address 0.08962523 0.012654644
## Most valuable available asset -0.00776502 0.276149365
## Age (years) 0.15071765 0.015383031
## Concurrent Credits -0.05580987 0.006077318
## Type of apartment 0.05001994 0.104243222
## No of Credits at this Bank 1.00000000 -0.026321269
## Occupation -0.02632127 1.000000000
## No of dependents 0.10966670 -0.093559276
## Telephone 0.06555321 0.383022159
## Foreign Worker -0.01889259 -0.092834959
## No of dependents Telephone
## Account Balance -0.01414543 0.06629583
## Duration of Credit (month) -0.02383448 0.16471821
## Payment Status of Previous Credit 0.01154955 0.05237019
## Purpose -0.03257687 0.07837054
## Credit Amount 0.01714358 0.27700018
## Value Savings/Stocks 0.02751379 0.08720840
## Length of current employment 0.09719200 0.06051808
## Instalment per cent -0.07120694 0.01441288
## Sex & Marital Status 0.12216485 0.02727487
## Guarantors 0.02039958 -0.07503458
## Duration in Current address 0.04264343 0.09535937
## Most valuable available asset 0.01187200 0.19680158
## Age (years) 0.11858918 0.14350585
## Concurrent Credits -0.07689064 -0.02513990
## Type of apartment 0.11554858 0.10032659
## No of Credits at this Bank 0.10966670 0.06555321
## Occupation -0.09355928 0.38302216
## No of dependents 1.00000000 -0.01475344
## Telephone -0.01475344 1.00000000
## Foreign Worker 0.07707085 -0.07501222
## Foreign Worker
## Account Balance -0.035186993
## Duration of Credit (month) -0.134679963
## Payment Status of Previous Credit 0.028554048
## Purpose -0.113243669
## Credit Amount -0.030661601
## Value Savings/Stocks 0.010449560
## Length of current employment -0.022845318
## Instalment per cent -0.094762307
## Sex & Marital Status 0.073103405
## Guarantors 0.140190191
## Duration in Current address -0.039690633
## Most valuable available asset -0.132461796
## Age (years) 0.013981187
## Concurrent Credits 0.007699595
## Type of apartment -0.083336024
## No of Credits at this Bank -0.018892588
## Occupation -0.092834959
## No of dependents 0.077070853
## Telephone -0.075012215
## Foreign Worker 1.000000000
##we want to find the varibales that are having correlation co-efficient more than 0.3
highlycor <- findCorrelation(m, 0.30)
highlycor
## [1] 5 12 19 15 3
#Remove highly correlated data and then subdivide train and tests
filteredData <- credit_rand2[, -(highlycor)]
#Model split between train and test
str(filteredData)
## 'data.frame': 1000 obs. of 16 variables:
## $ Creditability : Factor w/ 2 levels "0","1": 2 2 1 2 2 1 1 2 2 2 ...
## $ Account Balance : num 2 2 1 4 2 4 2 3 1 4 ...
## $ Payment Status of Previous Credit: num 4 4 2 2 2 2 2 2 2 4 ...
## $ Credit Amount : num 5954 7758 1371 1950 841 ...
## $ Value Savings/Stocks : num 1 4 5 1 2 3 5 1 2 1 ...
## $ Length of current employment : num 4 5 3 4 4 5 3 2 4 4 ...
## $ Instalment per cent : num 2 2 4 4 2 3 4 1 4 2 ...
## $ Sex & Marital Status : num 2 2 2 3 2 3 2 2 3 3 ...
## $ Guarantors : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Most valuable available asset : num 1 4 1 3 1 3 2 2 3 1 ...
## $ Age (years) : num 41 29 25 34 23 35 33 45 30 49 ...
## $ Type of apartment : num 2 1 1 2 1 2 2 2 2 2 ...
## $ No of Credits at this Bank : num 2 1 1 2 1 2 1 1 1 1 ...
## $ Occupation : num 2 3 3 3 2 3 3 2 4 2 ...
## $ Telephone : num 1 1 1 2 1 2 1 1 2 1 ...
## $ Foreign Worker : num 1 1 1 1 1 1 1 1 1 1 ...
filteredTraining <- filteredData[1:750, ]
filteredTest <- filteredData[751:1000, ]
##TRRain the Data
library(naivebayes)
nb_model <- naive_bayes(Creditability ~ ., data=filteredTraining)
## Evaluate the model
filteredTestPred <- predict(nb_model, newdata = filteredTest)
table(filteredTestPred, filteredTest$Creditability)
##
## filteredTestPred 0 1
## 0 52 47
## 1 18 133
(conf_nat <- table(filteredTestPred, filteredTest$Creditability))
##
## filteredTestPred 0 1
## 0 52 47
## 1 18 133
(Accuracy <- sum(diag(conf_nat))/sum(conf_nat)*100)
## [1] 74
From the results above, the accuracy of the model reduced to 75%. This shows that the feature selection did not really improve the performance of the model.
Conclusion:
When we randomized the data we got slightly different accuracy Result of 75.2% (down from 77.2%).
Now True Positive is 53, True Negative 135 and False Negative - 45, and False Negative - 17.
The performance didn’t improve due to randomization because the data set is too small, however the approach itself should work on a large data set.
In order to improve accuracy of dataset False Negative and False Positive should be as small as possible. For example, if False Negative and False Positive is zero or close to zero then our Accuracy result will be 100% or close to 100%.
In addition, True Positive and True Negative should significantly outweight False Positive and False Negative to get us to a higher accuracy level.