Project 5- KNN (DATA MINING)

Universal Bank is a relatively young bank growing rapidly in terms of overall customer acquisition. The majority of these customers are liability customers (depositors) with varying sizes of relationship with the bank. The customer base of asset customers (borrowers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business. In particular, it wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing depart- ment to devise smarter campaigns with better target marketing. The goal is to use k-NN to predict whether a new customer will accept a loan offer. This will serve as the basis for the design of a new campaign.

#Let's read the csv file 
mydata <- read.csv("UniversalBank.csv")

#Now let's look at the summary table
summary(mydata)

       ID            Age          Experience       Income          ZIP.Code    
 Min.   :   1   Min.   :23.00   Min.   :-3.0   Min.   :  8.00   Min.   : 9307  
 1st Qu.:1251   1st Qu.:35.00   1st Qu.:10.0   1st Qu.: 39.00   1st Qu.:91911  
 Median :2500   Median :45.00   Median :20.0   Median : 64.00   Median :93437  
 Mean   :2500   Mean   :45.34   Mean   :20.1   Mean   : 73.77   Mean   :93152  
 3rd Qu.:3750   3rd Qu.:55.00   3rd Qu.:30.0   3rd Qu.: 98.00   3rd Qu.:94608  
 Max.   :5000   Max.   :67.00   Max.   :43.0   Max.   :224.00   Max.   :96651  
     Family          CCAvg          Education        Mortgage    
 Min.   :1.000   Min.   : 0.000   Min.   :1.000   Min.   :  0.0  
 1st Qu.:1.000   1st Qu.: 0.700   1st Qu.:1.000   1st Qu.:  0.0  
 Median :2.000   Median : 1.500   Median :2.000   Median :  0.0  
 Mean   :2.396   Mean   : 1.938   Mean   :1.881   Mean   : 56.5  
 3rd Qu.:3.000   3rd Qu.: 2.500   3rd Qu.:3.000   3rd Qu.:101.0  
 Max.   :4.000   Max.   :10.000   Max.   :3.000   Max.   :635.0  
 Personal.Loan   Securities.Account   CD.Account         Online      
 Min.   :0.000   Min.   :0.0000     Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.000   1st Qu.:0.0000     1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.000   Median :0.0000     Median :0.0000   Median :1.0000  
 Mean   :0.096   Mean   :0.1044     Mean   :0.0604   Mean   :0.5968  
 3rd Qu.:0.000   3rd Qu.:0.0000     3rd Qu.:0.0000   3rd Qu.:1.0000  
 Max.   :1.000   Max.   :1.0000     Max.   :1.0000   Max.   :1.0000  
   CreditCard   
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.294  
 3rd Qu.:1.000  
 Max.   :1.000

#Let's look at the internal structure of R object which is from our dataset `mydata`
str(mydata)

'data.frame':   5000 obs. of  14 variables:
 $ ID                : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Age               : int  25 45 39 35 35 37 53 50 35 34 ...
 $ Experience        : int  1 19 15 9 8 13 27 24 10 9 ...
 $ Income            : int  49 34 11 100 45 29 72 22 81 180 ...
 $ ZIP.Code          : int  91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
 $ Family            : int  4 3 1 1 4 4 2 1 3 1 ...
 $ CCAvg             : num  1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
 $ Education         : int  1 1 1 2 2 2 2 3 2 3 ...
 $ Mortgage          : int  0 0 0 0 0 155 0 0 104 0 ...
 $ Personal.Loan     : int  0 0 0 0 0 0 0 0 0 1 ...
 $ Securities.Account: int  1 1 0 0 0 0 0 0 0 0 ...
 $ CD.Account        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Online            : int  0 0 0 0 0 1 1 0 1 0 ...
 $ CreditCard        : int  0 0 0 0 1 0 0 1 0 0 ...

a.Consider the following customer: Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education_1 = 0, Education_2 = 1, Education_3 = 0, Mortgage = 0, Securities Account = 0, CD Account = 0, Online = 1, and Credit Card = 1. Perform a k-NN classification with all predictors except ID and ZIP code using k = 1. Remember to transform categorical predictors with more than two categories into dummy variables first. Specify the success class as 1 (loan acceptance), and use the default cutoff value of 0.5. How would this customer be classified?

#Converting Education to a factor 
mydata$Education = as.factor(mydata$Education)

#We want to remove ID and ZIP code from our dataset.Also, transforming categorical predictor Education with more than two categories into dummy variable.
mydata_dummy = dummy.data.frame(select(mydata,-c(ZIP.Code,ID)))
head(mydata_dummy)

  Age Experience Income Family CCAvg Education1 Education2 Education3 Mortgage
1  25          1     49      4   1.6          1          0          0        0
2  45         19     34      3   1.5          1          0          0        0
3  39         15     11      1   1.0          1          0          0        0
4  35          9    100      1   2.7          0          1          0        0
5  35          8     45      4   1.0          0          1          0        0
6  37         13     29      4   0.4          0          1          0      155
  Personal.Loan Securities.Account CD.Account Online CreditCard
1             0                  1          0      0          0
2             0                  1          0      0          0
3             0                  0          0      0          0
4             0                  0          0      0          0
5             0                  0          0      0          1
6             0                  0          0      1          0

#Converting Personal.Loan to a factor present in the dataset 'bank_dummy'
mydata_dummy$Personal.Loan = as.factor(mydata_dummy$Personal.Loan)

#Setting set.seed as 3.14 before we partition the data
set.seed(3.14)

#We divide the data into validation set and training set.
train.index <- sample(row.names(mydata_dummy), 0.6*dim(mydata_dummy)[1]) 
test.index <- setdiff(row.names(mydata_dummy), train.index) 
train_data <- mydata_dummy[train.index, ]
valid_data <- mydata_dummy[test.index, ]

#Classifying the given customer
Given_CusData = data.frame(Age=40 , Experience=10, Income = 84, Family = 2, CCAvg = 2, Education1 = 0, Education2 = 1, Education3 = 0, Mortgage = 0, Securities.Account = 0, CD.Account = 0, Online = 1,  CreditCard = 1, stringsAsFactors = FALSE)
Given_CusData

  Age Experience Income Family CCAvg Education1 Education2 Education3 Mortgage
1  40         10     84      2     2          0          1          0        0
  Securities.Account CD.Account Online CreditCard
1                  0          0      1          1

norm.values <- preProcess(train_data[, -c(10)], method=c("center", "scale"))
train_data[, -c(10)] <- predict(norm.values, train_data[, -c(10)])
valid_data[, -c(10)] <- predict(norm.values, valid_data[, -c(10)])
new.df <- predict(norm.values, Given_CusData)

knn.1 <- knn(train = train_data[,-c(10)],test = new.df, cl = train_data[,10], k=5, prob=TRUE)
knn.attributes <- attributes(knn.1)

knn.attributes[1]

$levels
[1] "0" "1"

knn.attributes[3]

$prob
[1] 1

b.What is a choice of k that balances between overfitting and ignoring the predictor information?

The best choice of k which also balances the model from overfitting is k = 3

my_accurateChoice <- data.frame(k = seq(1, 14, 1), accuracy = rep(0, 14))

for(i in 1:14) {
  test1 <- knn(train = train_data[,-10],test = valid_data[,-10], cl = train_data[,10], k=i, prob=TRUE)
  my_accurateChoice[i, 2] <- confusionMatrix(test1, valid_data[,10])$overall[1]
}
my_accurateChoice

    k accuracy
1   1   0.9585
2   2   0.9565
3   3   0.9635
4   4   0.9600
5   5   0.9635
6   6   0.9615
7   7   0.9605
8   8   0.9590
9   9   0.9560
10 10   0.9565
11 11   0.9555
12 12   0.9545
13 13   0.9540
14 14   0.9530

c.Show the confusion matrix for the validation data that results from using the best k.

test2 <- knn(train = train_data[,-10],test = valid_data[,-10], cl = train_data[,10], k=3, prob=TRUE)
confusionMatrix(test2, valid_data[,10])

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1794   62
         1   11  133
                                          
               Accuracy : 0.9635          
                 95% CI : (0.9543, 0.9713)
    No Information Rate : 0.9025          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7652          
                                          
 Mcnemar's Test P-Value : 4.855e-09       
                                          
            Sensitivity : 0.9939          
            Specificity : 0.6821          
         Pos Pred Value : 0.9666          
         Neg Pred Value : 0.9236          
             Prevalence : 0.9025          
         Detection Rate : 0.8970          
   Detection Prevalence : 0.9280          
      Balanced Accuracy : 0.8380          
                                          
       'Positive' Class : 0

d.Consider the following customer: Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education_1 = 0, Education_2 = 1, Education_3 = 0, Mortgage = 0, Securities Account = 0, CD Account = 0, Online = 1 and Credit Card = 1. Classify the customer using the best k.

Given_CusData2= data.frame(Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education_1 = 0, Education_2 = 1, Education_3 = 0, Mortgage = 0, Securities.Account = 0, CD.Account = 0, Online = 1, CreditCard = 1)
my_knn <- knn(train = train_data[,-10],test = Given_CusData2, cl = train_data[,10], k=3, prob=TRUE)
my_knn

[1] 1
attr(,"prob")
[1] 1
Levels: 0 1

e.Repartition the data, this time into training, validation, and test sets (50% : 30% : 20%). Apply the k-NN method with the k chosen above. Compare the confusion matrix of the test set with that of the training and validation sets. Comment on the differences and their reason.

set.seed(3.14)
train.index <- sample(rownames(mydata_dummy), 0.5*dim(mydata_dummy)[1])
valid.index <- sample(setdiff(rownames(mydata_dummy),train.index), 0.3*dim(mydata_dummy)[1])
test.index = setdiff(rownames(mydata_dummy), union(train.index, valid.index))

train_data<- mydata_dummy[train.index, ]
valid_data <- mydata_dummy[valid.index, ]
test_data <- mydata_dummy[test.index, ]

norm.values <- preProcess(train_data[, -c(10)], method=c("center", "scale"))
train_data[, -c(10)] <- predict(norm.values, train_data[, -c(10)])
valid_data[, -c(10)] <- predict(norm.values, valid_data[, -c(10)])
test_data[,-c(10)] <- predict(norm.values, test_data[,-c(10)])

test_data1 <- knn(train = train_data[,-c(10)],test = test_data[,-c(10)], cl = train_data[,10], k=3, prob=TRUE)
valid_data1 <- knn(train = train_data[,-c(10)],test = valid_data[,-c(10)], cl = train_data[,10], k=3, prob=TRUE)
train_data1 <- knn(train = train_data[,-c(10)],test = train_data[,-c(10)], cl = train_data[,10], k=3, prob=TRUE)

confusionMatrix(test_data1, test_data[,10])

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 892  34
         1   7  67
                                          
               Accuracy : 0.959           
                 95% CI : (0.9448, 0.9704)
    No Information Rate : 0.899           
    P-Value [Acc > NIR] : 1.416e-12       
                                          
                  Kappa : 0.7438          
                                          
 Mcnemar's Test P-Value : 4.896e-05       
                                          
            Sensitivity : 0.9922          
            Specificity : 0.6634          
         Pos Pred Value : 0.9633          
         Neg Pred Value : 0.9054          
             Prevalence : 0.8990          
         Detection Rate : 0.8920          
   Detection Prevalence : 0.9260          
      Balanced Accuracy : 0.8278          
                                          
       'Positive' Class : 0

confusionMatrix(valid_data1, valid_data[,10])

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1345   49
         1    7   99
                                          
               Accuracy : 0.9627          
                 95% CI : (0.9518, 0.9717)
    No Information Rate : 0.9013          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7597          
                                          
 Mcnemar's Test P-Value : 4.281e-08       
                                          
            Sensitivity : 0.9948          
            Specificity : 0.6689          
         Pos Pred Value : 0.9648          
         Neg Pred Value : 0.9340          
             Prevalence : 0.9013          
         Detection Rate : 0.8967          
   Detection Prevalence : 0.9293          
      Balanced Accuracy : 0.8319          
                                          
       'Positive' Class : 0

confusionMatrix(train_data1, train_data[,10])

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 2266   54
         1    3  177
                                          
               Accuracy : 0.9772          
                 95% CI : (0.9706, 0.9827)
    No Information Rate : 0.9076          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8491          
                                          
 Mcnemar's Test P-Value : 3.528e-11       
                                          
            Sensitivity : 0.9987          
            Specificity : 0.7662          
         Pos Pred Value : 0.9767          
         Neg Pred Value : 0.9833          
             Prevalence : 0.9076          
         Detection Rate : 0.9064          
   Detection Prevalence : 0.9280          
      Balanced Accuracy : 0.8825          
                                          
       'Positive' Class : 0

Project 5- KNN (DATA MINING)

Santosh Tiwari

3/5/2022