#Introduction and Conclsuion Predicting using k-Nearest Neighbors on the bank.csv dataset from the UCI website.
KNN/K-nearest neighbors is used for the simplicity as well as classification. K-nearest neighbors is non-parametric, this means that the distribution of that data is not normal. KNN is called the “lazy” that means that it will generalize the data once the query is made. When we’re using KNN, the data points needed from the queries are stored and then used. When it’s time to prepare for the KNN, first we normalize the data by creating a normalization function then using it along with lapply which will actually look at the list of the objects and work to return another list of obects. Two lists has to be the same length functions/algorithm will not run. The results of the lapply will be corresponding to the data. After normalization, begin with a training and testing model. Using the training model shows training set cases, while using the testing model shows training model cases, we’re able to see the data frame for the testing set cases. The testing data is helped by a vector which will interpret as a row vector. This is actually used per case. When we test different k values, we look at the number of neighbors that we want to consider. Looking at different values, we’re able to see what the best k number of neighbors actually is. The bank data dataset, we’ve used the k values 5, 21, and 59. We used 59 because good pratice we take the square root of the training data 3521. After building the k-nearest neighbors classifier, we can see that the value of k being 5 is the best. With k having 5 neighbors, we’re able to get the best out of the algorithm. We can see in the cross table it yeilds the best results as we accuire a divorce prediction which we do not in the other k’s. The others do not include a divorce prediction which is bad, because it is definitely an option in our prediction.
#The Five Steps and Appendix Step 1: Reading in the data
BankData <- read.csv2("bank.csv", sep=";", stringsAsFactors = FALSE)
str(BankData)
'data.frame': 4521 obs. of 17 variables:
$ age : int 30 33 35 30 59 35 36 39 41 43 ...
$ job : chr "unemployed" "services" "management" "management" ...
$ marital : chr "married" "married" "single" "married" ...
$ education: chr "primary" "secondary" "tertiary" "tertiary" ...
$ default : chr "no" "no" "no" "no" ...
$ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
$ housing : chr "no" "yes" "yes" "yes" ...
$ loan : chr "no" "yes" "no" "yes" ...
$ contact : chr "cellular" "cellular" "cellular" "unknown" ...
$ day : int 19 11 16 3 5 23 14 6 14 17 ...
$ month : chr "oct" "may" "apr" "jun" ...
$ duration : int 79 220 185 199 226 141 341 151 57 313 ...
$ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
$ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
$ previous : int 0 4 1 0 0 3 2 0 0 2 ...
$ poutcome : chr "unknown" "failure" "failure" "unknown" ...
$ y : chr "no" "no" "no" "no" ...
table(BankData$marital)
divorced married single
528 2797 1196
library(purrr)
library(dplyr)
BankData2<- BankData %>% keep(is.numeric) %>% cbind(BankData$marital) %>% rename(marital=`BankData$marital`)
create normalization function and test normalization function - result should be identical
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
normalize(c(1, 2, 3, 4, 5))
[1] 0.00 0.25 0.50 0.75 1.00
normalize(c(10, 20, 30, 40, 50))
[1] 0.00 0.25 0.50 0.75 1.00
normalize the bank data and confirm that normalization worked
bank_n <- as.data.frame(lapply(BankData2[ ,1:7], normalize))
summary(bank_n)
age balance day duration campaign pdays previous
Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.0000
1st Qu.:0.2059 1st Qu.:0.04540 1st Qu.:0.2667 1st Qu.:0.03310 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
Median :0.2941 Median :0.05043 Median :0.5000 Median :0.05991 Median :0.02041 Median :0.00000 Median :0.0000
Mean :0.3260 Mean :0.06356 Mean :0.4972 Mean :0.08605 Mean :0.03660 Mean :0.04675 Mean :0.0217
3rd Qu.:0.4412 3rd Qu.:0.06433 3rd Qu.:0.6667 3rd Qu.:0.10758 3rd Qu.:0.04082 3rd Qu.:0.00000 3rd Qu.:0.0000
Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.0000
bank_train <- bank_n[1:3521, ]
bank_test <- bank_n[3522:4521, ]
bank_train_labels <- BankData2[1:3521, 8]
bank_test_labels <- BankData2[3522:4521, 8]
library(class)
bank_test_pred <- knn(train = bank_train, test = bank_test,
cl = bank_train_labels, k = 59)
head(bank_test_pred)
[1] married married married married single married
Levels: divorced married single
Create the cross tabulation of predicted vs. actual
library(gmodels)
CrossTable(x = bank_test_labels, y = bank_test_pred,
prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 1000
| bank_test_pred
bank_test_labels | married | single | Row Total |
-----------------|-----------|-----------|-----------|
divorced | 121 | 4 | 125 |
| 0.968 | 0.032 | 0.125 |
| 0.137 | 0.034 | |
| 0.121 | 0.004 | |
-----------------|-----------|-----------|-----------|
married | 571 | 40 | 611 |
| 0.935 | 0.065 | 0.611 |
| 0.647 | 0.339 | |
| 0.571 | 0.040 | |
-----------------|-----------|-----------|-----------|
single | 190 | 74 | 264 |
| 0.720 | 0.280 | 0.264 |
| 0.215 | 0.627 | |
| 0.190 | 0.074 | |
-----------------|-----------|-----------|-----------|
Column Total | 882 | 118 | 1000 |
| 0.882 | 0.118 | |
-----------------|-----------|-----------|-----------|
use the scale() function to z-score standardize a data frame and confirm that the transformation was applied correctly
bank_z <- as.data.frame(scale(BankData2[-8]))
summary(bank_z)
age balance day duration campaign pdays previous
Min. :-2.0962 Min. :-1.57350 Min. :-1.80842 Min. :-1.0004 Min. :-0.57677 Min. :-0.4072 Min. :-0.3204
1st Qu.:-0.7725 1st Qu.:-0.44977 1st Qu.:-0.83845 1st Qu.:-0.6156 1st Qu.:-0.57677 1st Qu.:-0.4072 1st Qu.:-0.3204
Median :-0.2052 Median :-0.32517 Median : 0.01027 Median :-0.3039 Median :-0.25520 Median :-0.4072 Median :-0.3204
Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.7403 3rd Qu.: 0.01905 3rd Qu.: 0.61650 3rd Qu.: 0.2503 3rd Qu.: 0.06636 3rd Qu.:-0.4072 3rd Qu.:-0.3204
Max. : 4.3333 Max. :23.18064 Max. : 1.82897 Max. :10.6252 Max. :15.17984 Max. : 8.3023 Max. :14.4414
create training and test datasets
bank_train2 <- bank_z[1:3521, ]
bank_test2 <- bank_z[3522:4521, ]
re-classify test cases
bank_test_pred2 <- knn(train = bank_train2, test = bank_test2,
cl = bank_train_labels, k = 5)
head(bank_test_pred2)
[1] married married single married married married
Levels: divorced married single
Create the cross tabulation of predicted vs. actual
CrossTable(x = bank_test_labels, y = bank_test_pred2,
prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 1000
| bank_test_pred2
bank_test_labels | divorced | married | single | Row Total |
-----------------|-----------|-----------|-----------|-----------|
divorced | 5 | 108 | 12 | 125 |
| 0.040 | 0.864 | 0.096 | 0.125 |
| 0.122 | 0.141 | 0.062 | |
| 0.005 | 0.108 | 0.012 | |
-----------------|-----------|-----------|-----------|-----------|
married | 30 | 496 | 85 | 611 |
| 0.049 | 0.812 | 0.139 | 0.611 |
| 0.732 | 0.649 | 0.436 | |
| 0.030 | 0.496 | 0.085 | |
-----------------|-----------|-----------|-----------|-----------|
single | 6 | 160 | 98 | 264 |
| 0.023 | 0.606 | 0.371 | 0.264 |
| 0.146 | 0.209 | 0.503 | |
| 0.006 | 0.160 | 0.098 | |
-----------------|-----------|-----------|-----------|-----------|
Column Total | 41 | 764 | 195 | 1000 |
| 0.041 | 0.764 | 0.195 | |
-----------------|-----------|-----------|-----------|-----------|
bank_test_pred3 <- knn(train = bank_train2, test = bank_test2,
cl = bank_train_labels, k = 21)
head(bank_test_pred2)
[1] married married single married married married
Levels: divorced married single
Create the cross tabulation of predicted vs. actual
CrossTable(x = bank_test_labels, y = bank_test_pred3,
prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 1000
| bank_test_pred3
bank_test_labels | married | single | Row Total |
-----------------|-----------|-----------|-----------|
divorced | 115 | 10 | 125 |
| 0.920 | 0.080 | 0.125 |
| 0.134 | 0.071 | |
| 0.115 | 0.010 | |
-----------------|-----------|-----------|-----------|
married | 559 | 52 | 611 |
| 0.915 | 0.085 | 0.611 |
| 0.651 | 0.369 | |
| 0.559 | 0.052 | |
-----------------|-----------|-----------|-----------|
single | 185 | 79 | 264 |
| 0.701 | 0.299 | 0.264 |
| 0.215 | 0.560 | |
| 0.185 | 0.079 | |
-----------------|-----------|-----------|-----------|
Column Total | 859 | 141 | 1000 |
| 0.859 | 0.141 | |
-----------------|-----------|-----------|-----------|