Question 3.1

Q: Using the same data set (credit_card_data.txt or credit_card_data-headers.txt) as in Question 2.2, use the ksvm or kknn function to find a good classifier: a) using cross-validation (do this for the k-nearest-neighbors model; SVM is optional)

My answer

I decided to try out using the caret package instead of kknn which we learn about from the previous assignment. I think ultimately the process feels about the same. In this attempt, I’ll be looking at the data gathered from running just cross-validation on a dataset after doing a train-test split in 3.1a) using k-nearest neighbors, and looking at data gathered from doing a train-validate-test split in 3.1b) using SVM.

3.1a)

Train-test split only ### Exploring the dataset

# clear env vars
rm(list=ls())

# Call caret and kknnlibrary for k-nearest neighbours model k-fold cross validation
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(kknn)
## 
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
## 
##     contr.dummy
# Reading the data
data <- read.table("C:/Users/keith/OneDrive/Desktop/credit_card_data.txt",stringsAsFactors=FALSE,header=FALSE)
head(data)
##   V1    V2    V3   V4 V5 V6 V7 V8  V9 V10 V11
## 1  1 30.83 0.000 1.25  1  0  1  1 202   0   1
## 2  0 58.67 4.460 3.04  1  0  6  1  43 560   1
## 3  0 24.50 0.500 1.50  1  1  0  1 280 824   1
## 4  1 27.83 1.540 3.75  1  0  5  0 100   3   1
## 5  1 20.17 5.625 1.71  1  1  0  1 120   0   1
## 6  1 32.08 4.000 2.50  1  1  0  0 360   0   1
# Explore the data before we do the train-test split
summary(data)
##        V1               V2              V3               V4        
##  Min.   :0.0000   Min.   :13.75   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:0.0000   1st Qu.:22.58   1st Qu.: 1.040   1st Qu.: 0.165  
##  Median :1.0000   Median :28.46   Median : 2.855   Median : 1.000  
##  Mean   :0.6896   Mean   :31.58   Mean   : 4.831   Mean   : 2.242  
##  3rd Qu.:1.0000   3rd Qu.:38.25   3rd Qu.: 7.438   3rd Qu.: 2.615  
##  Max.   :1.0000   Max.   :80.25   Max.   :28.000   Max.   :28.500  
##        V5               V6               V7               V8        
##  Min.   :0.0000   Min.   :0.0000   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.:0.0000  
##  Median :1.0000   Median :1.0000   Median : 0.000   Median :1.0000  
##  Mean   :0.5352   Mean   :0.5612   Mean   : 2.498   Mean   :0.5382  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 3.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :67.000   Max.   :1.0000  
##        V9               V10              V11        
##  Min.   :   0.00   Min.   :     0   Min.   :0.0000  
##  1st Qu.:  70.75   1st Qu.:     0   1st Qu.:0.0000  
##  Median : 160.00   Median :     5   Median :0.0000  
##  Mean   : 180.08   Mean   :  1013   Mean   :0.4526  
##  3rd Qu.: 271.00   3rd Qu.:   399   3rd Qu.:1.0000  
##  Max.   :2000.00   Max.   :100000   Max.   :1.0000
str(data)
## 'data.frame':    654 obs. of  11 variables:
##  $ V1 : int  1 0 0 1 1 1 1 0 1 1 ...
##  $ V2 : num  30.8 58.7 24.5 27.8 20.2 ...
##  $ V3 : num  0 4.46 0.5 1.54 5.62 ...
##  $ V4 : num  1.25 3.04 1.5 3.75 1.71 ...
##  $ V5 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ V6 : int  0 0 1 0 1 1 1 1 1 1 ...
##  $ V7 : int  1 6 0 5 0 0 0 0 0 0 ...
##  $ V8 : int  1 1 1 0 1 0 0 1 1 0 ...
##  $ V9 : int  202 43 280 100 120 360 164 80 180 52 ...
##  $ V10: int  0 560 824 3 0 0 31285 1349 314 1442 ...
##  $ V11: int  1 1 1 1 1 1 1 1 1 1 ...
#Check for missing values
sum(is.na(data))
## [1] 0

Now that we know there is no missing data, and we know the rough shape of the data, we can now proceed to split the data into train and test sets.

split <- 0.7 
train_test_mask <- sample(nrow(data), size=floor(nrow(data)*split))
train_set <- data[train_test_mask,] 
test_set <- data[-train_test_mask,]
# set.seed for reproducibility
set.seed(42)

# set maximum value of k (number of neighbors) to test
k_maximum <- 100

# since number of folds is arbitrary, we stick with 10-fold since ds is small
k_cv <-10

# we'll now fit the data to k-nearest neighbors
knn_fit <- train(as.factor(V11)~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,
    train_set, 
    method = "knn",
    trControl=trainControl(
               method="repeatedcv",
               number=k_cv,
               repeats=10), 
    preProcess = c("center", "scale"),
    tuneLength = k_maximum)

knn_fit
## k-Nearest Neighbors 
## 
## 457 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 412, 411, 411, 411, 412, 411, ... 
## Resampling results across tuning parameters:
## 
##   k    Accuracy   Kappa    
##     5  0.8405314  0.6780592
##     7  0.8442464  0.6853988
##     9  0.8385700  0.6738250
##    11  0.8374300  0.6711013
##    13  0.8426908  0.6811480
##    15  0.8465845  0.6887650
##    17  0.8417874  0.6782771
##    19  0.8396087  0.6735476
##    21  0.8416087  0.6774086
##    23  0.8440242  0.6821152
##    25  0.8481787  0.6903147
##    27  0.8488357  0.6915087
##    29  0.8507826  0.6951233
##    31  0.8525169  0.6985207
##    33  0.8518599  0.6970019
##    35  0.8498792  0.6928532
##    37  0.8507488  0.6943690
##    39  0.8500966  0.6931037
##    41  0.8501063  0.6929680
##    43  0.8490338  0.6907884
##    45  0.8481643  0.6889633
##    47  0.8466232  0.6856803
##    49  0.8457633  0.6839232
##    51  0.8472995  0.6870783
##    53  0.8477440  0.6880449
##    55  0.8455507  0.6834555
##    57  0.8462029  0.6847690
##    59  0.8472947  0.6868295
##    61  0.8490580  0.6903696
##    63  0.8462077  0.6844447
##    65  0.8470870  0.6862392
##    67  0.8477391  0.6874773
##    69  0.8464106  0.6846488
##    71  0.8466425  0.6853138
##    73  0.8470628  0.6860813
##    75  0.8461884  0.6840884
##    77  0.8455411  0.6828073
##    79  0.8459710  0.6836846
##    81  0.8455314  0.6828027
##    83  0.8442174  0.6799918
##    85  0.8448841  0.6814205
##    87  0.8431353  0.6779543
##    89  0.8418261  0.6751764
##    91  0.8411691  0.6738362
##    93  0.8422657  0.6759829
##    95  0.8407295  0.6727195
##    97  0.8396329  0.6703921
##    99  0.8398599  0.6707581
##   101  0.8400821  0.6710602
##   103  0.8400870  0.6710014
##   105  0.8387874  0.6682671
##   107  0.8394444  0.6697297
##   109  0.8387923  0.6684188
##   111  0.8377005  0.6660583
##   113  0.8370435  0.6646773
##   115  0.8365990  0.6637829
##   117  0.8366039  0.6636971
##   119  0.8368213  0.6641203
##   121  0.8368164  0.6640655
##   123  0.8370338  0.6643570
##   125  0.8370338  0.6643399
##   127  0.8374686  0.6651621
##   129  0.8379130  0.6660614
##   131  0.8379082  0.6661059
##   133  0.8379082  0.6660882
##   135  0.8392126  0.6686572
##   137  0.8383430  0.6668529
##   139  0.8385604  0.6672611
##   141  0.8389952  0.6681772
##   143  0.8381159  0.6662774
##   145  0.8376812  0.6653454
##   147  0.8376763  0.6653885
##   149  0.8376763  0.6654355
##   151  0.8379082  0.6658750
##   153  0.8381256  0.6662885
##   155  0.8374589  0.6648494
##   157  0.8368068  0.6634619
##   159  0.8367971  0.6634004
##   161  0.8359227  0.6616017
##   163  0.8357101  0.6612150
##   165  0.8346087  0.6588280
##   167  0.8346087  0.6588042
##   169  0.8332899  0.6559665
##   171  0.8337343  0.6569049
##   173  0.8324251  0.6541613
##   175  0.8326377  0.6545780
##   177  0.8322077  0.6536486
##   179  0.8315507  0.6522581
##   181  0.8311111  0.6513232
##   183  0.8311159  0.6513619
##   185  0.8313333  0.6518279
##   187  0.8306812  0.6504320
##   189  0.8293768  0.6476716
##   191  0.8300338  0.6489423
##   193  0.8307005  0.6503164
##   195  0.8293961  0.6475306
##   197  0.8289469  0.6465723
##   199  0.8287246  0.6459639
##   201  0.8276232  0.6435548
##   203  0.8276329  0.6434742
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 31.
cat("Using caret to train the k-nearest neighbors model, we're able to get our parameter k=", knn_fit$bestTune$k," and with an accuracy of ", max(knn_fit$results$Accuracy)*100,"%.")
## Using caret to train the k-nearest neighbors model, we're able to get our parameter k= 31  and with an accuracy of  85.25169 %.
# Prediction based on the test set
prediction <- predict(knn_fit, newdata=test_set)
prediction
##   [1] 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 1 0 0 0 1 0 1 0 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 0 1 1 1 1 1 1 0 0 1 0 1 1
##  [75] 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1
## [149] 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
# Evaluate the model
levels(prediction)
## [1] "0" "1"
levels(train_set$V11)
## NULL
cm <- confusionMatrix(as.factor(prediction), as.factor(test_set$V11))
cm_acc <- cm$overall["Accuracy"]
cm_acc
##  Accuracy 
## 0.8172589
cat("On running the model on the test set, there is a only a slight drop in accuracy from ", max(knn_fit$results$Accuracy)*100,"% to ",cm_acc*100,"% by ",(max(knn_fit$results$Accuracy)*100) - (cm_acc*100),"% for the predictions.")
## On running the model on the test set, there is a only a slight drop in accuracy from  85.25169 % to  81.72589 % by  3.525802 % for the predictions.

This could potentially be overfitting on the training set, and thus poorer performance when it came to the test set when exposed to random error.

3.1b)

splitting the data into training, validation, and test data sets (pick either KNN or SVM; the other is optional)

I’ve chosen to to use SVM to model the data

# clear env vars
rm(list=ls())

# call kernlab for ksvm
library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
library(kknn)
# load the data again since we cleared the environment
data <- read.table("C:/Users/keith/OneDrive/Desktop/credit_card_data.txt",stringsAsFactors=FALSE,header=FALSE)
# Split dataset into train and test sets before validation
# we can split the data set into 3 parts train, 1 part val, 1 part test
split <- 0.6 
train_test_mask <- sample(nrow(data), size=round(nrow(data)*split))
train_set <- data[train_test_mask,] 
remainder_set <- data[-train_test_mask,]

# From the test set we'll split into train and val
val_test_split <- 0.5
val_test_mask <- sample(nrow(remainder_set), size=round(nrow(remainder_set)*val_test_split))
val_set <- remainder_set[val_test_mask,]
test_set <- remainder_set[-val_test_mask,]
str(train_set)
## 'data.frame':    392 obs. of  11 variables:
##  $ V1 : int  1 0 0 0 1 1 0 1 0 1 ...
##  $ V2 : num  23.1 20.8 23.2 20.5 42 ...
##  $ V3 : num  11.5 3 5.88 11.84 9.79 ...
##  $ V4 : num  3.5 0.04 3.17 6 7.96 0 0.875 0.665 8 1.75 ...
##  $ V5 : int  1 1 1 1 1 0 1 1 1 1 ...
##  $ V6 : int  0 1 0 1 0 0 1 1 0 0 ...
##  $ V7 : int  9 0 10 0 8 2 0 0 14 2 ...
##  $ V8 : int  1 1 1 1 1 1 0 0 1 0 ...
##  $ V9 : int  56 100 120 340 0 17 491 0 0 0 ...
##  $ V10: int  742 0 245 0 0 1 0 0 2300 15 ...
##  $ V11: int  1 1 1 1 1 0 1 1 1 1 ...
str(val_set)
## 'data.frame':    131 obs. of  11 variables:
##  $ V1 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ V2 : num  43.2 33 21.9 27.8 19.3 ...
##  $ V3 : num  3 2.5 0.54 1.5 9.5 ...
##  $ V4 : num  6 7 0.04 2 1 ...
##  $ V5 : int  1 0 1 1 1 1 0 0 1 1 ...
##  $ V6 : int  0 1 0 0 1 0 1 1 1 1 ...
##  $ V7 : int  11 0 1 11 0 11 0 0 0 0 ...
##  $ V8 : int  1 0 0 0 0 0 1 1 1 1 ...
##  $ V9 : int  80 280 840 434 60 290 220 240 260 120 ...
##  $ V10: int  0 0 59 35 400 284 140 1 200 0 ...
##  $ V11: int  1 0 1 1 1 1 0 0 1 0 ...
str(test_set)
## 'data.frame':    131 obs. of  11 variables:
##  $ V1 : int  0 0 1 1 1 1 1 1 0 0 ...
##  $ V2 : num  58.7 24.5 20.2 42.5 48.1 ...
##  $ V3 : num  4.46 0.5 5.62 4.92 6.04 ...
##  $ V4 : num  3.04 1.5 1.71 3.16 0.04 ...
##  $ V5 : int  1 1 1 1 0 1 1 1 1 1 ...
##  $ V6 : int  0 1 1 1 1 0 0 0 1 0 ...
##  $ V7 : int  6 0 0 0 0 10 17 3 0 5 ...
##  $ V8 : int  1 1 1 0 1 0 0 1 0 0 ...
##  $ V9 : int  43 280 120 52 0 320 0 0 0 0 ...
##  $ V10: int  560 824 0 1442 2690 0 0 0 4000 560 ...
##  $ V11: int  1 1 1 1 1 1 1 1 1 1 ...

To test multiple values of c, I’ll generate a list of possible c values

# Define upper bound and lower bound of c-vals and iterate over powers of 10
low_pwr <- -5
up_pwr <- 5
c_val <- sapply(low_pwr:up_pwr, function(i) 10^i, simplify = TRUE)
c_num <- length(c_val)
print(c_val)
##  [1] 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05
# set.seed for reproducibility
set.seed(42)

# create empty list for accuracy storing
acc_ksvm <-rep(0,c_num)


# Train SVM over values of C to test (c_val)

for (i in 1:c_num) {
  
  # fit model using training set
  ksvm <- ksvm(as.matrix(train_set[,1:10]),
          as.factor(train_set[,11]),
          type = "C-svc",
          kernel = "vanilladot",
          C = c_val[i],
          scaled=TRUE)
  
  #  compare models using validation set
  
  pred <- predict(ksvm,val_set[,1:10])
  acc_ksvm[i] = sum(pred == val_set$V11) / nrow(val_set)
}
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters
acc_ksvm[1:c_num]
##  [1] 0.5725191 0.5725191 0.7480916 0.8931298 0.8931298 0.8931298 0.8931298
##  [8] 0.8931298 0.8931298 0.8931298 0.8931298
cat("Best SVM model is iteration: ", which.max(acc_ksvm[1:c_num]))
## Best SVM model is iteration:  4
cat("which gave the best C value of: ", c_val[which.max(acc_ksvm[1:c_num])])
## which gave the best C value of:  0.01
cat("Highest validation set accuracy is: ", max(acc_ksvm[1:c_num]))
## Highest validation set accuracy is:  0.8931298

Best SVM model is iteration: 4 which gave the best C value of: 0.01 Highest validation set accuracy is: 0.870229

Using these parameters, we can now assess how accurately the “best model” predicts the test data. First we need to recreate the model since it was overwritten within the loop.

# Based on the newly trained SVM after the cross validation,
# Using the optimal c-value which gave the best accuracy
best_ksvm <-  ksvm(as.matrix(train_set[,1:10]),
              as.factor(train_set[,11]),
              type = "C-svc", 
              kernel = "vanilladot",
              C = c_val[which.max(acc_ksvm[1:c_num])],
              scaled=TRUE)
##  Setting default kernel parameters
best_acc <- sum(predict(best_ksvm,test_set[,1:10]) == test_set$V11) / nrow(test_set)

cat("Accuracy of the trained model on test data is: ", best_acc)
## Accuracy of the trained model on test data is:  0.870229
cat('In 3.1b) we can see that there is big improvment in the model accuracy from the the validation to the test sets, where accuracy rose from',max(acc_ksvm[1:c_num])*100,'in the validation step to',best_acc,'in the test step.')
## In 3.1b) we can see that there is big improvment in the model accuracy from the the validation to the test sets, where accuracy rose from 89.31298 in the validation step to 0.870229 in the test step.

4.1)

Describe a situation or problem from your job, everyday life, current events, etc., for which a clustering model would be appropriate. List some (up to 5) predictors that you might use.

One problem that I face at work is the issue of customer segmentation when it comes to marketing campaigns to customers, as customers now demand hyperpersonalization or targeted marketing, whether it is about differences in lifestages, differences in lifestyles based on things like ethnicity, religious beliefs, current and/or savings account inflows and outflows and the nature of those transactions.

I would say a few reliable predictors are things like: 1. demographic data (eg. age, sex, etc.) 2. types of products owned (eg. Insurance/Home Loans/Investment products/Number of credit cards owned), 3. sum of assets under management 4. total account inflow and outflow

4.2)

The iris data set iris.txt contains 150 data points, each with four predictor variables and one categorical response. The predictors are the width and length of the sepal and petal of flowers and the response is the type of flower. The data is available from the R library datasets and can be accessed with iris once the library is loaded. It is also available at the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Iris ). The response values are only given to see how well a specific method performed and should not be used to build the model.

Use the R function kmeans to cluster the points as well as possible. Report the best combination of predictors, your suggested value of k, and how well your best clustering predicts flower type.

Set up the environment

# import libraries
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# clear env vars
rm(list =ls())

# set.seed for reproducibility
set.seed(42)

# load iris table and investigate
iris <- read.table("C:/Users/keith/OneDrive/Desktop/iris.txt", sep="")
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
plot(iris)

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##    Species         
##  Length:150        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
mapping<- c("versicolor"=1,"virginica"=2,"setosa"=3)
iris$Species<- mapping[iris$Species]
#remove species column for clustering
filtered_iris<-iris[,1:4]

#view summary of loaded data as check
head(filtered_iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4
#Set seed so random variables can be replicated
set.seed(42)


# Use the elbow method to find the optimal number of k
# this function outputs plot to evaluate ideal # clusters
fviz_nbclust(filtered_iris, kmeans, method ="wss")

k_means<- kmeans(filtered_iris,centers=3,nstart=25)
#output clustering information
str(k_means)
## List of 9
##  $ cluster     : Named int [1:150] 3 3 3 3 3 3 3 3 3 3 ...
##   ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
##  $ centers     : num [1:3, 1:4] 6.85 5.9 5.01 3.07 2.75 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "1" "2" "3"
##   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##  $ totss       : num 681
##  $ withinss    : num [1:3] 23.9 39.8 15.2
##  $ tot.withinss: num 78.9
##  $ betweenss   : num 603
##  $ size        : int [1:3] 38 62 50
##  $ iter        : int 2
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"
#store species predictions in vector for comparison
pred<- k_means$cluster

#Calculate # of correct species predictions
accuracy<- sum(pred == iris[,5]) / nrow(filtered_iris)
accuracy
## [1] 0.44

I don’t really know what happened such that the accuracy came up to 44%.