Q: Using the same data set (credit_card_data.txt or credit_card_data-headers.txt) as in Question 2.2, use the ksvm or kknn function to find a good classifier: a) using cross-validation (do this for the k-nearest-neighbors model; SVM is optional)
I decided to try out using the caret package instead of kknn which we learn about from the previous assignment. I think ultimately the process feels about the same. In this attempt, I’ll be looking at the data gathered from running just cross-validation on a dataset after doing a train-test split in 3.1a) using k-nearest neighbors, and looking at data gathered from doing a train-validate-test split in 3.1b) using SVM.
Train-test split only ### Exploring the dataset
# clear env vars
rm(list=ls())
# Call caret and kknnlibrary for k-nearest neighbours model k-fold cross validation
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(kknn)
##
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
##
## contr.dummy
# Reading the data
data <- read.table("C:/Users/keith/OneDrive/Desktop/credit_card_data.txt",stringsAsFactors=FALSE,header=FALSE)
head(data)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 1 30.83 0.000 1.25 1 0 1 1 202 0 1
## 2 0 58.67 4.460 3.04 1 0 6 1 43 560 1
## 3 0 24.50 0.500 1.50 1 1 0 1 280 824 1
## 4 1 27.83 1.540 3.75 1 0 5 0 100 3 1
## 5 1 20.17 5.625 1.71 1 1 0 1 120 0 1
## 6 1 32.08 4.000 2.50 1 1 0 0 360 0 1
# Explore the data before we do the train-test split
summary(data)
## V1 V2 V3 V4
## Min. :0.0000 Min. :13.75 Min. : 0.000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.:22.58 1st Qu.: 1.040 1st Qu.: 0.165
## Median :1.0000 Median :28.46 Median : 2.855 Median : 1.000
## Mean :0.6896 Mean :31.58 Mean : 4.831 Mean : 2.242
## 3rd Qu.:1.0000 3rd Qu.:38.25 3rd Qu.: 7.438 3rd Qu.: 2.615
## Max. :1.0000 Max. :80.25 Max. :28.000 Max. :28.500
## V5 V6 V7 V8
## Min. :0.0000 Min. :0.0000 Min. : 0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median : 0.000 Median :1.0000
## Mean :0.5352 Mean :0.5612 Mean : 2.498 Mean :0.5382
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 3.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :67.000 Max. :1.0000
## V9 V10 V11
## Min. : 0.00 Min. : 0 Min. :0.0000
## 1st Qu.: 70.75 1st Qu.: 0 1st Qu.:0.0000
## Median : 160.00 Median : 5 Median :0.0000
## Mean : 180.08 Mean : 1013 Mean :0.4526
## 3rd Qu.: 271.00 3rd Qu.: 399 3rd Qu.:1.0000
## Max. :2000.00 Max. :100000 Max. :1.0000
str(data)
## 'data.frame': 654 obs. of 11 variables:
## $ V1 : int 1 0 0 1 1 1 1 0 1 1 ...
## $ V2 : num 30.8 58.7 24.5 27.8 20.2 ...
## $ V3 : num 0 4.46 0.5 1.54 5.62 ...
## $ V4 : num 1.25 3.04 1.5 3.75 1.71 ...
## $ V5 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ V6 : int 0 0 1 0 1 1 1 1 1 1 ...
## $ V7 : int 1 6 0 5 0 0 0 0 0 0 ...
## $ V8 : int 1 1 1 0 1 0 0 1 1 0 ...
## $ V9 : int 202 43 280 100 120 360 164 80 180 52 ...
## $ V10: int 0 560 824 3 0 0 31285 1349 314 1442 ...
## $ V11: int 1 1 1 1 1 1 1 1 1 1 ...
#Check for missing values
sum(is.na(data))
## [1] 0
Now that we know there is no missing data, and we know the rough shape of the data, we can now proceed to split the data into train and test sets.
split <- 0.7
train_test_mask <- sample(nrow(data), size=floor(nrow(data)*split))
train_set <- data[train_test_mask,]
test_set <- data[-train_test_mask,]
# set.seed for reproducibility
set.seed(42)
# set maximum value of k (number of neighbors) to test
k_maximum <- 100
# since number of folds is arbitrary, we stick with 10-fold since ds is small
k_cv <-10
# we'll now fit the data to k-nearest neighbors
knn_fit <- train(as.factor(V11)~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,
train_set,
method = "knn",
trControl=trainControl(
method="repeatedcv",
number=k_cv,
repeats=10),
preProcess = c("center", "scale"),
tuneLength = k_maximum)
knn_fit
## k-Nearest Neighbors
##
## 457 samples
## 10 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 412, 411, 411, 411, 412, 411, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8405314 0.6780592
## 7 0.8442464 0.6853988
## 9 0.8385700 0.6738250
## 11 0.8374300 0.6711013
## 13 0.8426908 0.6811480
## 15 0.8465845 0.6887650
## 17 0.8417874 0.6782771
## 19 0.8396087 0.6735476
## 21 0.8416087 0.6774086
## 23 0.8440242 0.6821152
## 25 0.8481787 0.6903147
## 27 0.8488357 0.6915087
## 29 0.8507826 0.6951233
## 31 0.8525169 0.6985207
## 33 0.8518599 0.6970019
## 35 0.8498792 0.6928532
## 37 0.8507488 0.6943690
## 39 0.8500966 0.6931037
## 41 0.8501063 0.6929680
## 43 0.8490338 0.6907884
## 45 0.8481643 0.6889633
## 47 0.8466232 0.6856803
## 49 0.8457633 0.6839232
## 51 0.8472995 0.6870783
## 53 0.8477440 0.6880449
## 55 0.8455507 0.6834555
## 57 0.8462029 0.6847690
## 59 0.8472947 0.6868295
## 61 0.8490580 0.6903696
## 63 0.8462077 0.6844447
## 65 0.8470870 0.6862392
## 67 0.8477391 0.6874773
## 69 0.8464106 0.6846488
## 71 0.8466425 0.6853138
## 73 0.8470628 0.6860813
## 75 0.8461884 0.6840884
## 77 0.8455411 0.6828073
## 79 0.8459710 0.6836846
## 81 0.8455314 0.6828027
## 83 0.8442174 0.6799918
## 85 0.8448841 0.6814205
## 87 0.8431353 0.6779543
## 89 0.8418261 0.6751764
## 91 0.8411691 0.6738362
## 93 0.8422657 0.6759829
## 95 0.8407295 0.6727195
## 97 0.8396329 0.6703921
## 99 0.8398599 0.6707581
## 101 0.8400821 0.6710602
## 103 0.8400870 0.6710014
## 105 0.8387874 0.6682671
## 107 0.8394444 0.6697297
## 109 0.8387923 0.6684188
## 111 0.8377005 0.6660583
## 113 0.8370435 0.6646773
## 115 0.8365990 0.6637829
## 117 0.8366039 0.6636971
## 119 0.8368213 0.6641203
## 121 0.8368164 0.6640655
## 123 0.8370338 0.6643570
## 125 0.8370338 0.6643399
## 127 0.8374686 0.6651621
## 129 0.8379130 0.6660614
## 131 0.8379082 0.6661059
## 133 0.8379082 0.6660882
## 135 0.8392126 0.6686572
## 137 0.8383430 0.6668529
## 139 0.8385604 0.6672611
## 141 0.8389952 0.6681772
## 143 0.8381159 0.6662774
## 145 0.8376812 0.6653454
## 147 0.8376763 0.6653885
## 149 0.8376763 0.6654355
## 151 0.8379082 0.6658750
## 153 0.8381256 0.6662885
## 155 0.8374589 0.6648494
## 157 0.8368068 0.6634619
## 159 0.8367971 0.6634004
## 161 0.8359227 0.6616017
## 163 0.8357101 0.6612150
## 165 0.8346087 0.6588280
## 167 0.8346087 0.6588042
## 169 0.8332899 0.6559665
## 171 0.8337343 0.6569049
## 173 0.8324251 0.6541613
## 175 0.8326377 0.6545780
## 177 0.8322077 0.6536486
## 179 0.8315507 0.6522581
## 181 0.8311111 0.6513232
## 183 0.8311159 0.6513619
## 185 0.8313333 0.6518279
## 187 0.8306812 0.6504320
## 189 0.8293768 0.6476716
## 191 0.8300338 0.6489423
## 193 0.8307005 0.6503164
## 195 0.8293961 0.6475306
## 197 0.8289469 0.6465723
## 199 0.8287246 0.6459639
## 201 0.8276232 0.6435548
## 203 0.8276329 0.6434742
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 31.
cat("Using caret to train the k-nearest neighbors model, we're able to get our parameter k=", knn_fit$bestTune$k," and with an accuracy of ", max(knn_fit$results$Accuracy)*100,"%.")
## Using caret to train the k-nearest neighbors model, we're able to get our parameter k= 31 and with an accuracy of 85.25169 %.
# Prediction based on the test set
prediction <- predict(knn_fit, newdata=test_set)
prediction
## [1] 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 1 0 0 0 1 0 1 0 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 0 1 1 1 1 1 1 0 0 1 0 1 1
## [75] 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1
## [149] 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
# Evaluate the model
levels(prediction)
## [1] "0" "1"
levels(train_set$V11)
## NULL
cm <- confusionMatrix(as.factor(prediction), as.factor(test_set$V11))
cm_acc <- cm$overall["Accuracy"]
cm_acc
## Accuracy
## 0.8172589
cat("On running the model on the test set, there is a only a slight drop in accuracy from ", max(knn_fit$results$Accuracy)*100,"% to ",cm_acc*100,"% by ",(max(knn_fit$results$Accuracy)*100) - (cm_acc*100),"% for the predictions.")
## On running the model on the test set, there is a only a slight drop in accuracy from 85.25169 % to 81.72589 % by 3.525802 % for the predictions.
This could potentially be overfitting on the training set, and thus poorer performance when it came to the test set when exposed to random error.
splitting the data into training, validation, and test data sets (pick either KNN or SVM; the other is optional)
I’ve chosen to to use SVM to model the data
# clear env vars
rm(list=ls())
# call kernlab for ksvm
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
library(kknn)
# load the data again since we cleared the environment
data <- read.table("C:/Users/keith/OneDrive/Desktop/credit_card_data.txt",stringsAsFactors=FALSE,header=FALSE)
# Split dataset into train and test sets before validation
# we can split the data set into 3 parts train, 1 part val, 1 part test
split <- 0.6
train_test_mask <- sample(nrow(data), size=round(nrow(data)*split))
train_set <- data[train_test_mask,]
remainder_set <- data[-train_test_mask,]
# From the test set we'll split into train and val
val_test_split <- 0.5
val_test_mask <- sample(nrow(remainder_set), size=round(nrow(remainder_set)*val_test_split))
val_set <- remainder_set[val_test_mask,]
test_set <- remainder_set[-val_test_mask,]
str(train_set)
## 'data.frame': 392 obs. of 11 variables:
## $ V1 : int 1 0 0 0 1 1 0 1 0 1 ...
## $ V2 : num 23.1 20.8 23.2 20.5 42 ...
## $ V3 : num 11.5 3 5.88 11.84 9.79 ...
## $ V4 : num 3.5 0.04 3.17 6 7.96 0 0.875 0.665 8 1.75 ...
## $ V5 : int 1 1 1 1 1 0 1 1 1 1 ...
## $ V6 : int 0 1 0 1 0 0 1 1 0 0 ...
## $ V7 : int 9 0 10 0 8 2 0 0 14 2 ...
## $ V8 : int 1 1 1 1 1 1 0 0 1 0 ...
## $ V9 : int 56 100 120 340 0 17 491 0 0 0 ...
## $ V10: int 742 0 245 0 0 1 0 0 2300 15 ...
## $ V11: int 1 1 1 1 1 0 1 1 1 1 ...
str(val_set)
## 'data.frame': 131 obs. of 11 variables:
## $ V1 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ V2 : num 43.2 33 21.9 27.8 19.3 ...
## $ V3 : num 3 2.5 0.54 1.5 9.5 ...
## $ V4 : num 6 7 0.04 2 1 ...
## $ V5 : int 1 0 1 1 1 1 0 0 1 1 ...
## $ V6 : int 0 1 0 0 1 0 1 1 1 1 ...
## $ V7 : int 11 0 1 11 0 11 0 0 0 0 ...
## $ V8 : int 1 0 0 0 0 0 1 1 1 1 ...
## $ V9 : int 80 280 840 434 60 290 220 240 260 120 ...
## $ V10: int 0 0 59 35 400 284 140 1 200 0 ...
## $ V11: int 1 0 1 1 1 1 0 0 1 0 ...
str(test_set)
## 'data.frame': 131 obs. of 11 variables:
## $ V1 : int 0 0 1 1 1 1 1 1 0 0 ...
## $ V2 : num 58.7 24.5 20.2 42.5 48.1 ...
## $ V3 : num 4.46 0.5 5.62 4.92 6.04 ...
## $ V4 : num 3.04 1.5 1.71 3.16 0.04 ...
## $ V5 : int 1 1 1 1 0 1 1 1 1 1 ...
## $ V6 : int 0 1 1 1 1 0 0 0 1 0 ...
## $ V7 : int 6 0 0 0 0 10 17 3 0 5 ...
## $ V8 : int 1 1 1 0 1 0 0 1 0 0 ...
## $ V9 : int 43 280 120 52 0 320 0 0 0 0 ...
## $ V10: int 560 824 0 1442 2690 0 0 0 4000 560 ...
## $ V11: int 1 1 1 1 1 1 1 1 1 1 ...
To test multiple values of c, I’ll generate a list of possible c values
# Define upper bound and lower bound of c-vals and iterate over powers of 10
low_pwr <- -5
up_pwr <- 5
c_val <- sapply(low_pwr:up_pwr, function(i) 10^i, simplify = TRUE)
c_num <- length(c_val)
print(c_val)
## [1] 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05
# set.seed for reproducibility
set.seed(42)
# create empty list for accuracy storing
acc_ksvm <-rep(0,c_num)
# Train SVM over values of C to test (c_val)
for (i in 1:c_num) {
# fit model using training set
ksvm <- ksvm(as.matrix(train_set[,1:10]),
as.factor(train_set[,11]),
type = "C-svc",
kernel = "vanilladot",
C = c_val[i],
scaled=TRUE)
# compare models using validation set
pred <- predict(ksvm,val_set[,1:10])
acc_ksvm[i] = sum(pred == val_set$V11) / nrow(val_set)
}
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
acc_ksvm[1:c_num]
## [1] 0.5725191 0.5725191 0.7480916 0.8931298 0.8931298 0.8931298 0.8931298
## [8] 0.8931298 0.8931298 0.8931298 0.8931298
cat("Best SVM model is iteration: ", which.max(acc_ksvm[1:c_num]))
## Best SVM model is iteration: 4
cat("which gave the best C value of: ", c_val[which.max(acc_ksvm[1:c_num])])
## which gave the best C value of: 0.01
cat("Highest validation set accuracy is: ", max(acc_ksvm[1:c_num]))
## Highest validation set accuracy is: 0.8931298
Best SVM model is iteration: 4 which gave the best C value of: 0.01 Highest validation set accuracy is: 0.870229
Using these parameters, we can now assess how accurately the “best model” predicts the test data. First we need to recreate the model since it was overwritten within the loop.
# Based on the newly trained SVM after the cross validation,
# Using the optimal c-value which gave the best accuracy
best_ksvm <- ksvm(as.matrix(train_set[,1:10]),
as.factor(train_set[,11]),
type = "C-svc",
kernel = "vanilladot",
C = c_val[which.max(acc_ksvm[1:c_num])],
scaled=TRUE)
## Setting default kernel parameters
best_acc <- sum(predict(best_ksvm,test_set[,1:10]) == test_set$V11) / nrow(test_set)
cat("Accuracy of the trained model on test data is: ", best_acc)
## Accuracy of the trained model on test data is: 0.870229
cat('In 3.1b) we can see that there is big improvment in the model accuracy from the the validation to the test sets, where accuracy rose from',max(acc_ksvm[1:c_num])*100,'in the validation step to',best_acc,'in the test step.')
## In 3.1b) we can see that there is big improvment in the model accuracy from the the validation to the test sets, where accuracy rose from 89.31298 in the validation step to 0.870229 in the test step.
Describe a situation or problem from your job, everyday life, current events, etc., for which a clustering model would be appropriate. List some (up to 5) predictors that you might use.
One problem that I face at work is the issue of customer segmentation when it comes to marketing campaigns to customers, as customers now demand hyperpersonalization or targeted marketing, whether it is about differences in lifestages, differences in lifestyles based on things like ethnicity, religious beliefs, current and/or savings account inflows and outflows and the nature of those transactions.
I would say a few reliable predictors are things like: 1. demographic data (eg. age, sex, etc.) 2. types of products owned (eg. Insurance/Home Loans/Investment products/Number of credit cards owned), 3. sum of assets under management 4. total account inflow and outflow
The iris data set iris.txt contains 150 data points, each with four predictor variables and one categorical response. The predictors are the width and length of the sepal and petal of flowers and the response is the type of flower. The data is available from the R library datasets and can be accessed with iris once the library is loaded. It is also available at the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Iris ). The response values are only given to see how well a specific method performed and should not be used to build the model.
Use the R function kmeans to cluster the points as well as possible. Report the best combination of predictors, your suggested value of k, and how well your best clustering predicts flower type.
Set up the environment
# import libraries
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# clear env vars
rm(list =ls())
# set.seed for reproducibility
set.seed(42)
# load iris table and investigate
iris <- read.table("C:/Users/keith/OneDrive/Desktop/iris.txt", sep="")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
plot(iris)
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## Length:150
## Class :character
## Mode :character
##
##
##
mapping<- c("versicolor"=1,"virginica"=2,"setosa"=3)
iris$Species<- mapping[iris$Species]
#remove species column for clustering
filtered_iris<-iris[,1:4]
#view summary of loaded data as check
head(filtered_iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
#Set seed so random variables can be replicated
set.seed(42)
# Use the elbow method to find the optimal number of k
# this function outputs plot to evaluate ideal # clusters
fviz_nbclust(filtered_iris, kmeans, method ="wss")
k_means<- kmeans(filtered_iris,centers=3,nstart=25)
#output clustering information
str(k_means)
## List of 9
## $ cluster : Named int [1:150] 3 3 3 3 3 3 3 3 3 3 ...
## ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
## $ centers : num [1:3, 1:4] 6.85 5.9 5.01 3.07 2.75 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "1" "2" "3"
## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## $ totss : num 681
## $ withinss : num [1:3] 23.9 39.8 15.2
## $ tot.withinss: num 78.9
## $ betweenss : num 603
## $ size : int [1:3] 38 62 50
## $ iter : int 2
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
#store species predictions in vector for comparison
pred<- k_means$cluster
#Calculate # of correct species predictions
accuracy<- sum(pred == iris[,5]) / nrow(filtered_iris)
accuracy
## [1] 0.44
I don’t really know what happened such that the accuracy came up to 44%.