Question 3.1 Using the same data set (credit_card_data.txt or credit_card_data-headers.txt) as in Question 2.2, use the ksvm or kknn function to find a good classifier:

  1. using cross-validation (do this for the k-nearest-neighbors model; SVM is optional)

Clear the environment

rm(list = ls())

Load libraries

library(ISLR)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice

Load data into data frame and convert last column to a factor

data <- read.table("/Users/djmariano/Downloads/hw1-SP22 2/data 2.2/credit_card_data.txt", stringsAsFactor = FALSE, header = F)
data$V11 = as.factor(data$V11)

Set seed and split data into training (80%) and testing (20%) sets.

set.seed(123)
index <- createDataPartition(data$V11, p = 0.80, list = FALSE)
dataTraining = data[index,]
dataTesting = data[-index,]

Confirm correct number of rows

nrow(dataTraining)
## [1] 524
nrow(dataTesting)
## [1] 130

Preprocess data for modeling and excludes last column from training set.

process <- dataTraining[,names(dataTraining) != "V11"]
procvalues <- preProcess(x = process,method = c("center", "scale"))

Use the train function to train the model using kNN and repeated cross validation repeated 5 times for 50 different values of k.

set.seed(123)
control <- trainControl(method="repeatedcv", repeats = 5) # Stores parameters for train function
fit <- train(V11 ~ ., data = dataTraining, method = "knn", trControl = control, preProcess = c("center", "scale"), tuneLength = 50)

Evaluate data

plot(fit)

fit
## k-Nearest Neighbors 
## 
## 524 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 471, 472, 471, 472, 471, 472, ... 
## Resampling results across tuning parameters:
## 
##   k    Accuracy   Kappa    
##     5  0.8392547  0.6763263
##     7  0.8419032  0.6824450
##     9  0.8365325  0.6715822
##    11  0.8376646  0.6734999
##    13  0.8334843  0.6648276
##    15  0.8431296  0.6841965
##    17  0.8453133  0.6877379
##    19  0.8506555  0.6979263
##    21  0.8448712  0.6852643
##    23  0.8467943  0.6888815
##    25  0.8491093  0.6930676
##    27  0.8449145  0.6845197
##    29  0.8464605  0.6876835
##    31  0.8457206  0.6858855
##    33  0.8453357  0.6851203
##    35  0.8468744  0.6881913
##    37  0.8476146  0.6896066
##    39  0.8480283  0.6902023
##    41  0.8487830  0.6917242
##    43  0.8480137  0.6901194
##    45  0.8487685  0.6914347
##    47  0.8464825  0.6865711
##    49  0.8423098  0.6781444
##    51  0.8430648  0.6795424
##    53  0.8415336  0.6763540
##    55  0.8404015  0.6738811
##    57  0.8396251  0.6721647
##    59  0.8400460  0.6731228
##    61  0.8388997  0.6705533
##    63  0.8377531  0.6680574
##    65  0.8362143  0.6648383
##    67  0.8343281  0.6608542
##    69  0.8354819  0.6632127
##    71  0.8366213  0.6654946
##    73  0.8370059  0.6662211
##    75  0.8343058  0.6606165
##    77  0.8339214  0.6595818
##    79  0.8350825  0.6618596
##    81  0.8347052  0.6611057
##    83  0.8354523  0.6623972
##    85  0.8354523  0.6623050
##    87  0.8346904  0.6608769
##    89  0.8339432  0.6592408
##    91  0.8335658  0.6584243
##    93  0.8343130  0.6600762
##    95  0.8339209  0.6591705
##    97  0.8343055  0.6598959
##    99  0.8342979  0.6599648
##   101  0.8323966  0.6559195
##   103  0.8323894  0.6558225
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 19.

k = 19 results in the highest accuracy of 0.8506555

Question 3.1

  1. splitting the data into training, validation, and test data sets (I use kNN in my solution)

Clear the environment

rm(list = ls())

Load libraries

library(ISLR)
library(caret)

Load data into data frame and convert last column to a factor

data <- read.table("/Users/djmariano/Downloads/hw1-SP22 2/data 2.2/credit_card_data.txt", stringsAsFactor = FALSE, header = F)
data$V11 = as.factor(data$V11)

Set seed and split data into training (60%), testing (20%), and validation (20%) sets.

set.seed(123)
trainingindex <- createDataPartition(data$V11, p = 0.60, list = FALSE)
dataTraining = data[trainingindex,]
dataSplit = data[-trainingindex,]
validationindex <- createDataPartition(dataSplit$V11, p = 0.50, list = FALSE)
dataValidation <- dataSplit[validationindex,]
dataTesting <- dataSplit[-validationindex,]

Confirm correct number of rows

nrow(dataTraining)
## [1] 393
nrow(dataValidation)
## [1] 131
nrow(dataTesting)
## [1] 130

Preprocess data for modeling and excludes last column from training set.

process <- dataTraining[,names(dataTraining) != "V11"]
procvalues <- preProcess(x = process, method = c("center", "scale"))

Use the train function to train the model using kNN, 10 folds, and repeated cross validation repeated 5 times for 50 different values of k.

set.seed(123)
control <- trainControl(method = "repeatedcv", number = 10, repeats = 5) #Define the training control
fit <- train(V11~., data = dataTraining, method = "knn", trControl = control, preProcess = c("center", "scale"), tuneLength = 50)

Make predictions on the test set, finds the best value of k (which yielded the highest accuracy during training), prints it, and then computes a confusion matrix to evaluate the model’s performance on the test set.

plot(fit)

fit
## k-Nearest Neighbors 
## 
## 393 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 353, 353, 355, 354, 354, 354, ... 
## Resampling results across tuning parameters:
## 
##   k    Accuracy   Kappa    
##     5  0.8350702  0.6671398
##     7  0.8326727  0.6619904
##     9  0.8275695  0.6510167
##    11  0.8387760  0.6728139
##    13  0.8412874  0.6774943
##    15  0.8391957  0.6723895
##    17  0.8427213  0.6793004
##    19  0.8477854  0.6897293
##    21  0.8482982  0.6907484
##    23  0.8467861  0.6877152
##    25  0.8452868  0.6846119
##    27  0.8422092  0.6781653
##    29  0.8401565  0.6738338
##    31  0.8422874  0.6778642
##    33  0.8443279  0.6817877
##    35  0.8474190  0.6881540
##    37  0.8443408  0.6817478
##    39  0.8453671  0.6838032
##    41  0.8463293  0.6857200
##    43  0.8438030  0.6801612
##    45  0.8407503  0.6738007
##    47  0.8381721  0.6685765
##    49  0.8366457  0.6652212
##    51  0.8351201  0.6618047
##    53  0.8315560  0.6545535
##    55  0.8310304  0.6533088
##    57  0.8320951  0.6555318
##    59  0.8284906  0.6482660
##    61  0.8279906  0.6471484
##    63  0.8274906  0.6461210
##    65  0.8280034  0.6468726
##    67  0.8280297  0.6468243
##    69  0.8274906  0.6458542
##    71  0.8269771  0.6447966
##    73  0.8290034  0.6488418
##    75  0.8289771  0.6487096
##    77  0.8299771  0.6508818
##    79  0.8268988  0.6445381
##    81  0.8269251  0.6445009
##    83  0.8269507  0.6443755
##    85  0.8274636  0.6453947
##    87  0.8284899  0.6473190
##    89  0.8284899  0.6473190
##    91  0.8285027  0.6472748
##    93  0.8290290  0.6483820
##    95  0.8305418  0.6513436
##    97  0.8290290  0.6483183
##    99  0.8290290  0.6483183
##   101  0.8295290  0.6493757
##   103  0.8295290  0.6494205
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 21.

k = 21 results in the highest accuracy of 0.8482982

predictions <- predict(fit, newdata = dataTesting)
confusionMatrix(predictions, dataTesting$V11)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 61 12
##          1 10 47
##                                           
##                Accuracy : 0.8308          
##                  95% CI : (0.7551, 0.8908)
##     No Information Rate : 0.5462          
##     P-Value [Acc > NIR] : 6.893e-12       
##                                           
##                   Kappa : 0.6576          
##                                           
##  Mcnemar's Test P-Value : 0.8312          
##                                           
##             Sensitivity : 0.8592          
##             Specificity : 0.7966          
##          Pos Pred Value : 0.8356          
##          Neg Pred Value : 0.8246          
##              Prevalence : 0.5462          
##          Detection Rate : 0.4692          
##    Detection Prevalence : 0.5615          
##       Balanced Accuracy : 0.8279          
##                                           
##        'Positive' Class : 0               
## 

Question 4.1

Describe a situation or problem from your job, everyday life, current events, etc., for which a clustering model would be appropriate. List some (up to 5) predictors that you might use.

A retail business may want to use a clustering model to analyze and segment their in-store customers based on a variety of certain characteristics and behaviors. This helps them allocate resources more effectively and also capitalize on other potential opportunities.

The business may look at the following attributes: 1. Time spent per visit 2. Amount spent per visit 3. Number of visits 4. Gender/Sex 5. Age

Question 4.2

The iris data set iris.txt contains 150 data points, each with four predictor variables and one categorical response. The predictors are the width and length of the sepal and petal of flowers and the response is the type of flower. The data is available from the R library datasets and can be accessed with iris once the library is loaded. It is also available at the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Iris ). The response values are only given to see how well a specific method performed and should not be used to build the model.

Use the R function kmeans to cluster the points as well as possible. Report the best combination of predictors, your suggested value of k, and how well your best clustering predicts flower type.

Clear the environment

rm(list = ls())

Load libraries

library(stats)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(kknn)
## 
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
## 
##     contr.dummy
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2)
library(ggfortify)

Import iris data

data <- read.table("/Users/djmariano/Downloads/hw2-SP22/iris.txt", stringsAsFactors = FALSE, header = TRUE)
head(data)

Exclude the species column for kmeans clustering. We are trying to group points into clusters based upon their attributes without knowing what species they belong to. This is a method of unsupervised learning.

filtered_data <- data[,1:4]
head(filtered_data)

Use an elbow chart to determine what would be the right number of clusters to use. Function for wssplot, found here: https://rpubs.com/violetgirl/201598

set.seed(123)
wssplot <- function(data, nc=10, seed=123){
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  for (i in 2:nc){
    set.seed(seed)
    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
  plot(1:nc, wss, type="b", xlab="Number of Clusters",
       ylab="Within groups sum of squares")
  return(wss)
}

Evaluate the elbow chart:

wssplot(filtered_data)

##  [1] 681.37060 152.34795  78.85144  71.44525  69.24240  48.07006  37.25063
##  [8]  32.96739  28.39518  26.26296

We can see that 3 would be the appropriate choice for k (which is correct because there are 3 species)

Now we will use the kmeans cluster with 3 clusters specified, and evaluate the results:

set.seed(123)
kcluster <- kmeans(filtered_data, centers=3, nstart=25)
print(kcluster)
## K-means clustering with 3 clusters of sizes 62, 38, 50
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.901613    2.748387     4.393548    1.433871
## 2     6.850000    3.073684     5.742105    2.071053
## 3     5.006000    3.428000     1.462000    0.246000
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   3   3   3   3   3   3   3   3   3   3   1   1   2   1   1   1   1   1   1   1 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   1   1 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   2   1   2   2   2   2   1   2   2   2   2   2   2   1   1   2   2   2   2   1 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   2   1   2   1   2   2   1   1   2   2   2   2   2   1   2   2   2   2   1   2 
## 141 142 143 144 145 146 147 148 149 150 
##   2   2   1   2   2   2   1   2   2   1 
## 
## Within cluster sum of squares by cluster:
## [1] 39.82097 23.87947 15.15100
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

We can see that we have 88.4% accuracy which is fairly accurate.

If we would like to, we can double check that using the following steps

Add cluster results to the data

data$cluster <- as.factor(kcluster$cluster)

Compare the clusters with their true species to review accuracy

comparison <- table(data$Species, data$cluster)
print(comparison)
##             
##               1  2  3
##   setosa      0  0 50
##   versicolor 48  2  0
##   virginica  14 36  0

We can also visualize the clusters for a better understanding of the groupings

Visualize the clusters

fviz_cluster(kcluster, data=filtered_data)