Summary of Cross-Validations

I chose the Train.KKNN method to first fine-tuned the optimal K-value (48) and to determine the best kernel ("optimal") based on Training data. The motivation was to expose the model to as much training as possible, after obtaining the "best" sets of hyper-parameters from cross-validation, before rolling out the final model with the test data.

Procedurally:

- These best parameters were utilized to calculate the best average accuracy score (84%) on the entire train data  

- Followed by taking that same Optimal K-value against the testing data and recalculating the accuracy score again (80%) 

The 2nd part of this exercise was to mannually partition the data sets into 3 different chunks; Train(70%), Validation(15%) and Testing(15%).  The KKNN method was used to obtain the best K-value (7) and compute the accuracy based on the Validation data set.  The accuracy was found to be 90%.  Finally, this same K-value was plug into the model against the test data and the acccuracy dropped to 82%. 

Both these instances clearly showed testing accuracy were lower than training.  The reasons could derived from training models that was exposed to subtle nuances of the training data such as random effects.  This could've distorted the accuracy level higher than it should. Another reason might simply be the distribution of the testing data were quite different from the training set; thus lowering the accuracy accordingly.

library(scales)
library(tidyr)
library(dplyr)
library(kableExtra)
library(knitr)
library(kernlab)
library(kknn)
require(ggthemes)
library(ggplot2)
library(tidyverse)  
library(cluster)    # clustering algorithms
library(factoextra) # clustering algorithms & visualization
library(caret)
library(quantreg)
library("Hmisc")
library(corrplot)
library(gridExtra)
library(NbClust)
library("scatterplot3d")

Question 3.1a

using cross-validation using the k-nearest-neighbors model

#reading txt file 
data<-read.table("credit_card_data.txt", stringsAsFactor = FALSE, header = F, sep = "")

head(data,5)

##   V1    V2    V3   V4 V5 V6 V7 V8  V9 V10 V11
## 1  1 30.83 0.000 1.25  1  0  1  1 202   0   1
## 2  0 58.67 4.460 3.04  1  0  6  1  43 560   1
## 3  0 24.50 0.500 1.50  1  1  0  1 280 824   1
## 4  1 27.83 1.540 3.75  1  0  5  0 100   3   1
## 5  1 20.17 5.625 1.71  1  1  0  1 120   0   1

set.seed(123)
#Generate a random sample of 80% of the rows
random_row<- sample(1:nrow(data),as.integer(0.8*nrow(data)))
trainData = data[random_row,]
#Assign the testData set to the remaining 20% of the original set
testData = data[-random_row,]

The Method LOOCV

Creating Cross-Validations via LOOCV

set.seed(123)

# Using LOOCV to determine optimal k values
modelloocv<-train.kknn((V11)~.,data=trainData ,kmax=100,scale=T)
#model results
summary(modelloocv)

## 
## Call:
## train.kknn(formula = (V11) ~ ., data = trainData, kmax = 100,     scale = T)
## 
## Type of response variable: continuous
## minimal mean absolute error: 0.1873805
## Minimal mean squared error: 0.1035524
## Best kernel: optimal
## Best k: 48

#Training accuracy
predicted_train <- rep(0,(nrow(trainData))) # predictions: start with a vector of all zeros
accuracy.kknn.train<-rep(0,(nrow(trainData)))
train_accuracy<- 0  #initialize variable

for (i in 1:nrow(trainData)){#using the best k returned from LOOOCV
model<-kknn((V11)~.,trainData[-i,],trainData[i,],k=48,kernel="optimal", scale = TRUE) 
  predicted_train[i]<- as.integer(fitted(model)+0.5) 
  accuracy.kknn.train[i]<- sum(predicted_train == trainData$V11)/nrow(trainData)
}

print(paste0("Performance on training data is ", percent(max(accuracy.kknn.train),2)))

## [1] "Performance on training data is 84%"

cat("Use KKNN with k = 48","\n")

## Use KKNN with k = 48

#Testing accuracy
predicted_test <- rep(0,(nrow(testData))) 
accuracy.kknn.test<-rep(0,(nrow(testData))) 
test_accuracy<- 0 
for (i in 1:nrow(testData)){
  model=kknn((V11)~.,testData[-i,],testData[i,],k=48,kernel="optimal", scale = TRUE) 
  predicted_test[i]<- as.integer(fitted(model)+0.5) 
  accuracy.kknn.test[i]<- sum(predicted_test == testData$V11)/nrow(testData)
}

print(paste0("Performance on test data is ", percent(max(accuracy.kknn.test),2)))

## [1] "Performance on test data is 80%"

Observation 1: The original credit card data was initially split into 80% training and 20& testing (hold-out). The LOOCV function fine-tuned the optimal K values and kernel of the model based off the train data. The “best” K=48 found was re-trained over again on all of the training data. This “best K” was used to obtain the training accuracy (84%). Lastly, the same k-value of 48 was used to compute the accuracy on the test data (80%).

Question 3.2b

splitting the data into training, validation, and test data sets

Training on Training data and validating on validation data

set.seed(123)
#Splitting data into a test set & (Training+validation) set
mask_train<- sample(nrow(data), size=floor(nrow(data)*0.7)) # 70% training
cred_train<-data[mask_train,]#training data set

# Split the remaining into validation and test
temp = data[-mask_train, ]
temp2 = sample(nrow(temp), round(nrow(temp)*.5))
cc_validation = temp[temp2, ]#validation dataset
cc_test = temp[-temp2, ]#testing dataset
kmax<-100
acc <- rep(0,kmax) 

for (k in 1:kmax) {

    # fit k-nearest-neighbor model using training set, validate on validation set

     knn.model <- kknn(V11~.,cred_train,cc_validation ,k=k,scale=TRUE)

      #  compare models using validation set

     pred <- as.integer(fitted(knn.model)+0.5) # round off to 0 or 1

     acc[k] = sum(pred == cc_validation$V11) / nrow(cc_validation)
}

# find best-performing KNN model on validation data; by determining the max values
df<-data.frame(acc)
print(paste0("The best KKNN Model training accuracy is: ", percent(max(df),2)))

## [1] "The best KKNN Model training accuracy is: 90%"

#determining the K-value that gives the max value
max_kvalue <-which.max(df$acc)
print(paste0("The K-value that best classifies the data points is ", max_kvalue))

## [1] "The K-value that best classifies the data points is 7"

#confusion matrix on validation data
confusionMatrix(as.factor(pred),as.factor(cc_validation$V11))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 47  5
##          1  6 40
##                                          
##                Accuracy : 0.8878         
##                  95% CI : (0.808, 0.9426)
##     No Information Rate : 0.5408         
##     P-Value [Acc > NIR] : 1.494e-13      
##                                          
##                   Kappa : 0.7744         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.8868         
##             Specificity : 0.8889         
##          Pos Pred Value : 0.9038         
##          Neg Pred Value : 0.8696         
##              Prevalence : 0.5408         
##          Detection Rate : 0.4796         
##    Detection Prevalence : 0.5306         
##       Balanced Accuracy : 0.8878         
##                                          
##        'Positive' Class : 0              
##

Testing on test data

set.seed(123)
knn.test.model <- kknn(V11~.,cred_train,cc_test,k=which.max(acc),scale=TRUE)

pred <- as.integer(fitted(knn.test.model)+0.5) # round off to 0 or 1

acc.test<-sum(pred == cc_test$V11) / nrow(cc_test)

print(paste0("Performance on test data is ", percent(acc.test,2)))

## [1] "Performance on test data is 82%"

cat("Use KKNN with k = ",which.max(acc),"\n")

## Use KKNN with k =  7

#Confusion matrix for the test data set
confusionMatrix(as.factor(pred),as.factor(cc_test$V11))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 45  5
##          1 12 37
##                                           
##                Accuracy : 0.8283          
##                  95% CI : (0.7394, 0.8967)
##     No Information Rate : 0.5758          
##     P-Value [Acc > NIR] : 7.732e-08       
##                                           
##                   Kappa : 0.656           
##                                           
##  Mcnemar's Test P-Value : 0.1456          
##                                           
##             Sensitivity : 0.7895          
##             Specificity : 0.8810          
##          Pos Pred Value : 0.9000          
##          Neg Pred Value : 0.7551          
##              Prevalence : 0.5758          
##          Detection Rate : 0.4545          
##    Detection Prevalence : 0.5051          
##       Balanced Accuracy : 0.8352          
##                                           
##        'Positive' Class : 0               
##

Observation 2: Here, the original credit card data was partitioned into 3 sets: Training (70%), Validating(15%) and Testing(15%). KKNN was used to determined the “best” k-value based on training data. The best k-value of 7 was validated against the validation data set with an accuracy of 90%. Finally, this same best k-value of 7 was plug into the test data set to compute the test accuracy (82%).

Question 4.1

Describe a situation or problem from your job, everyday life, current events, etc., for which a clustering model would be appropriate. List some (up to 5) predictors that you might use.

Answer:

One interesting application of a clustering algorithm would be the segmentation of voters leading up to an election to provide more tailored-made and influencing messaging for each targeted demographics. Demographic data on voters could be gathered and clustering models could be created using the following predictor variables:

- Median household income

- Geographic location 

- Age 

- Income and educational levels

- gender and much more etc...

Summary of K-means Clustering

This unsupervised learning exercise had 3 deliverables; best combination of predictors, accuracy level based on these sets of predictors and finally optimal number of clusters.

In order to visualize how well the observations (species) were grouped together, we used cluster plots and iterated from k=2 to k=5 with nStart point of 25.  Naturally because we had only 3 groups, it was graphically observed that K=3 had the best delineated clusterings with minimal overlaps; this meant that most species were allocated to the same cluster with little cross-overs.  However, the BSS/TSS ratio which is measure of "goodness of fit" was another tool used to ascertain whether additional clusterings were necessary in order to have higher quality of partitions.  We realized that blindly using this ratio as a guide might be misleading as higher ratio levels may speciously indicate better clustering formations.  

Therefore, the Elbow Method was the deciding tool to ensure optimal number of k-values were obtained; the Elbow plot indicated that we should be using K=3 to be our optimal number of clusters.  The beauty of the Eblow Plot is both visual and numerical. It shows how WSS which is the total distance of data points from their respective cluster centroids decreases as the number of k clusters increases.  In other words, WSS is a measure of tighness of each cluster and the smaller the better. The "kink" or elbow of this plot visually tells the analyst that they are entering the zone of diminishing returns; meaning increasing K beyond that "kink" for lower WSS is simply uneconomical.

Finally, correlation matrix amongst the predictors along with some 3-D help, we were able to iteratively pin point which combination of predictors were giving us the best results.  Not surprisingly, the combination of:

- Sepal Width, Petal Width & Petal Length  

Gave the best delineated clustering effects with the highest accuracy at 87%.  It was not surprisingly because these 3 predictors had the lowest correlations between each other.  An indication that they do not occupy the same vector space and thus rendering it more divisible.

Question 4.2

Determine & Report the best combination of predictors, optimal k value & clustering prediction accuracy
3 distinct species (Setosa, Virginica & Versicolor) used as references in k-means clustering exercise

#Invoke iris from R
idat<-read.table("iris.txt", header = TRUE)
head(idat,5)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa

#glimpse of the iris data set
summary(idat)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Run Clustering Algorithm

Cluster plots for better visualizations
Initially with All Predictors in place

#Scaling the predictors
scaled.data <- scale(idat[1:4] )
k2 <- kmeans(scaled.data, centers = 2, nstart = 25)
k3 <- kmeans(scaled.data, centers = 3, nstart = 25)
k4 <- kmeans(scaled.data, centers = 4, nstart = 25)
k5 <- kmeans(scaled.data, centers = 5, nstart = 25)

# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = scaled.data) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = scaled.data) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = scaled.data) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = scaled.data) + ggtitle("k = 5")

#Plotting clusters
grid.arrange(p1, p2, p3, p4, nrow = 2, top="Fig1: Cluster Plots")

Comparing clusters with species

table(k2$cluster, idat$Species)

##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         50        50

table(k3$cluster, idat$Species)

##    
##     setosa versicolor virginica
##   1      0         39        14
##   2     50          0         0
##   3      0         11        36

table(k4$cluster, idat$Species)

##    
##     setosa versicolor virginica
##   1     25          0         0
##   2      0         39        14
##   3      0         11        36
##   4     25          0         0

table(k5$cluster, idat$Species)

##    
##     setosa versicolor virginica
##   1     25          0         0
##   2      0          2        27
##   3     25          0         0
##   4      0         27        21
##   5      0         21         2

Observation 3:

With an Nstart pt of 25, three clusters seems to give the best delineations between the 3 groups of flowers. Although it wasn’t perfect but all of Setosa and most of Versicolor were in their distinct clusters. However, Virginica was the hardest to “bucketized” as it was not as easily divisible

Quality of a k-means partition

The quality of a k-means partition is found by calculating the percentage of the TSS “explained” by the partition using the following formula:

(BSS/TSS)×100%

where BSS and TSS stand for Between Sum of Squares and Total Sum of Squares, respectively. The higher the percentage, the better the score (and thus the quality) because it means that BSS is large and/or WSS is small

q<-list()
for (i in 1:5) {
  

model <-  kmeans(scaled.data, centers = i, nstart = 25) 

BSS<-model$betweenss
TSS<-model$totss
q[i]<-(BSS/TSS)*100 
}
#Listings of k values vs. "goodness of fit" (BSS/TSS ratio)
a_mat1<-as.matrix(q)
rownames(a_mat1) <- c("k=1","k=2","k=3","k=4","k=5")
colnames(a_mat1) <- c('% explained by K-partition')
a_mat1

##     % explained by K-partition
## k=1 1.907497e-14              
## k=2 62.93972                  
## k=3 76.69658                  
## k=4 80.98463                  
## k=5 84.8654

Observation 4:

With the cluster plots showing that 3 clusters seems to be the best, we need to double check the quality of the partitions

In particular, we need to see how well the data points are “clump” together within each cluster: Small withinss (WSS)
On the hand, we want large betweenss (BSS)

Therefore, the higher this ratio, the better overall quality of it’s partitions. But this can be deceiving as one can easily see that more clusters naturally leads to higher ratio levels and this will always be the case: with more classes, the partition will be finer, and the BSS contribution will be higher. On the other hand, the “model” will be more complex, requiring more classes. In the extreme case when k = n (each observation is a singleton class), we have BSS = TSS, the partitions will by then have lost all its meaning.

What is a good cut-off K value to select then? The answer lies in the Elbow Method

Elbow method

This method will definitively ascertain optimal clusters (k)

# Elbow method
fviz_nbclust(scaled.data, kmeans, method = "wss") +
  geom_vline(xintercept = 3, linetype = 5) + # add line for better visualization
  labs(subtitle = "Fig2: Elbow method")+theme_economist()

Observation 5:

From the previous discussion, we were faced with what or how to obtain an optimal number of partitions; in particular, what is a good cut-off K value. To break this tie, we turned to the Elbow Method. The Elbow Method will visually and numerically confirmed to the end-users the most optimal clusters to used. It gives a visual of the diminishing returns with respect to the total WSS metric as number of clustering increases. The Elbow plot decreased rapidly at first and starts to flattens out later as k clusters increases. It shows that model complexities will start outweighing any incremental gains
The Elbow test resulted in an optimal cluster of 3 in this case

Accuracy Measures:

With all Predictors
With Clusters = 3 (determined and explained above)

kmeans_accuracy <-list()
kmeans_accuracyALL<-(50+39+36)/150
kmeans_accuracy[1]<-kmeans_accuracyALL
cat("Prediction Accuracy (w/ all Predictors) = ",percent(kmeans_accuracyALL,3),"\n")

## Prediction Accuracy (w/ all Predictors) =  84%

Determining what’s the best predictor combinations

By:

Correlation plot (scaled data) between predictors
3 Dimensional plots

set.seed(2)
#correlation between independent variables
corrmatrix <- cor(scaled.data )
corrplot(corrmatrix, method = 'number',type='upper',bg="lightblue")
mtext("Fig3: Correlation Matrix", at=-.3, line=-14, cex=1.)

3D Plot function

#function for 3-D plotting
plot.type <- function(x1,x2,x3)
{
par(bg = 'lightblue', fg = 'black')
colors <- c("red", "blue", "green")
colors <- colors[as.numeric(iris$Species)]
# x=1, y=2 and z=3 coordinates
coord<-c(x1,x2,x3)
s3d <- scatterplot3d(idat[,coord], pch = 16,  angle = 125,color=colors,main="IRIS 3-D Plot ")
legend("bottom", legend = levels(idat$Species),
      col =  c("red", "blue", "green"), pch = 16,inset = -0.32, xpd = TRUE, horiz = TRUE)
}

From Corrplot:

Sepal Length, Sepal Width & Petal Length

#Predictors: Sepal Length, Sepal Width & Petal Length
plot.type(1,2,3)

From Corrplot:

Sepal Length, Sepal Width & Petal Width

#Predictors: Sepal Length, Sepal Width & Petal Width
plot.type(1,2,4)

From Corrplot: Tried the rest of Predictor combinations

#Predictors: Sepal Length, Petal Width & Petal Length
plot.type(1,3,4)

#Predictors: Petal Length, Sepal Width & Petal Width
plot.type(2,3,4)

Best Predictor Combination Analysis & Results:

Picking Optimal predictor combos via Accuracy Measurements

#Accuracy table function
kmean.acc <- function(x1,x2,x3)
{
coord1<-c(x1,x2,x3)
kcombo<-kmeans(scaled.data[,coord1],3,nstart=25)
table(kcombo$cluster, idat$Species)
}

3 Predictors

Sepal Length, Sepal Width & Petal Length

kmean.acc(1,2,3)

##    
##     setosa versicolor virginica
##   1      1         35        13
##   2     49          0         0
##   3      0         15        37

kmeans_accuracy1<-(49+35+37)/150
kmeans_accuracy[2]<-kmeans_accuracy1
cat("Prediction Accuracy (w/ Sepal.Length, Sepal.Width & Petal.Length) = ",percent(kmeans_accuracy1,3),"\n")

## Prediction Accuracy (w/ Sepal.Length, Sepal.Width & Petal.Length) =  81%

Sepal Length, Sepal Width & Petal Width

kmean.acc(1,2,4)

##    
##     setosa versicolor virginica
##   1      0         12        34
##   2     49          0         0
##   3      1         38        16

kmeans_accuracy2<-(49+38+34)/150
kmeans_accuracy[3]<-kmeans_accuracy2
cat("Prediction Accuracy (w/ Sepal.Length, Sepal.Width & Petal.Width) = ",percent(kmeans_accuracy2,3),"\n")

## Prediction Accuracy (w/ Sepal.Length, Sepal.Width & Petal.Width) =  81%

Petal Length, Sepal Width & Petal Width

kmean.acc(2,3,4)

##    
##     setosa versicolor virginica
##   1     49          0         0
##   2      1         42        12
##   3      0          8        38

kmeans_accuracy3<-(49+42+38)/150
kmeans_accuracy[4]<-kmeans_accuracy3
cat("Prediction Accuracy (w/ Sepal.Length, Petal.Width & Petal.Length) = ",percent(kmeans_accuracy3,3),"\n")

## Prediction Accuracy (w/ Sepal.Length, Petal.Width & Petal.Length) =  87%

Sepal Length, Petal Width & Petal Width

kmean.acc(1,3,4)

##    
##     setosa versicolor virginica
##   1      0         44        14
##   2      0          5        36
##   3     50          1         0

kmeans_accuracy4<-(50+44+36)/150
kmeans_accuracy[5]<-kmeans_accuracy4
cat("Prediction Accuracy (w/ Petal.Length, Sepal.Width & Petal.Width) = ",percent(kmeans_accuracy4,3),"\n")

## Prediction Accuracy (w/ Petal.Length, Sepal.Width & Petal.Width) =  87%

a_mat2<-as.matrix(kmeans_accuracy)
rownames(a_mat2) <- c("All 4 Predictors","Sepal.Length, Sepal.Width & Petal.Length","Sepal.Length, Sepal.Width & Petal.Width","Sepal.Length, Petal.Width & Petal.Length","Petal.Length, Sepal.Width & Petal.Width")
colnames(a_mat2) <- c('Accuracy %')
a_mat2

##                                          Accuracy %
## All 4 Predictors                         0.8333333 
## Sepal.Length, Sepal.Width & Petal.Length 0.8066667 
## Sepal.Length, Sepal.Width & Petal.Width  0.8066667 
## Sepal.Length, Petal.Width & Petal.Length 0.86      
## Petal.Length, Sepal.Width & Petal.Width  0.8666667

Observation 6:

First, it should be noted that normally predictions or accuracy measures are almost non-existent in unsupervised learning. For academic purposes, we just happened to know the “clusters” these species belong to
Having said that, the correlational matrix between the 4 predictors was used as a guide for us to iteratively determined, which predictor combinations resulted in the best delineations amongst the cluster groups and still returned the best accuracy levels
It was found the combination of Petal.Length, Sepal.Width & Petal.Width had the best delineations of clusters with minimal overlaps and a accuracy score of 87%. As it turned out, these 3 predictors had the most negative inter-correlation between each other. This means that negatively-correlated predictors do not occupy the same vector spaces and therefore, more easily discerning and separable
Lastly, it should be noted that Setosa was the most distinguishable regardless of which predictors were used

Cross-Validations & K-Means Clustering HW2

8/26/2020