KNN and SVM Classification

Use Case: Credit Card Customers Classification

From a sample of a German credit card transaction dataset, I used the Support Vector Machine (ksvm) and the K-Nearest Neighborhood (kknn) functions in R to find a good classifier for identifying potential customers less likely to default on their debts.

Some techniques implemented include: (a) Cross-validation in the k-nearest-neighbors model; and (b) Split the sample data into training, validation, and test data sets for this analysis exercise.

#use set seed to make the analysis reproducible
set.seed(1)

data <- read.delim("credit_card_data-headers.txt")
head(data)

##   A1    A2    A3   A8 A9 A10 A11 A12 A14 A15 R1
## 1  1 30.83 0.000 1.25  1   0   1   1 202   0  1
## 2  0 58.67 4.460 3.04  1   0   6   1  43 560  1
## 3  0 24.50 0.500 1.50  1   1   0   1 280 824  1
## 4  1 27.83 1.540 3.75  1   0   5   0 100   3  1
## 5  1 20.17 5.625 1.71  1   1   0   1 120   0  1
## 6  1 32.08 4.000 2.50  1   1   0   0 360   0  1

3.1.1 Use K-Nearest-Neighbor Model

Usually the full dataset would be split into training, validation and testing sets. Training set is used for building the model, validation set is used for choosing the best model, and testing set is used for evaluating the overall model fit results. However, note here in the case of k-nearest-neighbors model, cross validation is performing model training and validation together, so no need to set aside an individual validation set. Just a training & validation set and a testing set would suffice.

#First, chose to use the k-nearest-neighbors model here
#set aside 20% as the test data, and 80% for the training and validation set (combined)
#Use random sampling method to get the subsets
#nrow(data)
Sample1<- sample(1:654, 131)
training <- data[-Sample1, ]
testing <- data[Sample1, ]

#Sample2<- sample(1:262, 131)
#validation <- data[Sample1, ][-Sample2, ]
#testing <- data[Sample1, ][Sample2, ]

#confirming on the data splitting
nrow(training)

## [1] 523

nrow(testing)

## [1] 131

Then build the kknn models using the training data set. Note here that similar to HW1, we need to decide on a good k value, and we are going to use cross validation on models with different k values to decide on the best model (each model with a different K). This time, however, we are just using the training data set to do this, instead of the full data set. Also, for simplicity, let’s split the training set into 10 random parts (a rule of thumb good value for the number of folds) and implement cross validation on them.

library(kknn)

#First, to ensure randomization, shuffle the data
training <- training[sample(nrow(training)),]

#Secondly, create labels for 10 random folds out of the shuffled data
folds <- cut(seq(1,nrow(training)), breaks=10, labels=FALSE)

#test on one example first, to comment out in the final version
# val_label0 <- which(folds == 1, arr.ind = TRUE)
# val_fold0 <- training[val_label0, ]
# train_folds0 <- training[-val_label0, ]
# model_kknn_0 <- kknn(R1~., train_folds0, val_fold0, k = 12, distance = 2,kernel = "optimal",scale = TRUE) 
# predict.kknn0 <- predict(model_kknn_0)
# predict.kknn0
# fit0 <- as.integer(predict.kknn0 + 0.5)
# table(fit0, val_fold0[,11])
# accuracy0 <- sum(fit0 == val_fold0[,11])/nrow(val_fold0)
# accuracy0

#develop a mean accuracy function to report back the average accuracy rate of a kknn model with a given K value using 10-fold cross validation
mean_accuracy <- function(X){
  model_kknn <- vector(10, mode = "list")
  fit <- vector(10, mode = "list")
  accuracy <- vector(10, mode = "list")
  for (i in 1:10){
    val_labels <- which(folds == i, arr.ind = TRUE)
    val_fold <- training[val_labels, ]
    train_folds <- training[-val_labels, ]
    model_kknn[[i]] <- kknn(R1~., train_folds, val_fold, k = X, distance = 2,kernel = "optimal",scale = TRUE)  
    #use as.integer function on the fitted value + 0.5, assuming that if the model predicts at 0.5, it will be classified as 1.
    fit[[i]]<- as.integer(predict(model_kknn[[i]]) +0.5)
    accuracy[[i]] <- sum(fit[[i]] == val_fold[,11])/nrow(val_fold)
  }
  
  #return the average accuracy rate of the 10 folds model fits
  avg_accuracy <- mean(unlist(accuracy))
  return(avg_accuracy)
}

#test with K=10  
mean_accuracy(10)

## [1] 0.8565675

Now, loop through a range of K values to calculate their corresponding models’ accuracy.

k_val <- c(1:100)
accuracy_set <- vector(100, mode = "list")

for (k in 1:100){
  accuracy_set[[k]] <- mean_accuracy(k)
}

Plot the different k values against the model fits average accuracies

plot(k_val, accuracy_set)
lines(k_val, accuracy_set, lwd=1.5)

Find the optimal K with the highest average accuracy rate

max(unlist(accuracy_set))

## [1] 0.8584906

which.max(unlist(accuracy_set))

## [1] 17

Therefore, by using a 10-fold cross validation method on the training and validation set, the optimal k value is 17, and the highest average accuracy rate is 85.85%. Now, let’s evaluate this chosen model (k = 17) on the full training and validation set and the initially set aside test set. Note that unlike ksvm, the kknn function doesn’t report back a formula with coefficients that one can feed new data to. It just reports back the prediction from the test data it receives.

model_kknn_test <- kknn(R1~., training, testing, k = 17, distance = 2, kernel = "optimal",scale = TRUE)

accuracy_test <- sum(as.integer(predict(model_kknn_test) + 0.5) == testing[,11])/nrow(testing)
accuracy_test

## [1] 0.8320611

Therefore, the chosen model with a K of 17 has an average accuracy rate of 83.2%.

3.1.2 Use Support Vector Machines

Also, try using the ksvm function.

#First split the data into 60% training, 20% validation and 20% testing sets
Sample2 <- sample(1:654, 131)
test_svm <- data[Sample2, ]
train_val <- data[-Sample2, ]

Sample3 <- sample(1:523, 131)
val_svm <- train_val[Sample3, ]
train_svm <- train_val[-Sample3, ]

#verify the split
nrow(train_svm)

## [1] 392

nrow(val_svm)

## [1] 131

nrow(test_svm)

## [1] 131

Create a function that tests the accuracy rate of a model with C value = x. From last exercise, we learned that kernel rbfdot can produce the highest accuracy rate among all the kernels.

library(kernlab)

## Warning: package 'kernlab' was built under R version 4.0.3

#A function that checks accuracy of the model with C = x
check_accuracy <- function(x){
  #build ksvm model
  model_svm <- ksvm(as.matrix(train_svm[,1:10]),as.factor(train_svm[,11]),type="C-svc",kernel="rbfdot",C=x,scaled=TRUE)
  #validate the result
  return(sum(predict(model_svm, val_svm[,1:10]) == val_svm[,11]) / nrow(val_svm))
}

#test with C = 100
check_accuracy(100)

## [1] 0.8244275

Loop through a range of different C values to test their accuracy rates.

C_val <- c(0.001, 0.01, 0.1, 1,10,100,1000,10000,100000,1000000,10000000,100000000,1000000000,10000000000)
accuracy <- vector(length(C_val), mode = "list")

for (i in seq_along(C_val)){
  accuracy[[i]] <- check_accuracy(C_val[[i]])
}

Plot out the varying accuracy rates with the changing C values.

plot(log10(C_val), accuracy)
lines(log10(C_val), accuracy, lwd=1.5)

#identify the max accuracy rate achieved
max(unlist(accuracy))

## [1] 0.8549618

C_val[[which.max(unlist(accuracy))]]

## [1] 0.1

Therefore, the highest accuracy rate is achieved at 0.8549618 when C = 0.1. Let’s retrain the ksvm model with C = 0.1 on the combined training and validation set, and then report back the results on the testing set.

#nrow(train_val)
model_svm_1 <- ksvm(as.matrix(train_val[,1:10]),as.factor(train_val[,11]),type="C-svc",kernel="rbfdot",C=0.1,scaled=TRUE)

#obtain the model formula
# calculate a1...am
a <- colSums(model_svm_1@xmatrix[[1]] * model_svm_1@coef[[1]])
a

##         A1         A2         A3         A8         A9        A10        A11 
##  0.5723173  2.5354337  3.2895226  5.6671757 16.6691976 -3.6517156  6.0478352 
##        A12        A14        A15 
## -0.8200559 -2.5609595  5.3585085

length(a)

## [1] 10

#calculate a0
a0 <- model_svm_1@b
a0

## [1] -0.4577435

#print the classifier
sprintf("y = %s + %s * A1 + %s * A2 + %s * A3 + %s * A8 + %s * A9 + %s * A10 + %s * A11 + %s * A12 + %s * A14 + %s * A15", a0, a[1], a[2], a[3], a[4], a[5], a[6], a[7], a[8], a[9], a[10])

## [1] "y = -0.45774351924611 + 0.572317331806859 * A1 + 2.5354337298551 * A2 + 3.28952260804462 * A3 + 5.66717572959846 * A8 + 16.669197576736 * A9 + -3.6517156475674 * A10 + 6.04783515749494 * A11 + -0.820055932669521 * A12 + -2.56095950598715 * A14 + 5.35850851466814 * A15"

Then let’s report this model’s accuracy rate using the testing set.

#nrow(test_svm)
accuracy_1 <- sum(predict(model_svm_1, test_svm[,1:10]) == test_svm[,11]) / nrow(test_svm)
accuracy_1

## [1] 0.870229

Therefore, this classifier achieves an accuracy rate of 0.870229 overall.

Final Results

#print the results
sprintf("The best kknn model had K = %s. The best ksvm model is when C = %s, and the predicting formula is: y = %s + %s * A1 + %s * A2 + %s * A3 + %s * A8 + %s * A9 + %s * A10 + %s * A11 + %s * A12 + %s * A14 + %s * A15", which.max(unlist(accuracy_set)), C_val[[which.max(unlist(accuracy))]], a0, a[1], a[2], a[3], a[4], a[5], a[6], a[7], a[8], a[9], a[10])

## [1] "The best kknn model had K = 17. The best ksvm model is when C = 0.1, and the predicting formula is: y = -0.45774351924611 + 0.572317331806859 * A1 + 2.5354337298551 * A2 + 3.28952260804462 * A3 + 5.66717572959846 * A8 + 16.669197576736 * A9 + -3.6517156475674 * A10 + 6.04783515749494 * A11 + -0.820055932669521 * A12 + -2.56095950598715 * A14 + 5.35850851466814 * A15"

Use Case: Potential Real Estate Development Identification through Clustering

Describe a situation or problem from your job, everyday life, current events, etc., for which a clustering model would be appropriate. List some (up to 5) predictors that you might use.

As an urban planning consultant, in my day-to-day job, an important urban issue to be studied is residential segregation - the physical separation of residential communities with significantly different socioeconomic status and opportunities, which could mean inequality in opportunities and further, indicates an unsustainable development pattern of the city itself. An important task that a clustering model would be useful is to identify different socioeconomic clusters of a city’s population using census and other publicly available data, and compare the identified clusters with their locations to see if there is a similar pattern. If there are significant overlaps between the socioeconomic pattern and their spatial pattern, then it is a warning sign for city government and policy makers to work on fostering equity across the city region.

Some of the key predictors can include: Median Household Income This is a continuous predictor, which generally can vary significantly city from city, and among a city’s different parts. For this analysis, census data can be collected on a census tract level, which is small enough to derive potential city wide policy recommendations.

Highest Level of Education This can be measured either as a categorical predictor (from primary school, high school, college, graduate school, etc.) or a continuous predictor as the number of years of education received. Census data can be collected on a census tract level as well.

Percentage of Minority Population This would be a continuous predictor with values from 0 to 100%, indicating how many minority population live in certain census tracts.

Number of Full-time Workers per Household This would be a continuous predictor indicating the general employment condition per household in certain areas of the city. While this predictor may be correlated with the median household income (and other) predictor, it is worthwhile to bring it in as it also reveals the economic opportunities a household can get.

Percentage of Chronic Disease This would most likely be a continuous predictor showing the prevalence of some major chronic diseases in certain census tracts. This predictor relects the health condition of the population and is an important aspect of the equity topic.

Use Case: Iris Flowers Characteristics Prediction

The iris data set iris.txt contains 150 data points, each with four predictor variables and one categorical response. The predictors are the width and length of the sepal and petal of flowers and the response is the type of flower. The data is available from the R library datasets and can be accessed with iris once the library is loaded. It is also available at the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Iris ). The response values are only given to see how well a specific method performed and should not be used to build the model.

Use the R function kmeans to cluster the points as well as possible. Report the best combination of predictors, your suggested value of k, and how well your best clustering predicts flower type.

4.2.1 Data Inspection

#load the iris dataset
data_i_unscaled <- read.table("iris.txt")
head(data_i_unscaled)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

str(data_i_unscaled)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...

nrow(data_i_unscaled)

## [1] 150

#take a look at one of the rows
data_i_unscaled[1,]

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa

#check how many types of flowers are recorded in the data
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.0.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

count(data_i_unscaled, Species)

##      Species  n
## 1     setosa 50
## 2 versicolor 50
## 3  virginica 50

So there are 50 data points for each of the three types. Now, let’s try to get a quick understanding of the relationship among the four different predictors, namely, Sepal Length, Sepal Width, Petal Length and Petal Width.

4.2.2 Data Scaling

Let’s take a look at the predictors’ ranges.

range(data_i_unscaled$Sepal.Length)

## [1] 4.3 7.9

range(data_i_unscaled$Sepal.Width)

## [1] 2.0 4.4

range(data_i_unscaled$Petal.Length)

## [1] 1.0 6.9

range(data_i_unscaled$Petal.Width)

## [1] 0.1 2.5

Here we see the four predictors’ values are smaller than 10 or a magnitude of 1. However, the fourth predictor Petal Width has much smaller value than the other three, so let’s scale the data. In this step, as mentioned in the lectures, we could use scaling or standardization approach, but it will depend on whether we needs the scaled data to stay in a value range that still makes sense. I tested both scaling and standardization to the data, and found that standardization would turn some of the lengths and widths into negative values (which no longer makes sense for lengths and widths) and lead to worse clustering results (larger within-cluster sum of squares). However, scaling (normalization) would improve the clustering results.

#create a function to scale the column values

normalize <- function(x) {
    (x - min(x)) / (max(x) - min(x))
}

#scale the four predictors columns
data_i <- as.data.frame(lapply(data_i_unscaled[1:4], normalize))
data_i$Species <- data_i_unscaled$Species
head(data_i)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1   0.22222222   0.6250000   0.06779661  0.04166667  setosa
## 2   0.16666667   0.4166667   0.06779661  0.04166667  setosa
## 3   0.11111111   0.5000000   0.05084746  0.04166667  setosa
## 4   0.08333333   0.4583333   0.08474576  0.04166667  setosa
## 5   0.19444444   0.6666667   0.06779661  0.04166667  setosa
## 6   0.30555556   0.7916667   0.11864407  0.12500000  setosa

#I also tested standardization using this method here: Reference:https://stackoverflow.com/questions/23619188/r-scaling-numeric-values-only-in-a-dataframe-with-mixed-types
#data_i <- data_i_unscaled %>% mutate_if(is.numeric, scale)
#this method increased the sum of squares significantly.

4.2.3 Finding the Best Predictor(s)

#plotting pairwise graphs between the predictors
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:kernlab':
## 
##     alpha

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

p1 <- ggplot(data_i, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point()
p2 <- ggplot(data_i, aes(Sepal.Length, Petal.Length, color = Species)) + geom_point()
p3 <- ggplot(data_i, aes(Sepal.Length, Petal.Width, color = Species)) + geom_point()
p4 <- ggplot(data_i, aes(Sepal.Width, Petal.Length, color = Species)) + geom_point()
p5 <- ggplot(data_i, aes(Sepal.Width, Petal.Width, color = Species)) + geom_point()
p6 <- ggplot(data_i, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()
grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 3)

From the plots above, we can see that certain pairs of predictors have clearer differences across the three species. For example, as shown in the top left plot of Sepal Length and Sepal Width, versicolor and virginica are mingled with each other and hard to separate on the graph, while in the bottom right plot of Petal Length and Petal Width, the three species seem to scatter around three clusters.

The reason behind could be that some pairs of predictors are more significantly correlated to each other than other pairs, therefore, leading to different distribution patterns.

To take a closer look, let’s create a correlation matrix across all predictors.

#load the corrplot library
library(corrplot)

## corrplot 0.84 loaded

#?round
#?cor
# Compute a correlation matrix
corr <- round(cor(data_i[,1:4]), 2)
corrplot(corr, method = "circle")

From this corelation matrix above, it is clear that Petal Length is highly correlated with Petal Width, meanwhile, Sepal Length and Sepal Width are rather uncorrelated. For two highly correlated predictors, given a same value of one predictor, the two species can have quite different values in the other predictor (think of these as two linear function with different coefficients). Therefore, their plots look like that the species can be separated easily. On the contrary, for a uncorrelated pair of predictors, the two predictors’ values act randomly between the two species, so they may end up having similar(or not) values given a same value on one predictor, thus their plots look like that they congest together.

However, note that in supervised learning, when trying to choose a good combination of predictors to predict the species effectively and efficiently, we want to reduce duplicate information, which is present in highly correlated predictors (or multicollinearity in regression).

But for this clustering exercise, let’s test k-means clustering models with different numbers and combinations of predictors.

data_i_pred <- data_i[,1:4]
#Sepal Length, Sepal Width, Petal Length and Petal Width
cluster_0 <- kmeans(data_i_pred, centers = 3, nstart = 10)

#compare with original unscaled data
cluster_unscaled <- kmeans(data_i_unscaled[,1:4], centers = 3, nstart = 10)

#report back the total within-cluster sum of squares
cluster_0$tot.withinss

## [1] 6.982216

cluster_unscaled$tot.withinss

## [1] 78.85144

The normalization indeed improves the clustering results! Let’s see if using 3 predictors is better than using all 4 predictors, keeping all other parameters the same.

tts_three_pred <- rep(0, 4)
for (i in 1:4){
  cluster_pred <- kmeans(data_i_pred[, -i], centers = 3, nstart = 10)
  tts_three_pred[[i]] <- cluster_pred$tot.withinss
} 

tts_three_pred

## [1] 4.481992 4.309118 6.107412 5.310904

Therefore, for 3 predictors combinations, the one with Sepal Length, Petal Length and Petal Width together has the best result. What about 2 predictors combinations? Let’s do the same tests.

tts_two_pred <- rep(0, 6)
pairs <- combn(c(1:4),2)
pairs

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    1    1    2    2    3
## [2,]    2    3    4    3    4    4

for (i in 1:6){
  cluster_pred <- kmeans(data_i_pred[, pairs[,i]], centers = 3, nstart = 10)
  tts_two_pred[[i]] <- cluster_pred$tot.withinss
}

tts_two_pred

## [1] 4.115194 2.729245 3.485827 3.312646 3.576805 1.701875

From above results, we can see that the lowest sum of squares is 1.701875, reached when i = 6, using only Petal Length and Petal Width as the predictors.

Finally, let’s test the single predictor models.

tts_one_pred <- rep(0, 4)

for (i in 1:4){
  cluster_pred <- kmeans(data_i_pred[, i], centers = 3, nstart = 10)
  tts_one_pred[[i]] <- cluster_pred$tot.withinss
}

tts_one_pred

## [1] 1.2159043 0.9131406 0.7142459 0.8529817

Looks like the single predictor models are even better, the min within-cluster sum of squares is achieved when i = 3, using only Petal Length as the single predictor can create the best clustering results.

4.2.4 Finding the Optimal k

Now, let’s try to find the best optimal k or number of clusters.

#create vector storing multiple potential k, and reporting back their corresponding total within-cluster sum of squares

cluster_set <- vector(30, mode = "list")
tss_set <- vector(30, mode = "list")

#create a function that reports back the total within-cluster sum of squares of a k-means model with given number of clusters. Use nstart = 10 as it is neither too large or too small, and for simplicity purpose we'll keep it the same.

check_tss <- function(x){
  cluster_set[[x]] <- kmeans(data_i_pred[,3], centers = x, nstart = 10)
  return(cluster_set[[x]]$tot.withinss)
}

#test on k = 1 and k = 5
check_tss(1)

## [1] 13.33885

check_tss(5)

## [1] 0.2511179

#loop through k from 1 to 30, and plot the line between k values and sum of squares
for (k in 1:30){
  tss_set[[k]] <- check_tss(k)
}

plot(c(1:30), tss_set, xlim = c(1,31), ylim = c(-2,20), main = "K Values and Total Within-cluster Sum of Squares", xlab = "K values", ylab = "Total Within-cluster Sum of Squares")
lines(c(1:30), tss_set, lwd=2, col = "blue")
text(c(1:30), tss_set, labels = round(unlist(tss_set),2), cex= 0.5, font = 2, pos = 1)
text(c(1:30), tss_set, labels = c(1:30), cex= 0.5, col = "red", font = 2, pos = 3)

Now obtain the marginal reduction of total within-cluster sum of squares by k val.

#create a list to store the marginal reduction values
margin_reduct <- rep(0, 29)
#run a loop to get the values
for (i in 1:29) {
  margin_reduct[[i]] <- unlist(tss_set)[[i]] - tss_set[[i+1]]
}
#take a look at the marginal reductions
margin_reduct

##  [1] 11.3967730126  1.2377851247  0.3429738618  0.1115281653  0.0782937139
##  [6]  0.0495761959  0.0233344389  0.0075920190  0.0255148700  0.0069949945
## [11]  0.0102964770  0.0062131714  0.0008321609  0.0103133448  0.0004805328
## [16]  0.0039034933  0.0058558095  0.0019144952  0.0005085296  0.0013787617
## [21]  0.0025485287  0.0008866676 -0.0007982055  0.0032013304  0.0023479553
## [26]  0.0005294563  0.0004009915  0.0008610233  0.0012611339

Plot the marginal reduction by increasing k values.

#use barplot function to plot
barplot(margin_reduct,
main = "K Values and Marginal Reduction of Sum of Squares",
xlab = "K values",
ylab = "Marginal Reduction",
ylim = c(-2,18),
col = "darkred",
horiz = FALSE)
text(c(1:29), margin_reduct, labels = round(margin_reduct,2), col = "red", cex= 0.5, font = 2, pos = 3)
#add the original sum of squares to the labels
text(c(1:30), tss_set, labels = round(unlist(tss_set),2), col = "blue", cex= 0.5, font = 2.5, pos = 3, offset = 2)

Therefore, from the above bar charts, we can see that when k increases from 1 to 2, the total sum of squares dropped 11.4 from 13.34, a 85.4% drop. When k changes from 2 to 3, the drop is 1.24, a 9.2% drop as of the original sum of squares. Then, when k increases to 4, the drop is 0.34, only a 2.6% drop, and only 3.1% of the first drop. Further on, the marginal reduction of sum of squares are even smaller, less than 1% of the first drop.

Let’s assume we want the marginal reduction to be larger than 1.33, or 10% of the original total within-cluster sum of squares, to be a worthwhile increase of the k. Thus, the optimal k is 3, when the toal sum of squares is 0.71, 5.3% of the original. This makes sense as well because we are trying to identify clusters of features among the three types of flowers here.

4.2.5 Evaluate the Chosen Clustering Model

Next, let’s try to evaluate how well this model predicts the flower types.

#Calculate the accuracy of clustering.
cluster_best <- kmeans(data_i_pred[,3], centers = 3, nstart = 10)
tss_best <- cluster_best$tot.withinss
tss_best

## [1] 0.7142459

Compare it with the model with Petal Length and Width as predictors:

#Calculate the accuracy of clustering.
cluster_best_1 <- kmeans(data_i_pred[,3:4], centers = 3, nstart = 10)
tss_best_1 <- cluster_best_1$tot.withinss
tss_best_1

## [1] 1.701875

Actually, later I found if using the scale() function to scale the data, the result would be that Petal Length and Petal Width would be the best combinations of predictors, instead of the single predictor of Petal Length!

data_i_pred

##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     0.22222222  0.62500000   0.06779661  0.04166667
## 2     0.16666667  0.41666667   0.06779661  0.04166667
## 3     0.11111111  0.50000000   0.05084746  0.04166667
## 4     0.08333333  0.45833333   0.08474576  0.04166667
## 5     0.19444444  0.66666667   0.06779661  0.04166667
## 6     0.30555556  0.79166667   0.11864407  0.12500000
## 7     0.08333333  0.58333333   0.06779661  0.08333333
## 8     0.19444444  0.58333333   0.08474576  0.04166667
## 9     0.02777778  0.37500000   0.06779661  0.04166667
## 10    0.16666667  0.45833333   0.08474576  0.00000000
## 11    0.30555556  0.70833333   0.08474576  0.04166667
## 12    0.13888889  0.58333333   0.10169492  0.04166667
## 13    0.13888889  0.41666667   0.06779661  0.00000000
## 14    0.00000000  0.41666667   0.01694915  0.00000000
## 15    0.41666667  0.83333333   0.03389831  0.04166667
## 16    0.38888889  1.00000000   0.08474576  0.12500000
## 17    0.30555556  0.79166667   0.05084746  0.12500000
## 18    0.22222222  0.62500000   0.06779661  0.08333333
## 19    0.38888889  0.75000000   0.11864407  0.08333333
## 20    0.22222222  0.75000000   0.08474576  0.08333333
## 21    0.30555556  0.58333333   0.11864407  0.04166667
## 22    0.22222222  0.70833333   0.08474576  0.12500000
## 23    0.08333333  0.66666667   0.00000000  0.04166667
## 24    0.22222222  0.54166667   0.11864407  0.16666667
## 25    0.13888889  0.58333333   0.15254237  0.04166667
## 26    0.19444444  0.41666667   0.10169492  0.04166667
## 27    0.19444444  0.58333333   0.10169492  0.12500000
## 28    0.25000000  0.62500000   0.08474576  0.04166667
## 29    0.25000000  0.58333333   0.06779661  0.04166667
## 30    0.11111111  0.50000000   0.10169492  0.04166667
## 31    0.13888889  0.45833333   0.10169492  0.04166667
## 32    0.30555556  0.58333333   0.08474576  0.12500000
## 33    0.25000000  0.87500000   0.08474576  0.00000000
## 34    0.33333333  0.91666667   0.06779661  0.04166667
## 35    0.16666667  0.45833333   0.08474576  0.04166667
## 36    0.19444444  0.50000000   0.03389831  0.04166667
## 37    0.33333333  0.62500000   0.05084746  0.04166667
## 38    0.16666667  0.66666667   0.06779661  0.00000000
## 39    0.02777778  0.41666667   0.05084746  0.04166667
## 40    0.22222222  0.58333333   0.08474576  0.04166667
## 41    0.19444444  0.62500000   0.05084746  0.08333333
## 42    0.05555556  0.12500000   0.05084746  0.08333333
## 43    0.02777778  0.50000000   0.05084746  0.04166667
## 44    0.19444444  0.62500000   0.10169492  0.20833333
## 45    0.22222222  0.75000000   0.15254237  0.12500000
## 46    0.13888889  0.41666667   0.06779661  0.08333333
## 47    0.22222222  0.75000000   0.10169492  0.04166667
## 48    0.08333333  0.50000000   0.06779661  0.04166667
## 49    0.27777778  0.70833333   0.08474576  0.04166667
## 50    0.19444444  0.54166667   0.06779661  0.04166667
## 51    0.75000000  0.50000000   0.62711864  0.54166667
## 52    0.58333333  0.50000000   0.59322034  0.58333333
## 53    0.72222222  0.45833333   0.66101695  0.58333333
## 54    0.33333333  0.12500000   0.50847458  0.50000000
## 55    0.61111111  0.33333333   0.61016949  0.58333333
## 56    0.38888889  0.33333333   0.59322034  0.50000000
## 57    0.55555556  0.54166667   0.62711864  0.62500000
## 58    0.16666667  0.16666667   0.38983051  0.37500000
## 59    0.63888889  0.37500000   0.61016949  0.50000000
## 60    0.25000000  0.29166667   0.49152542  0.54166667
## 61    0.19444444  0.00000000   0.42372881  0.37500000
## 62    0.44444444  0.41666667   0.54237288  0.58333333
## 63    0.47222222  0.08333333   0.50847458  0.37500000
## 64    0.50000000  0.37500000   0.62711864  0.54166667
## 65    0.36111111  0.37500000   0.44067797  0.50000000
## 66    0.66666667  0.45833333   0.57627119  0.54166667
## 67    0.36111111  0.41666667   0.59322034  0.58333333
## 68    0.41666667  0.29166667   0.52542373  0.37500000
## 69    0.52777778  0.08333333   0.59322034  0.58333333
## 70    0.36111111  0.20833333   0.49152542  0.41666667
## 71    0.44444444  0.50000000   0.64406780  0.70833333
## 72    0.50000000  0.33333333   0.50847458  0.50000000
## 73    0.55555556  0.20833333   0.66101695  0.58333333
## 74    0.50000000  0.33333333   0.62711864  0.45833333
## 75    0.58333333  0.37500000   0.55932203  0.50000000
## 76    0.63888889  0.41666667   0.57627119  0.54166667
## 77    0.69444444  0.33333333   0.64406780  0.54166667
## 78    0.66666667  0.41666667   0.67796610  0.66666667
## 79    0.47222222  0.37500000   0.59322034  0.58333333
## 80    0.38888889  0.25000000   0.42372881  0.37500000
## 81    0.33333333  0.16666667   0.47457627  0.41666667
## 82    0.33333333  0.16666667   0.45762712  0.37500000
## 83    0.41666667  0.29166667   0.49152542  0.45833333
## 84    0.47222222  0.29166667   0.69491525  0.62500000
## 85    0.30555556  0.41666667   0.59322034  0.58333333
## 86    0.47222222  0.58333333   0.59322034  0.62500000
## 87    0.66666667  0.45833333   0.62711864  0.58333333
## 88    0.55555556  0.12500000   0.57627119  0.50000000
## 89    0.36111111  0.41666667   0.52542373  0.50000000
## 90    0.33333333  0.20833333   0.50847458  0.50000000
## 91    0.33333333  0.25000000   0.57627119  0.45833333
## 92    0.50000000  0.41666667   0.61016949  0.54166667
## 93    0.41666667  0.25000000   0.50847458  0.45833333
## 94    0.19444444  0.12500000   0.38983051  0.37500000
## 95    0.36111111  0.29166667   0.54237288  0.50000000
## 96    0.38888889  0.41666667   0.54237288  0.45833333
## 97    0.38888889  0.37500000   0.54237288  0.50000000
## 98    0.52777778  0.37500000   0.55932203  0.50000000
## 99    0.22222222  0.20833333   0.33898305  0.41666667
## 100   0.38888889  0.33333333   0.52542373  0.50000000
## 101   0.55555556  0.54166667   0.84745763  1.00000000
## 102   0.41666667  0.29166667   0.69491525  0.75000000
## 103   0.77777778  0.41666667   0.83050847  0.83333333
## 104   0.55555556  0.37500000   0.77966102  0.70833333
## 105   0.61111111  0.41666667   0.81355932  0.87500000
## 106   0.91666667  0.41666667   0.94915254  0.83333333
## 107   0.16666667  0.20833333   0.59322034  0.66666667
## 108   0.83333333  0.37500000   0.89830508  0.70833333
## 109   0.66666667  0.20833333   0.81355932  0.70833333
## 110   0.80555556  0.66666667   0.86440678  1.00000000
## 111   0.61111111  0.50000000   0.69491525  0.79166667
## 112   0.58333333  0.29166667   0.72881356  0.75000000
## 113   0.69444444  0.41666667   0.76271186  0.83333333
## 114   0.38888889  0.20833333   0.67796610  0.79166667
## 115   0.41666667  0.33333333   0.69491525  0.95833333
## 116   0.58333333  0.50000000   0.72881356  0.91666667
## 117   0.61111111  0.41666667   0.76271186  0.70833333
## 118   0.94444444  0.75000000   0.96610169  0.87500000
## 119   0.94444444  0.25000000   1.00000000  0.91666667
## 120   0.47222222  0.08333333   0.67796610  0.58333333
## 121   0.72222222  0.50000000   0.79661017  0.91666667
## 122   0.36111111  0.33333333   0.66101695  0.79166667
## 123   0.94444444  0.33333333   0.96610169  0.79166667
## 124   0.55555556  0.29166667   0.66101695  0.70833333
## 125   0.66666667  0.54166667   0.79661017  0.83333333
## 126   0.80555556  0.50000000   0.84745763  0.70833333
## 127   0.52777778  0.33333333   0.64406780  0.70833333
## 128   0.50000000  0.41666667   0.66101695  0.70833333
## 129   0.58333333  0.33333333   0.77966102  0.83333333
## 130   0.80555556  0.41666667   0.81355932  0.62500000
## 131   0.86111111  0.33333333   0.86440678  0.75000000
## 132   1.00000000  0.75000000   0.91525424  0.79166667
## 133   0.58333333  0.33333333   0.77966102  0.87500000
## 134   0.55555556  0.33333333   0.69491525  0.58333333
## 135   0.50000000  0.25000000   0.77966102  0.54166667
## 136   0.94444444  0.41666667   0.86440678  0.91666667
## 137   0.55555556  0.58333333   0.77966102  0.95833333
## 138   0.58333333  0.45833333   0.76271186  0.70833333
## 139   0.47222222  0.41666667   0.64406780  0.70833333
## 140   0.72222222  0.45833333   0.74576271  0.83333333
## 141   0.66666667  0.45833333   0.77966102  0.95833333
## 142   0.72222222  0.45833333   0.69491525  0.91666667
## 143   0.41666667  0.29166667   0.69491525  0.75000000
## 144   0.69444444  0.50000000   0.83050847  0.91666667
## 145   0.66666667  0.54166667   0.79661017  1.00000000
## 146   0.66666667  0.41666667   0.71186441  0.91666667
## 147   0.55555556  0.20833333   0.67796610  0.75000000
## 148   0.61111111  0.41666667   0.71186441  0.79166667
## 149   0.52777778  0.58333333   0.74576271  0.91666667
## 150   0.44444444  0.41666667   0.69491525  0.70833333

#print the clustering result
cluster_best

## K-means clustering with 3 clusters of sizes 49, 50, 51
## 
## Cluster means:
##         [,1]
## 1 0.54721550
## 2 0.07830508
## 3 0.77234962
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1
##  [75] 1 1 1 3 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 1 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3
## [149] 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 0.26831370 0.04245332 0.40347883
##  (between_SS / total_SS =  94.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Try to evaluate and visualize the clustering results.

#create a table showing the clustering result among the original flower types. Note that the model values are reported in the same order as the rows in the original data, so we can use table() function to do this. 
table(cluster_best$cluster, data_i$Species)

##    
##     setosa versicolor virginica
##   1      0         46         3
##   2     50          0         0
##   3      0          4        47

From this table, let’s use the ratio of successfully clustered data points over total number of data points as a measuring tool. Here successfully clustered points within a specific flower type mean those falling in the same cluster as the majority of that specific type.

We can see that a total of 143(50+46+47) data points are clustered into either one of the flower types, and only 7 data points are clustered differently. This success ratio is therefore 95.33%.

#visualize the three clusters
#append the cluster ID to the original dataset
data_i_unscaled$cluster <- cluster_best$cluster
ggplot(data_i_unscaled, aes(Petal.Length, cluster, color = Species)) + geom_point()

Above plot shows three clear clusters with a few exceptions between the versicolor and virginica types.

Final Results

sprintf("Best combination of predictors is using Petal Length as the predictor. Best value of k is 3, which would give a minimum total within-cluster sum of squares of %s, and a success clustering ratio of 0.9533",tss_best)

## [1] "Best combination of predictors is using Petal Length as the predictor. Best value of k is 3, which would give a minimum total within-cluster sum of squares of 0.71424585002056, and a success clustering ratio of 0.9533"