k_mean

Running Code

Selection of data to be used

data <- state.x77[, c("Population", "Income", "Illiteracy")]
head(data)

           Population Income Illiteracy
Alabama          3615   3624        2.1
Alaska            365   6315        1.5
Arizona          2212   4530        1.8
Arkansas         2110   3378        1.9
California      21198   5114        1.1
Colorado         2541   4884        0.7

#Scaling Data
scaled_data <- scale(data)

library(stats)

# Apply the K-Means algorithm for 3, 4 and 5 clusters
k3 <- kmeans(scaled_data, centers = 3)
k4 <- kmeans(scaled_data, centers = 4)
k5 <- kmeans(scaled_data, centers = 5)


#cluster invocation

cluster_3 <- k3$cluster
cluster_4 <- k4$cluster
cluster_5 <- k5$cluster

2) What are the sizes of clusters in the 3-cluster case?

# Calculate the number of observations in each cluster
cluster_3 <- k3$cluster
table(cluster_3)

cluster_3
 1  2  3 
29 12  9

3) What are the sizes of clusters in the 4-cluster case?

# Calculate the number of observations in each cluster
cluster_4 <- k4$cluster
table(cluster_4)

cluster_4
 1  2  3  4 
 6  9 23 12

4) What are the sizes of clusters in the 5-cluster case?

# Calculate the number of observations in each cluster
cluster_5 <- k5$cluster
table(cluster_5)

cluster_5
 1  2  3  4  5 
 8 10 10 16  6

5) For the 3-cluster case, give two examples from the cluster, which includes Alabama.

cluster_alabama <- k3$cluster["Alabama"] 

# Selection of two sample data from the cluster containing Alabama
cluster_data <- subset(scaled_data, k3$cluster == cluster_alabama)
two_samples_alabama <- head(cluster_data, 2)


print(two_samples_alabama)

        Population     Income Illiteracy
Alabama -0.1414316 -1.3211387   1.525758
Arizona -0.4556891  0.1533029   1.033578

6) For the 4-cluster case, give two examples from the cluster, which includes Alaska.

# Finding which cluster Alaska is in
cluster_alaska <- k4$cluster["Alaska"]

# Selection of two sample data from the cluster containing Alaska
cluster_data_alaska <- subset(scaled_data, k4$cluster == cluster_alaska)
two_samples_alaska <- head(cluster_data_alaska, 2)

print(two_samples_alaska)

           Population   Income Illiteracy
Alaska      -0.869398 3.058246   0.541398
California   3.796979 1.103716  -0.114842

7) For the 5-cluster case, give two examples from the cluster, which includes Arizona.

# Finding which cluster Arizona is in
cluster_arizona <- k5$cluster["Arizona"]

# Selection of two sample data from the cluster containing Arizona
cluster_data_arizona <- subset(scaled_data, k5$cluster == cluster_arizona)
two_samples_arizona <- head(cluster_data_arizona, 2)

print(two_samples_arizona)

        Population     Income Illiteracy
Arizona -0.4556891  0.1533029   1.033578
Georgia  0.1533389 -0.5611340   1.361698

8) Compare the overall quality of the three solutions. Which K value would you select for this data set? Why?

library(ggplot2)
library(factoextra)
 
fviz_nbclust(scaled_data, FUN = kmeans, method = 'silhouette')

The silhouette coefficient takes a value between -1 and 1. - Close to 1 indicates that the cluster of a point is homogeneous and much better than other clusters. - Close to 0 indicates that the cluster of a point is borderline and not discrete with respect to other clusters. close to -1 indicates that a point is in the wrong cluster and the clusters should be reversed.

fviz_nbclust(scaled_data, FUN = kmeans, method = 'gap_stat')

The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering). The reference dataset is generated using Monte Carlo simulations of the sampling process. That is, for each variable (xi) in the data set we compute its range [min(xi),max(xi)] and generate values for the n points uniformly from the interval min to max.

For the observed data and the the reference data, the total intracluster variation is computed using different values of k.

9) Visualize your choice in the previous question on a scatter plot (Population on the x-axis, Income on the y-axis). Provide your R code and the output.

library(ggplot2)
library(plotly)

k3 <- kmeans(data, centers = 3)
Cluster <- as.factor(k3$cluster)
# Create a scatterplot to visualize the clusters
scatter_plot<-ggplot(data, aes(x = Population , y = Income, color = Cluster)) +
geom_point(size = 4) +
scale_color_manual(values = c("green", "purple", "yellow")) +
labs(title = "Illiteracy Segmentation Based on Population and Income",
    x = "Population",
    y = "Income") +
theme_minimal()

ggplotly(scatter_plot)