Introduction

This week’s assignment revolves around conducting a k-means cluster analysis, an unsupervised machine learning method. The primary objective is to partition observations in a data set into distinct clusters based on the similarity of responses on multiple variables. The clustering variables should predominantly be quantitative, although binary variables can also be considered.

Task Overview

The assignment requires running a k-means cluster analysis to identify subgroups of observations in the data set exhibiting similar response patterns to a set of clustering variables. The ultimate goal is to present the syntax used for the analysis, the corresponding output, and a concise written summary.

Steps Taken

Data Preparation: Generate a random dataset for analysis.

# Load necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Set seed for reproducibility
set.seed(123)

# Create a random dataset with 100 observations and 4 variables
data <- data.frame(
  var1 = rnorm(100),
  var2 = rnorm(100),
  binary_var1 = sample(c(0, 1), 100, replace = TRUE),
  binary_var2 = sample(c(0, 1), 100, replace = TRUE)
)

# Check the structure of the dataset
str(data)

## 'data.frame':    100 obs. of  4 variables:
##  $ var1       : num  -0.5605 -0.2302 1.5587 0.0705 0.1293 ...
##  $ var2       : num  -0.71 0.257 -0.247 -0.348 -0.952 ...
##  $ binary_var1: num  0 0 0 0 0 0 1 0 1 0 ...
##  $ binary_var2: num  1 1 1 1 0 0 0 0 1 0 ...

Variable Selection: Identify and select clustering variables for the analysis.

# Select relevant quantitative and binary variables
clustering_vars <- data %>%
  select(var1, var2, binary_var1, binary_var2)

K-Means Cluster Analysis: Run the k-means cluster analysis.

# Load required library for k-means
library(stats)

# Specify the number of clusters (replace 'k' with your chosen number)
k <- 3

# Run k-means clustering
kmeans_result <- kmeans(clustering_vars, centers = k, nstart = 25)

# Print k-means results
kmeans_result

## K-means clustering with 3 clusters of sizes 36, 36, 28
## 
## Cluster means:
##          var1       var2 binary_var1 binary_var2
## 1  0.98160173 -0.4519972   0.6111111   0.5000000
## 2 -0.66024342 -0.6706386   0.2777778   0.5555556
## 3 -0.09029673  1.0592932   0.6428571   0.6785714
## 
## Clustering vector:
##   [1] 2 3 1 2 2 1 1 2 2 3 1 3 1 1 3 1 1 2 1 2 2 2 2 2 3 2 1 1 2 1 3 3 1 1 1 3 1
##  [38] 3 3 2 3 2 2 1 1 2 2 3 3 2 3 3 3 1 2 1 3 1 3 1 3 2 2 3 2 3 3 2 1 1 2 2 1 3
##  [75] 2 1 2 2 3 2 2 3 2 1 2 1 3 1 3 1 1 1 1 2 1 3 1 1 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 45.49706 36.84375 35.32291
##  (between_SS / total_SS =  47.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Output and Summary: Present the output and a brief summary.

# Assign cluster labels to original data
data$cluster <- kmeans_result$cluster

# Display cluster summary
summary(data$cluster)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    2.00    1.92    3.00    3.00

Results and Summary

The k-means cluster analysis resulted in the formation of ‘k’ clusters, with each observation assigned to a specific cluster based on the similarity of responses in the selected variables. The output provides insights into the characteristics of each cluster, facilitating a better understanding of the underlying patterns within the data.

Rationale for Not Using Test Data

Given the size of the dataset and the unsupervised nature of cluster analysis, there was no explicit need to split the data into training and test sets. This decision was made to streamline the analysis process and ensure the simplicity of the assignment. The cluster analysis was performed solely on the training dataset.

Visualizations

Now, let’s create visualizations to explore the data further. We will showcase 10 different types of charts using the ggplot2 package.

# Load necessary library for plotting
library(ggplot2)

# Scatter plot
ggplot(data, aes(x = var1, y = var2, color = factor(cluster))) +
  geom_point() +
  ggtitle("Scatter Plot of var1 vs var2 by Cluster")

# Bar plot
ggplot(data, aes(x = factor(cluster))) +
  geom_bar() +
  ggtitle("Bar Plot of Cluster Distribution")

# Histogram
ggplot(data, aes(x = var1, fill = factor(cluster))) +
  geom_histogram(binwidth = 0.5, position = "identity", alpha = 0.7) +
  ggtitle("Histogram of var1 by Cluster")

# Box plot
ggplot(data, aes(x = factor(cluster), y = var2, fill = factor(cluster))) +
  geom_boxplot() +
  ggtitle("Box Plot of var2 by Cluster")

# Line chart
ggplot(data, aes(x = seq_along(var1), y = var1, group = cluster, color = factor(cluster))) +
  geom_line() +
  ggtitle("Line Chart of var1 by Cluster")

# Violin plot
ggplot(data, aes(x = factor(cluster), y = var2, fill = factor(cluster))) +
  geom_violin() +
  ggtitle("Violin Plot of var2 by Cluster")

# Density plot
ggplot(data, aes(x = var1, fill = factor(cluster))) +
  geom_density(alpha = 0.7) +
  ggtitle("Density Plot of var1 by Cluster")

# Pie chart
pie_data <- data %>%
  group_by(cluster) %>%
  summarise(count = n())
pie_data$label <- paste("Cluster", pie_data$cluster)
pie_chart <- ggplot(pie_data, aes(x = "", y = count, fill = label)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y") +
  ggtitle("Pie Chart of Cluster Distribution")

# Heatmap
correlation_matrix <- cor(clustering_vars)
ggplot(data = as.data.frame(as.table(correlation_matrix)), aes(x = Var1, y = Var2, fill = Freq)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") +
  ggtitle("Correlation Heatmap of Clustering Variables")

# Bubble chart
bubble_data <- data %>%
  group_by(cluster) %>%
  summarise(avg_var1 = mean(var1), avg_var2 = mean(var2), size = n())
ggplot(bubble_data, aes(x = avg_var1, y = avg_var2, size = size, color = factor(cluster))) +
  geom_point(alpha = 0.7) +
  ggtitle("Bubble Chart of Cluster Centers")

In this section, we’ve presented 10 different types of charts to visually explore the data, offering a comprehensive view of the analysis results.

Feel free to adjust the visualizations based on your preferences or add more as needed.

K-Means Cluster Analysis in R

Muhammad Farhhad

February 21, 2024