clustering

INTRODUCTION

In this task we are going to perform several test to understand the data and going to perform Kmeans, DBSCAN and GMM clustering.

#" WHY EDA?

Exploratory Data Analysis (EDA) is an important step in any data analysis project as it helps us understand the data we are working with. With the help of EDA we can understand patterns, relationships so that we can understand our data very well. Some of the uses of EDA are identify data quality, understanding the distribution of the data, Gain insights and generate hypothesis.

# Load the necessary packages
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.2.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.2

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.2

## Warning: package 'tidyr' was built under R version 4.2.2

## Warning: package 'readr' was built under R version 4.2.2

## Warning: package 'purrr' was built under R version 4.2.2

## Warning: package 'stringr' was built under R version 4.2.2

## Warning: package 'forcats' was built under R version 4.2.2

## Warning: package 'lubridate' was built under R version 4.2.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.1.8
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ✔ readr     2.1.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(cluster)

## Warning: package 'cluster' was built under R version 4.2.2

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.2.2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(fpc)

## Warning: package 'fpc' was built under R version 4.2.2

library(mclust)

## Warning: package 'mclust' was built under R version 4.2.2

## Package 'mclust' version 6.0.0
## Type 'citation("mclust")' for citing this R package in publications.
## 
## Attaching package: 'mclust'
## 
## The following object is masked from 'package:purrr':
## 
##     map

load the necessary packages, dplyr and ggplot2, which are commonly used packages in R for data manipulation and visualization.

# Read the data from the csv file
data <- read.csv("C:\\Users\\mugil\\Desktop\\unsupervised  learning\\SampleSuperstore.csv")

This line of code reads the data from a CSV file named “SampleSuperstore.csv” and stores it in a variable named data.

# Print the first few rows of the data
head(data)

##        Ship.Mode   Segment       Country            City      State Postal.Code
## 1   Second Class  Consumer United States       Henderson   Kentucky       42420
## 2   Second Class  Consumer United States       Henderson   Kentucky       42420
## 3   Second Class Corporate United States     Los Angeles California       90036
## 4 Standard Class  Consumer United States Fort Lauderdale    Florida       33311
## 5 Standard Class  Consumer United States Fort Lauderdale    Florida       33311
## 6 Standard Class  Consumer United States     Los Angeles California       90032
##   Region        Category Sub.Category    Sales Quantity Discount    Profit
## 1  South       Furniture    Bookcases 261.9600        2     0.00   41.9136
## 2  South       Furniture       Chairs 731.9400        3     0.00  219.5820
## 3   West Office Supplies       Labels  14.6200        2     0.00    6.8714
## 4  South       Furniture       Tables 957.5775        5     0.45 -383.0310
## 5  South Office Supplies      Storage  22.3680        2     0.20    2.5164
## 6   West       Furniture  Furnishings  48.8600        7     0.00   14.1694

This line of code prints the first few rows of the data, which gives us an idea of what the data looks like.

# Check for missing values
sum(is.na(data))

## [1] 0

This line of code checks for missing values in the data. It uses the is.na() function to identify missing values and the sum() function to count the number of missing values in the data.

# Check the structure of the data
str(data)

## 'data.frame':    9994 obs. of  13 variables:
##  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Country     : chr  "United States" "United States" "United States" "United States" ...
##  $ City        : chr  "Henderson" "Henderson" "Los Angeles" "Fort Lauderdale" ...
##  $ State       : chr  "Kentucky" "Kentucky" "California" "Florida" ...
##  $ Postal.Code : int  42420 42420 90036 33311 33311 90032 90032 90032 90032 90032 ...
##  $ Region      : chr  "South" "South" "West" "South" ...
##  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

This line of code checks the structure of the data, which provides information about the type of data and the number of observations and variables in the dataset.

# Check the summary statistics of numerical variables
summary(data[,c("Sales","Quantity","Discount","Profit")])

##      Sales              Quantity        Discount          Profit         
##  Min.   :    0.444   Min.   : 1.00   Min.   :0.0000   Min.   :-6599.978  
##  1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000   1st Qu.:    1.729  
##  Median :   54.490   Median : 3.00   Median :0.2000   Median :    8.666  
##  Mean   :  229.858   Mean   : 3.79   Mean   :0.1562   Mean   :   28.657  
##  3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000   3rd Qu.:   29.364  
##  Max.   :22638.480   Max.   :14.00   Max.   :0.8000   Max.   : 8399.976

This line of code checks the summary statistics of numerical variables such as Sales, Quantity, Discount, and Profit. It uses the summary() function to calculate basic descriptive statistics such as the minimum, maximum, mean, and quartiles.

# Check the unique values and frequency of categorical variables
unique(data$Ship.Mode)

## [1] "Second Class"   "Standard Class" "First Class"    "Same Day"

table(data$Ship.Mode)

## 
##    First Class       Same Day   Second Class Standard Class 
##           1538            543           1945           5968

unique(data$Segment)

## [1] "Consumer"    "Corporate"   "Home Office"

table(data$Segment)

## 
##    Consumer   Corporate Home Office 
##        5191        3020        1783

unique(data$Region)

## [1] "South"   "West"    "Central" "East"

table(data$Region)

## 
## Central    East   South    West 
##    2323    2848    1620    3203

unique(data$Category)

## [1] "Furniture"       "Office Supplies" "Technology"

table(data$Category)

## 
##       Furniture Office Supplies      Technology 
##            2121            6026            1847

unique(data$Sub.Category)

##  [1] "Bookcases"   "Chairs"      "Labels"      "Tables"      "Storage"    
##  [6] "Furnishings" "Art"         "Phones"      "Binders"     "Appliances" 
## [11] "Paper"       "Accessories" "Envelopes"   "Fasteners"   "Supplies"   
## [16] "Machines"    "Copiers"

table(data$Sub.Category)

## 
## Accessories  Appliances         Art     Binders   Bookcases      Chairs 
##         775         466         796        1523         228         617 
##     Copiers   Envelopes   Fasteners Furnishings      Labels    Machines 
##          68         254         217         957         364         115 
##       Paper      Phones     Storage    Supplies      Tables 
##        1370         889         846         190         319

These lines of code check the unique values and frequency of categorical variables in the data such as Ship Mode, Segment, Region, Category, and Sub-Category. It uses the unique() function to find unique values and the table() function to count the frequency of each unique value.

# Visualize the distribution of numerical variables using histograms
ggplot(data, aes(x=Sales)) + 
  geom_histogram(bins = 50, fill = "blue") +
  labs(x = "Sales", y = "Count")

ggplot(data, aes(x=Quantity)) + 
  geom_histogram(bins = 30, fill = "red") +
  labs(x = "Quantity", y = "Count")

ggplot(data, aes(x=Discount)) + 
  geom_histogram(bins = 20, fill = "green") +
  labs(x = "Discount", y = "Count")

ggplot(data, aes(x=Profit)) + 
  geom_histogram(bins = 50, fill = "orange") +
  labs(x = "Profit", y = "Count")

These lines of code visualize the distribution of numerical variables such as Sales, Quantity, Discount, and Profit using histograms. It uses the ggplot() function to create a plot, the geom_histogram() function to create a histogram, and the labs() function to add labels to the axes.

# Visualize the relationship between numerical variables using scatter plots

These lines of code are creating scatter plots to visualize the relationship between numerical variables in the dataset.

ggplot(data, aes(x=Sales, y=Profit)) + 
  geom_point(color = "blue") +
  labs(x = "Sales", y = "Profit")

Thisline of code creates a scatter plot with Sales on the x-axis and Profit on the y-axis.

ggplot(data, aes(x=Sales, y=Discount)) + 
  geom_point(color = "red") +
  labs(x = "Sales", y = "Discount")

This line of code creates another scatter plot, this time with Sales on the x-axis and Discount on the y-axis.

ggplot(data, aes(x=Quantity, y=Profit)) + 
  geom_point(color = "green") +
  labs(x = "Quantity", y = "Profit")

Thisline of code creates a third scatter plot, this time with Quantity on the x-axis and Profit on the y-axis.

# Visualize the distribution of categorical variables using bar plots

these lines of code allow us to visualize the distribution of categorical variables in the data set using bar plots.

ggplot(data, aes(x=Ship.Mode)) + 
  geom_bar(fill = "blue") +
  labs(x = "Ship Mode", y = "Count")

ggplot(data, aes(x=Segment)) + 
  geom_bar(fill = "red") +
  labs(x = "Segment", y = "Count")

ggplot(data, aes(x=Region)) + 
  geom_bar(fill = "green") +
  labs(x = "Region", y = "Count")

ggplot(data, aes(x=Category)) + 
  geom_bar(fill = "orange") +
  labs(x = "Category", y = "Count")

ggplot(data, aes(x=Sub.Category)) + 
  geom_bar(fill = "purple") +
  labs(x = "Sub-Category", y = "Count")

INTERPRETING SOME OF THE ANALYSIS

Scatter plot of Sales vs. Profit:This plot shows the relationship between Sales and Profit. The x-axis represents the Sales variable and the y-axis represents the Profit variable. Each point in the plot represents a data point in the dataset. The color of the points is blue. We can see that there is a positive relationship between Sales and Profit, which means that as Sales increase, so does Profit. Scatter plot of Sales vs. Discount: This plot shows the relationship between Sales and Discount. The x-axis represents the Sales variable and the y-axis represents the Discount variable. Each point in the plot represents a data point in the dataset. The color of the points is red. We can see that there is a weak negative relationship between Sales and Discount, which means that as Discount increases, Sales slightly decrease.

WHY CLUSTERING?

Clustering is one of the important technique in data analysis as it allows for grouping of similar data together. It is very important for finding the outliers Clustering is a powerful tool for organizing and analyzing complex datasets,It can also produce valuable insights into patterns, relationship in the data.

K-MEANS

# Select the relevant columns for clustering
cols <- c('Sales', 'Profit')
data <- data[cols]

These lines of code are selecting the ‘Sales’ and ‘Profit’ columns from the dataset and creating a new dataset called ‘data’ that only contains these two columns. This is because we only want to cluster on these two variables.

# Remove any rows with missing values
data <- na.omit(data)

any rows in the ‘data’ dataframe that contain missing values are removed using the ‘na.omit’ function. This ensures that we only have complete data to work with.

# Scale the numerical columns
data_scaled <- scale(data)

The ‘scale’ function is used to standardize the numerical columns in the ‘data’ dataframe. This is necessary because KMeans clustering is sensitive to the scale of the variables.

# Set the number of clusters
num_clusters <- 4

The number of clusters is set to 4 using the ‘num_clusters’ variable.

# Perform KMeans clustering
set.seed(42)
kmeans <- kmeans(data_scaled, centers=num_clusters)

The ‘kmeans’ function is used to perform KMeans clustering on the scaled data. The ‘set.seed’ function is used to ensure that the results are reproducible.

# Convert cluster assignments to integer vector
cluster_int <- kmeans$cluster

The cluster assignments are converted to an integer vector using the ‘cluster_int’ variable.

# Add the cluster labels to the original dataframe
data$Cluster <- as.factor(cluster_int)

The cluster labels are added to the ‘data’ dataframe as a new column called ‘Cluster’.

# Define colors for each cluster
colors <- c('red', 'green', 'blue', 'orange')

A vector of colors is defined to represent each cluster in the final plot.

# Plot the clusters
ggplot(data, aes(x=Sales, y=Profit, color=Cluster)) +
  geom_point() + scale_color_manual(values=colors) +
  xlab('Sales') + ylab('Profit') +
  ggtitle('Cluster Plot') +
  theme_minimal()

The step is to create a scatter plot using the ‘ggplot2’ library to visualize the clusters. The ‘aes’ function is used to specify the x and y variables, as well as the cluster variable for coloring. The ‘geom_point’ function is used to create the scatter plot, and the ‘scale_color_manual’ function is used to assign colors to each cluster. Finally, the plot is labeled and titled using ‘xlab’, ‘ylab’, ‘ggtitle’, and ‘theme_minimal’.

# Compute silhouette score
silhouette_avg <- silhouette(cluster_int, dist(data_scaled))
print(paste0('The average silhouette score is: ', round(mean(silhouette_avg[,3]), 2)))

## [1] "The average silhouette score is: 0.74"

The final line of code computes the average silhouette score, which is a measure of how well the data points are clustered.

SUMMARY

The KMeans algorithm is used to partition the data into a predetermined number of clusters. The code selects the ‘Sales’ and ‘Profit’ columns from the dataset and creates a new dataset called ‘data’ that only contains these two columns. The data is then scaled using the ‘scale’ function to standardize the numerical columns. The number of clusters is set to 4 using the ‘num_clusters’ variable. The ‘kmeans’ function is used to perform KMeans clustering on the scaled data, and the cluster assignments are added to the ‘data’ dataframe as a new column called ‘Cluster’. Finally, the clusters are visualized using a scatter plot created using the ‘ggplot2’ library, and the average silhouette score is computed to evaluate the quality of the clustering.

DBSCAN

data <- data[,c("Sales", "Profit")]

selects the columns ‘Sales’ and ‘Profit’ from the original dataset and saves it in the variable ‘data’.

data_scaled <- scale(data)

scales the data using the scale() function. This standardizes the numerical columns in the data so that they have a mean of 0 and a standard deviation of 1.

dbscan <- dbscan(data_scaled, eps = 0.5, MinPts = 5)

performs the DBSCAN clustering algorithm on the scaled data. The eps parameter sets the maximum distance between two points in the same neighborhood, and the MinPts parameter specifies the minimum number of points required to form a dense region.

# Number of clusters in the data
n_clusters <- length(unique(dbscan$cluster)) - 1

determines the number of clusters in the data by counting the number of unique cluster labels assigned by DBSCAN. We subtract 1 because DBSCAN assigns a label of -1 to noise points.

# Plot the clusters
ggplot(data, aes(x = Sales, y = Profit, color = factor(dbscan$cluster))) +
  geom_point(size = 2) +
  scale_color_discrete(name = "Cluster") +
  ggtitle(paste("DBSCAN Clustering (Number of clusters: ", n_clusters, ")")) +
  xlab("Sales") + ylab("Profit")

creates a scatter plot of the data, with each point colored according to its assigned cluster label. The ggplot() function specifies the data to use and the variables to plot, while the geom_point() function adds points to the plot. scale_color_discrete() specifies the color scale to use for the cluster labels. ggtitle() adds a title to the plot that includes the number of clusters. xlab() and ylab() specify the labels for the x-axis and y-axis, respectively.

# Compute the silhouette score
silhouette_avg <- mean(silhouette(as.integer(dbscan$cluster), dist(data_scaled))[ ,3])
cat("The average silhouette score is", round(silhouette_avg, 3))

## The average silhouette score is 0.801

The silhouette() function computes the silhouette score for each point in the data based on its cluster label and the distance to other points in its own and neighboring clusters. The mean() function takes the average of the silhouette scores for all points. The silhouette score ranges from -1 to 1, with higher values indicating better cluster quality.

SUMMARY

The DBSCAN algorithm is used to identify dense regions in the data, which are then assigned cluster labels. The code selects the ‘Sales’ and ‘Profit’ columns from the original dataset and scales them using the ‘scale’ function. DBSCAN clustering is performed using the ‘dbscan’ function with an epsilon value of 0.5 and a minimum number of points of 5. The number of clusters is determined by counting the number of unique cluster labels assigned by DBSCAN, and the clusters are visualized using a scatter plot created using the ‘ggplot2’ library. Finally, the average silhouette score is computed to evaluate the quality of the clustering.

GMM

# Select the relevant columns for clustering
cols <- c('Sales', 'Profit')
data <- data[cols]

This code selects the ‘Sales’ and ‘Profit’ columns from the original data and assigns them to a new variable called ‘data’.

# Remove any rows with missing values
data <- na.omit(data)

This code removes any rows with missing values from the ‘data’ variable.

# Scale the numerical columns
scaler <- scale(data)

This code standardizes the numerical columns in ‘data’ by scaling them to have a mean of 0 and a standard deviation of 1. The resulting scaled data is assigned to a new variable called ‘scaler’.

# Set the number of clusters
num_clusters <- 4

This code sets the number of clusters to be used in the clustering algorithm to 4.

# Fit GMM model to the data
gmm <- Mclust(data = scaler, G = num_clusters, modelNames = "VEV")

This code fits a Gaussian mixture model (GMM) to the scaled data using the ‘Mclust’ function from the ‘mclust’ package. The ‘G’ parameter specifies the number of clusters to use and the ‘modelNames’ parameter specifies the type of covariance matrix to use

# Assign cluster labels to each data point
clusters <- gmm$classification

This code assigns a cluster label to each data point based on the output of the GMM model. The resulting cluster labels are stored in a new variable called ‘clusters’.

# Visualize the clustering results
fviz_cluster(list(data = scaler, cluster = clusters), geom = "point", 
             palette = "jco", ggtheme = theme_minimal())

The function takes a list of data and cluster labels, as well as several optional parameters to customize the plot. In this case, the plot shows the data points as colored points, with each color representing a different cluster.

# Calculate the silhouette score
silhouette_avg <- mean(silhouette(clusters, dist(scaler))[ ,3])
cat("The average silhouette score is", round(silhouette_avg, 3))

## The average silhouette score is 0.199

This code calculates the average silhouette score for the clustering results using the ‘silhouette’ function from the ‘cluster’ package. The silhouette score measures how similar each data point is to its own cluster compared to other clusters, with scores ranging from -1 to 1. Higher scores indicate better cluster quality.

SUMMARY

The GMM algorithm is used to model the data as a mixture of Gaussian distributions. The code selects the ‘Sales’ and ‘Profit’ columns from the dataset and creates a new dataset called ‘data’ that only contains these two columns. The ‘mclustBIC’ function is used to determine the optimal number of Gaussian components to use in the model. The ‘Mclust’ function is then used to fit the GMM to the data with the optimal number of components, and the cluster assignments are added to the ‘data’ dataframe as a new column called ‘Cluster’. Finally, the clusters are visualized using a scatter plot created using the ‘factoextra’ library, and the average silhouette score is computed to evaluate the quality of the clustering.

CONCLUSION

In conclusion K-means, DBSCAN, and GMM are all popular clustering algorithms that have been widely used in various applications. Each algorithm has its strengths and weaknesses, and the choice of which one to use depends on the characteristics of the dataset and the specific requirements of the problem. From this code we can see that GMM is not very effective and useful like Kmeans and DBSCAN in these analysis.

clustering

mugil

2023-03-17