INTRODUCTION
In this task we are going to perform several test to understand the data and going to perform Kmeans, DBSCAN and GMM clustering.
#" WHY EDA?
Exploratory Data Analysis (EDA) is an important step in any data analysis project as it helps us understand the data we are working with. With the help of EDA we can understand patterns, relationships so that we can understand our data very well. Some of the uses of EDA are identify data quality, understanding the distribution of the data, Gain insights and generate hypothesis.
# Load the necessary packages
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## Warning: package 'tidyr' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.2
## Warning: package 'purrr' was built under R version 4.2.2
## Warning: package 'stringr' was built under R version 4.2.2
## Warning: package 'forcats' was built under R version 4.2.2
## Warning: package 'lubridate' was built under R version 4.2.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.1.8
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(cluster)
## Warning: package 'cluster' was built under R version 4.2.2
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.2.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(fpc)
## Warning: package 'fpc' was built under R version 4.2.2
library(mclust)
## Warning: package 'mclust' was built under R version 4.2.2
## Package 'mclust' version 6.0.0
## Type 'citation("mclust")' for citing this R package in publications.
##
## Attaching package: 'mclust'
##
## The following object is masked from 'package:purrr':
##
## map
load the necessary packages, dplyr and ggplot2, which are commonly used packages in R for data manipulation and visualization.
# Read the data from the csv file
data <- read.csv("C:\\Users\\mugil\\Desktop\\unsupervised learning\\SampleSuperstore.csv")
This line of code reads the data from a CSV file named “SampleSuperstore.csv” and stores it in a variable named data.
# Print the first few rows of the data
head(data)
## Ship.Mode Segment Country City State Postal.Code
## 1 Second Class Consumer United States Henderson Kentucky 42420
## 2 Second Class Consumer United States Henderson Kentucky 42420
## 3 Second Class Corporate United States Los Angeles California 90036
## 4 Standard Class Consumer United States Fort Lauderdale Florida 33311
## 5 Standard Class Consumer United States Fort Lauderdale Florida 33311
## 6 Standard Class Consumer United States Los Angeles California 90032
## Region Category Sub.Category Sales Quantity Discount Profit
## 1 South Furniture Bookcases 261.9600 2 0.00 41.9136
## 2 South Furniture Chairs 731.9400 3 0.00 219.5820
## 3 West Office Supplies Labels 14.6200 2 0.00 6.8714
## 4 South Furniture Tables 957.5775 5 0.45 -383.0310
## 5 South Office Supplies Storage 22.3680 2 0.20 2.5164
## 6 West Furniture Furnishings 48.8600 7 0.00 14.1694
This line of code prints the first few rows of the data, which gives us an idea of what the data looks like.
# Check for missing values
sum(is.na(data))
## [1] 0
This line of code checks for missing values in the data. It uses the is.na() function to identify missing values and the sum() function to count the number of missing values in the data.
# Check the structure of the data
str(data)
## 'data.frame': 9994 obs. of 13 variables:
## $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
## $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
## $ Country : chr "United States" "United States" "United States" "United States" ...
## $ City : chr "Henderson" "Henderson" "Los Angeles" "Fort Lauderdale" ...
## $ State : chr "Kentucky" "Kentucky" "California" "Florida" ...
## $ Postal.Code : int 42420 42420 90036 33311 33311 90032 90032 90032 90032 90032 ...
## $ Region : chr "South" "South" "West" "South" ...
## $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
## $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
This line of code checks the structure of the data, which provides information about the type of data and the number of observations and variables in the dataset.
# Check the summary statistics of numerical variables
summary(data[,c("Sales","Quantity","Discount","Profit")])
## Sales Quantity Discount Profit
## Min. : 0.444 Min. : 1.00 Min. :0.0000 Min. :-6599.978
## 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000 1st Qu.: 1.729
## Median : 54.490 Median : 3.00 Median :0.2000 Median : 8.666
## Mean : 229.858 Mean : 3.79 Mean :0.1562 Mean : 28.657
## 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000 3rd Qu.: 29.364
## Max. :22638.480 Max. :14.00 Max. :0.8000 Max. : 8399.976
This line of code checks the summary statistics of numerical variables such as Sales, Quantity, Discount, and Profit. It uses the summary() function to calculate basic descriptive statistics such as the minimum, maximum, mean, and quartiles.
# Check the unique values and frequency of categorical variables
unique(data$Ship.Mode)
## [1] "Second Class" "Standard Class" "First Class" "Same Day"
table(data$Ship.Mode)
##
## First Class Same Day Second Class Standard Class
## 1538 543 1945 5968
unique(data$Segment)
## [1] "Consumer" "Corporate" "Home Office"
table(data$Segment)
##
## Consumer Corporate Home Office
## 5191 3020 1783
unique(data$Region)
## [1] "South" "West" "Central" "East"
table(data$Region)
##
## Central East South West
## 2323 2848 1620 3203
unique(data$Category)
## [1] "Furniture" "Office Supplies" "Technology"
table(data$Category)
##
## Furniture Office Supplies Technology
## 2121 6026 1847
unique(data$Sub.Category)
## [1] "Bookcases" "Chairs" "Labels" "Tables" "Storage"
## [6] "Furnishings" "Art" "Phones" "Binders" "Appliances"
## [11] "Paper" "Accessories" "Envelopes" "Fasteners" "Supplies"
## [16] "Machines" "Copiers"
table(data$Sub.Category)
##
## Accessories Appliances Art Binders Bookcases Chairs
## 775 466 796 1523 228 617
## Copiers Envelopes Fasteners Furnishings Labels Machines
## 68 254 217 957 364 115
## Paper Phones Storage Supplies Tables
## 1370 889 846 190 319
These lines of code check the unique values and frequency of categorical variables in the data such as Ship Mode, Segment, Region, Category, and Sub-Category. It uses the unique() function to find unique values and the table() function to count the frequency of each unique value.
# Visualize the distribution of numerical variables using histograms
ggplot(data, aes(x=Sales)) +
geom_histogram(bins = 50, fill = "blue") +
labs(x = "Sales", y = "Count")
ggplot(data, aes(x=Quantity)) +
geom_histogram(bins = 30, fill = "red") +
labs(x = "Quantity", y = "Count")
ggplot(data, aes(x=Discount)) +
geom_histogram(bins = 20, fill = "green") +
labs(x = "Discount", y = "Count")
ggplot(data, aes(x=Profit)) +
geom_histogram(bins = 50, fill = "orange") +
labs(x = "Profit", y = "Count")
These lines of code visualize the distribution of numerical variables such as Sales, Quantity, Discount, and Profit using histograms. It uses the ggplot() function to create a plot, the geom_histogram() function to create a histogram, and the labs() function to add labels to the axes.
# Visualize the relationship between numerical variables using scatter plots
These lines of code are creating scatter plots to visualize the relationship between numerical variables in the dataset.
ggplot(data, aes(x=Sales, y=Profit)) +
geom_point(color = "blue") +
labs(x = "Sales", y = "Profit")
Thisline of code creates a scatter plot with Sales on the x-axis and Profit on the y-axis.
ggplot(data, aes(x=Sales, y=Discount)) +
geom_point(color = "red") +
labs(x = "Sales", y = "Discount")
This line of code creates another scatter plot, this time with Sales on the x-axis and Discount on the y-axis.
ggplot(data, aes(x=Quantity, y=Profit)) +
geom_point(color = "green") +
labs(x = "Quantity", y = "Profit")
Thisline of code creates a third scatter plot, this time with Quantity on the x-axis and Profit on the y-axis.
# Visualize the distribution of categorical variables using bar plots
these lines of code allow us to visualize the distribution of categorical variables in the data set using bar plots.
ggplot(data, aes(x=Ship.Mode)) +
geom_bar(fill = "blue") +
labs(x = "Ship Mode", y = "Count")
ggplot(data, aes(x=Segment)) +
geom_bar(fill = "red") +
labs(x = "Segment", y = "Count")
ggplot(data, aes(x=Region)) +
geom_bar(fill = "green") +
labs(x = "Region", y = "Count")
ggplot(data, aes(x=Category)) +
geom_bar(fill = "orange") +
labs(x = "Category", y = "Count")
ggplot(data, aes(x=Sub.Category)) +
geom_bar(fill = "purple") +
labs(x = "Sub-Category", y = "Count")
INTERPRETING SOME OF THE ANALYSIS
Scatter plot of Sales vs. Profit:This plot shows the relationship between Sales and Profit. The x-axis represents the Sales variable and the y-axis represents the Profit variable. Each point in the plot represents a data point in the dataset. The color of the points is blue. We can see that there is a positive relationship between Sales and Profit, which means that as Sales increase, so does Profit. Scatter plot of Sales vs. Discount: This plot shows the relationship between Sales and Discount. The x-axis represents the Sales variable and the y-axis represents the Discount variable. Each point in the plot represents a data point in the dataset. The color of the points is red. We can see that there is a weak negative relationship between Sales and Discount, which means that as Discount increases, Sales slightly decrease.
WHY CLUSTERING?
Clustering is one of the important technique in data analysis as it allows for grouping of similar data together. It is very important for finding the outliers Clustering is a powerful tool for organizing and analyzing complex datasets,It can also produce valuable insights into patterns, relationship in the data.
K-MEANS
# Select the relevant columns for clustering
cols <- c('Sales', 'Profit')
data <- data[cols]
These lines of code are selecting the ‘Sales’ and ‘Profit’ columns from the dataset and creating a new dataset called ‘data’ that only contains these two columns. This is because we only want to cluster on these two variables.
# Remove any rows with missing values
data <- na.omit(data)
any rows in the ‘data’ dataframe that contain missing values are removed using the ‘na.omit’ function. This ensures that we only have complete data to work with.
# Scale the numerical columns
data_scaled <- scale(data)
The ‘scale’ function is used to standardize the numerical columns in the ‘data’ dataframe. This is necessary because KMeans clustering is sensitive to the scale of the variables.
# Set the number of clusters
num_clusters <- 4
The number of clusters is set to 4 using the ‘num_clusters’ variable.
# Perform KMeans clustering
set.seed(42)
kmeans <- kmeans(data_scaled, centers=num_clusters)
The ‘kmeans’ function is used to perform KMeans clustering on the scaled data. The ‘set.seed’ function is used to ensure that the results are reproducible.
# Convert cluster assignments to integer vector
cluster_int <- kmeans$cluster
The cluster assignments are converted to an integer vector using the ‘cluster_int’ variable.
# Add the cluster labels to the original dataframe
data$Cluster <- as.factor(cluster_int)
The cluster labels are added to the ‘data’ dataframe as a new column called ‘Cluster’.
# Define colors for each cluster
colors <- c('red', 'green', 'blue', 'orange')
A vector of colors is defined to represent each cluster in the final plot.
# Plot the clusters
ggplot(data, aes(x=Sales, y=Profit, color=Cluster)) +
geom_point() + scale_color_manual(values=colors) +
xlab('Sales') + ylab('Profit') +
ggtitle('Cluster Plot') +
theme_minimal()
The step is to create a scatter plot using the ‘ggplot2’ library to visualize the clusters. The ‘aes’ function is used to specify the x and y variables, as well as the cluster variable for coloring. The ‘geom_point’ function is used to create the scatter plot, and the ‘scale_color_manual’ function is used to assign colors to each cluster. Finally, the plot is labeled and titled using ‘xlab’, ‘ylab’, ‘ggtitle’, and ‘theme_minimal’.
# Compute silhouette score
silhouette_avg <- silhouette(cluster_int, dist(data_scaled))
print(paste0('The average silhouette score is: ', round(mean(silhouette_avg[,3]), 2)))
## [1] "The average silhouette score is: 0.74"
The final line of code computes the average silhouette score, which is a measure of how well the data points are clustered.
SUMMARY
The KMeans algorithm is used to partition the data into a predetermined number of clusters. The code selects the ‘Sales’ and ‘Profit’ columns from the dataset and creates a new dataset called ‘data’ that only contains these two columns. The data is then scaled using the ‘scale’ function to standardize the numerical columns. The number of clusters is set to 4 using the ‘num_clusters’ variable. The ‘kmeans’ function is used to perform KMeans clustering on the scaled data, and the cluster assignments are added to the ‘data’ dataframe as a new column called ‘Cluster’. Finally, the clusters are visualized using a scatter plot created using the ‘ggplot2’ library, and the average silhouette score is computed to evaluate the quality of the clustering.
DBSCAN
data <- data[,c("Sales", "Profit")]
selects the columns ‘Sales’ and ‘Profit’ from the original dataset and saves it in the variable ‘data’.
data_scaled <- scale(data)
scales the data using the scale() function. This standardizes the numerical columns in the data so that they have a mean of 0 and a standard deviation of 1.
dbscan <- dbscan(data_scaled, eps = 0.5, MinPts = 5)
performs the DBSCAN clustering algorithm on the scaled data. The eps parameter sets the maximum distance between two points in the same neighborhood, and the MinPts parameter specifies the minimum number of points required to form a dense region.
# Number of clusters in the data
n_clusters <- length(unique(dbscan$cluster)) - 1
determines the number of clusters in the data by counting the number of unique cluster labels assigned by DBSCAN. We subtract 1 because DBSCAN assigns a label of -1 to noise points.
# Plot the clusters
ggplot(data, aes(x = Sales, y = Profit, color = factor(dbscan$cluster))) +
geom_point(size = 2) +
scale_color_discrete(name = "Cluster") +
ggtitle(paste("DBSCAN Clustering (Number of clusters: ", n_clusters, ")")) +
xlab("Sales") + ylab("Profit")
creates a scatter plot of the data, with each point colored according to its assigned cluster label. The ggplot() function specifies the data to use and the variables to plot, while the geom_point() function adds points to the plot. scale_color_discrete() specifies the color scale to use for the cluster labels. ggtitle() adds a title to the plot that includes the number of clusters. xlab() and ylab() specify the labels for the x-axis and y-axis, respectively.
# Compute the silhouette score
silhouette_avg <- mean(silhouette(as.integer(dbscan$cluster), dist(data_scaled))[ ,3])
cat("The average silhouette score is", round(silhouette_avg, 3))
## The average silhouette score is 0.801
The silhouette() function computes the silhouette score for each point in the data based on its cluster label and the distance to other points in its own and neighboring clusters. The mean() function takes the average of the silhouette scores for all points. The silhouette score ranges from -1 to 1, with higher values indicating better cluster quality.
SUMMARY
The DBSCAN algorithm is used to identify dense regions in the data, which are then assigned cluster labels. The code selects the ‘Sales’ and ‘Profit’ columns from the original dataset and scales them using the ‘scale’ function. DBSCAN clustering is performed using the ‘dbscan’ function with an epsilon value of 0.5 and a minimum number of points of 5. The number of clusters is determined by counting the number of unique cluster labels assigned by DBSCAN, and the clusters are visualized using a scatter plot created using the ‘ggplot2’ library. Finally, the average silhouette score is computed to evaluate the quality of the clustering.
GMM
# Select the relevant columns for clustering
cols <- c('Sales', 'Profit')
data <- data[cols]
This code selects the ‘Sales’ and ‘Profit’ columns from the original data and assigns them to a new variable called ‘data’.
# Remove any rows with missing values
data <- na.omit(data)
This code removes any rows with missing values from the ‘data’ variable.
# Scale the numerical columns
scaler <- scale(data)
This code standardizes the numerical columns in ‘data’ by scaling them to have a mean of 0 and a standard deviation of 1. The resulting scaled data is assigned to a new variable called ‘scaler’.
# Set the number of clusters
num_clusters <- 4
This code sets the number of clusters to be used in the clustering algorithm to 4.
# Fit GMM model to the data
gmm <- Mclust(data = scaler, G = num_clusters, modelNames = "VEV")
This code fits a Gaussian mixture model (GMM) to the scaled data using the ‘Mclust’ function from the ‘mclust’ package. The ‘G’ parameter specifies the number of clusters to use and the ‘modelNames’ parameter specifies the type of covariance matrix to use
# Assign cluster labels to each data point
clusters <- gmm$classification
This code assigns a cluster label to each data point based on the output of the GMM model. The resulting cluster labels are stored in a new variable called ‘clusters’.
# Visualize the clustering results
fviz_cluster(list(data = scaler, cluster = clusters), geom = "point",
palette = "jco", ggtheme = theme_minimal())
The function takes a list of data and cluster labels, as well as several optional parameters to customize the plot. In this case, the plot shows the data points as colored points, with each color representing a different cluster.
# Calculate the silhouette score
silhouette_avg <- mean(silhouette(clusters, dist(scaler))[ ,3])
cat("The average silhouette score is", round(silhouette_avg, 3))
## The average silhouette score is 0.199
This code calculates the average silhouette score for the clustering results using the ‘silhouette’ function from the ‘cluster’ package. The silhouette score measures how similar each data point is to its own cluster compared to other clusters, with scores ranging from -1 to 1. Higher scores indicate better cluster quality.
SUMMARY
The GMM algorithm is used to model the data as a mixture of Gaussian distributions. The code selects the ‘Sales’ and ‘Profit’ columns from the dataset and creates a new dataset called ‘data’ that only contains these two columns. The ‘mclustBIC’ function is used to determine the optimal number of Gaussian components to use in the model. The ‘Mclust’ function is then used to fit the GMM to the data with the optimal number of components, and the cluster assignments are added to the ‘data’ dataframe as a new column called ‘Cluster’. Finally, the clusters are visualized using a scatter plot created using the ‘factoextra’ library, and the average silhouette score is computed to evaluate the quality of the clustering.
CONCLUSION
In conclusion K-means, DBSCAN, and GMM are all popular clustering algorithms that have been widely used in various applications. Each algorithm has its strengths and weaknesses, and the choice of which one to use depends on the characteristics of the dataset and the specific requirements of the problem. From this code we can see that GMM is not very effective and useful like Kmeans and DBSCAN in these analysis.