Installing packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
data = read.csv('Mall_Customers.csv')

The dataset contains 200 rows and 5 columns. The 5 variables are CustomerID, Genre(Gender?), Age, Annual Income and Spending score of the customers in a Mall.

str(data)
## 'data.frame':    200 obs. of  5 variables:
##  $ CustomerID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Genre                 : chr  "Male" "Male" "Female" "Female" ...
##  $ Age                   : int  19 21 20 23 31 22 35 23 64 30 ...
##  $ Annual.Income..k..    : int  15 15 16 16 17 17 18 18 19 19 ...
##  $ Spending.Score..1.100.: int  39 81 6 77 40 76 6 94 3 72 ...
subset(data, Genre == 'Female')[1:20,]
##    CustomerID  Genre Age Annual.Income..k.. Spending.Score..1.100.
## 3           3 Female  20                 16                      6
## 4           4 Female  23                 16                     77
## 5           5 Female  31                 17                     40
## 6           6 Female  22                 17                     76
## 7           7 Female  35                 18                      6
## 8           8 Female  23                 18                     94
## 10         10 Female  30                 19                     72
## 12         12 Female  35                 19                     99
## 13         13 Female  58                 20                     15
## 14         14 Female  24                 20                     77
## 17         17 Female  35                 21                     35
## 20         20 Female  35                 23                     98
## 23         23 Female  46                 25                      5
## 25         25 Female  54                 28                     14
## 27         27 Female  45                 28                     32
## 29         29 Female  40                 29                     31
## 30         30 Female  23                 29                     87
## 32         32 Female  21                 30                     73
## 35         35 Female  49                 33                     14
## 36         36 Female  21                 33                     81
summary(data)
##    CustomerID        Genre                Age        Annual.Income..k..
##  Min.   :  1.00   Length:200         Min.   :18.00   Min.   : 15.00    
##  1st Qu.: 50.75   Class :character   1st Qu.:28.75   1st Qu.: 41.50    
##  Median :100.50   Mode  :character   Median :36.00   Median : 61.50    
##  Mean   :100.50                      Mean   :38.85   Mean   : 60.56    
##  3rd Qu.:150.25                      3rd Qu.:49.00   3rd Qu.: 78.00    
##  Max.   :200.00                      Max.   :70.00   Max.   :137.00    
##  Spending.Score..1.100.
##  Min.   : 1.00         
##  1st Qu.:34.75         
##  Median :50.00         
##  Mean   :50.20         
##  3rd Qu.:73.00         
##  Max.   :99.00

We can see that the minimum age of the customers is 18. The customer’s Annual income range from 15k to 137k.

Checking for the missing values

missing_values <- sum(is.na(data))
print(paste("Number of Missing Values:", missing_values))
## [1] "Number of Missing Values: 0"

There are no missing values in the dataset.

So lets proceed with our analysis

Customer Segmentation using K means Clustering.

set.seed(123)  # for reproducibility
customer_data <- read.csv("Mall_Customers.csv")
features <- customer_data[, c("Spending.Score..1.100.", "Annual.Income..k..")]
scaled_features <- scale(features)

Performing K-means clustering (e.g., with k=3)

kmeans_result <- kmeans(scaled_features, centers = 3)
cluster_assignments <- kmeans_result$cluster

K-means is an unsupervised ML algorithm used for clustering a dataset into K distinct, non-overlapping clusters.The goal of the K-means algorithm is to group data points into clusters based on similarity, with each cluster represented by its centroid.

Cluster analysis

customer_data$Cluster <- cluster_assignments

Demographics Analysis

demographic_analysis <- customer_data %>%
  group_by(Cluster, Genre) %>%
  summarise(Count = n()) %>%
  ungroup()
## `summarise()` has grouped output by 'Cluster'. You can override using the
## `.groups` argument.

Bar Plot of Gender Distribution in Each Cluster

ggplot(demographic_analysis, aes(x = as.factor(Cluster), y = Count, fill = Genre)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Gender Distribution in Each Cluster", x = "Cluster", y = "Count") +
  theme_minimal()

The bar plot illustrates the gender distribution within each cluster. It is evident that Cluster 1 has slightly higher proportion of females, Cluster 2 has a balanced representation of both genders, while Cluster 3 has a fairly higher proportion of females.

Stacked Bar Plot of Age Group Distribution in Each Cluster

ggplot(customer_data, aes(x = as.factor(Cluster), fill = Age, group = Age)) +
  geom_bar() +
  labs(title = "Age Group Distribution in Each Cluster", x = "Cluster", y = "Count") +
  theme_minimal()

This stacked bar plot breaks down the age group distribution within each cluster. The height of the bar indicates the total count of customers in each cluster, and the segments within the bar show the distribution of customers across age groups.

Cluster 3 is showing a higher total count of customers, with a significantly higher population of young people ranging from age 20-30. It may be worthwhile to tailor promotions or products to appeal to a younger audience in Cluster 3.

print(demographic_analysis)
## # A tibble: 6 Ă— 3
##   Cluster Genre  Count
##     <int> <chr>  <int>
## 1       1 Female    16
## 2       1 Male      10
## 3       2 Female    18
## 4       2 Male      22
## 5       3 Female    78
## 6       3 Male      56

Scatter plot

ggplot(customer_data, aes(x = Annual.Income..k.., y = Spending.Score..1.100., color = as.factor(Cluster))) +
  geom_point() +
  labs(title = "Customer Segmentation", x = "Annual Income", y = "Spending Score") +
  theme_minimal()

The scatter plot illustrates the relationship between customers’ annual income and spending score, color-coded by the identified clusters. Cluster 1 who has a lower annual income is showing a low spending score. Cluster 2 also shows a lower spending score even though their annual income is high. Cluster 3 tends to have higher spending scores, both among customers with higher as well as lower annual income. Cluster 3 might be a potential high-value customer segment.

Boxplot

ggplot(customer_data, aes(x = Genre, y = Spending.Score..1.100., fill = Genre)) +
  geom_boxplot() +
  labs(title = "Spending Score by Genre", x = "Genre", y = "Spending Score") +
  theme_minimal()

This boxplot compares the distribution of spending scores between genders. It shows that while the median spending scores are similar for both genders, there is greater variability among male customers.

Histogram

ggplot(customer_data, aes(x = Annual.Income..k.., fill = as.factor(Cluster))) +
  geom_histogram(binwidth = 5000, position = "identity", alpha = 0.7) +
  labs(title = "Distribution of Annual Income", x = "Annual Income", y = "Frequency") +
  theme_minimal()

Violin plot

ggplot(customer_data, aes(x = as.factor(Cluster), y = Spending.Score..1.100., fill = as.factor(Cluster))) +
  geom_violin() +
  labs(title = "Spending Score Distribution by Cluster", x = "Cluster", y = "Spending Score") +
  theme_minimal()

The violin plot provides a detailed view of spending score distributions within each cluster. Notably, Cluster 1 has a population of customers with relatively low spending scores, the highest being below 50. Cluster 2 include a much more wider range of customer spending score, more number of people with lower spending score, yet peaks towards higher values, indicating a more diverse spending behavior compared to the other clusters. Cluster 3 is a group predominantly of customers with higher spending scores with lowest score being around 40 till all the way up to about 100.

Faceted Scatter plot

ggplot(customer_data, aes(x = Annual.Income..k.., y = Spending.Score..1.100., color = as.factor(Cluster))) +
  geom_point() +
  facet_wrap(~Cluster, scales = "free") +
  labs(title = "Customer Segmentation", x = "Annual Income", y = "Spending Score") +
  theme_minimal()

This faceted scatter plot showcases spending score against annual income for each cluster separately. It helps in visually discerning the distinct spending behaviors within each cluster.

Okay, that’s it for today!