Installing packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
data = read.csv('Mall_Customers.csv')
The dataset contains 200 rows and 5 columns. The 5 variables are CustomerID, Genre(Gender?), Age, Annual Income and Spending score of the customers in a Mall.
str(data)
## 'data.frame': 200 obs. of 5 variables:
## $ CustomerID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Genre : chr "Male" "Male" "Female" "Female" ...
## $ Age : int 19 21 20 23 31 22 35 23 64 30 ...
## $ Annual.Income..k.. : int 15 15 16 16 17 17 18 18 19 19 ...
## $ Spending.Score..1.100.: int 39 81 6 77 40 76 6 94 3 72 ...
subset(data, Genre == 'Female')[1:20,]
## CustomerID Genre Age Annual.Income..k.. Spending.Score..1.100.
## 3 3 Female 20 16 6
## 4 4 Female 23 16 77
## 5 5 Female 31 17 40
## 6 6 Female 22 17 76
## 7 7 Female 35 18 6
## 8 8 Female 23 18 94
## 10 10 Female 30 19 72
## 12 12 Female 35 19 99
## 13 13 Female 58 20 15
## 14 14 Female 24 20 77
## 17 17 Female 35 21 35
## 20 20 Female 35 23 98
## 23 23 Female 46 25 5
## 25 25 Female 54 28 14
## 27 27 Female 45 28 32
## 29 29 Female 40 29 31
## 30 30 Female 23 29 87
## 32 32 Female 21 30 73
## 35 35 Female 49 33 14
## 36 36 Female 21 33 81
summary(data)
## CustomerID Genre Age Annual.Income..k..
## Min. : 1.00 Length:200 Min. :18.00 Min. : 15.00
## 1st Qu.: 50.75 Class :character 1st Qu.:28.75 1st Qu.: 41.50
## Median :100.50 Mode :character Median :36.00 Median : 61.50
## Mean :100.50 Mean :38.85 Mean : 60.56
## 3rd Qu.:150.25 3rd Qu.:49.00 3rd Qu.: 78.00
## Max. :200.00 Max. :70.00 Max. :137.00
## Spending.Score..1.100.
## Min. : 1.00
## 1st Qu.:34.75
## Median :50.00
## Mean :50.20
## 3rd Qu.:73.00
## Max. :99.00
We can see that the minimum age of the customers is 18. The customer’s Annual income range from 15k to 137k.
Checking for the missing values
missing_values <- sum(is.na(data))
print(paste("Number of Missing Values:", missing_values))
## [1] "Number of Missing Values: 0"
There are no missing values in the dataset.
So lets proceed with our analysis
Customer Segmentation using K means Clustering.
set.seed(123) # for reproducibility
customer_data <- read.csv("Mall_Customers.csv")
features <- customer_data[, c("Spending.Score..1.100.", "Annual.Income..k..")]
scaled_features <- scale(features)
Performing K-means clustering (e.g., with k=3)
kmeans_result <- kmeans(scaled_features, centers = 3)
cluster_assignments <- kmeans_result$cluster
K-means is an unsupervised ML algorithm used for clustering a dataset into K distinct, non-overlapping clusters.The goal of the K-means algorithm is to group data points into clusters based on similarity, with each cluster represented by its centroid.
Cluster analysis
customer_data$Cluster <- cluster_assignments
Demographics Analysis
demographic_analysis <- customer_data %>%
group_by(Cluster, Genre) %>%
summarise(Count = n()) %>%
ungroup()
## `summarise()` has grouped output by 'Cluster'. You can override using the
## `.groups` argument.
Bar Plot of Gender Distribution in Each Cluster
ggplot(demographic_analysis, aes(x = as.factor(Cluster), y = Count, fill = Genre)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Gender Distribution in Each Cluster", x = "Cluster", y = "Count") +
theme_minimal()
The bar plot illustrates the gender distribution within each cluster. It is evident that Cluster 1 has slightly higher proportion of females, Cluster 2 has a balanced representation of both genders, while Cluster 3 has a fairly higher proportion of females.
Stacked Bar Plot of Age Group Distribution in Each Cluster
ggplot(customer_data, aes(x = as.factor(Cluster), fill = Age, group = Age)) +
geom_bar() +
labs(title = "Age Group Distribution in Each Cluster", x = "Cluster", y = "Count") +
theme_minimal()
This stacked bar plot breaks down the age group distribution within each cluster. The height of the bar indicates the total count of customers in each cluster, and the segments within the bar show the distribution of customers across age groups.
Cluster 3 is showing a higher total count of customers, with a significantly higher population of young people ranging from age 20-30. It may be worthwhile to tailor promotions or products to appeal to a younger audience in Cluster 3.
print(demographic_analysis)
## # A tibble: 6 Ă— 3
## Cluster Genre Count
## <int> <chr> <int>
## 1 1 Female 16
## 2 1 Male 10
## 3 2 Female 18
## 4 2 Male 22
## 5 3 Female 78
## 6 3 Male 56
Scatter plot
ggplot(customer_data, aes(x = Annual.Income..k.., y = Spending.Score..1.100., color = as.factor(Cluster))) +
geom_point() +
labs(title = "Customer Segmentation", x = "Annual Income", y = "Spending Score") +
theme_minimal()
The scatter plot illustrates the relationship between customers’ annual income and spending score, color-coded by the identified clusters. Cluster 1 who has a lower annual income is showing a low spending score. Cluster 2 also shows a lower spending score even though their annual income is high. Cluster 3 tends to have higher spending scores, both among customers with higher as well as lower annual income. Cluster 3 might be a potential high-value customer segment.
Boxplot
ggplot(customer_data, aes(x = Genre, y = Spending.Score..1.100., fill = Genre)) +
geom_boxplot() +
labs(title = "Spending Score by Genre", x = "Genre", y = "Spending Score") +
theme_minimal()
This boxplot compares the distribution of spending scores between genders. It shows that while the median spending scores are similar for both genders, there is greater variability among male customers.
Histogram
ggplot(customer_data, aes(x = Annual.Income..k.., fill = as.factor(Cluster))) +
geom_histogram(binwidth = 5000, position = "identity", alpha = 0.7) +
labs(title = "Distribution of Annual Income", x = "Annual Income", y = "Frequency") +
theme_minimal()
Violin plot
ggplot(customer_data, aes(x = as.factor(Cluster), y = Spending.Score..1.100., fill = as.factor(Cluster))) +
geom_violin() +
labs(title = "Spending Score Distribution by Cluster", x = "Cluster", y = "Spending Score") +
theme_minimal()
The violin plot provides a detailed view of spending score distributions within each cluster. Notably, Cluster 1 has a population of customers with relatively low spending scores, the highest being below 50. Cluster 2 include a much more wider range of customer spending score, more number of people with lower spending score, yet peaks towards higher values, indicating a more diverse spending behavior compared to the other clusters. Cluster 3 is a group predominantly of customers with higher spending scores with lowest score being around 40 till all the way up to about 100.
Faceted Scatter plot
ggplot(customer_data, aes(x = Annual.Income..k.., y = Spending.Score..1.100., color = as.factor(Cluster))) +
geom_point() +
facet_wrap(~Cluster, scales = "free") +
labs(title = "Customer Segmentation", x = "Annual Income", y = "Spending Score") +
theme_minimal()
This faceted scatter plot showcases spending score against annual income for each cluster separately. It helps in visually discerning the distinct spending behaviors within each cluster.
Okay, that’s it for today!