This document presents a data segmentation analysis using a dataset
of consumer behaviors. The analysis involves loading necessary
libraries, reading and exploring the dataset, scaling the data, finding
the optimal number of clusters using the elbow method, and performing
K-means clustering.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(cluster)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
##Explanation: Loading essential libraries for data manipulation (tidyverse), clustering algorithms (cluster), and visualization of clustering results (factoextra).
pda_data <- read.csv("C:/Users/gambe/Downloads/PDA2001DataSegmentation.csv")
## Explanation: Reading the dataset from a specified path. This dataset is assumed to contain various features for segmentation analysis.
glimpse(pda_data)
## Rows: 160
## Columns: 33
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ Innovator <int> 4, 5, 3, 3, 4, 5, 5, 3, 6, 2, 4, 4, 4, 3, 3, 4, 3, 4, 4, …
## $ usemessage <int> 1, 4, 5, 1, 1, 3, 5, 1, 5, 2, 5, 3, 4, 3, 6, 2, 4, 3, 2, …
## $ usecell <int> 5, 7, 5, 6, 6, 5, 5, 5, 5, 7, 7, 5, 5, 5, 7, 7, 7, 6, 5, …
## $ UsePIM <int> 7, 5, 6, 7, 7, 7, 6, 5, 7, 5, 6, 6, 7, 5, 6, 7, 7, 7, 6, …
## $ InfPassive <int> 4, 6, 3, 5, 7, 3, 6, 6, 3, 4, 5, 7, 7, 5, 5, 3, 7, 3, 7, …
## $ InfActive <int> 6, 5, 4, 7, 5, 7, 4, 5, 7, 5, 7, 3, 7, 6, 4, 6, 6, 5, 5, …
## $ RemoteAccess <int> 3, 6, 7, 3, 5, 4, 3, 4, 1, 5, 2, 3, 4, 2, 3, 4, 4, 4, 3, …
## $ ShareInf <int> 1, 5, 6, 4, 5, 6, 3, 3, 3, 4, 2, 5, 1, 7, 3, 1, 3, 2, 1, …
## $ Monitor <int> 4, 4, 4, 4, 2, 5, 4, 4, 2, 6, 6, 4, 7, 2, 3, 5, 3, 7, 2, …
## $ Email <int> 5, 7, 4, 4, 7, 6, 6, 7, 6, 4, 7, 7, 7, 4, 6, 6, 7, 7, 4, …
## $ Web <int> 5, 3, 4, 7, 5, 7, 5, 4, 4, 6, 6, 6, 3, 6, 7, 6, 7, 4, 7, …
## $ Mmedia <int> 6, 6, 6, 5, 5, 6, 7, 7, 6, 5, 6, 5, 6, 6, 5, 5, 7, 6, 2, …
## $ Ergonomic <int> 4, 1, 3, 5, 4, 3, 4, 1, 3, 3, 4, 6, 4, 3, 4, 4, 5, 2, 4, …
## $ Monthly <int> 15, 15, 35, 15, 25, 40, 20, 15, 15, 25, 20, 15, 30, 20, 3…
## $ Price <int> 280, 310, 370, 390, 410, 250, 310, 230, 240, 290, 230, 16…
## $ Age <int> 48, 43, 48, 28, 54, 62, 53, 48, 46, 59, 53, 21, 54, 41, 2…
## $ Education <int> 3, 3, 2, 3, 3, 2, 3, 2, 3, 2, 3, 3, 3, 3, 2, 2, 3, 2, 2, …
## $ Income <int> 47, 89, 24, 52, 84, 46, 71, 39, 87, 21, 76, 63, 53, 81, 4…
## $ Construction <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ Emergency <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ Sales <int> 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, …
## $ Service <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Professional <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Computers <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ PDA <int> 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, …
## $ Cell <int> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ PC <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Away <int> 3, 3, 6, 7, 2, 4, 6, 3, 4, 6, 4, 6, 4, 4, 6, 5, 5, 5, 1, …
## $ BusinessWeek <int> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, …
## $ PCMagazine <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ FieldStream <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Mgourmet <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
##Explanation: Using glimpse() to get a brief overview of the dataset structure, including the number of observations, variable names, and types.
pda_data_scaled <- scale(pda_data[,-1])
##Explanation: Scaling the data to standardize the feature values. This step is crucial for clustering analysis to ensure each feature contributes equally.
fviz_nbclust(pda_data_scaled, kmeans, method = "wss") + geom_vline(xintercept = 4, linetype = 2)

##Explanation: Determining the optimal number of clusters by analyzing the within-sum-of-squares (WSS) plot. The vertical line at 4 suggests considering four clusters for this analysis.
set.seed(123) # Ensure reproducibility
kmeans_result <- kmeans(pda_data_scaled, centers = 4, nstart = 25)
##Explanation: Executing K-means clustering with 4 clusters as identified. The set.seed function ensures reproducibility of the results.
pda_data_clustered <- pda_data
pda_data_clustered$cluster <- as.factor(kmeans_result$cluster)
pda_data_clustered %>%
group_by(cluster) %>%
summarise_all(funs(mean))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 4 × 34
## cluster ID Innovator usemessage usecell UsePIM InfPassive InfActive
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 147. 2.41 3.35 4.29 3 6 6.18
## 2 2 31 3.62 3.75 5.80 5.86 4.93 5.20
## 3 3 79.5 2.33 5.61 5.5 2.20 3.83 3.78
## 4 4 122. 5 3.61 5.80 3.93 3.85 3.66
## # ℹ 26 more variables: RemoteAccess <dbl>, ShareInf <dbl>, Monitor <dbl>,
## # Email <dbl>, Web <dbl>, Mmedia <dbl>, Ergonomic <dbl>, Monthly <dbl>,
## # Price <dbl>, Age <dbl>, Education <dbl>, Income <dbl>, Construction <dbl>,
## # Emergency <dbl>, Sales <dbl>, Service <dbl>, Professional <dbl>,
## # Computers <dbl>, PDA <dbl>, Cell <dbl>, PC <dbl>, Away <dbl>,
## # BusinessWeek <dbl>, PCMagazine <dbl>, FieldStream <dbl>, Mgourmet <dbl>
##Explanation: Assigning cluster labels to the original data and calculating the mean of each variable by cluster to profile them.
fviz_cluster(kmeans_result, data = pda_data_scaled)

##Explanation: Visualizing the clustering results with fviz_cluster, which helps in understanding the distribution of data points among the identified clusters.