Introduction

This document presents a data segmentation analysis using a dataset of consumer behaviors. The analysis involves loading necessary libraries, reading and exploring the dataset, scaling the data, finding the optimal number of clusters using the elbow method, and performing K-means clustering.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(cluster)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
##Explanation: Loading essential libraries for data manipulation (tidyverse), clustering algorithms (cluster), and visualization of clustering results (factoextra).
pda_data <- read.csv("C:/Users/gambe/Downloads/PDA2001DataSegmentation.csv")

## Explanation: Reading the dataset from a specified path. This dataset is assumed to contain various features for segmentation analysis.
glimpse(pda_data)
## Rows: 160
## Columns: 33
## $ ID           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ Innovator    <int> 4, 5, 3, 3, 4, 5, 5, 3, 6, 2, 4, 4, 4, 3, 3, 4, 3, 4, 4, …
## $ usemessage   <int> 1, 4, 5, 1, 1, 3, 5, 1, 5, 2, 5, 3, 4, 3, 6, 2, 4, 3, 2, …
## $ usecell      <int> 5, 7, 5, 6, 6, 5, 5, 5, 5, 7, 7, 5, 5, 5, 7, 7, 7, 6, 5, …
## $ UsePIM       <int> 7, 5, 6, 7, 7, 7, 6, 5, 7, 5, 6, 6, 7, 5, 6, 7, 7, 7, 6, …
## $ InfPassive   <int> 4, 6, 3, 5, 7, 3, 6, 6, 3, 4, 5, 7, 7, 5, 5, 3, 7, 3, 7, …
## $ InfActive    <int> 6, 5, 4, 7, 5, 7, 4, 5, 7, 5, 7, 3, 7, 6, 4, 6, 6, 5, 5, …
## $ RemoteAccess <int> 3, 6, 7, 3, 5, 4, 3, 4, 1, 5, 2, 3, 4, 2, 3, 4, 4, 4, 3, …
## $ ShareInf     <int> 1, 5, 6, 4, 5, 6, 3, 3, 3, 4, 2, 5, 1, 7, 3, 1, 3, 2, 1, …
## $ Monitor      <int> 4, 4, 4, 4, 2, 5, 4, 4, 2, 6, 6, 4, 7, 2, 3, 5, 3, 7, 2, …
## $ Email        <int> 5, 7, 4, 4, 7, 6, 6, 7, 6, 4, 7, 7, 7, 4, 6, 6, 7, 7, 4, …
## $ Web          <int> 5, 3, 4, 7, 5, 7, 5, 4, 4, 6, 6, 6, 3, 6, 7, 6, 7, 4, 7, …
## $ Mmedia       <int> 6, 6, 6, 5, 5, 6, 7, 7, 6, 5, 6, 5, 6, 6, 5, 5, 7, 6, 2, …
## $ Ergonomic    <int> 4, 1, 3, 5, 4, 3, 4, 1, 3, 3, 4, 6, 4, 3, 4, 4, 5, 2, 4, …
## $ Monthly      <int> 15, 15, 35, 15, 25, 40, 20, 15, 15, 25, 20, 15, 30, 20, 3…
## $ Price        <int> 280, 310, 370, 390, 410, 250, 310, 230, 240, 290, 230, 16…
## $ Age          <int> 48, 43, 48, 28, 54, 62, 53, 48, 46, 59, 53, 21, 54, 41, 2…
## $ Education    <int> 3, 3, 2, 3, 3, 2, 3, 2, 3, 2, 3, 3, 3, 3, 2, 2, 3, 2, 2, …
## $ Income       <int> 47, 89, 24, 52, 84, 46, 71, 39, 87, 21, 76, 63, 53, 81, 4…
## $ Construction <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ Emergency    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ Sales        <int> 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, …
## $ Service      <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Professional <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Computers    <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ PDA          <int> 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, …
## $ Cell         <int> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ PC           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Away         <int> 3, 3, 6, 7, 2, 4, 6, 3, 4, 6, 4, 6, 4, 4, 6, 5, 5, 5, 1, …
## $ BusinessWeek <int> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, …
## $ PCMagazine   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ FieldStream  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Mgourmet     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
##Explanation: Using glimpse() to get a brief overview of the dataset structure, including the number of observations, variable names, and types.
pda_data_scaled <- scale(pda_data[,-1])

##Explanation: Scaling the data to standardize the feature values. This step is crucial for clustering analysis to ensure each feature contributes equally.
fviz_nbclust(pda_data_scaled, kmeans, method = "wss") + geom_vline(xintercept = 4, linetype = 2)

##Explanation: Determining the optimal number of clusters by analyzing the within-sum-of-squares (WSS) plot. The vertical line at 4 suggests considering four clusters for this analysis.
set.seed(123) # Ensure reproducibility
kmeans_result <- kmeans(pda_data_scaled, centers = 4, nstart = 25)

##Explanation: Executing K-means clustering with 4 clusters as identified. The set.seed function ensures reproducibility of the results.
pda_data_clustered <- pda_data
pda_data_clustered$cluster <- as.factor(kmeans_result$cluster)

pda_data_clustered %>%
  group_by(cluster) %>%
  summarise_all(funs(mean))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 4 × 34
##   cluster    ID Innovator usemessage usecell UsePIM InfPassive InfActive
##   <fct>   <dbl>     <dbl>      <dbl>   <dbl>  <dbl>      <dbl>     <dbl>
## 1 1       147.       2.41       3.35    4.29   3          6         6.18
## 2 2        31        3.62       3.75    5.80   5.86       4.93      5.20
## 3 3        79.5      2.33       5.61    5.5    2.20       3.83      3.78
## 4 4       122.       5          3.61    5.80   3.93       3.85      3.66
## # ℹ 26 more variables: RemoteAccess <dbl>, ShareInf <dbl>, Monitor <dbl>,
## #   Email <dbl>, Web <dbl>, Mmedia <dbl>, Ergonomic <dbl>, Monthly <dbl>,
## #   Price <dbl>, Age <dbl>, Education <dbl>, Income <dbl>, Construction <dbl>,
## #   Emergency <dbl>, Sales <dbl>, Service <dbl>, Professional <dbl>,
## #   Computers <dbl>, PDA <dbl>, Cell <dbl>, PC <dbl>, Away <dbl>,
## #   BusinessWeek <dbl>, PCMagazine <dbl>, FieldStream <dbl>, Mgourmet <dbl>
##Explanation: Assigning cluster labels to the original data and calculating the mean of each variable by cluster to profile them.
fviz_cluster(kmeans_result, data = pda_data_scaled)

##Explanation: Visualizing the clustering results with fviz_cluster, which helps in understanding the distribution of data points among the identified clusters.