# Make sure the file name matches EXACTLY what you uploaded
mydata <- read.csv("customer_segmentation (1).csv")
head(mydata)
## ID CS_helpful Recommend Come_again All_Products Profesionalism Limitation
## 1 1 2 2 2 2 2 2
## 2 2 1 2 1 1 1 1
## 3 3 2 1 1 1 1 2
## 4 4 3 3 2 4 1 2
## 5 5 2 1 3 5 2 1
## 6 6 1 1 3 2 1 1
## Online_grocery delivery Pick_up Find_items other_shops Gender Age Education
## 1 2 3 4 1 2 1 2 2
## 2 2 3 3 1 2 1 2 2
## 3 3 3 2 1 3 1 2 2
## 4 3 3 2 2 2 1 3 5
## 5 2 3 1 2 3 2 4 2
## 6 1 2 1 1 4 1 2 5
str(mydata)
## 'data.frame': 22 obs. of 15 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ CS_helpful : int 2 1 2 3 2 1 2 1 1 1 ...
## $ Recommend : int 2 2 1 3 1 1 1 1 1 1 ...
## $ Come_again : int 2 1 1 2 3 3 1 1 1 1 ...
## $ All_Products : int 2 1 1 4 5 2 2 2 2 1 ...
## $ Profesionalism: int 2 1 1 1 2 1 2 1 2 1 ...
## $ Limitation : int 2 1 2 2 1 1 1 2 1 1 ...
## $ Online_grocery: int 2 2 3 3 2 1 2 1 2 3 ...
## $ delivery : int 3 3 3 3 3 2 2 1 1 2 ...
## $ Pick_up : int 4 3 2 2 1 1 2 2 3 2 ...
## $ Find_items : int 1 1 1 2 2 1 1 2 1 1 ...
## $ other_shops : int 2 2 3 2 3 4 1 4 1 1 ...
## $ Gender : int 1 1 1 1 2 1 1 1 2 2 ...
## $ Age : int 2 2 2 3 4 2 2 2 2 2 ...
## $ Education : int 2 2 2 5 2 5 3 2 1 2 ...
summary(mydata)
## ID CS_helpful Recommend Come_again
## Min. : 1.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.: 6.25 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :11.50 Median :1.000 Median :1.000 Median :1.000
## Mean :11.50 Mean :1.591 Mean :1.318 Mean :1.455
## 3rd Qu.:16.75 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:2.000
## Max. :22.00 Max. :3.000 Max. :3.000 Max. :3.000
## All_Products Profesionalism Limitation Online_grocery delivery
## Min. :1.000 Min. :1.000 Min. :1.0 Min. :1.000 Min. :1.000
## 1st Qu.:1.250 1st Qu.:1.000 1st Qu.:1.0 1st Qu.:2.000 1st Qu.:2.000
## Median :2.000 Median :1.000 Median :1.0 Median :2.000 Median :3.000
## Mean :2.091 Mean :1.409 Mean :1.5 Mean :2.273 Mean :2.409
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.0 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :5.000 Max. :3.000 Max. :4.0 Max. :3.000 Max. :3.000
## Pick_up Find_items other_shops Gender
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.250 1st Qu.:1.000
## Median :2.000 Median :1.000 Median :2.000 Median :1.000
## Mean :2.455 Mean :1.455 Mean :2.591 Mean :1.273
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:3.750 3rd Qu.:1.750
## Max. :5.000 Max. :3.000 Max. :5.000 Max. :2.000
## Age Education
## Min. :2.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000
## Median :2.000 Median :2.500
## Mean :2.455 Mean :3.182
## 3rd Qu.:3.000 3rd Qu.:5.000
## Max. :4.000 Max. :5.000
# Remove ID column (first column)
cluster_data <- mydata[,-1]
# Keep ONLY numeric columns (THIS FIXES YOUR ERROR)
numeric_data <- cluster_data[, sapply(cluster_data, is.numeric)]
# Scale the data
cluster_scaled <- scale(numeric_data)
set.seed(123)
kmeans_result <- kmeans(cluster_scaled, centers = 3, nstart = 25)
# Add cluster labels to original dataset
mydata$Cluster <- as.factor(kmeans_result$cluster)
head(mydata)
## ID CS_helpful Recommend Come_again All_Products Profesionalism Limitation
## 1 1 2 2 2 2 2 2
## 2 2 1 2 1 1 1 1
## 3 3 2 1 1 1 1 2
## 4 4 3 3 2 4 1 2
## 5 5 2 1 3 5 2 1
## 6 6 1 1 3 2 1 1
## Online_grocery delivery Pick_up Find_items other_shops Gender Age Education
## 1 2 3 4 1 2 1 2 2
## 2 2 3 3 1 2 1 2 2
## 3 3 3 2 1 3 1 2 2
## 4 3 3 2 2 2 1 3 5
## 5 2 3 1 2 3 2 4 2
## 6 1 2 1 1 4 1 2 5
## Cluster
## 1 3
## 2 3
## 3 3
## 4 2
## 5 2
## 6 1
table(mydata$Cluster)
##
## 1 2 3
## 6 4 12
Answer:
The table above shows the number of observations in each cluster. Each
value represents how many customers belong to each group.
aggregate(numeric_data, by = list(Cluster = mydata$Cluster), mean)
## Cluster CS_helpful Recommend Come_again All_Products Profesionalism
## 1 1 1.000000 1.00 1.500000 2.166667 1.000000
## 2 2 2.500000 2.00 2.500000 3.250000 2.000000
## 3 3 1.583333 1.25 1.083333 1.666667 1.416667
## Limitation Online_grocery delivery Pick_up Find_items other_shops Gender
## 1 1.166667 1.666667 1.666667 2.000000 1.333333 3.666667 1.166667
## 2 2.000000 2.250000 3.000000 1.250000 2.000000 2.750000 1.750000
## 3 1.500000 2.583333 2.583333 3.083333 1.333333 2.000000 1.166667
## Age Education
## 1 2.833333 4.500000
## 2 2.750000 3.000000
## 3 2.166667 2.583333
aggregate(numeric_data, by = list(Cluster = mydata$Cluster), median)
## Cluster CS_helpful Recommend Come_again All_Products Profesionalism
## 1 1 1.0 1 1.0 2.0 1
## 2 2 2.5 2 2.5 3.5 2
## 3 3 1.5 1 1.0 2.0 1
## Limitation Online_grocery delivery Pick_up Find_items other_shops Gender Age
## 1 1.0 1.5 2 2 1 4.0 1 2.5
## 2 1.5 2.5 3 1 2 2.5 2 2.5
## 3 1.0 3.0 3 3 1 2.0 1 2.0
## Education
## 1 5.0
## 2 2.5
## 3 2.0
Answer:
Means and medians help describe each cluster by showing the typical
values for each variable. This allows us to understand the differences
between customer groups and identify patterns in behavior. These
insights help businesses create better marketing strategies.
Answer:
The median is better when the data contains outliers because it is not
affected by extreme values. The mean is useful when the data is evenly
distributed. In this case, the median is more reliable because it better
represents the typical customer.
Answer:
Important summary measures include mean, median, minimum, maximum, and
standard deviation. These help describe each cluster’s characteristics
and allow for better comparison between groups.
distance_matrix <- dist(cluster_scaled)
hc_result <- hclust(distance_matrix, method = "ward.D2")
plot(hc_result, main = "Hierarchical Clustering Dendrogram",
xlab = "", sub = "")
Answer:
K-means clustering requires selecting the number of clusters beforehand
and works efficiently with large datasets. Hierarchical clustering
builds a dendrogram and does not require a predefined number of
clusters. I prefer K-means because it is faster and easier to
interpret.
Answer:
We should use numeric_data (instead of mydata or
mydata[,-1]) because it removes non-numeric columns. This ensures
accurate calculations since clustering summaries require numeric
variables only.