There are five columns in the mall_customers.csv file. They are:
Which variables can be used to cluster the customers using k-means clustering? Briefly explain why.
ID: Can’t be used because it is a categorical (or identifier variable)
gender: Can’t be used because it is categorical
age: Could be used because it is numeric
income: Could be used because it is numeric
spending_score: Could be used because it is numeric
Regardless of your answer in question 1, use income and spending_score to cluster the customers
Create the appropriate graphs to examine the variables
mall_cust |>
dplyr::select(income, spending_score) |>
GGally::ggpairs()
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Using the graph(s) you created in question 2, how many clusters do there appear to be in the data? Briefly explain your answer.
From the scatterplot, there appears to be 5 clusters.
Create a data set called cust_stan that contains the standardized versions of income and spending_score.
cust_stan <-
mall_cust |>
dplyr::select(income:spending_score) |>
scale() |>
data.frame()
# If done correctly, the code below should run and you can use it to check that it was done correctly
skimr::skim(cust_stan)
Name | cust_stan |
Number of rows | 200 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
income | 0 | 1 | 0 | 1 | -1.73 | -0.73 | 0.04 | 0.66 | 2.91 | ▆▇▇▂▁ |
spending_score | 0 | 1 | 0 | 1 | -1.91 | -0.60 | -0.01 | 0.88 | 1.89 | ▃▃▇▃▃ |
How does the output from skim()
tell you that
the data were standardized?
The mean is 0 and the standard deviation is 1!
Create the two graphs seen in class that helps decide the
number of clusters in the data below. Make sure to include
nstart = 10
!
fviz_nbclust(
x = cust_stan,
FUNcluster = kmeans,
method = 'wss',
nstart = 10
)
fviz_nbclust(
x = cust_stan,
FUNcluster = kmeans,
method = 'silhouette',
nstart = 10
)
From all the graphs created so far, how many clusters do you think are in the data?
Both plots above suggest 5 clusters, which agrees with the scatter plot created in question 2!
Regardless of your answer in the previous question, perform
k-means clustering for five (5) clusters. Again, make sure to use
nstart = 10
!. Save the results as cust_k5
and
display the centers in the knitted document rounded to 2 decimal
places
# Keep this at the top
RNGversion('4.1.0')
set.seed(1870)
# Create cust_k5 below
cust_k5 <-
kmeans(
x = cust_stan,
centers = 5,
nstart = 10
)
# Display the centers of the 5 clusters below
round(cust_k5$centers, 2)
## income spending_score
## 1 -1.33 1.13
## 2 -1.30 -1.13
## 3 -0.20 -0.03
## 4 1.05 -1.28
## 5 0.99 1.24
The code chunks below will create a set of graphs you can use to describe how the 5 clusters differ across the two variables.
mall_cust |>
dplyr::select(income, spending_score) |>
GGally::ggpairs(mapping = aes(color = factor(cust_k5$cluster)))
mall_cust |>
# Adding the cluster each person was assigned to
mutate(cluster = factor(cust_k5$cluster)) |>
# pivoting income and spending score into one column
pivot_longer(
cols = income:spending_score,
names_to = 'variable',
values_to = 'value'
) |>
# Creating faceted box plots
ggplot(
mapping = aes(
x = value,
y = cluster,
fill = cluster
)
) +
geom_boxplot(show.legend = F) +
facet_wrap(
facets = vars(variable),
scales = 'free_x'
) +
theme_bw() +
labs(x = NULL)
What separates (or if you wanna use business lingo, segments) the different customer clusters?
Cluster 1: Low income, high spending score
Cluster 2: Low income, low spending score
Cluster 3: Medium income, medium spending score
Cluster 4: High income, low spending score
Cluster 5: High income, high spending score