Module 8 Homework: Clustering Mall Customers

Mall Customers description

There are five columns in the mall_customers.csv file. They are:

ID: A column that uniquely IDs each person
gender: The gender of the shopper
age: The age of the shopper
income: They person’s yearly income (in thousands of $)
spending_score: A score between 1 - 100 assigned to the individual where 100 is someone who shops a lot and/or spends a lot while a score of 1 indicates never spending any money

Question 1: Appropriate variables to use

Which variables can be used to cluster the customers using k-means clustering? Briefly explain why.

ID: Can’t be used because it is a categorical (or identifier variable)

gender: Can’t be used because it is categorical

age: Could be used because it is numeric

income: Could be used because it is numeric

spending_score: Could be used because it is numeric

Question 2: Exploratory data analysis

Regardless of your answer in question 1, use income and spending_score to cluster the customers

Create the appropriate graphs to examine the variables

mall_cust |> 
  dplyr::select(income, spending_score) |> 
  GGally::ggpairs()

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Question 3: How many clusters appear in the data

Using the graph(s) you created in question 2, how many clusters do there appear to be in the data? Briefly explain your answer.

From the scatterplot, there appears to be 5 clusters.

Quesiton 4: Standardizing the data

Create a data set called cust_stan that contains the standardized versions of income and spending_score.

cust_stan <- 
  mall_cust |> 
  dplyr::select(income:spending_score) |> 
  scale() |> 
  data.frame()


# If done correctly, the code below should run and you can use it to check that it was done correctly
skimr::skim(cust_stan)

Data summary
Name	cust_stan
Number of rows	200
Number of columns	2
_______________________
Column type frequency:
numeric	2
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
income	0	1	0	1	-1.73	-0.73	0.04	0.66	2.91	▆▇▇▂▁
spending_score	0	1	0	1	-1.91	-0.60	-0.01	0.88	1.89	▃▃▇▃▃

How does the output from skim() tell you that the data were standardized?

The mean is 0 and the standard deviation is 1!

Question 5: Deciding on the number of clusters

Create the two graphs seen in class that helps decide the number of clusters in the data below. Make sure to include nstart = 10!

fviz_nbclust(
  x = cust_stan,
  FUNcluster = kmeans,
  method = 'wss',
  nstart = 10
)

fviz_nbclust(
  x = cust_stan,
  FUNcluster = kmeans,
  method = 'silhouette',
  nstart = 10
)

From all the graphs created so far, how many clusters do you think are in the data?

Both plots above suggest 5 clusters, which agrees with the scatter plot created in question 2!

Question 6: Clustering the data

Regardless of your answer in the previous question, perform k-means clustering for five (5) clusters. Again, make sure to use nstart = 10!. Save the results as cust_k5 and display the centers in the knitted document rounded to 2 decimal places

# Keep this at the top
RNGversion('4.1.0')
set.seed(1870)

# Create cust_k5 below
cust_k5 <- 
  kmeans(
    x = cust_stan,
    centers = 5,
    nstart = 10
  )

# Display the centers of the 5 clusters below
round(cust_k5$centers, 2)

##   income spending_score
## 1  -1.33           1.13
## 2  -1.30          -1.13
## 3  -0.20          -0.03
## 4   1.05          -1.28
## 5   0.99           1.24

Question 7: Describe the difference between the 5 clusters

The code chunks below will create a set of graphs you can use to describe how the 5 clusters differ across the two variables.

mall_cust |> 
  dplyr::select(income, spending_score) |> 
  GGally::ggpairs(mapping = aes(color = factor(cust_k5$cluster)))

mall_cust |> 
  # Adding the cluster each person was assigned to
  mutate(cluster = factor(cust_k5$cluster)) |> 
  # pivoting income and spending score into one column
  pivot_longer(
    cols = income:spending_score,
    names_to = 'variable',
    values_to = 'value'
  ) |> 
  # Creating faceted box plots
  ggplot(
    mapping = aes(
      x = value,
      y = cluster,
      fill = cluster
    )
  ) + 
  geom_boxplot(show.legend = F) + 
  facet_wrap(
    facets = vars(variable),
    scales = 'free_x'
  ) + 
  theme_bw() + 
  labs(x = NULL)

What separates (or if you wanna use business lingo, segments) the different customer clusters?

Cluster 1: Low income, high spending score

Cluster 2: Low income, low spending score

Cluster 3: Medium income, medium spending score

Cluster 4: High income, low spending score

Cluster 5: High income, high spending score