# setting theme for all plots
theme_update(plot.title = element_text(hjust = 0.5, size = 2))
theme_update(plot.subtitle = element_text(hjust = 0.5))
mytheme <- theme_minimal() + theme(plot.title = element_text(size = 11, face = "bold"), 
        legend.position = "bottom", 
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank(),
        text=(element_text(family = "calibri")),
        plot.background = element_blank())

Market segmentation is the activity of dividing a broad consumer or business market, normally consisting of existing and potential customers, into sub-groups of consumers based on some type of shared characteristics.

Customers are different in their own way and under this assumption, consumer focus/service businesses, such as shopping centres, should tweak their marketing effort to target a specific and smaller groups, rather than operate based on generalization. The end result is definitely to drive each one of them to buy something (Profit). In the process, companies also hope to gain better insights on their customers preferences and needs with the idea of discovering what each group find the most valuable in their shopping experience (enrichment).

This is where development of Machine Learning assert itself and aid to gauge. The shopping complexes make use of their customers’ data and develop ML models to target the right ones. This not only to gauge sales but also make the boost the shopping experience.

1 Data Preparation

This dataset consist of 200 customers and their individual personal information (gender and age) as well as their spending related information (annual income and spending score). This is clearly sample of much larger set of customer data used for Unsupervised Learning purpose only and that should not hinder us to go further.

# Import Data
data <- read.csv("datasets_42674_74935_Mall_Customers.csv")
glimpse(data)

#> Rows: 200
#> Columns: 5
#> $ CustomerID             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
#> $ Gender                 <fct> Male, Male, Female, Female, Female, Female, ...
#> $ Age                    <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, ...
#> $ Annual.Income..k..     <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, ...
#> $ Spending.Score..1.100. <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99,...

1.1 Data Cleaning

# Checking missing data
anyNA(data)

#> [1] FALSE

# change column name to be more coherent and tidy
data_clean <- data %>% 
  rename_all(tolower) %>% 
  `colnames<-`(c("customerid", "gender", "age", "annual_income", "spending_score")) %>% 
# put variable to correct type
  mutate(customerid = as.factor(customerid))

data_clean

2 Data Exploratory

Variables explanation : customerid : customer id number gender : female or male age : age of each customer annual income (k$) : annual income of customer spending score : the score(out of 100) given to a customer by the mall authorities, based on the money spent and the behavior of the customer.

Understanding their customers from demographic distibution and their spending habits is a crucial step in building market segmentation. We heard a lot of great stories how shopping centres project their extremely segmented image appealed to very tight group only. For example, if you want to be fancy in Jakarta, you may go to Plaza Indonesia. That means upper middle - upper class is their primary target. However, a lot of shopping centres also normally have primary and secondary target. Their main focus is still the primary one but they will allocate part of their business resources’ to the less vital segments although not as enourmous as on their main one.

Again, demographic will be a good initial indicator in creating customers’ buckets.

Below are some insights captured by EDA :
* Female customers are more frequent than male. And by looking from combination of gender and age, it is interesting as both grow older, they will be less frequent in visiting shopping malls. Although, male customers have less bending decline compared to their female counterparts. The most intense activities happen around age 25 - 32 for both genders.

par(mfrow=c(1,2))

data_clean %>% 
  group_by(gender) %>% 
  dplyr::summarise(freq = n()) %>% 
  ggplot(., aes(x = gender, y = freq, fill = gender))+
  geom_col() + mytheme

ggplot(data_clean, aes( x = age, fill = gender)) + 
  geom_density(alpha = 0.4)

Spending score and annual income will be our focal point here since logically, it correlates more with the intention of someone’s future buying and they will become our orimary data to perform clustering.

It is clearly visible that the most of the males have a Spending Score of around 25k USD to 70k USD whereas the Females have a spending score of around 35k USD to 75k USD. Which again points to the fact that women are Shopping Leaders.

There are more number of males who get paid more than females. But, The number of males and females are equal in number when it comes to low annual income.

The above Plot Between Annual Income and Age represented by a blue color line, and a plot between Annual Income and the Spending Score represented by a pink color. shows how Age and Spending Varies with Annual Income.

ggplot(data_clean, aes(y = spending_score, fill = gender)) +
  geom_boxplot()

# ggplot(data_clean, aes(x = gender = annual_income, fill = gender)) +
  # geom_boxplot()

3 Clustering

K-means clustering center around 2 important ideas : * determining number of k centroids or how many clusters can optimally boost our machine learning effectiveness, and * scaling data. Scaling data really depends on the distance between attributes are high or they have different metrics. If your data are already well-defined, then scaling is not necessary since it will cause distortion instead.

In this case, they are pretty much in the same range, so we can proceed.

# check whether data needs scale
range(data_clean$annual_income)

#> [1]  15 137

range(data_clean$spending_score)

#> [1]  1 99

# select relevant variable
data_scale <- data_clean %>% 
  select(4,5)

# Find optimum cluster
RNGkind(sample.kind = "Rounding")
kmeansTunning <- function(data, maxK) {
  withinall <- NULL
  total_k <- NULL
  for (i in 2:maxK) {
    set.seed(101)
    temp <- kmeans(data,i)$tot.withinss
    withinall <- append(withinall, temp)
    total_k <- append(total_k,i)
  }
  plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total within")
}

data_tune <- kmeansTunning(data_scale, maxK = 9)

We’ll use 5 clusters since after that the slope is not steep enough.

# calculating k-means
RNGkind(sample.kind = "Rounding")
set.seed(100)
data_kmeans <- kmeans(data_scale, centers = 5)
data_kmeans

#> K-means clustering with 5 clusters of sizes 101, 36, 5, 22, 36
#> 
#> Cluster means:
#>   annual_income spending_score
#> 1      48.16832       43.39604
#> 2      83.11111       82.41667
#> 3     129.20000       56.40000
#> 4      25.72727       79.36364
#> 5      84.52778       18.38889
#> 
#> Clustering vector:
#>   [1] 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1
#>  [38] 4 1 4 1 4 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [112] 1 1 1 1 1 1 1 1 1 1 1 1 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2
#> [149] 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5
#> [186] 2 5 2 5 2 5 2 5 2 5 3 3 3 3 3
#> 
#> Within cluster sum of squares by cluster:
#> [1] 42712.297  7718.306  4036.000  3519.455  9873.528
#>  (between_SS / total_SS =  74.9 %)
#> 
#> Available components:
#> 
#> [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
#> [6] "betweenss"    "size"         "iter"         "ifault"

The ratio between the sum of squares distance between cluster to the total sum of squares is 74.9%, meaning that most of the sum of squares distance comes from the distance between clusters. Thus, we can conclude that our data is properly clustered since the observations in the same cluster has a little distance or variations. The number of members on each cluster is not equally distributed.

# ploting cluster result
fviz_cluster(data_kmeans,data_scale, ggtheme = theme_minimal())

We’ve already get the information about the cluster of each observation. Let’s join the vector cluster into the dataset.

# Join cluster and original data
data_clust <- data_clean %>% 
  bind_cols(cluster = as.factor(data_kmeans$cluster)) %>% 
    select(cluster, 1:5)
data_clust

4 Conclusion

data_clust %>% 
  group_by(cluster) %>% 
  dplyr::summarise(across(everything(), mean)) %>% 
  select(-c(2,3,4))

Now, let’s deduce the characteristics of each cluster as our summary :

Cluster 1 : medium income, medium spending. Cluster 2 : High income, high spending. This is what should be our primrary target. They have the money and they have the willingness to spend. Cluster 3 : high income, generic spending. This may as well be our secondary target, since they have the resource. We just have to understand what appeals to them the most in general. Further market analysis is also necessary. Cluster 4 : low income, high spending. The spendthrift. Well, this is a danger to them, but not to us. Cluster 5 : high income, low spending. These people are quite careful with their money. Perhaps we may have to extend and launch separate market analysis here.

Spotify Clustering

Yoanna

8/21/2020

1 Data Preparation

1.1 Data Cleaning

2 Data Exploratory

3 Clustering

4 Conclusion