library(tidyverse)
library(RODBC)
library(recipes)
library(FactoMineR)
library(kableExtra)
library(ggpubr)

1 Introduction

Customers are the key focus for every business. Understanding and knowing their behaviors has been a top priority for the longest time, whether it is to optimize resources or to give a more personal experience to clients who may fall in the same category.

For that same reason this project pretends to create a clear segmentation of the clients, with the idea of better understanding the way different groups of customers behave to subsequently find the specific needs for a more specific subgroup of the total group of clients.

A proper segmentation can be of immense value for any business for correctly directing promotion efforts for products depending on the found groups.

2 Extract data from SQL server

The desired data for this segmentation project is located in an open source datawarehouse provided by microsoft called “AdventureWorks2019”, that contains a variety of tables regarding different characteristics for products, sales, departments, customers and much more, on a certain business.

As mentioned before, the idea in this case will be to perform as clear of a segmentation as possible on the clients or customers of the business. For that reason, the DimCustomer table of the data warehouse will be the main table although some more variables will be extracted from other tables regarding some of the client’s behavior.

2.1 Open A database connection with SQL Server

Before creating a query to extract the data, it is necessary to create a connection with sql server as shown here:

# Creating connection locally

db_conn = odbcConnect("LocalDSN",rows_at_time = 1) 

2.2 Write the correspondent SQL query

Now we create a SQL query which basically extracts all the data related to customers plus some information about their buying patterns like how many objects the have purchased over the time.

sql = "
SET NOCOUNT ON;

WITH CustomerInfo AS (SELECT CustomerKey, BirthDate, MaritalStatus, Gender, YearlyIncome, TotalChildren,EnglishEducation,
EnglishOccupation, HouseOwnerFlag, NumberCarsOwned, EnglishCountryRegionName, StateProvinceName, DateFirstPurchase
FROM DimCustomer
INNER JOIN DimGeography 
ON DimCustomer.GeographyKey = DimGeography.GeographyKey
WHERE EnglishCountryRegionName = 'United States'),

SalesInfo AS (SELECT CustomerKey,SUM(SalesAmount) AS TotalSpent, COUNT(CustomerKey) AS NumberOfSales 
FROM FactInternetSales 
group by CustomerKey),

Reasons AS (SELECT CustomerKey, SalesReasonKey, COUNT(SalesReasonKey) AS Count 
FROM FactInternetSales 
INNER JOIN FactInternetSalesReason 
ON FactInternetSales.SalesOrderNumber = FactInternetSalesReason.SalesOrderNumber
AND FactInternetSales.SalesOrderLineNumber = FactInternetSalesReason.SalesOrderLineNumber
GROUP BY CustomerKey, SalesReasonKey),

added_row_number AS (
  SELECT
    *,
    ROW_NUMBER() OVER(PARTITION BY CustomerKey ORDER BY Count DESC) AS row_number
  FROM Reasons
),

FavoriteReason as (select CustomerKey, DimSalesReason.SalesReasonName from added_row_number 
inner join DimSalesReason on added_row_number.SalesReasonKey = DimSalesReason.SalesReasonKey
where row_number = 1)

SELECT CustomerInfo.*, TotalSpent, NumberOfSales, FavoriteReason.SalesReasonName AS FavoriteReason
FROM CustomerInfo 
INNER JOIN SalesInfo 
ON CustomerInfo.CustomerKey = SalesInfo.CustomerKey
INNER JOIN FavoriteReason 
ON FavoriteReason.CustomerKey = CustomerInfo.CustomerKey 
ORDER BY CustomerInfo.CustomerKey
;"

2.3 Load the data from the SQL query

Once the SQL query has been created, all that is left to do is to load the data from SQL Server into the R environment.

# Saving the SQL data in a Data frame
customer_info = sqlQuery(db_conn,sql, stringsAsFactors = F)

# Closing the previous connection
odbcClose(db_conn)

3 Data contextualization

As previously stated, the loaded data contains a variety of characteristics for 7248 clients for a certain business living in the US.

A brief explanation for each of these variables will be given next up:

  • CustomerKey: Unique identification key for each customer.
  • BirthDate: Date of birth of the customer.
  • MaritalStatus: Marital status of the customer (M = married, S = single).
  • Gender: Gender of the customer (F = female, M = Male).
  • YearlyIncome: Income every year for the customer.
  • TotalChildren: Number of children customer’s children.
  • EnglishEducation: Highest level of education reached by the customer (Bachelors, Graduate Degree, High School, Partial College, Partial High School).
  • English Occupation: Occupation of the customer.
  • HouseOwnerFlag: Indicative variable for owning a house (1 = owns a house, 0 = does not own a house).
  • NumberCarsOwned: Number of cars owned by the customer.
  • EnglishCountryRegionName: Country where the customer is from (United States for everyone).
  • StateProvinceName: State where the customer is from.
  • DateFirstPurchase: Date of the customers’s first purchase.
  • TotalSpent: Total amount of dollars the customer has spent on the business.
  • NumberOfSales: Total number of products that the customer has bought from the business.
  • FavoriteReason: Main reason why the customer has bought products from the business.

4 Data pre-processing

In order to build the desired clustering model it is necessary to manipulate the data to get it into a usable format.

For this purpose we will be creating some variables that might be of interest and using the “recipes” package, which will allow us to create a recipe object with the pre-processing instructions that need to be performed on the data; like scaling, centering, creation of dummy variables and more.

4.1 Creation of interest variables

First we will create a new variable that holds the current age of each customer in years by subtracting the birth data from the final recorded date in the business database which is the 28th of january of 2014.

# Calculating customer's Age
customer_info$Age = as.numeric((as.Date("2014-01-28")-customer_info$BirthDate)/365.25)

Next up, we are interested on building a variable that accounts for the amount of time a person has been a customer by subtracting the date for the first purchase from the final recorded date.

# Calculating customer's time as a client
customer_info$TimeAsClient = as.numeric(as.Date("2014-01-28")-customer_info$DateFirstPurchase)

4.2 Creation of the recipe object

# Dropping unnecessary variables
customer_info2 = customer_info[,-c(1,2,11,13)]

# Creating recipe object for preprocessing
object_recipe = recipe(TotalSpent~.,data = customer_info2)

Now it is desired to perform different concatenations over this object that will add new instructions to the pre-processing.

4.3 Scaling and centering numerical variables

# Including normalization step to recipe object
object_recipe = object_recipe %>%
  step_normalize(all_numeric(),-HouseOwnerFlag)

object_recipe
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         13
## 
## Operations:
## 
## Centering and scaling for all_numeric(), -HouseOwnerFlag

The function “step_normalize” will create a specification of a recipe step that will center and scale all the given variables. In this case, “all_numeric” was specified which means that all the numeric variables will be normalized.

4.4 Creation of dummy variables

For this, we need to create dummy variables out of the qualitative variables like the marital status or the favorite reason. A dummy variable is a variable that indicates whether a specific observation meets a certain criteria or not.

# Including dummy variable creation step to recipe
object_recipe = object_recipe %>%
  step_dummy(all_nominal())

object_recipe
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         13
## 
## Operations:
## 
## Centering and scaling for all_numeric(), -HouseOwnerFlag
## Dummy variables from all_nominal()

4.5 Creation of the final data

set.seed(111)

# Fitting the recipe to the customer information
trained_recipe = prep(object_recipe,training = customer_info2)
trained_recipe
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         13
## 
## Training data contained 7248 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for YearlyIncome, TotalChildren, NumberCarsOwned, N... [trained]
## Dummy variables from MaritalStatus, Gender, EnglishEducation, EnglishOccupatio... [trained]
# Creating new data frame applying the recipe
finalCustomer= bake(trained_recipe,new_data = customer_info2)

Finally we end up with a new data frame that contains all the interesting customer’s information with all the necessary pre-processing. Including, centering, scaling for the numeric variables and creation of dummy variables out of the qualitative ones as we can see in the next subtable for the entire dataset:

# Outputting a part of the data
kbl(finalCustomer[1:10,7:11]) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
TimeAsClient TotalSpent MaritalStatus_S Gender_M EnglishEducation_Graduate.Degree
0.0545748 -0.6465335 0 0 0
-0.0521567 -0.6274189 0 1 0
0.0278919 -0.6131034 1 0 0
0.2718496 0.7678935 1 0 0
0.1879892 0.6692867 0 1 0
0.2527904 0.6924814 1 0 0
0.2642259 0.6634413 0 1 0
0.1574944 -0.6225789 0 1 0
-0.3380445 -0.6610009 0 1 0
-0.0140383 0.6802060 0 1 0

5 K-means segmentation

Now, it was decided to use k means as the algorithm of choice for the segmentation of the customers in the business.

This is an unsupervised learning technique that intends to use a parameter \(k\) to create \(k\) mutually exclusive groups from the data by classifying observations that have similar characteristics together.

The final segmentation provided by the k means will depend on the choice for the parameter \(k\) of clusters. Reason why choosing a proper value for the parameter is notoriously important.

set.seed(2222)
# Vector to store the within cluster sum of squares for different K's
wcss <- vector()
for(i in 1:15){
  # Storing the WCSS for the i-th model
  wcss[i] <- sum(kmeans(finalCustomer, i)$withinss)
}
# Plotting the WCSS for every K used
plot(wcss, type = "ol")

Using k = 3 seems like a good option based on the previous graph, since even though it is not the parameter that returns the best values for the error it seems like after k = 3 the error does not really changes much added to the fact that, as we will see later, using 3 clusters gives a good segmentation of the customers into mutually exclusive groups. Taking this into account and the idea of trying to keep the model as simple as possible while still delivering good results, k = 3 will be the final parameter used for the k means clustering.

set.seed(223)

# Calculating first two principal components from the data
pca = prcomp(finalCustomer,rank = 2)

# Extracting the principal components from the pca object
x_pca = as.data.frame(pca$x)

# Plotting the principal components
ggplot(x_pca,aes(PC1,PC2))+
  geom_point()

set.seed(223)

# Calculating the cluster using the k-means algorithm with K = 3
kmeans_customer = kmeans(finalCustomer,3,iter.max =1000,nstart = 1)

# Including the clusters in the same data frame as the principal components
x_pca$cluster = kmeans_customer$cluster

# Visualizing the clusters using the principal components
ggplot(x_pca,aes(PC1,PC2,col = as.factor(cluster)))+
  geom_point()+
  labs(col = "Cluster")

# Calculating frequency for each cluster
cluster_table = x_pca %>%
  group_by(cluster) %>%
  summarise(Total_customers = n())

# Outputting table with frecuencies
kbl(cluster_table)%>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
cluster Total_customers
1 3268
2 1319
3 2661

As mentioned before, the utilization of 3 clustering groups results in a very clear segmentation of the customers into mutually exclusive groups as we can see in the previous principal components plot. This leads us to believe that using k = 3 for this exercise is the optimal way to go since it keeps a simple model and provides an outstanding segmentation while also keeping a reduced error.

# Including the cluster in the original data frame
customer_info$Cluster = as.factor(x_pca$cluster)

6 Descriptive analysis for the clusters.

Building these groups is only the first thing we are interested on. Next up, it is necessary to study the behavior of the clients that fall into each one of the three groups in some variables that might represent interest for the business.

# Calculating average money spent
customer_info$AverageSpent = customer_info$TotalSpent/customer_info$NumberOfSales


# Plot for average spent by cluster
ggplot(customer_info,aes(x = Cluster, y = AverageSpent,fill = Cluster))+
  geom_boxplot()+
  theme_bw()+
  theme(legend.position = "none")+
  labs(title = "Distribution for the average amount spent for each purchase")

It is very clear how those customers classified as part of the second group, are the ones that tend to spend the most money in general out of all the groups in every purchase. On the other hand it seems that groups 1 and 3 do not present a significant difference in the average money spent.

# Plot for frequency of each gender by cluster
p1 = ggplot(customer_info,aes(x = Cluster,fill = Gender))+
  geom_bar(position = position_dodge())+
  theme_bw()+
  theme(legend.position = "bottom")

# Plot for frequency of marital status by cluster
p2 = ggplot(customer_info,aes(x = Cluster,fill = MaritalStatus))+
  geom_bar(position = position_dodge())+
  theme_bw()+
  theme(legend.position = "bottom")

# Arranging both plots in the same layout
ggarrange(p1,p2)

Now regarding gender, it looks like every cluster has about the same proportion of women and men. Which means that having a specific gender does not make you more likely to fall into a specific group.

On the other hand, is looks like the majority of people classified as part of the first group are married, while the other two groups seem to have a very balanced distribution.

# Plot for the yearly income by cluster
ggplot(customer_info,aes(x = Cluster, y = YearlyIncome,fill = Cluster))+
  geom_boxplot()+
  theme_bw()+
  theme(legend.position = "none")+
  labs(title = "Distribution for the yearly income by cluster")

Even though we saw how the second group is the one that tends to spend the most, the previous plot shows how this group has in average the same yearly income than the first group. This is very interesting, because it means that customers in the second group are far more likely to spend a lot of money, not because of the money they make but because of the things they are willing to buy while the first group seems to be formed by people who get a similar income but do not want to spend it on more expensive products.

# Plot for the number of children by the cluster
ggplot(customer_info,aes(x = Cluster, y = TotalChildren,fill = Cluster))+
  geom_boxplot()+
  theme_bw()+
  theme(legend.position = "none")+
  labs(title = "Distribution for the number of children by cluster")

Now this plot demonstrates a very interesting result. It basically shows how people classified into the first cluster tend to have more children than people on the other two groups. This leads us to believe that the probable reason why the first cluster buys less expensive products than the second group even when both of them tend to have the same yearly income, is because of the number of children, where, since the people in the first group tend to have more children, they cannot afford to buy more expensive products like the second group can.

This result also makes sense since we previously saw how the first group is mainly formed by married people, which makes it reasonable to believe that they will also have more children in average.

# Plot for the yearly income by the level of education
p1 = ggplot(customer_info,aes(x = EnglishEducation,
                              y = YearlyIncome,
                              fill = EnglishEducation))+
  geom_boxplot()+
  theme_bw()+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45,hjust = 1))

# Plot for the frequency of the levels of education by cluster
p2 = ggplot(customer_info,aes(x = Cluster,fill = EnglishEducation))+
  geom_bar(position = position_dodge())+
  theme_bw()+
  theme(legend.position = "none")

# Arranging both plots in the same layout
ggarrange(p1,p2)

# Plot for the yearly income by the occupation 
p1 = ggplot(customer_info,aes(x = EnglishOccupation,y = YearlyIncome,fill = EnglishOccupation))+
  geom_boxplot()+
  theme_bw()+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45,hjust = 1))

# Plot for the frequency ofeach occupation by cluster
p2 = ggplot(customer_info,aes(x = Cluster,fill = EnglishOccupation))+
  geom_bar(position = position_dodge())+
  theme_bw()+
  theme(legend.position = "none")

# Arranging both plots in the same layout
ggarrange(p1,p2)

This set of graphics come to show how the third cluster is mainly formed of people who did not finish their academical studies and have clerical or skilled manual related jobs which, generally, can be related to a lower social class o projected income as we also saw before with the average yearly income.

This gets us to believe that the people who were classified as part of the third cluster are, in general, the ones who have less resources.

# Plot for whether a person owns a flag or not and the cluster
p1 = ggplot(customer_info,aes(x = Cluster,
                         fill = as.factor(HouseOwnerFlag)))+
  geom_bar(position = "fill")+
  theme_bw()+
  theme(legend.position = "bottom")+
  labs(fill = "Owns a house")

# Plot for the number of cars owned by cluster
p2 = ggplot(customer_info,aes(x = Cluster,
                              y = NumberCarsOwned,
                              fill = Cluster))+
  geom_boxplot()+
  theme_bw()+
  theme(legend.position = "none")

# Arranging both plots in the same layout
ggarrange(p1,p2)

It seems there is not a big difference regarding the number of cars. On the other hand, it looks like people from the third cluster are a little less likely to own a house but not by much.

# Creating a copy of the original data frame
customer_info3 = customer_info

# Using only the 3 most frequent buying reasons
customer_info3$FavoriteReason = 
  ifelse(customer_info3$FavoriteReason 
      %in% c("Price","On Promotion"),
      customer_info3$FavoriteReason,"Other")

# Plot for the frequency of each reason for buying by cluster.
ggplot(customer_info3,aes(x = Cluster,fill = FavoriteReason))+
  geom_bar(position = "fill")+
  theme_bw()+
  theme(legend.position = "bottom")

In general, as we already suspected, the main reason why people tend to buy at this business is because of the competent prices. However, it is important to mention that those people classified as part of the second cluster, are far more likely to buy products because they were in promotion or for other reasons that may include: Quality, Review, Manufacturer, etc. Which again comes to show how, in general, the second group is the one willing to spend the most since a great portion of them do not really mind the price itself that much.

# Plot for the age by cluster
ggplot(customer_info,aes(x = Cluster,y = Age,fill = Cluster))+
  geom_boxplot()+
  theme_bw()+
  theme(legend.position = "none")

Before, we mentioned how people classified as part of the third group tended to be people with lower incomes and resources in general. This previous plot might give us an idea of why that is; one of the probable reasons why this group tends to be less wealthy is basically because they are also the youngest demographic.

# Plot for the time as a client by cluster
p1  = ggplot(customer_info,aes(x = Cluster,
                               y = TimeAsClient,
                               fill = Cluster))+
  geom_boxplot()+
  theme_bw()+
  theme(legend.position =  "none")

# Plot for the number of products bought by cluster
p2  = ggplot(customer_info,aes(x = Cluster,
                               y = NumberOfSales,
                               fill = Cluster))+
  geom_boxplot()+
  theme_bw()+
  theme(legend.position =  "none")

# Arranging both plots in the same layout
ggarrange(p1,p2)

Despite not being the oldest population, those classified as part of the second cluster are, in general, the clients who have been buying products from the business for the longest time. Probable reason why they are also the ones who have bought more objects in general.

7 Conclusions.

7.1 Cluster 1:

This cluster is mainly formed by older people that seem to be part of a stable family, being usually married and having in general more children than the people located in the other two clusters. Even though they tend to have a very signifiCant yearly income, due to being mostly fully graduated people and having well-paid jobs, it is very clear that their main interest is not to buy expensive or very luxurious items, rather they probably focus on buying products that might represent an improvement for the lifestyle of their family members without having to overspend unnecessarily, usually looking for those objects that have a reasonable price.

7.2 Cluster 2:

On the other hand, the second cluster seems to be formed of people with stable families and life but with less responsibilities like children but maintaining a good financial status because of their general high level of education and good paying jobs. This leads to them being way more likely to worry less about the price and spend more money every time they buy something, or in other words, they are mainly interested in more expensive items meaning they might be a good focus point for the enterprise since they can produce the highest earnings. However, it is also important to note that this is the less populated cluster, so maybe investing to much on them might prove to be counterproductive since only a small proportion of the clients will be accounted for.

7.3 Cluster 3:

Finally, the last cluster seems to group younger people less likely to have a well established family and, in general, with less purchasing power due to many of them not having completed higher levels of education and mainly having less profitable jobs. They generally focus on buying things that are cheap, disregarding the overall quality of the product or more important reasons. This does not mean the business should totally forget about these people; however, investing a lot resources on them might not seem very reasonable.