Unsupervised Learning methods on online sales data using dimension reduction, clustering and association rules.

Business impact: what useful insights could be found for the online retailer?

Daniela Quintero Narvaez

The main goal of this report is to provide meaningful and strategic insights to an online retailer based on the sales data gathered from its customers and products, using unsupervised learning methods. The data set was downloaded from Kaggle and can be found here.

For the first part, a customer segmentation analysis will be done, where Dimension Reduction and Clustering will be used. For the second part, a basket analysis will be performed using the product categories and not the SKUs, only for computational reasons.

Data cleaning and preparation

Don’t forget to set the seed!

Following the reasoning of ‘garbage in - garbage out’, to ensure accurate and meaningful analysis, it is crucial to clean and pre-process the data prior to any modeling. This involves handling missing or incomplete data, as well as identifying and handling outliers or extreme values. Additionally, categorical variables must be converted into numerical variables using techniques such as one-hot encoding. By performing these essential data pre-processing steps, the subsequent modeling efforts are based on reliable and accurate data.

head(sales)

##    order_id order_date   status item_id                    sku qty_ordered
## 1 100354678 2020-10-01 received  574772     oasis_Oasis-064-36          21
## 2 100354678 2020-10-01 received  574774        Fantastic_FT-48          11
## 3 100354680 2020-10-01 complete  574777        mdeal_DMC-610-8           9
## 4 100354680 2020-10-01 complete  574779     oasis_Oasis-061-36           9
## 5 100367357 2020-11-13 received  595185    MEFNAR59C38B6CA08CD           2
## 6 100367357 2020-11-13 received  595186 MEFBUY59B7C3DDC2CA3-42           2
##   price  value discount_amount  total      category payment_method bi_st
## 1  89.9 1798.0               0 1798.0 Men's Fashion            cod Valid
## 2  19.0  190.0               0  190.0 Men's Fashion            cod Valid
## 3 149.9 1199.2               0 1199.2 Men's Fashion            cod   Net
## 4  79.9  639.2               0  639.2 Men's Fashion            cod   Net
## 5  99.9   99.9               0   99.9 Men's Fashion            cod Valid
## 6  39.9   39.9               0   39.9 Men's Fashion            cod Valid
##   cust_id year    month ref_num Name.Prefix First.Name Middle.Initial Last.Name
## 1   60124 2020 Oct-2020  987867        Drs.       Jani              W     Titus
## 2   60124 2020 Oct-2020  987867        Drs.       Jani              W     Titus
## 3   60124 2020 Oct-2020  987867        Drs.       Jani              W     Titus
## 4   60124 2020 Oct-2020  987867        Drs.       Jani              W     Titus
## 5   60124 2020 Nov-2020  987867        Drs.       Jani              W     Titus
## 6   60124 2020 Nov-2020  987867        Drs.       Jani              W     Titus
##   Gender age   full_name               E.Mail Customer.Since         SSN
## 1      F  43 Titus, Jani jani.titus@gmail.com      8/22/2006 627-31-5251
## 2      F  43 Titus, Jani jani.titus@gmail.com      8/22/2006 627-31-5251
## 3      F  43 Titus, Jani jani.titus@gmail.com      8/22/2006 627-31-5251
## 4      F  43 Titus, Jani jani.titus@gmail.com      8/22/2006 627-31-5251
## 5      F  43 Titus, Jani jani.titus@gmail.com      8/22/2006 627-31-5251
## 6      F  43 Titus, Jani jani.titus@gmail.com      8/22/2006 627-31-5251
##      Phone.No. Place.Name County   City State   Zip Region User.Name
## 1 405-959-1129     Vinson Harmon Vinson    OK 73571  South   jwtitus
## 2 405-959-1129     Vinson Harmon Vinson    OK 73571  South   jwtitus
## 3 405-959-1129     Vinson Harmon Vinson    OK 73571  South   jwtitus
## 4 405-959-1129     Vinson Harmon Vinson    OK 73571  South   jwtitus
## 5 405-959-1129     Vinson Harmon Vinson    OK 73571  South   jwtitus
## 6 405-959-1129     Vinson Harmon Vinson    OK 73571  South   jwtitus
##   Discount_Percent
## 1                0
## 2                0
## 3                0
## 4                0
## 5                0
## 6                0

str(sales)

## 'data.frame':    286392 obs. of  36 variables:
##  $ order_id        : chr  "100354678" "100354678" "100354680" "100354680" ...
##  $ order_date      : chr  "2020-10-01" "2020-10-01" "2020-10-01" "2020-10-01" ...
##  $ status          : chr  "received" "received" "complete" "complete" ...
##  $ item_id         : num  574772 574774 574777 574779 595185 ...
##  $ sku             : chr  "oasis_Oasis-064-36" "Fantastic_FT-48" "mdeal_DMC-610-8" "oasis_Oasis-061-36" ...
##  $ qty_ordered     : num  21 11 9 9 2 2 2 2 2 2 ...
##  $ price           : num  89.9 19 149.9 79.9 99.9 ...
##  $ value           : num  1798 190 1199.2 639.2 99.9 ...
##  $ discount_amount : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ total           : num  1798 190 1199.2 639.2 99.9 ...
##  $ category        : chr  "Men's Fashion" "Men's Fashion" "Men's Fashion" "Men's Fashion" ...
##  $ payment_method  : chr  "cod" "cod" "cod" "cod" ...
##  $ bi_st           : chr  "Valid" "Valid" "Net" "Net" ...
##  $ cust_id         : num  60124 60124 60124 60124 60124 ...
##  $ year            : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ month           : chr  "Oct-2020" "Oct-2020" "Oct-2020" "Oct-2020" ...
##  $ ref_num         : int  987867 987867 987867 987867 987867 987867 987867 171143 171143 171143 ...
##  $ Name.Prefix     : chr  "Drs." "Drs." "Drs." "Drs." ...
##  $ First.Name      : chr  "Jani" "Jani" "Jani" "Jani" ...
##  $ Middle.Initial  : chr  "W" "W" "W" "W" ...
##  $ Last.Name       : chr  "Titus" "Titus" "Titus" "Titus" ...
##  $ Gender          : chr  "F" "F" "F" "F" ...
##  $ age             : num  43 43 43 43 43 43 43 28 28 28 ...
##  $ full_name       : chr  "Titus, Jani" "Titus, Jani" "Titus, Jani" "Titus, Jani" ...
##  $ E.Mail          : chr  "jani.titus@gmail.com" "jani.titus@gmail.com" "jani.titus@gmail.com" "jani.titus@gmail.com" ...
##  $ Customer.Since  : chr  "8/22/2006" "8/22/2006" "8/22/2006" "8/22/2006" ...
##  $ SSN             : chr  "627-31-5251" "627-31-5251" "627-31-5251" "627-31-5251" ...
##  $ Phone.No.       : chr  "405-959-1129" "405-959-1129" "405-959-1129" "405-959-1129" ...
##  $ Place.Name      : chr  "Vinson" "Vinson" "Vinson" "Vinson" ...
##  $ County          : chr  "Harmon" "Harmon" "Harmon" "Harmon" ...
##  $ City            : chr  "Vinson" "Vinson" "Vinson" "Vinson" ...
##  $ State           : chr  "OK" "OK" "OK" "OK" ...
##  $ Zip             : int  73571 73571 73571 73571 73571 73571 73571 32042 32042 32042 ...
##  $ Region          : chr  "South" "South" "South" "South" ...
##  $ User.Name       : chr  "jwtitus" "jwtitus" "jwtitus" "jwtitus" ...
##  $ Discount_Percent: num  0 0 0 0 0 0 0 0 0 0 ...

First, it is necessary to modify variable classes to ensure that they are in the correct format for analysis. Next, unnecessary columns containing personal data should be removed to protect customer privacy, but also because they won’t bring relevant insights to the analysis. Additionally, any rows containing missing values or zero transaction values will be removed to ensure the integrity of the data. In this case, the age variable will also be grouped into categories for easier interpretation of results. Finally, category names should be cleaned to remove any unnecessary characters or formatting.

## 'data.frame':    174059 obs. of  13 variables:
##  $ order_id      : int  100354678 100354678 100354680 100354680 100367357 100367357 100367360 100356116 100358724 100403536 ...
##  $ qty_ordered   : int  21 11 9 9 2 2 2 1 1 3 ...
##  $ price         : num  89.9 19 149.9 79.9 99.9 ...
##  $ value         : num  1798 190 1199.2 639.2 99.9 ...
##  $ total         : num  1798 190 1199.2 639.2 99.9 ...
##  $ category      : Factor w/ 15 levels "Appliances","BeautynGrooming",..: 9 9 9 9 9 9 10 1 1 1 ...
##  $ payment_method: Factor w/ 13 levels "apg","bankalfalah",..: 4 4 4 4 4 4 4 13 4 6 ...
##  $ cust_id       : int  60124 60124 60124 60124 60124 60124 60124 42485 42485 42485 ...
##  $ year          : Factor w/ 2 levels "2020","2021": 1 1 1 1 1 1 1 1 1 1 ...
##  $ month         : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 10 10 10 10 11 11 11 10 10 12 ...
##  $ Gender        : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 2 2 2 ...
##  $ State         : Factor w/ 51 levels "AK","AL","AR",..: 37 37 37 37 37 37 37 10 10 10 ...
##  $ age_group     : Factor w/ 4 levels "Less than 30",..: 2 2 2 2 2 2 2 1 1 1 ...

After those steps, the data has the following overview:

summary(cleansales)

##     order_id          qty_ordered          price             value        
##  Min.   :100354678   Min.   :  2.000   Min.   :    0.1   Min.   :    1.0  
##  1st Qu.:100401548   1st Qu.:  2.000   1st Qu.:   39.9   1st Qu.:   53.2  
##  Median :100445795   Median :  2.000   Median :   92.3   Median :  135.0  
##  Mean   :100452409   Mean   :  3.168   Mean   :  526.8   Mean   :  685.9  
##  3rd Qu.:100509890   3rd Qu.:  3.000   3rd Qu.:  403.2   3rd Qu.:  674.0  
##  Max.   :100562385   Max.   :501.000   Max.   :29166.7   Max.   :35250.0  
##                                                                           
##      total                    category             payment_method 
##  Min.   :    1.0   MensFashion    :28909   cod            :91053  
##  1st Qu.:   52.0   MobilesnTablets:28247   Easypay        :26100  
##  Median :  124.8   WomansFashion  :19785   easypay_voucher:16221  
##  Mean   :  613.0   Appliances     :17508   Payaxis        :12896  
##  3rd Qu.:  602.7   BeautynGrooming:12608   bankalfalah    : 6126  
##  Max.   :35250.0   Superstore     :10004   Easypay_MA     : 3625  
##                    (Other)        :47210   (Other)        : 8250  
##     cust_id         year           month       Gender        State       
##  Min.   :     4   2020:66352   Dec    :50345   F:79780   CA     : 10088  
##  1st Qu.: 52526   2021:97919   Apr    :20824   M:84491   TX     :  9913  
##  Median : 72380                Mar    :16037             NY     :  9069  
##  Mean   : 67651                Jun    :12890             PA     :  8152  
##  3rd Qu.: 89145                Jan    :11544             IL     :  7893  
##  Max.   :115325                May    : 9732             OH     :  5892  
##                                (Other):42899             (Other):113264  
##          age_group    
##  Less than 30 :35603  
##  Between 30-45:42438  
##  Between 46-60:43957  
##  More than 60 :42273  
##                       
##                       
##

The above clean dataset will be used to perform the customer segmentation analysis and the basket analysis. Before each process additional data preparation steps will be done, which will be mentioned further.

It is important to keep in mind the meaning of some of the factor variables for further interpretation, for example:

Age groups:

Less than 30	Between 30-45	Between 46-60	More than 60
1	2	3	4

Gender:

Female	Male
1	2

Customer Segmentation Analysis

Customer segmentation is a process of dividing a large customer base into smaller groups of customers who share similar characteristics, behaviors, or needs. One way to perform customer segmentation is to use dimensionality reduction techniques such as principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) to reduce the number of variables and visualize the data in a lower-dimensional space. Then, clustering algorithms such as k-means can be applied to group customers based on their similarities. These techniques are highly important as they can help businesses to better understand their customers, tailor marketing strategies to specific customer segments, and improve customer satisfaction and retention.

Exploratory Data Analysis

After completing the data cleaning process, it is important to conduct exploratory data analysis (EDA). In this case, it is important to use this stage to get to know the data and also the business and its main characteristics; this will allow to better interpretability of the results and better recommendations for the retailer.

About the customers it is visible that the proportion of Female and Male shoppers is almost 50% - 50%

# F - Female; M- Male
summary(cleansales$Gender)

##     F     M 
## 79780 84491

The customers are also proportionally distributed amog the different age groups

#Age group
summary(cleansales$age_group)

##  Less than 30 Between 30-45 Between 46-60  More than 60 
##         35603         42438         43957         42273

#customers per age
ggplot(cleansales, aes(x = age_group)) +
  geom_bar() +
  labs(x = "Age Group", y = "Count of cust_id", title = 'Customers per age group')

The most popular payment method is by far ‘cod’ which means ‘Cash on Delivery’, followed by ‘Easypay’ options, which may suggest that customers of this online shop like convenient payment methods.

# Payment method
summary(cleansales$payment_method)

##               apg       bankalfalah    cashatdoorstep               cod 
##               353              6126                 9             91053 
##    customercredit           Easypay        Easypay_MA   easypay_voucher 
##              3265             26100              3625             16221 
## financesettlement       jazzvoucher        jazzwallet           mcblite 
##                 0              2223              2194               206 
##           Payaxis 
##             12896

Checking the distribution of the customers along USA, the highest amount of customers are in California, Florida, Illinois, Montana, New York, Pennsylvania, Texas; but with overall presence in whole USA. States where they should strengthen their business: Delaware, Hawaii and Rhode Island.

#customers per state
ggplot(cleansales, aes(x = State)) +
  geom_bar() +
  labs(x = "State", y = "Count of cust_id", title = 'Customers per State') +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Taking a look into the business perspective, the sales in value and in volume, per category look as follows:

ggplot(cleansales, aes(x = category, y = total)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(x = "Category", y = "Value", title = "Sales per category") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

From the graph above it is visible that the main cashflow categories are Mobiles and Tablets, Appliances and Enterntainment categories, which represent more than 80% of the sales of the online store.

ggplot(cleansales, aes(x = category, y = qty_ordered)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(x = "Category", y = "Volume", title = "Volume sold per category") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

By looking at the volumes, the main categories are Men’s Fashion, Superstore, Mobiles and Tablets, Woman’s Fashion, Appliances, Beauty and grooming. Just by looking at these graphs main inferences about the business model can be made: * The company’s strenght is around Mobiles and Tablets, being a category performing well both in value and volume. Following the same logic is the Appliances category, showing smaller perfomance in value in comparison to volume, but this can be by design, as mobiles and tablets might me more expensive items than the Appliances being sold online. These categories could also be named as destination categories. * Traffic generator categories, like Men’s and Women’s Fashion and Superstore, where the performance in volume is significantly higher than in value; nice and big discounts may happen here, to attract consumers to website or tempt them to expand their basket once they are done with their main shopping. * Entertainment, Computing and Other categories have similar relative perfomance both in volume and value. Assuming that the majority of the customers visit the store for mobiles and tablets categories, these other categories could work perfectly as complementary (software, video games, other accesories, etc)

Now that we have more insights about the business, is important to revise another aspect of EDA which involves, for example, examining histograms of the data that are useful because they provide a visual representation of the distribution of a dataset. They also allow to quickly see the range, frequency, and shape of the data, making it easier to identify patterns, outliers, and potential issues. For the present case, they will correspond to the total purchase values, total quantities ordered and price paid per category, which are the numeric variables that will be mostly used.

As all the three histograms are right skewed, i.e. the tail of the distribution extends to the right of the peak. In our case this is particularly bad as the analysis may be influenced by the outliers on the high end of the distribution. This can result in the algorithm placing too much emphasis on the high values, while ignoring the lower values. It could also lead to biased or incorrect results, and may cause the clustering algorithm to create clusters that do not accurately reflect the underlying patterns in the data.

To address the skewness of the data, log transformation will be applied to change the shape of the histograms to be more like a normal distribution. But please bear in mind that it is generally a good practice to explore different data transformations and compare the results to determine the best approach for your specific dataset and analysis goals.

hist(clean_log$log_qty_ordered)
title("Total quantity ordered per customer")

hist(clean_log$log_price) 
title("Price paid per customer")

After the log transformation was applied, the data has similar shape as Gauss bell for the 3 variables in question.

Outliers - Boxplots

Outliers can significantly impact the results of the analysis, as they are data points that are significantly different from others and may skew the distribution of data. This can lead, for example, to a biased estimation of cluster centers and can affect the performance of the clustering algorithms. Therefore, removing outliers can help to ensure that the customer segmentation analysis is more accurate and representative of the underlying data distribution. Additionally, removing outliers can help to improve the interpretability of the results by reducing the noise and providing a clearer understanding of the distinct customer groups.

Based on the above box plots, the data outside the whiskers was removed as they show to be really extreme values probably from customers that did a one-time purchase or perhaps shopped for a special occasion only. Also, the number of rows removed was not highly significant, in this particular case it was not higher than 10% of the total data.

# Removing outliers
clean_log1 <- clean_log
clean_log1 <- clean_log1[clean_log1$log_total <= 10 & clean_log1$log_total >= 0, ]
clean_log1 <- clean_log1[clean_log1$log_price <= 10 & clean_log1$log_price >= 0, ]
clean_log1 <- clean_log1[clean_log1$log_qty_ordered <= 3, ]

Dimension Reduction

In order to apply dimension reduction techniques, it is necessary to transform all variables into numeric ones. After this transformation, each row will represent a unique customer ID and categories will be in single columns, allowing us to group the values of quantity ordered and determine if there is any relevance in the amount of items purchased per category. Since there are thousands of SKUs, it is more practical to group the purchases by category.

Finally, scaling the data before performing PCA is important because it ensures that each variable is equally important and prevents variables with larger scales from dominating the analysis.

datascaled <- scale(customersegm)
head(datascaled)

##      MensFashion MobilesnTablets Appliances HomenLiving Computing
## [1,]   6.9357314       3.8883944  2.3665690  -0.1820035 -0.142464
## [2,]   1.0412076      -0.2843682 -0.2341078   1.6446086 -0.142464
## [3,]  -0.3190671      -0.2843682  2.2810900  -0.1820035 -0.142464
## [4,]  -0.3190671      -0.2843682 -0.2341078  -0.1820035 -0.142464
## [5,]  -0.3190671      -0.2843682 -0.2341078   3.4712207 -0.142464
## [6,]   0.1343578       1.3847369 -0.2341078  -0.1820035 -0.142464
##      BeautynGrooming HealthnSports WomansFashion  Superstore     Soghaat
## [1,]       0.3161139   -0.01369184    -0.2062302 -0.09107498  1.90887790
## [2,]      -0.1748283   -0.01369184    -0.2062302 -0.09107498 -0.08022073
## [3,]      -0.1748283    0.13163932    -0.2062302 -0.09107498 -0.08022073
## [4,]       1.0942390   -0.01369184    -0.2062302 -0.09107498 -0.08022073
## [5,]      -0.1748283    0.13163932    -0.2062302 -0.09107498 -0.08022073
## [6,]      -0.1748283   -0.01369184    -0.2062302 -0.09107498  3.57341300
##       KidsnBaby Entertainment     Others SchoolnEducation       Books
## [1,] -0.1299224    -0.1836163 -0.2187409      -0.04945439 -0.04294828
## [2,]  0.8797685    -0.1836163 -0.2187409      -0.04945439 -0.04294828
## [3,] -0.1299224    -0.1836163 -0.2187409      -0.04945439 -0.04294828
## [4,] -0.1299224    -0.1836163 -0.2187409      -0.04945439 -0.04294828
## [5,] -0.1299224    -0.1836163 -0.2187409      -0.04945439 -0.04294828
## [6,] -0.1299224    -0.1836163 -0.2187409      -0.04945439 -0.04294828
##      log_qty_ordered_mean log_total_mean log_price_mean log_value_mean
## [1,]           -0.4041259     -0.2570028     -0.1265375     -0.2597608
## [2,]           -0.3100046     -1.2715274     -1.2183792     -1.2718723
## [3,]           -0.5100124      1.5061651      1.7232208      1.4759969
## [4,]            2.7415270     -0.4432649     -1.4373444     -0.4647305
## [5,]           -0.1099969     -0.2863268     -0.3011143     -0.3117943
## [6,]           -0.1848584     -1.1551245     -1.0696267     -1.0980011
##       age_group       State     Gender payment_method
## [1,] -0.4982988 -0.08523892  0.9972075     -0.5343675
## [2,]  1.3186351 -0.69836259 -1.0027794     -0.5343675
## [3,] -1.4067658 -0.01711407  0.9972075     -0.5343675
## [4,] -0.4982988 -0.35773833  0.9972075     -0.5343675
## [5,]  1.3186351 -0.28961348  0.9972075     -0.5343675
## [6,]  1.3186351  1.48163266  0.9972075     -0.5343675

Both PCA (Principal Component Analysis) and T-SNE (t-Distributed Stochastic Neighbor Embedding) will be used for our customer segmentation analysis as PCA may be used for dimension reduction and feature extraction, while t-SNE could be used for visualizing high-dimensional data in a low-dimensional space. Further on, by comparing the results of clustering with both techniques, it is likely to gain a better understanding of the structure of the data and potentially uncover more insights.

Principal Component Analysis

As mentioned before, PCA is a statistical method that can be applied to analyze datasets in our customer segmentation analysis. The technique aims to simplify the dataset by reducing the number of variables while maintaining as much information as possible. It works by transforming the original variables into a new set of variables called principal components, which are combinations of the original variables that are arranged in descending order of the amount of variation they capture, with the first principal component retaining the most variation in the dataset. This approach will be used to identify patterns and relationships among the variables, allowing the retailer to segment its customers more effectively.

pcadata <- prcomp(datascaled, scale. = FALSE)
fviz_eig(pcadata, addlabels = TRUE, ylim = c(0, 15))

After applying PCA, the dimensions together with the Eigenvalues should be analyzed. The eigenvalues are a measure of the amount of variation that is captured by each principal component (PC). The larger the eigenvalue, the more variation is retained by the corresponding PC. Typically, the first few PCs will have larger eigenvalues than the later ones, indicating that they capture the majority of the variation in the data set, meaning that the first PCs provide the most important information about the structure of the data set, and are often the ones used for further analysis.

get_eigenvalue(pcadata)

##          eigenvalue variance.percent cumulative.variance.percent
## Dim.1  3.4228598241     14.881999235                    14.88200
## Dim.2  2.3391274917     10.170119529                    25.05212
## Dim.3  1.4161016616      6.156963746                    31.20908
## Dim.4  1.1195347751      4.867542500                    36.07663
## Dim.5  1.0556177672      4.589642466                    40.66627
## Dim.6  1.0422228046      4.531403498                    45.19767
## Dim.7  1.0069628020      4.378099139                    49.57577
## Dim.8  1.0046895197      4.368215303                    53.94399
## Dim.9  1.0007243731      4.350975535                    58.29496
## Dim.10 0.9931723546      4.318140672                    62.61310
## Dim.11 0.9887547823      4.298933836                    66.91204
## Dim.12 0.9386593624      4.081127663                    70.99316
## Dim.13 0.9332072109      4.057422656                    75.05059
## Dim.14 0.8781230268      3.817926203                    78.86851
## Dim.15 0.8191746201      3.561628783                    82.43014
## Dim.16 0.7834451212      3.406283136                    85.83642
## Dim.17 0.7336449604      3.189760697                    89.02618
## Dim.18 0.7114141784      3.093105123                    92.11929
## Dim.19 0.6492877482      2.822990209                    94.94228
## Dim.20 0.6084230641      2.645317670                    97.58760
## Dim.21 0.5524234220      2.401840965                    99.98944
## Dim.22 0.0020833927      0.009058229                    99.99850
## Dim.23 0.0003457369      0.001503204                   100.00000

In this case it is visible through the Scree plot and the eigenvalues above that the first 3 components have the highest variance justification, though all the 3 together are able to explain only 30% of it, meaning that the majority of the variance is not captured by these components. Only when taking into consideration the first 12 Dimensions, it is possible to capture 70% of the variance. Ideally, this amount of variance could be possible to capture with the first (or less) dimensions, but in this case, it is still half of the components of our data allowing to capture the 70%.

Taking a lot to the variable contribution to the first 12 dimensions, the red dotted line represents the cutoff for the amount of variance to retain in the data. Variables with loadings above the line are considered important in explaining the variance in the data and should be included in further analysis, while those below the line may be less important and can be potentially excluded. In this case, it is visible that the variables that contribute the most to the 12 dimensions are: - Gender - Health and Sports category - Age group - State - Average value spend (value before discounts) - Average total paid (value after discounts were applied) - Average price paid - Books category - Computing Category - Superstore

var <- get_pca_var(pcadata)
fviz_contrib(pcadata, choice = "var", axes = 1:12, top = 15)

Which means that the relevant variables above are the ones that will be more significant to the customer segmentation analysis. So far, it is possible to nearly conclude that the different approaches to those categories will be the ones to give best differential insights on how the customers behave and how to categorize them.

Similarly, the graph below also shows the the contributions of variables to principal components, and the color of the points in the plot corresponds to the squared cosine of the variable on the principal component. The squared cosine measures the quality of the variable’s representation on that component. It is a value between 0 and 1 that represents the proportion of the variable’s variance that is explained by that component. A value of 1 indicates that the variable is perfectly represented by that component, and a value of 0 indicates that the variable has no relationship with that component.

As per the results below, it seams that the only ‘well-represented’ variables by the component are the ones related to sales (price, total spend), as they are ones in darker shades of yellow/orange. The other variables show not so high quality representation by the cos2. It is importnat to bear in mind also that the longer the arrow, the strongest the influence on the PC and variables pointing in the same direction represent measures of similar underlying factors, which is the case for several variables in the plot.

fviz_pca_var(pcadata, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)

The below correlation plot shows slightly the importance of each variable for each dimension where the first dimension is composed of purchase level, the second one mainly groups the quantities bought and the ‘Other’ category and for example the 8th one, which is correlated with the State - geographical location.

corrplot(var$cos2, is.corr=FALSE)

In order to compare properly the results and apply clustering, and even if the first few principal components explain a relatively small percentage of the variance, they may still contain useful information for clustering or other analysis tasks. It’s important to keep in mind that PCA and t-SNE are both techniques for dimensionality reduction and data visualization, but they have different strengths and weaknesses. After trying both and comparing the results it will be easier to evaluate which one works better for the dataset and analysis goals.

t-Distributed Stochastic Neighbor Embedding

In a nutshell, t-SNE, is a powerful tool for reducing the dimensionality of high-dimensional data and visualizing it in a low-dimensional space. This is a non-linear technique that creates a probability distribution over pairs of high-dimensional data points and constructs a similar probability distribution over pairs of corresponding points in a two-dimensional space. By minimizing the difference between these two probability distributions using a gradient descent algorithm, t-SNE can create a low-dimensional representation of the data that retains the local structure of the high-dimensional data.

#reducing perplexity and increasing iterations
tsnemodel <- Rtsne(datascaled, dims = 2, perplexity = 50, 
               n_iter = 2000, eta = 1000, check_duplicates = FALSE)
tsne_data <- tsnemodel$Y

The algorith was applied with perplexity of 50, number os iterations of 2000 and eta = 1000. Perplexity is a hyperparameter in t-SNE that controls the balance between preserving local versus global structure in the data. A value of 50 for perplexity is a common starting point, but the optimal value can depend on the specific dataset and desired visualization. In this particular case, higher perplexity was taking much longer to run and displaying a graph with higher noise and much condensed dots, which is why it was preferable to reduce perplexity but increase iterations from 1000 to 2000.

With the resuts, this is how the customer data looks in a 2 dimensional plot:

plot(tsne_data, pch = 20, main = 'T-SNE Plot')

For practical and graphical purposes, the following steps on clustering will be done using the result from t-SNE.

Clustering

Now at this stage clustering will be used to group the similar data points in the dataset from t-SNE. It is often used in customer segmentation analysis to identify distinct groups of customers based on their purchasing behavior or other characteristics. Specifically k-means algorithm will be used, which splits data into a pre-specified number of clusters based on the similarity between data points. After performing t-SNE, k-means can be a good choice for clustering the data based on the new representation and group similar points together in this new space.

In order to determine the optimal number of clusters, below it is presented the Elbow plot. It first calculates the clustering error for a range of cluster numbers (in this case from 1 to 15), and then plots the results as a line graph with corresponding points for each number of clusters. The elbow plot helps identify the “elbow point” where the reduction in error begins to level off, indicating the optimal number of clusters.

# Elbow method
df_error <- (data.frame(n_clusterst = 1:15, error = (errort)))
ggplot((df_error), aes(x = n_clusterst, y = (errort))) +
geom_line() +
geom_point() +
labs(x = "Number of Clusters", y = "Error", title = "Elbow Plot")

As per above, the elbow point for our dataset is 5 clusters, which results in the following customer groups:

model <- kmeans((tsne_data), 5, nstart = 10, iter.max = 5000)
segments <- model$cluster

ggplot(data = as.data.frame((tsne_data)), aes(x = V1, y = V2, color = factor(segments))) +
geom_point() +
labs(title = "T-SNE Plot with clusters", color = 'Clusters') +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_brewer(palette = "Set2")

But, what do these groups mean?

Analyzing the clusters

In order to find some meaning to the clusters just found, several steps are being done: 1. It is necessary to merge the clusters found with the clean data used for the analysis, that it is possible to graph the clusters with each of the variables of interest. 2. As an additional step, grouping by mode and mean was also done. This was to ensure proper output in the graphs shown below, as the data had different format in the rows and columns (grouping by, summarizing by the variables of interest and transposing the results) 3. Graph the variables of interest :)

data_clusters <- cbind(customersegm, cluster = factor(segments))

# Cluster analysis
# Function to compute the mode
data_clusters <- as.data.frame(data_clusters)
Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

#
means_cat <- data_clusters %>%
  group_by(cluster) %>%
  summarise(across(-c(age_group, State, Gender, log_qty_ordered_mean,
                      log_total_mean, log_price_mean, log_value_mean,payment_method), mean))
means_t <- t(means_cat[-1])

#
means_val <- data_clusters %>%
  group_by(cluster) %>%
  summarise(Average_Qty_orderedlog = mean(log_qty_ordered_mean),
            Average_price_itemlog = mean(log_price_mean),
            Average_total_grosslog = mean(log_value_mean),
            Average_total_netlog = mean(log_total_mean),
            Avergage_payment_method = mean(payment_method)
  )
                      
meansval_t <- t(means_val[-1])

#
modes_other <- data_clusters %>%
  group_by(cluster) %>%
  summarise(
    age_group = Mode(age_group),
    Gender = Mode(Gender))
modes_t <- t(modes_other[-1])

For the first graph, it is shown the relationship between the clusters and the age group and gender. It is observed that clusters 1 and 4 are characterized by young customers, belonging to age group of below 30s, while other age groups belong to other ages. For gender, clusters 1, 3 and 5 are composed of females and 2 and 4 of male customers.

matplot(modes_t, type = "l", ylab = "Value",
        col = c('red', 'black', 'orange', 'green', 'blue'), 
        lty = c('dashed', 'dashed', 'dotted', 'dotted', 'dotted'),
        xaxt = "n", lwd = 2)
title('Clusters vs. Mode of Age group and Gender')
axis(1, at = 1:nrow(modes_t), labels = rownames(modes_t))
legend("topright", legend = 1:ncol(modes_t), col = c('red', 'blue', 'orange', 'green', 'black'),
       lty = c('dashed', 'dashed', 'dotted', 'dotted', 'dotted'), bty = "n", xjust = 1.2, yjust = 1.2,
       lwd = 2)

Next there is clusters and their average value sold per category. Clusters 1 and 4 have similar threshold for all categories, only difference is cluster 4 has higher average values for Mobiles and tablets, though this category is also the highest for customers on Cluster 1. Cluster 2 has higher average sales on categories: Men’s fashion, Home Living, Beauty ad Grooming, Woman’s Fashion, Kids, Education. They are the wealthiest. Cluster 3 has higher spending on Other catgeories, Appliances, Superstore and Health. Cluster 5 could be seen as ‘middle class’, according to their spending. Their higher categories are Home and Living and Men’s Fashion and they have the lowest spending on Mobiles, Appliances, Entertainment, Others.

In terms of discount ‘hunters’, the comparison between total spend before discount and after discount (gross and net) is almost not visible, which means that the customers are not intentionally promotion hunters, so these 2 variables have no influence in our clusters. Cluster 1 and 4 have similar behaviour. High average price per item, being customers on cluster 4 the biggest ‘spenders’, which is alligned with information from previous graph, as they are buying the most Mobiles and Tablets. Cluster 2 and 5 similar behaviour. Small purchases with small prices. Cluster 3 is smart customer, high quantities, but lower average price per item. Preferred payment method: vouchers

matplot(meansval_t, type = "l", ylab = "Value",
        col = 1:nrow(meansval_t), lty = 1:nrow(meansval_t),
        xaxt = "n", lwd = 2)
title('Clusters vs. Average orders values')
axis(1, at = 1:nrow(meansval_t), labels = FALSE)
text(1:nrow(meansval_t), par("usr")[3], srt = 30, adj = 1,
     labels = rownames(meansval_t), xpd = TRUE)
legend("topleft", legend = 1:ncol(meansval_t), col = 1:nrow(meansval_t),
       lty = 1:nrow(meansval_t), bty = "n", xjust = 1.5, yjust = 1.5, lwd = 2)

Overall, from the analysis done, the customers can be segmented/classified as follows:

Cluster	Dominant gender	Dominant age group	Categories (from largest to smallest)	Spending level	Customer segmentation groups
1	Female	Below 30	Mobiles & Tablets	Low spending	Mission-driven buyers - students
2	Male	Above 60	Woman’s fashion, Beauty, Men’s Fashion, Kids, School, Superstore	Highest spending in both total ticket value and average price	‘One-stop customer’ or ‘Loyal’ customer Families, high income
3	Female	Above 60	Others, Appliances, Superstore, Health	Medium spending	Trendy customers + need-based
4	Male	Below 30	Mobiles & Tablets	Low spending	Mission-driven buyers - young professionals
5	Female	Above 60	Home & Living; woman’s Fashion, men;s fashion.	Medium spending	Need-based + impulse shoppers

As per the table above, based on the characteristics and traits gathered from the different variables of the data, customer can be classified into different segments.

Cluster 1: Mission-driven customers. From the EDA at the very beginning it was observed that the Mobiles & Tablets category was the principal category of the retailer. In reality this means that when someone would like to buy any product from this category, he or she, will think of this store. Mission-driven customers have a clear view of what they want to buy and stick to it. This is why customers belonging to this segment only buy products from this category. As they are young and their total spending is quite low (since they only purchase what they need from 1 category), the most suited profile would be teenagers or students.

Cluster 2: Loyal or One-stop customers. It is quite difficult to speak about loyalty per se as the data about re-purchase rates or frequency of intended purchase is not being analyzed. But looking at the spending budget and the categories where these customers shop, it is clear that these are family purchases and almost all categories involving ‘family’ products (clothes, kids, books, school, groceries). These could be called ‘One-stop’ customers as they seem to do all their shopping in one place.

Cluster 3: Trendy customers. These customers are the ones that like new things or have very specific needs or desires. I like to called them ‘trendy’ because they don’t go for mainstream products nor categories. Despite of Mobiles & Tablets being the main strength of the retailer, these customers are not afraid to buy in other ‘unconventional’ categories. These customers could also be shopping this categories by impulse or ads. As the most frequent age group is older than 60, this validates the point of ‘trend’ or just trying new things: imagine your father or grandfather shopping some appliance online (as this might be another destination category for the retailer) and then being curios about other little devices, gadgets, health supplements, etc.

Cluster 4: Mission-driven customers. Similar like cluster 1, but their spending in the category is slightly higher, which is why they could potentially be young professionals or college students. Still lowe budget but higher than cluster 1.

Cluster 5: Need-based + impulse customers. These customers are most commonly after their 60’s and have a medium budget for their shopping. They might need some things for their home but they end up being tempted by the deals in the fashion categories that, based on the initial EDA, it is known that the retailer does (to generate traffic and attract customers).

Based on dimension reduction and clustering techniques it was possible to identify the main traits of the customers of the online retailer. These clusters will allow the retailer to tailor their offers and personalize the promotions for each cluster type. Deeper insights will be provided in the last section of Recommendations, where all conclusions will be analyzed as a whole.

Association rules - Basket Analysis

After knowing the profiles, or segmentation, of the customers, is also important to know what kind of relationship there is between each category purchased. Ideally, this would be done by product but since the database contains thousands of products, the analysis will be generalized to categories purchased instead of products to save computational time and memory.

Association rules are a data mining technique used to discover relationships between items frequently purchased together in a dataset, such as the categories of products in the online retailer’s basket data. It helps to identify which items are frequently purchased together, which can be used to inform product recommendations, promotional strategies, and store layout optimization/navigation.

Data transformation

The data for association rules analysis should be transformed into a binary matrix, where each row represents a transaction and each column represents a category. The cells of the matrix indicate whether or not the item was purchased in the transaction, with a value of 1 representing a purchase and a value of 0 representing no purchase. This transformation is necessary because association rules algorithms require binary data as input.

## `summarise()` has grouped output by 'order_id'. You can override using the
## `.groups` argument.

## [1] FALSE

Once it is ensured that there are no NAs or duplicates and that each row represents a transaction and each column is a product category, it is necessary to one-hot encode the data. One hot encoding is a technique used to transform categorical data into a binary format, it involves creating a new binary variable for each unique value of the categorical variable, with a value of 1 indicating the presence of that value and 0 indicating its absence. This creates a sparse matrix of binary variables that can be used as input for machine learning models. The next step would be to replace any value different from 0 with 1 and leave the zeroes as they are. The main objective is to know whether certain categories are bought together or not. In this analysis is not important the amount of items bought.

# Categories: each row is a transaction, each column is a category
# One-hot encoding
categories[, sapply(categories, is.numeric)] <-
  ifelse(categories[, sapply(categories, is.numeric)] != 0, 1, 0)

head(categories)

## # A tibble: 6 × 15
##   MensFashion Computing HomenL…¹ Mobil…² Woman…³ Appli…⁴ Soghaat Healt…⁵ Beaut…⁶
##         <dbl>     <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1           1         0        0       0       0       0       0       0       0
## 2           1         0        0       0       0       0       0       0       0
## 3           0         1        0       0       0       0       0       0       0
## 4           0         0        1       0       0       0       0       0       0
## 5           0         0        0       1       0       0       0       0       0
## 6           1         0        0       0       0       0       0       0       0
## # … with 6 more variables: Superstore <dbl>, KidsnBaby <dbl>,
## #   Entertainment <dbl>, Others <dbl>, Books <dbl>, SchoolnEducation <dbl>, and
## #   abbreviated variable names ¹HomenLiving, ²MobilesnTablets, ³WomansFashion,
## #   ⁴Appliances, ⁵HealthnSports, ⁶BeautynGrooming

Before starting the analysis, it is important to take another look at the categories and their frequency. This step might give some new knowledge, deeper understanding or some reassurance about the perfomance of the business.

# Convert tibble to long format using tidyr
library(tidyr)
categories_long <- pivot_longer(categories, cols = everything())
# Create bar plot with frequencies
library(ggplot2)
frequencyplot <- ggplot(categories_long, aes(x = reorder(name, value, sum), y = value)) +
  geom_bar(stat = "summary", fun = "sum") +
  labs(x = "Category", y = "Frequency", title = 'Frequency per category' ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
print(frequencyplot)

The plot shows the most purchases categories in ascending order. Top 3 categories are Mobiles & Tablets,Men’s Fashion, Appliances.

Equivalence Class Transformation

ECLAT (Equivalence Class Transformation) is an algorithm for mining association rules that is used to detect frequent itemsets in a transaction database. It divides the itemset into smaller subsets, recursively combines them, and creates a list of frequent itemsets by combining those with shared subsets. ECLAT is preferred for market basket analysis in retail and e-commerce industries due to its efficiency in handling large datasets and high dimensionality.

categories_df <- as.data.frame(categories)
categories_df <- apply(categories_df, 2, as.logical)
categories_bin <- as.matrix(categories_df)
colnames(categories_bin) <- colnames(categories_df)

With 1% support level for the frequent categories and 2% confidence on the association rules,only 2 association rules were found (men’s fashion and women’s fashion); which is why it was decided to try with lower values, to analyze the top results and the strongest relationships support = 0.005 confidence = 0.005

library(arules)
# 18 - this one to see better
frequent_itemsets <- eclat(categories_bin, parameter = list(support=0.005))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.005      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 539 
## 
## create itemset ... 
## set transactions ...[15 item(s), 107841 transaction(s)] done [0.01s].
## sorting and recoding items ... [13 item(s)] done [0.00s].
## creating sparse bit matrix ... [13 row(s), 107841 column(s)] done [0.00s].
## writing  ... [22 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

# Generate association rules
transactions <- as(categories_bin, "transactions")
association_rules <- ruleInduction(frequent_itemsets, transactions, confidence=0.005)
# Print the association rules
inspect(association_rules) # set of 18 rules found!!

##      lhs                  rhs               support     confidence lift     
## [1]  {Superstore}      => {BeautynGrooming} 0.007390510 0.14331955 1.9463195
## [2]  {BeautynGrooming} => {Superstore}      0.007390510 0.10036519 1.9463195
## [3]  {HomenLiving}     => {MensFashion}     0.007288508 0.13702929 0.7525655
## [4]  {MensFashion}     => {HomenLiving}     0.007288508 0.04002852 0.7525655
## [5]  {Appliances}      => {HomenLiving}     0.007965431 0.05426748 1.0202684
## [6]  {HomenLiving}     => {Appliances}      0.007965431 0.14975593 1.0202684
## [7]  {BeautynGrooming} => {HomenLiving}     0.006435400 0.08739453 1.6430812
## [8]  {HomenLiving}     => {BeautynGrooming} 0.006435400 0.12099024 1.6430812
## [9]  {BeautynGrooming} => {MensFashion}     0.007946885 0.10792092 0.5927022
## [10] {MensFashion}     => {BeautynGrooming} 0.007946885 0.04364433 0.5927022
## [11] {BeautynGrooming} => {Appliances}      0.005535928 0.07517945 0.5121882
## [12] {Appliances}      => {BeautynGrooming} 0.005535928 0.03771559 0.5121882
## [13] {BeautynGrooming} => {WomansFashion}   0.006787771 0.09217983 0.8117561
## [14] {WomansFashion}   => {BeautynGrooming} 0.006787771 0.05977462 0.8117561
## [15] {WomansFashion}   => {MensFashion}     0.011785870 0.10378899 0.5700096
## [16] {MensFashion}     => {WomansFashion}   0.011785870 0.06472805 0.5700096
## [17] {MobilesnTablets} => {MensFashion}     0.006055211 0.02439571 0.1339814
## [18] {MensFashion}     => {MobilesnTablets} 0.006055211 0.03325525 0.1339814
##      itemset
## [1]  1      
## [2]  1      
## [3]  2      
## [4]  2      
## [5]  3      
## [6]  3      
## [7]  4      
## [8]  4      
## [9]  5      
## [10] 5      
## [11] 6      
## [12] 6      
## [13] 7      
## [14] 7      
## [15] 8      
## [16] 8      
## [17] 9      
## [18] 9

18 rules where found and in order to interpret the association rules obtained from the analysis, the support and confidence measures can be examined. The support metric indicates the frequency of the rule in the dataset, while confidence measures the strength of the association between A and B. It is generally recommended to focus on rules that have high support and confidence. Additionally, lift can be used to assess the strength and significance of the rules. Lift indicates the degree to which A and B occur together more often than expected by chance.

By checking the top 5 rules with the highest support, it is found that Woman’s Fashion and Men’s fashion have the highest support measure, which indicates the frequency with which these categories in the antecedent and consequent of a rule co-occur in the dataset,meaning that the combination of “WomansFashion” and “MensFashion” appears in 1.1786% of all transactions in the dataset Similarly is for the second position Men’s Fashion as an antecedent of Woman’s Fashion. Around 2,5% of all transactions in the dataset are represented by tickets have the combination of Woman’s and Men’s Fashion and viceversa, Appliances and Home and viceversa and Beauty and Men’s Fashion.

#Analysis

#top 5 rules with highest support levels
support_rules <- sort(association_rules, by = 'support', decreasing = TRUE)
top_support <- head(support_rules, n = 5)
inspect(top_support)

##     lhs                  rhs             support     confidence lift     
## [1] {WomansFashion}   => {MensFashion}   0.011785870 0.10378899 0.5700096
## [2] {MensFashion}     => {WomansFashion} 0.011785870 0.06472805 0.5700096
## [3] {Appliances}      => {HomenLiving}   0.007965431 0.05426748 1.0202684
## [4] {HomenLiving}     => {Appliances}    0.007965431 0.14975593 1.0202684
## [5] {BeautynGrooming} => {MensFashion}   0.007946885 0.10792092 0.5927022
##     itemset
## [1] 8      
## [2] 8      
## [3] 3      
## [4] 3      
## [5] 5

Taking a look now at the strenght of the associaton, below are the top 5 rules with the highest levels of confidence:

#top 5 rules with highest confidence levels
conf_rules <- sort(association_rules, by = 'confidence', decreasing = TRUE)
top_conf <- head(conf_rules, n = 5)
inspect(top_conf)

##     lhs                  rhs               support     confidence lift     
## [1] {HomenLiving}     => {Appliances}      0.007965431 0.1497559  1.0202684
## [2] {Superstore}      => {BeautynGrooming} 0.007390510 0.1433195  1.9463195
## [3] {HomenLiving}     => {MensFashion}     0.007288508 0.1370293  0.7525655
## [4] {HomenLiving}     => {BeautynGrooming} 0.006435400 0.1209902  1.6430812
## [5] {BeautynGrooming} => {MensFashion}     0.007946885 0.1079209  0.5927022
##     itemset
## [1] 3      
## [2] 1      
## [3] 2      
## [4] 4      
## [5] 5

Taking the Home & Living category, it is visible that 40.76% of the transactions also contain either Appliances, Men’s Fashion or Beauty & Grooming as: * 14.97% of the transactions containing “HomenLiving” also contain “Appliances”. * 13.70% of the transactions containing “HomenLiving” also contain “MensFashion”. * 12.09% of the transactions containing “HomenLiving” also contain “BeautynGrooming”.

Finally, looking at the top 5 highest lift levels it is worth to remember that a lift greater than 1 indicates that the antecedent and consequent are positively correlated, while a lift less than 1 indicates a negative correlation.

As it is visible from the table below, there is a strong correlation among Beauty category with Superstore (and viceversa) and Home & Living (and viceversa), as well as between Appliances and Home & Living, for example

#top 5 rules with highest lift levels
lift_rules <- sort(association_rules, by = 'lift', decreasing = TRUE)
top_lift <- head(lift_rules, n = 5)
inspect(top_lift)

##     lhs                  rhs               support     confidence lift    
## [1] {Superstore}      => {BeautynGrooming} 0.007390510 0.14331955 1.946320
## [2] {BeautynGrooming} => {Superstore}      0.007390510 0.10036519 1.946320
## [3] {BeautynGrooming} => {HomenLiving}     0.006435400 0.08739453 1.643081
## [4] {HomenLiving}     => {BeautynGrooming} 0.006435400 0.12099024 1.643081
## [5] {Appliances}      => {HomenLiving}     0.007965431 0.05426748 1.020268
##     itemset
## [1] 1      
## [2] 1      
## [3] 4      
## [4] 4      
## [5] 3

library(arulesViz)
plot(association_rules, measure=c("support","confidence"), shading="lift")

In the plot above is where everything comes together. Ideally, to make strong and reliable conclusions the rules should have high support and confidence. In this particular case, there is one rule with high support and relatively high confidence but the lift is slightly below 1:

lhs	rhs	support	confidence	lift
{WomansFashion} =>	{MensFashion}	0.011785870	0.10378899	0.5700096

The lift value of 0.5 indicates that the occurrence of the Woman’s Fashion and Men’s Fashion in the antecedent and consequent are negatively correlated, meaning that the presence of one decreases the likelihood of the other being present. This makes sense as if you are a single woman, you would probably clothes for women and not men. Though there are other types of customers that shop for the whole family hence they buy both categories and this might be the reason why this lift value is not as extreme to be negative.

Another interesting insight is to identify the categories with positive correlation (lift > 1) with higher confidence levels as it represents the proportion of transactions containing the antecedent that also contain the consequent, as in this case the retailer could promote these categories together in order to increase their share of sales:

lhs	rhs	support	confidence	lift
{Superstore}	{BeautynGrooming}	0.0073905	0.1433195	1.94
{HomenLiving}	{BeautynGrooming}	0.006435400	0.1209902	1.64

If the intention is to increase the sales in the Beauty & Grooming category, for example, a good strategy would be to also attract customers through categories as Superstore and Home & Living as it is likely that they will also buy something from the Beauty Category.

Recommendations

As practical approach, these are several strategies/recommendations the retailer could follow with the fndings in this paper:

From customer segmentation
- From PCA analysis, it was visible that variables such as age, state, gender and spending contain a lot of important information when it comes to the similarities in purchase behaviors from the customers. With this being said, it is important that the marketing strategies are adjusted to these variables. Different marketing campaigns can be targeted depending on which State the customer is located.
- Additional PCA analysis could be performed based on States, to know which are the most important states and target customer clusters in a more effective way. The opposite analysis is also recommended: the least significant states could be treated as opportunities to develop strong and new market trends.
- Cluster 1 customers are the most price sensitive to the Mobiles and Tablets category, as they have low budget and buy only what they need. Strong promotions in complementary categories is recommended (accessories that match the product bought or other kind of products that are compatible with the product they bought). Low tier products.
- Cluster 2 customers are the perfect candidates for a loyalty card/account, where they can get short term benefits whenever they buy. Vouchers, cross category promotions is the way to keep them interested. The retailer might benefit of combining offers from 1 or categories that these customers buy with another one that they don’t buy so much, to increase the basket size.
- Cluster 3 customers promote novelties, offer easy return policies and easy payment methods, as this tends to be a + 60 years old category.
- Cluster 4 customers can be treated like Cluster 1 but with mid tier products as their budget is slightly higher.
- Cluster 5 customers are promo driven as they purchase in the categories with cheap price per item which generate the bigger volumes in sales. They purchase in fashion and home categories and they could benefit from a loyalty system and cross category offers to increase their loyalty and their basket size.
From the Association rules:
- Women and Men Fashion categories could benefit from bundles and promotional agenda together as their lift value suggests that they are negatively correlated but they don’t have to :) Is important that customers see the benefit of buying clothes on special occasions (as a gift to someone else, for example) or as whole bundle for the family.
- To increase the sales in the Beauty & Grooming category might be beneficial to focus also on categories as Superstore and Home Living, as customers that buy these categories are more likely to include something for Beauty afterwards.
- Top performing categories are the ones appearing in most of the rules, so these ones should have a special banner in the home page to strengthen the positioning together with Mobiles and Tablets. In a physical store, a good idea would be to build the layout around Mobiles and Tablets category as this will drive people inside the store while taking a look around to the other offer available.
- When customers are shopping in a top performing category, side banners should advertise products from the least performing categories according to the cluster where the customer belongs (for example cluster 2 can be linked with books, health and other kids products, as they are likely to have a family established). Is important to keep this narrowed to the age group and budget.
- Giving samples to customers on Cluster 1 and 4 is a good way to increase their penetration in the other different categories as they only shop in Mobiles and Tablets, have low budget and mission-driven customers, they wont stop to look around or check other offers that are not within their needs.

As a final conclusion, I would like people to explore more around Unsupervised learning techniques and see the value in the insights behind it. It is not only about finding the perfect method, algorithm or technique but to find meaningful information behind it that could be very valuable to any company, even if you do not have any previous programming or scientific background, all you gotta do is try, like me! Thanks for the read!

Unsupervised Learning for Online sales

Daniela Quintero Narvaez

2023-02-15

Unsupervised Learning methods on online sales data using dimension reduction, clustering and association rules.

Business impact: what useful insights could be found for the online retailer?

Daniela Quintero Narvaez

Data cleaning and preparation

Don’t forget to set the seed!

Customer Segmentation Analysis

Exploratory Data Analysis

Outliers - Boxplots

Dimension Reduction

Principal Component Analysis

t-Distributed Stochastic Neighbor Embedding

Clustering

Analyzing the clusters

Association rules - Basket Analysis

Data transformation

Equivalence Class Transformation

Recommendations