Unsupervised Machine Learning techniques for market segmentation

Introducción

Market segmentation is the process of dividing a larger market into smaller groups of consumers who have similar needs, characteristics, or behaviors. This allows companies to target specific groups of consumers with tailored marketing messages and product offerings.

Machine learning algorithms can be useful in the process of market segmentation by analyzing large amounts of data to identify patterns and clusters of consumers with similar demographic and geographic characteristics or behaviors. This can help companies more accurately and efficiently identify and target specific market segments and optimize their marketing, advertising, and sales efforts.

About the dataset

The dataset for an Indian store includes 10000 observations and 5 variables:

Variable	Description
reps	Representatives who are involved in the promotion and sale of the products in their respective region.
products	The 12 brands of products promoted by the company.
qty	The quantity of products sold in units per transaction.
revenue	The revenue generated for each transaction.
region	Indicates the region in India where the transaction took place, with four regions: East, North, South, and West India.

Seetharam Indurti (2019) [Dataset available at: Kaggle)

Objective

The aim of this project is to obtain the segmentation of products based on the revenue generated in different regions in order to understand the market trends. The segmentation will help the company to devise marketing strategies and promotional schemes to position the right product according to the preference of the consumers in any given segment.

1. Importing data (.csv)

datos <-read.csv("store.csv", header=T, sep = ",", dec=".")

##Load libraries
library(dplyr) 
library(ggplot2)
library(caret)
library(tibble)
library(tidyr)
library(factoextra)
library(forcats)
library(knitr)

2. Exploratory Data Analysis (EDA)

###2.a. Structure of data####
head(datos, n=10)

##      reps product qty revenue region
## 1  Chitra  Galaxy   2  155.10   West
## 2   Vijay     Jet   2   39.30  North
## 3    Mala  Beacon   3   74.25   West
## 4   Suman   Alpen   3  100.98  North
## 5  Rachna   Orbit   2   44.98  North
## 6    Aash Trident   1   29.25   East
## 7   Chand    Mars   3   68.39  North
## 8   Suraj   Orbit   2   45.44   West
## 9  Rachna   Milka   1   22.38  North
## 10   Bala  Almond   2   49.25   West

str(datos)

## 'data.frame':    10000 obs. of  5 variables:
##  $ reps   : chr  "Chitra" "Vijay" "Mala" "Suman" ...
##  $ product: chr  "Galaxy" "Jet" "Beacon" "Alpen" ...
##  $ qty    : int  2 2 3 3 2 1 3 2 1 2 ...
##  $ revenue: num  155.1 39.3 74.2 101 45 ...
##  $ region : chr  "West" "North" "West" "North" ...

dim(datos)

## [1] 10000     5

#summary statistics
summary(datos)

##      reps             product               qty            revenue       
##  Length:10000       Length:10000       Min.   : 1.000   Min.   :  18.43  
##  Class :character   Class :character   1st Qu.: 2.000   1st Qu.:  39.30  
##  Mode  :character   Mode  :character   Median : 2.000   Median :  58.42  
##                                        Mean   : 3.387   Mean   :  90.57  
##                                        3rd Qu.: 3.000   3rd Qu.:  75.00  
##                                        Max.   :25.000   Max.   :1998.75  
##     region         
##  Length:10000      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

#categorical variables levels
library(forcats)
fct_count(datos$product) #12 different products

## # A tibble: 12 × 2
##    f           n
##    <fct>   <int>
##  1 Almond   1015
##  2 Alpen    1588
##  3 Beacon    651
##  4 Galaxy    342
##  5 Halls     307
##  6 Jet      1274
##  7 Mars      638
##  8 Milka    1294
##  9 Orbit    1255
## 10 Prince    325
## 11 Star      652
## 12 Trident   659

fct_count(datos$reps) #72 different reps

## # A tibble: 72 × 2
##    f           n
##    <fct>   <int>
##  1 Aash      315
##  2 Akila     109
##  3 Alka      116
##  4 Anahit    104
##  5 Ananya    216
##  6 Anusha     82
##  7 Aparna    104
##  8 Bala       97
##  9 Bharath   171
## 10 Bhat       87
## # … with 62 more rows

fct_count(datos$region) #4 regions

## # A tibble: 4 × 2
##   f         n
##   <fct> <int>
## 1 East   1703
## 2 North  3603
## 3 South  1665
## 4 West   3029

The dataset contains 10000 transactions of revenues from 12 different products sold by 72 representatives in 4 regions of India (Western, Norther, Souther and Easter India).

# check missing values
any(!complete.cases(datos))

## [1] FALSE

The dataset does not contain missing values.

Quantitative variables

Data visualization

#Histograms
ggplot(datos, aes(x = qty)) + geom_bar()

ggplot(datos, aes(x = revenue)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Boxplot
ggplot(datos, aes(y = qty)) + geom_boxplot()

ggplot(datos, aes(y = revenue)) + geom_boxplot()

The variables qty and revenue show outliers in their distribution. However, at first we’ll consider all values in the model since it may contain valuable information of importat revenues.

###Check correlation between variables
library(ggcorrplot)
options(repr.plot.width = 6, repr.plot.height = 5)

#Correlogram plot
corr <- round(cor(select_if(datos, is.numeric)), 2)
ggcorrplot(corr, hc.order = T, ggtheme = ggplot2::theme_gray,
           colors = c("red", "white", "blue"), lab = T)

As it was expected, the variables qty and revenue are highly correlated.

What products are fast moving in terms of the quantity and revenue generated?

#Scatter plot
options(repr.plot.width = 11, repr.plot.height = 5)

#Qty and revenue by products
scplot1 <- ggplot(datos, aes(revenue, qty, col = product)) + 
  geom_point() + theme(legend.position = 'bottom') + 
  labs(x='revenue', y ='qty')
print(scplot1)

#Qty and revenue by region
scplot2 <- ggplot(datos, aes(revenue, qty, col = region)) + 
  geom_point() + theme(legend.position = 'bottom') + 
  labs(x='revenue', y ='qty')
print(scplot2)

The scatter plot shows differences in the effectiveness of different products in generating revenue. For example, sales of Galaxy are the ones that generate the highest revenue for the lowest number of items sold. Hence, Galaxy followed by Alpen and Almond are the most fast moving brands in terms of the quantity and revenue generated.

Categorical variables

Data visualization

#Transactions per region
ggplot(data = datos, aes(x = region, y = after_stat(count), fill = region)) +
  geom_bar() +
  scale_fill_brewer(palette ="Greens") +
  labs(title = "Number of transactions per regions") +
  theme_bw() +
  theme(legend.position = "bottom")

Are there regional differences in the quantity of items sold and the revenue achieved?

datos %>% 
  group_by(region) %>% 
  summarise(Tot_qty= sum(qty),
            Tot_revenue=sum(revenue),
            Tot_transactions= n()) %>% 
  kable(caption ="Total transactions per region")

Total transactions per region
region	Tot_qty	Tot_revenue	Tot_transactions
East	5614	146141.5	1703
North	12174	326109.5	3603
South	5908	158819.9	1665
West	10178	274631.3	3029

The north (3603) and west (3029) regions double the number of transactions to the east and south regions. Hence the total quantity of sold products and total revenue in this regions are higher too.

#Products
ggplot(data = datos, aes(x = product, y = after_stat(count), fill = product)) +
  geom_bar() +
  labs(title = "Number of transactions per products") +
  theme_bw() +
  theme(legend.position = "bottom")

datos %>% 
  group_by(product) %>% 
  summarise(tot_qty= sum(qty),
            tot_revenue=sum(revenue),
            tot_transactions= n()) %>% 
  arrange(desc(tot_revenue)) %>% 
  kable(caption = "Total transactions and revenue per products")

Total transactions and revenue per products
product	tot_qty	tot_revenue	tot_transactions
Alpen	5267	177152.07	1588
Orbit	4375	99284.25	1255
Milka	4153	94249.40	1294
Galaxy	1150	90919.18	342
Jet	4262	84077.56	1274
Almond	3258	80581.91	1015
Trident	2436	72289.20	659
Mars	2498	58109.60	638
Beacon	2152	53164.41	651
Star	2101	43637.99	652
Prince	1253	34025.46	325
Halls	969	18211.18	307

Overall, of the 12 products sold in the 4 regions, Alpen is the best-selling in India with a total revenue of 177152.07 rupies, followed by Orbit, Milka, and Galaxy. Halls generated the lowest revenue.

#Products per region
ggplot(data = datos, aes(x = product, y = after_stat(count), fill = region)) +
  geom_bar() +
  labs(title = "Number of transactions per products by region") +
  theme_bw() +
  theme(legend.position = "bottom")

Suppose that the store manager want to identified the more efficient representative in each region in terms of revenue and transactions.

#One alternative is using pivot tables
library(pivottabler)

pt <- PivotTable$new()
pt$addData(datos)
pt$addRowDataGroups("region")
pt$addRowDataGroups("reps")
pt$defineCalculation(calculationName="Tot_transactions", summariseExpression="n()")
pt$defineCalculation(calculationName="Tot_revenues", summariseExpression="sum(revenue)")
pt$evaluatePivot()

# apply the green style for reps with more than 300 transactions
cells <- pt$findCells(minValue=300, maxValue=1000, includeNull=FALSE, includeNA=FALSE)
pt$setStyling(cells=cells, declarations=list("background-color"="#C6EFCE", "color"="#006100"))

# apply the yellow style for revenues higher than 20000
cells <- pt$findCells(minValue=20000, maxValue=50000, includeNull=FALSE, includeNA=FALSE)
pt$setStyling(cells=cells, declarations=list("background-color"="#FFEB9C", "color"="#9C5700"))

pt$renderPivot()

The total revenue generated by all the representatives for the company is 9,05,702.21 rupies, covering 12 brands and involving 10,000 transactions in all the 4 regions. However, there is a difference between regions in the efficiency to generate revenue between the reps. In the table above, in green, the most efficient representatives of the 4 regions that reach more than 300 transactions obtaining total revenues higher than 20,000 rupies (yellow) are indicated.

East region: Aash and Vish
South region: Seet
North region: Rachna
West region: Santosh

Suppose that now the store manager want to see the revenue and quantity broken down by the products and region.

#Sales trends of products per reps
pt2 <- PivotTable$new()
pt2$addData(datos)
pt2$addRowDataGroups("region")
pt2$addColumnDataGroups("product")
pt2$defineCalculation(calculationName="tot_revenue", summariseExpression="sum(revenue)")
pt2$defineCalculation(calculationName="tot_qty", summariseExpression="sum(qty)")
pt2$evaluatePivot()
pt2$renderPivot()

3. Data preprocessing

#Normalizing data
datos_preProces <-preProcess(datos, method = c("scale", "center"))
datos_uns <- predict(datos_preProces,datos)

#One Hot-encoding
datos_uns$reps <- as.factor(datos_uns$reps)
datos_uns$region <- as.factor(datos_uns$region)
datos_uns$product <- as.factor(datos_uns$product)
library(mltools)

## 
## Attaching package: 'mltools'

## The following object is masked from 'package:tidyr':
## 
##     replace_na

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

datos_uns <- one_hot(as.data.table(datos_uns))

#Chequeamos preprocesamiento
str(datos_uns)

## Classes 'data.table' and 'data.frame':   10000 obs. of  90 variables:
##  $ reps_Aash      : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ reps_Akila     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Alka      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Anahit    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Ananya    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Anusha    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Aparna    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Bala      : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ reps_Bharath   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Bhat      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Chand     : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ reps_Chandra   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Chitra    : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Durga     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Easwar    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Hussain   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Jagdish   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Jaggi     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Javed     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Jay       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_John      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Kamat     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Keshab    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Kishen    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Kishore   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Kumar     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Madhu     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Mak       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Mala      : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ reps_Manju     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Meena     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Mehta     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Mukesh    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Mukund    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Nandini   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Nidhi     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Palak     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Pooja     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Prarth    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Prasad    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Priya     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Rachna    : int  0 0 0 0 1 0 0 0 1 0 ...
##  $ reps_Rahul     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Rajat     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Raji      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Ram       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Ranga     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Ratna     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Ravi      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Reva      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Rishi     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Rohini    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Sai       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Santosh   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Satya     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Satyen    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Seet      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Sesh      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Shaanth   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Sruti     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Suman     : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ reps_Sumedh    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Suraj     : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ reps_Suresh    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Susan     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Swami     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Vaghya    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Veeyes    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Venkat    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Vidya     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reps_Vijay     : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ reps_Vish      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ product_Almond : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ product_Alpen  : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ product_Beacon : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ product_Galaxy : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ product_Halls  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ product_Jet    : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ product_Mars   : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ product_Milka  : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ product_Orbit  : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ product_Prince : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ product_Star   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ product_Trident: int  0 0 0 0 0 1 0 0 0 0 ...
##  $ qty            : num  -0.3213 -0.3213 -0.0897 -0.0897 -0.3213 ...
##  $ revenue        : num  0.5049 -0.4012 -0.1277 0.0815 -0.3567 ...
##  $ region_East    : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ region_North   : int  0 1 0 1 1 0 1 0 1 0 ...
##  $ region_South   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ region_West    : int  1 0 1 0 0 0 0 1 0 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

4. Asessing Clustering Tendency

Before applying any clustering method on your data, it’s important to evaluate whether the data sets contains meaningful clusters. Because, even if the data is uniformly distributed, the k-means algorithm and hierarchical clustering impose a classification. Since the main problem faced by unsupervised learning methods is the difficulty in validating the results (because there is no response variable available to compare them), this type of analysis can be used to evaluate the validity of clustering analysis.

library(clustertend)

## Package `clustertend` is deprecated.  Use package `hopkins` instead.

# Plot data set
fviz_pca_ind(prcomp(datos_uns), title = "PCA",
             habillage = datos$region, palette = "jco",
             geom = "point", ggtheme = theme_classic(),
             legend = "bottom")

As we can see, the data is not randomly or uniformly distributed.

5. Principal Component Analysis (PCA)

Principal Component Analysis belongs to the family of techniques known as unsupervised learning. This method allows to “condense” the information provided by multiple variables into just a few components, to find a linear combination of the original features that capture maximum variance in the dataset.

library(FactoMineR)

#Running a PCA.
store_pca <- PCA(datos_uns, graph = FALSE)
print(store_pca)

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 10000 individuals, described by 90 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

#Exploring PCA()
# Getting the summary of the pca
summary(store_pca)

## 
## Call:
## PCA(X = datos_uns, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               2.520   2.318   2.229   1.962   1.217   1.196   1.183
## % of var.              2.801   2.576   2.477   2.180   1.353   1.329   1.315
## Cumulative % of var.   2.801   5.376   7.853  10.033  11.386  12.715  14.030
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               1.170   1.136   1.134   1.132   1.125   1.105   1.097
## % of var.              1.300   1.262   1.260   1.258   1.250   1.227   1.219
## Cumulative % of var.  15.330  16.593  17.852  19.110  20.360  21.587  22.806
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               1.091   1.031   1.031   1.030   1.030   1.029   1.028
## % of var.              1.213   1.146   1.145   1.144   1.144   1.143   1.142
## Cumulative % of var.  24.019  25.165  26.310  27.454  28.598  29.742  30.884
##                       Dim.22  Dim.23  Dim.24  Dim.25  Dim.26  Dim.27  Dim.28
## Variance               1.025   1.024   1.022   1.020   1.019   1.019   1.018
## % of var.              1.139   1.138   1.136   1.134   1.133   1.132   1.131
## Cumulative % of var.  32.023  33.161  34.296  35.430  36.563  37.695  38.826
##                       Dim.29  Dim.30  Dim.31  Dim.32  Dim.33  Dim.34  Dim.35
## Variance               1.017   1.017   1.016   1.014   1.014   1.012   1.011
## % of var.              1.130   1.130   1.128   1.127   1.127   1.125   1.124
## Cumulative % of var.  39.956  41.086  42.214  43.342  44.468  45.593  46.717
##                       Dim.36  Dim.37  Dim.38  Dim.39  Dim.40  Dim.41  Dim.42
## Variance               1.011   1.011   1.011   1.011   1.011   1.011   1.011
## % of var.              1.124   1.124   1.124   1.124   1.123   1.123   1.123
## Cumulative % of var.  47.840  48.964  50.088  51.211  52.335  53.458  54.581
##                       Dim.43  Dim.44  Dim.45  Dim.46  Dim.47  Dim.48  Dim.49
## Variance               1.011   1.011   1.011   1.011   1.010   1.010   1.010
## % of var.              1.123   1.123   1.123   1.123   1.123   1.123   1.123
## Cumulative % of var.  55.704  56.827  57.950  59.073  60.196  61.319  62.441
##                       Dim.50  Dim.51  Dim.52  Dim.53  Dim.54  Dim.55  Dim.56
## Variance               1.010   1.010   1.010   1.010   1.010   1.010   1.010
## % of var.              1.122   1.122   1.122   1.122   1.122   1.122   1.122
## Cumulative % of var.  63.564  64.686  65.808  66.931  68.053  69.175  70.297
##                       Dim.57  Dim.58  Dim.59  Dim.60  Dim.61  Dim.62  Dim.63
## Variance               1.010   1.010   1.010   1.010   1.010   1.009   1.009
## % of var.              1.122   1.122   1.122   1.122   1.122   1.122   1.122
## Cumulative % of var.  71.419  72.541  73.663  74.785  75.907  77.028  78.150
##                       Dim.64  Dim.65  Dim.66  Dim.67  Dim.68  Dim.69  Dim.70
## Variance               1.009   1.009   1.009   1.009   1.009   1.009   1.009
## % of var.              1.121   1.121   1.121   1.121   1.121   1.121   1.121
## Cumulative % of var.  79.271  80.393  81.514  82.635  83.756  84.877  85.998
##                       Dim.71  Dim.72  Dim.73  Dim.74  Dim.75  Dim.76  Dim.77
## Variance               1.008   0.999   0.987   0.980   0.978   0.973   0.969
## % of var.              1.120   1.110   1.097   1.089   1.087   1.081   1.077
## Cumulative % of var.  87.118  88.228  89.324  90.414  91.500  92.581  93.659
##                       Dim.78  Dim.79  Dim.80  Dim.81  Dim.82  Dim.83  Dim.84
## Variance               0.960   0.952   0.947   0.938   0.934   0.917   0.060
## % of var.              1.066   1.058   1.052   1.043   1.037   1.018   0.066
## Cumulative % of var.  94.725  95.783  96.835  97.878  98.915  99.934 100.000
##                       Dim.85  Dim.86  Dim.87  Dim.88  Dim.89  Dim.90
## Variance               0.000   0.000   0.000   0.000   0.000   0.000
## % of var.              0.000   0.000   0.000   0.000   0.000   0.000
## Cumulative % of var. 100.000 100.000 100.000 100.000 100.000 100.000
## 
## Individuals (the 10 first)
##                     Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1               | 11.710 |  1.821  0.013  0.024 | -1.486  0.010  0.016 |  0.114
## 2               | 10.040 | -1.996  0.016  0.040 | -0.700  0.002  0.005 | -0.124
## 3               |  8.381 |  1.960  0.015  0.055 | -1.501  0.010  0.032 | -0.092
## 4               |  8.165 | -1.984  0.016  0.059 | -0.724  0.002  0.008 |  0.022
## 5               |  6.498 | -1.974  0.015  0.092 | -0.809  0.003  0.016 | -0.060
## 6               |  7.318 |  0.438  0.001  0.004 |  2.133  0.020  0.085 | -2.557
## 7               | 10.881 | -1.973  0.015  0.033 | -0.784  0.003  0.005 | -0.128
## 8               | 11.362 |  1.871  0.014  0.027 | -1.507  0.010  0.018 | -0.052
## 9               |  6.507 | -1.927  0.015  0.088 | -0.655  0.002  0.010 | -0.101
## 10              | 10.791 |  1.825  0.013  0.029 | -1.431  0.009  0.018 | -0.132
##                    ctr   cos2  
## 1                0.000  0.000 |
## 2                0.000  0.000 |
## 3                0.000  0.000 |
## 4                0.000  0.000 |
## 5                0.000  0.000 |
## 6                0.029  0.122 |
## 7                0.000  0.000 |
## 8                0.000  0.000 |
## 9                0.000  0.000 |
## 10               0.000  0.000 |
## 
## Variables (the 10 first)
##                    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## reps_Aash       |  0.048  0.091  0.002 |  0.258  2.880  0.067 | -0.299  4.001
## reps_Akila      |  0.123  0.605  0.015 | -0.100  0.429  0.010 | -0.007  0.002
## reps_Alka       |  0.028  0.031  0.001 |  0.143  0.887  0.021 |  0.191  1.631
## reps_Anahit     |  0.121  0.580  0.015 | -0.098  0.413  0.010 | -0.005  0.001
## reps_Ananya     | -0.185  1.352  0.034 | -0.073  0.229  0.005 | -0.003  0.000
## reps_Anusha     |  0.024  0.022  0.001 |  0.128  0.711  0.016 | -0.153  1.050
## reps_Aparna     | -0.126  0.635  0.016 | -0.049  0.102  0.002 | -0.002  0.000
## reps_Bala       |  0.116  0.533  0.013 | -0.094  0.383  0.009 |  0.001  0.000
## reps_Bharath    |  0.035  0.049  0.001 |  0.175  1.322  0.031 |  0.231  2.390
## reps_Bhat       |  0.025  0.025  0.001 |  0.124  0.661  0.015 |  0.166  1.241
##                   cos2  
## reps_Aash        0.089 |
## reps_Akila       0.000 |
## reps_Alka        0.036 |
## reps_Anahit      0.000 |
## reps_Ananya      0.000 |
## reps_Anusha      0.023 |
## reps_Aparna      0.000 |
## reps_Bala        0.000 |
## reps_Bharath     0.053 |
## reps_Bhat        0.028 |

#Tracing variable contributions
store_pca$var$contrib

##                        Dim.1        Dim.2        Dim.3        Dim.4
## reps_Aash       9.126914e-02 2.880312e+00 4.000514e+00 2.585428e-01
## reps_Akila      6.049895e-01 4.293797e-01 2.075660e-03 6.110025e-02
## reps_Alka       3.085068e-02 8.870338e-01 1.630592e+00 1.296619e-02
## reps_Anahit     5.804718e-01 4.130298e-01 1.237727e-03 4.865336e-02
## reps_Ananya     1.351809e+00 2.287304e-01 4.400193e-04 4.345170e-03
## reps_Anusha     2.237026e-02 7.114428e-01 1.049974e+00 1.141791e-04
## reps_Aparna     6.347325e-01 1.022493e-01 1.306734e-04 1.459846e-03
## reps_Bala       5.326743e-01 3.832123e-01 3.652381e-05 3.319142e-02
## reps_Bharath    4.880789e-02 1.322191e+00 2.390174e+00 8.107790e-02
## reps_Bhat       2.467424e-02 6.608977e-01 1.240882e+00 2.180997e-04
## reps_Chand      6.155843e-01 1.001215e-01 3.600039e-04 6.653208e-03
## reps_Chandra    5.547627e-01 3.997474e-01 4.733758e-04 4.298805e-06
## reps_Chitra     5.269208e-01 3.752940e-01 2.499613e-04 1.265482e-03
## reps_Durga      2.817092e-02 8.365107e-01 1.188191e+00 3.649964e-02
## reps_Easwar     1.012080e+00 7.205027e-01 1.461209e-03 3.608812e-02
## reps_Hussain    6.311231e-02 1.828517e+00 2.590902e+00 2.710314e-02
## reps_Jagdish    1.694150e+00 1.211890e+00 2.358543e-03 4.044017e-02
## reps_Jaggi      4.891125e-01 3.517084e-01 8.137054e-04 1.807803e-02
## reps_Javed      2.601138e-02 7.759286e-01 1.103459e+00 3.205657e-02
## reps_Jay        2.448346e-02 6.468499e-01 1.177036e+00 4.939788e-03
## reps_John       5.968412e-01 4.241554e-01 2.231347e-04 5.902222e-02
## reps_Kamat      5.583464e-01 9.164844e-02 1.039784e-03 6.904572e-02
## reps_Keshab     3.141269e-02 8.721165e-01 1.200158e+00 4.034408e-02
## reps_Kishen     2.944384e-02 8.755794e-01 1.230285e+00 4.096359e-02
## reps_Kishore    5.377989e-01 8.909523e-02 1.123309e-03 1.261610e-01
## reps_Kumar      2.629355e-02 7.978231e-01 1.073492e+00 1.073231e-01
## reps_Madhu      3.102445e-02 9.449676e-01 1.369027e+00 2.529577e-03
## reps_Mak        5.613522e-01 3.931571e-01 2.598404e-04 8.010751e-02
## reps_Mala       1.094638e+00 7.884700e-01 1.370512e-04 6.734804e-03
## reps_Manju      5.163068e-01 3.731615e-01 2.820414e-04 4.023765e-03
## reps_Meena      5.538185e-01 3.918799e-01 1.475420e-04 4.885188e-04
## reps_Mehta      6.270911e-01 4.470969e-01 3.853633e-04 9.544612e-02
## reps_Mukesh     5.433165e-01 3.930628e-01 8.957424e-04 1.500349e-02
## reps_Mukund     7.053313e-01 1.165761e-01 1.802971e-04 3.226557e-05
## reps_Nandini    1.836139e+00 3.025454e-01 2.732726e-05 9.420612e-03
## reps_Nidhi      3.042405e-02 8.119538e-01 1.437608e+00 8.063841e-02
## reps_Palak      6.384705e-01 4.496514e-01 6.797111e-05 2.327589e-03
## reps_Pooja      6.052876e-01 1.012586e-01 3.730242e-04 1.339135e-02
## reps_Prarth     5.821456e-01 9.917017e-02 1.304760e-03 5.386421e-02
## reps_Prasad     5.232909e-01 8.620927e-02 8.746298e-05 2.117441e-02
## reps_Priya      6.103982e-01 4.307019e-01 4.608751e-06 7.067387e-03
## reps_Rachna     2.033926e+00 3.354797e-01 1.798543e-04 3.699611e-04
## reps_Rahul      2.912412e-02 8.112981e-01 1.409991e+00 1.596909e-01
## reps_Rajat      6.045053e-01 1.042727e-01 7.346125e-05 9.023017e-04
## reps_Raji       3.036097e-02 8.766230e-01 1.510060e+00 1.489057e-01
## reps_Ram        2.599799e-02 7.139268e-01 1.062281e+00 1.392439e-03
## reps_Ranga      6.905234e-01 1.170781e-01 7.636377e-06 1.136959e-02
## reps_Ratna      5.834138e-01 9.526292e-02 8.322010e-05 3.256593e-03
## reps_Ravi       1.764924e+00 2.871214e-01 8.802324e-06 3.321440e-02
## reps_Reva       4.936773e-01 3.538508e-01 3.616372e-04 3.295340e-03
## reps_Rishi      5.956418e-01 9.730827e-02 2.288875e-04 5.758519e-03
## reps_Rohini     4.828947e-01 8.114726e-02 1.177757e-03 4.530916e-02
## reps_Sai        1.863295e+00 3.070708e-01 1.082394e-03 1.514977e-02
## reps_Santosh    1.860980e+00 1.321478e+00 4.302138e-04 2.068068e-03
## reps_Satya      3.385817e-02 9.906489e-01 1.388218e+00 7.440249e-02
## reps_Satyen     3.248771e-02 8.503359e-01 1.568118e+00 1.015551e-02
## reps_Seet       1.053986e-01 2.896000e+00 5.310312e+00 4.092982e-02
## reps_Sesh       4.948702e-01 8.083477e-02 3.545893e-04 4.523198e-03
## reps_Shaanth    1.797179e+00 2.988703e-01 1.751508e-05 2.296284e-02
## reps_Sruti      2.792923e-02 7.921121e-01 1.420471e+00 1.092452e-02
## reps_Suman      1.077273e+00 1.760712e-01 5.635727e-08 7.924891e-03
## reps_Sumedh     6.918845e-01 4.861714e-01 3.450117e-09 7.507632e-03
## reps_Suraj      4.684067e-01 3.308927e-01 1.973896e-05 2.540159e-02
## reps_Suresh     1.168881e+00 8.379279e-01 7.206562e-04 2.108465e-03
## reps_Susan      6.236323e-01 1.037777e-01 2.118178e-03 1.854908e-01
## reps_Swami      3.097095e-02 8.511155e-01 1.509163e+00 6.787611e-02
## reps_Vaghya     5.848086e-02 1.627130e+00 2.902172e+00 5.295584e-02
## reps_Veeyes     5.419412e-01 9.485072e-02 2.048657e-03 8.269226e-02
## reps_Venkat     6.571476e-01 1.078194e-01 3.761756e-04 9.584989e-03
## reps_Vidya      5.418096e-01 3.890714e-01 1.557574e-04 1.887188e-03
## reps_Vijay      6.689547e-01 1.118592e-01 1.557898e-03 5.209196e-02
## reps_Vish       9.190382e-02 2.807941e+00 4.053434e+00 3.179441e-02
## product_Almond  1.252444e-02 2.820086e-03 2.897915e-02 8.839478e-02
## product_Alpen   3.557211e-03 1.684165e-03 1.492996e-02 3.379919e-01
## product_Beacon  4.476679e-02 1.072185e-02 1.216649e-02 2.560888e-02
## product_Galaxy  4.741947e-03 5.585370e-03 5.152675e-02 3.789638e+00
## product_Halls   8.651118e-03 1.414780e-02 5.573989e-03 1.800614e-01
## product_Jet     2.412117e-02 1.279931e-02 3.711257e-05 4.505851e-01
## product_Mars    1.225051e-03 1.600073e-02 2.971042e-02 6.990947e-02
## product_Milka   3.267631e-02 9.360212e-02 8.651372e-03 3.503110e-01
## product_Orbit   7.532472e-04 4.484999e-02 8.832152e-03 5.422228e-02
## product_Prince  1.141692e-02 1.674871e-03 1.007486e-04 9.753618e-02
## product_Star    4.073355e-04 7.777759e-03 5.059132e-02 1.635066e-01
## product_Trident 3.152009e-04 1.035003e-02 3.843395e-03 2.171605e-01
## qty             1.021555e-03 2.727557e-03 9.573269e-01 4.278458e+01
## revenue         1.137793e-03 1.231221e-07 1.097322e+00 4.757954e+01
## region_East     5.920567e-01 1.777238e+01 2.518601e+01 5.566755e-01
## region_North    3.439830e+01 5.699335e+00 4.739475e-03 1.173116e-03
## region_South    5.583659e-01 1.535545e+01 2.768812e+01 5.890334e-01
## region_West     2.393548e+01 1.706692e+01 7.851776e-03 2.270370e-03
##                        Dim.5
## reps_Aash       5.563493e-01
## reps_Akila      1.040647e-02
## reps_Alka       5.422753e-03
## reps_Anahit     6.495928e-01
## reps_Ananya     2.291160e-01
## reps_Anusha     5.818555e-01
## reps_Aparna     5.319524e-01
## reps_Bala       1.321652e-01
## reps_Bharath    4.910534e-02
## reps_Bhat       1.666652e-01
## reps_Chand      1.157881e-02
## reps_Chandra    1.025434e-01
## reps_Chitra     4.926586e-01
## reps_Durga      7.008470e-02
## reps_Easwar     9.125761e-02
## reps_Hussain    4.431965e-03
## reps_Jagdish    1.500219e-01
## reps_Jaggi      8.108294e-02
## reps_Javed      1.435683e-01
## reps_Jay        5.630798e-02
## reps_John       4.534379e-01
## reps_Kamat      3.675797e-02
## reps_Keshab     2.272323e+00
## reps_Kishen     3.965849e-01
## reps_Kishore    1.055071e-04
## reps_Kumar      3.563015e-01
## reps_Madhu      5.158448e-03
## reps_Mak        4.110437e-04
## reps_Mala       6.246380e-01
## reps_Manju      1.505673e-01
## reps_Meena      7.274106e-04
## reps_Mehta      2.179807e-01
## reps_Mukesh     2.539955e-01
## reps_Mukund     2.018587e-01
## reps_Nandini    3.750417e-01
## reps_Nidhi      1.496604e-03
## reps_Palak      1.153770e-01
## reps_Pooja      4.527582e-01
## reps_Prarth     3.818279e-02
## reps_Prasad     3.417742e-01
## reps_Priya      7.193725e-03
## reps_Rachna     2.475571e-01
## reps_Rahul      1.084123e-03
## reps_Rajat      2.579573e-02
## reps_Raji       5.931738e-03
## reps_Ram        7.826801e-01
## reps_Ranga      2.647436e-02
## reps_Ratna      1.769534e-03
## reps_Ravi       2.977884e-01
## reps_Reva       1.714722e-01
## reps_Rishi      3.048499e-03
## reps_Rohini     1.135946e-01
## reps_Sai        2.041951e-01
## reps_Santosh    1.437317e-02
## reps_Satya      5.537735e-01
## reps_Satyen     5.233452e-03
## reps_Seet       4.433331e-01
## reps_Sesh       2.982912e-01
## reps_Shaanth    5.369421e-01
## reps_Sruti      2.617994e-01
## reps_Suman      1.110145e+00
## reps_Sumedh     3.003468e-01
## reps_Suraj      1.874928e-02
## reps_Suresh     6.881097e-01
## reps_Susan      1.373540e-03
## reps_Swami      1.115170e-01
## reps_Vaghya     2.788027e-01
## reps_Veeyes     2.414574e-02
## reps_Venkat     4.627818e-04
## reps_Vidya      6.173212e-01
## reps_Vijay      2.049095e-02
## reps_Vish       3.048169e-01
## product_Almond  1.073657e-02
## product_Alpen   5.580359e+01
## product_Beacon  2.878500e-01
## product_Galaxy  5.104822e-02
## product_Halls   4.020579e-03
## product_Jet     1.158794e+00
## product_Mars    4.155618e-02
## product_Milka   6.013317e+00
## product_Orbit   1.782020e+01
## product_Prince  1.638989e-02
## product_Star    4.790145e-06
## product_Trident 1.446876e-02
## qty             8.691487e-01
## revenue         7.495622e-03
## region_East     3.414311e-03
## region_North    2.112827e-03
## region_South    1.659424e-03
## region_West     3.943604e-03

#Visualizing PCA
fviz_pca_var(store_pca, col.var = "contrib", gradient.cols = c("#002bbb", "#bb2e00"), repel = TRUE)

#Creating a factor map for the top 10 variables with the highest contributions.
fviz_pca_var(store_pca, select.var = list(contrib = 8), repel = TRUE)

#Barplotting the contributions of variables
fviz_contrib(store_pca, choice = "var", axes = 1, top = 10)

The red line corresponds to the expected percentage if the distributions were uniform.

6. K-means

The number of clusters (k) must be set before we start the algorithm. At first we can use several different values of k and examine the differences in the results.

library(factoextra)
library(rattle)

## Loading required package: bitops

## Rattle: A free graphical interface for data science with R.
## Versión 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Escriba 'rattle()' para agitar, sacudir y  rotar sus datos.

library(cluster)
library(gclus)

set.seed(123)
k2 <- kmeans(datos_uns, centers = 2, nstart = 25)
k3 <- kmeans(datos_uns, centers = 3, nstart = 25)
k4 <- kmeans(datos_uns, centers = 4, nstart = 25)
k5 <- kmeans(datos_uns, centers = 5, nstart = 25)
k6 <- kmeans(datos_uns, centers = 6, nstart = 25)

# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = datos_uns) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = datos_uns) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = datos_uns) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = datos_uns) + ggtitle("k = 5")

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

grid.arrange(p1, p2, p3, p4, nrow = 2)

#Determining Optimal Clusters with elbow method
set.seed(123)
library(purrr)

## 
## Attaching package: 'purrr'

## The following object is masked from 'package:data.table':
## 
##     transpose

## The following object is masked from 'package:caret':
## 
##     lift

# function to compute total within-cluster sum of square 
wss <- function(k) {
  kmeans(datos_uns, k, nstart = 25 )$tot.withinss
}

# Compute and plot wss for k = 1 to k = 6
k.values <- 1:6

# extract wss for 2-6 clusters
wss_values <- map_dbl(k.values, wss)
#plot
plot(k.values, wss_values,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

According to the elbow method, the optimal K-value is 2.

##Building a k-means model with a k=2
set.seed(123)
km.res1 <- kmeans(datos_uns, 2, nstart = 25)

fviz_cluster(list(data = datos_uns, cluster = km.res1$cluster),
             ellipse.type = "norm", geom = "point", stand = FALSE,
             palette = "jco",ggtheme = theme_classic())

#Extracting the vector of cluster assignment from the model
clust_store <- km.res1$cluster

#Building the segment_customers dataframe
segment_store <- mutate(datos_uns, cluster = clust_store)

#Calculating the mean for each category
count(segment_store, cluster)

##    cluster    n
## 1:       1  700
## 2:       2 9300

#Adding the cluster variable to the original dataframe
datos <- datos %>% mutate(cluster = segment_store$cluster)

#Adding the cluster variable to the original dataframe
datos_uns <- datos_uns %>% mutate(cluster = segment_store$cluster)

#It’s possible to compute the mean of each variables by clusters using the original data:
datos %>% 
  group_by(cluster) %>% 
  summarise(mean_revenue = mean(revenue),
            mean_qty = mean(qty),
            tot_transactions = n())

## # A tibble: 2 × 4
##   cluster mean_revenue mean_qty tot_transactions
##     <int>        <dbl>    <dbl>            <int>
## 1       1        484.     17.9               700
## 2       2         60.9     2.30             9300

#visualizing revenue
datos %>% ggplot(aes(revenue)) + geom_histogram(color = "black", fill = "lightblue") + facet_wrap(vars(cluster)) +  geom_vline(aes(xintercept=mean(revenue)),color="blue", linetype="dashed", size = 1)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#visualizing qty
datos %>% ggplot(aes(qty)) + geom_histogram(color = "black", fill = "lightgreen") + facet_wrap(vars(cluster)) +  geom_vline(aes(xintercept=mean(qty)),color="blue", linetype="dashed", size = 1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

7. Final conclusions

An unsupervised learning in R was perfomed with basic clustering (K-means) and dimesionality reduction (PCA) in order to get some insights from the data.

According with the results of the clustering analysis, we can compare the 2 clusters from the visuals above.

Cluster 1

Has a total of 700 transactions
This group has a high purchasing power when it comes to quantity of sold items.
Have relatively higher total revenue.

Cluster 2

Has a total of 9300 transactions.
This cluster have low purchasing power when it comes to quantity of sold items.
Have lower total revenue
Majority of the transactions have low revenue generating.

This dataset can be used to analyze various aspects of the store’s performance, including sales trends, regional differences, and the effectiveness of different products and representatives in generating revenue.

Other analysis clustering alternatives are hierarchical clustering or converting the numeric variables qty and revenue into categories and using K-modes.

8. References

Alboukadel Kassambara. Practical Guide to Cluster Analysis in R. Unsupervised Machine Learning. (2017).

Joaquín Amat Rodrigo. Análisis de Componentes Principales (Principal Component Analysis, PCA) y t-SNE (2017). Available at link