Market segmentation is the process of dividing a larger market into smaller groups of consumers who have similar needs, characteristics, or behaviors. This allows companies to target specific groups of consumers with tailored marketing messages and product offerings.
Machine learning algorithms can be useful in the process of market segmentation by analyzing large amounts of data to identify patterns and clusters of consumers with similar demographic and geographic characteristics or behaviors. This can help companies more accurately and efficiently identify and target specific market segments and optimize their marketing, advertising, and sales efforts.
The dataset for an Indian store includes 10000 observations and 5 variables:
| Variable | Description |
|---|---|
| reps | Representatives who are involved in the promotion and sale of the products in their respective region. |
| products | The 12 brands of products promoted by the company. |
| qty | The quantity of products sold in units per transaction. |
| revenue | The revenue generated for each transaction. |
| region | Indicates the region in India where the transaction took place, with four regions: East, North, South, and West India. |
Seetharam Indurti (2019) [Dataset available at: Kaggle)
The aim of this project is to obtain the segmentation of products based on the revenue generated in different regions in order to understand the market trends. The segmentation will help the company to devise marketing strategies and promotional schemes to position the right product according to the preference of the consumers in any given segment.
datos <-read.csv("store.csv", header=T, sep = ",", dec=".")
##Load libraries
library(dplyr)
library(ggplot2)
library(caret)
library(tibble)
library(tidyr)
library(factoextra)
library(forcats)
library(knitr)
###2.a. Structure of data####
head(datos, n=10)
## reps product qty revenue region
## 1 Chitra Galaxy 2 155.10 West
## 2 Vijay Jet 2 39.30 North
## 3 Mala Beacon 3 74.25 West
## 4 Suman Alpen 3 100.98 North
## 5 Rachna Orbit 2 44.98 North
## 6 Aash Trident 1 29.25 East
## 7 Chand Mars 3 68.39 North
## 8 Suraj Orbit 2 45.44 West
## 9 Rachna Milka 1 22.38 North
## 10 Bala Almond 2 49.25 West
str(datos)
## 'data.frame': 10000 obs. of 5 variables:
## $ reps : chr "Chitra" "Vijay" "Mala" "Suman" ...
## $ product: chr "Galaxy" "Jet" "Beacon" "Alpen" ...
## $ qty : int 2 2 3 3 2 1 3 2 1 2 ...
## $ revenue: num 155.1 39.3 74.2 101 45 ...
## $ region : chr "West" "North" "West" "North" ...
dim(datos)
## [1] 10000 5
#summary statistics
summary(datos)
## reps product qty revenue
## Length:10000 Length:10000 Min. : 1.000 Min. : 18.43
## Class :character Class :character 1st Qu.: 2.000 1st Qu.: 39.30
## Mode :character Mode :character Median : 2.000 Median : 58.42
## Mean : 3.387 Mean : 90.57
## 3rd Qu.: 3.000 3rd Qu.: 75.00
## Max. :25.000 Max. :1998.75
## region
## Length:10000
## Class :character
## Mode :character
##
##
##
#categorical variables levels
library(forcats)
fct_count(datos$product) #12 different products
## # A tibble: 12 × 2
## f n
## <fct> <int>
## 1 Almond 1015
## 2 Alpen 1588
## 3 Beacon 651
## 4 Galaxy 342
## 5 Halls 307
## 6 Jet 1274
## 7 Mars 638
## 8 Milka 1294
## 9 Orbit 1255
## 10 Prince 325
## 11 Star 652
## 12 Trident 659
fct_count(datos$reps) #72 different reps
## # A tibble: 72 × 2
## f n
## <fct> <int>
## 1 Aash 315
## 2 Akila 109
## 3 Alka 116
## 4 Anahit 104
## 5 Ananya 216
## 6 Anusha 82
## 7 Aparna 104
## 8 Bala 97
## 9 Bharath 171
## 10 Bhat 87
## # … with 62 more rows
fct_count(datos$region) #4 regions
## # A tibble: 4 × 2
## f n
## <fct> <int>
## 1 East 1703
## 2 North 3603
## 3 South 1665
## 4 West 3029
The dataset contains 10000 transactions of revenues from 12 different products sold by 72 representatives in 4 regions of India (Western, Norther, Souther and Easter India).
# check missing values
any(!complete.cases(datos))
## [1] FALSE
The dataset does not contain missing values.
Data visualization
#Histograms
ggplot(datos, aes(x = qty)) + geom_bar()
ggplot(datos, aes(x = revenue)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Boxplot
ggplot(datos, aes(y = qty)) + geom_boxplot()
ggplot(datos, aes(y = revenue)) + geom_boxplot()
The variables qty and revenue show outliers in their distribution. However, at first we’ll consider all values in the model since it may contain valuable information of importat revenues.
###Check correlation between variables
library(ggcorrplot)
options(repr.plot.width = 6, repr.plot.height = 5)
#Correlogram plot
corr <- round(cor(select_if(datos, is.numeric)), 2)
ggcorrplot(corr, hc.order = T, ggtheme = ggplot2::theme_gray,
colors = c("red", "white", "blue"), lab = T)
As it was expected, the variables qty and revenue are highly correlated.
What products are fast moving in terms of the quantity and revenue generated?
#Scatter plot
options(repr.plot.width = 11, repr.plot.height = 5)
#Qty and revenue by products
scplot1 <- ggplot(datos, aes(revenue, qty, col = product)) +
geom_point() + theme(legend.position = 'bottom') +
labs(x='revenue', y ='qty')
print(scplot1)
#Qty and revenue by region
scplot2 <- ggplot(datos, aes(revenue, qty, col = region)) +
geom_point() + theme(legend.position = 'bottom') +
labs(x='revenue', y ='qty')
print(scplot2)
The scatter plot shows differences in the effectiveness of different products in generating revenue. For example, sales of Galaxy are the ones that generate the highest revenue for the lowest number of items sold. Hence, Galaxy followed by Alpen and Almond are the most fast moving brands in terms of the quantity and revenue generated.
Data visualization
#Transactions per region
ggplot(data = datos, aes(x = region, y = after_stat(count), fill = region)) +
geom_bar() +
scale_fill_brewer(palette ="Greens") +
labs(title = "Number of transactions per regions") +
theme_bw() +
theme(legend.position = "bottom")
Are there regional differences in the quantity of items sold and the revenue achieved?
datos %>%
group_by(region) %>%
summarise(Tot_qty= sum(qty),
Tot_revenue=sum(revenue),
Tot_transactions= n()) %>%
kable(caption ="Total transactions per region")
| region | Tot_qty | Tot_revenue | Tot_transactions |
|---|---|---|---|
| East | 5614 | 146141.5 | 1703 |
| North | 12174 | 326109.5 | 3603 |
| South | 5908 | 158819.9 | 1665 |
| West | 10178 | 274631.3 | 3029 |
The north (3603) and west (3029) regions double the number of transactions to the east and south regions. Hence the total quantity of sold products and total revenue in this regions are higher too.
#Products
ggplot(data = datos, aes(x = product, y = after_stat(count), fill = product)) +
geom_bar() +
labs(title = "Number of transactions per products") +
theme_bw() +
theme(legend.position = "bottom")
datos %>%
group_by(product) %>%
summarise(tot_qty= sum(qty),
tot_revenue=sum(revenue),
tot_transactions= n()) %>%
arrange(desc(tot_revenue)) %>%
kable(caption = "Total transactions and revenue per products")
| product | tot_qty | tot_revenue | tot_transactions |
|---|---|---|---|
| Alpen | 5267 | 177152.07 | 1588 |
| Orbit | 4375 | 99284.25 | 1255 |
| Milka | 4153 | 94249.40 | 1294 |
| Galaxy | 1150 | 90919.18 | 342 |
| Jet | 4262 | 84077.56 | 1274 |
| Almond | 3258 | 80581.91 | 1015 |
| Trident | 2436 | 72289.20 | 659 |
| Mars | 2498 | 58109.60 | 638 |
| Beacon | 2152 | 53164.41 | 651 |
| Star | 2101 | 43637.99 | 652 |
| Prince | 1253 | 34025.46 | 325 |
| Halls | 969 | 18211.18 | 307 |
Overall, of the 12 products sold in the 4 regions, Alpen is the best-selling in India with a total revenue of 177152.07 rupies, followed by Orbit, Milka, and Galaxy. Halls generated the lowest revenue.
#Products per region
ggplot(data = datos, aes(x = product, y = after_stat(count), fill = region)) +
geom_bar() +
labs(title = "Number of transactions per products by region") +
theme_bw() +
theme(legend.position = "bottom")
Suppose that the store manager want to identified the more efficient representative in each region in terms of revenue and transactions.
#One alternative is using pivot tables
library(pivottabler)
pt <- PivotTable$new()
pt$addData(datos)
pt$addRowDataGroups("region")
pt$addRowDataGroups("reps")
pt$defineCalculation(calculationName="Tot_transactions", summariseExpression="n()")
pt$defineCalculation(calculationName="Tot_revenues", summariseExpression="sum(revenue)")
pt$evaluatePivot()
# apply the green style for reps with more than 300 transactions
cells <- pt$findCells(minValue=300, maxValue=1000, includeNull=FALSE, includeNA=FALSE)
pt$setStyling(cells=cells, declarations=list("background-color"="#C6EFCE", "color"="#006100"))
# apply the yellow style for revenues higher than 20000
cells <- pt$findCells(minValue=20000, maxValue=50000, includeNull=FALSE, includeNA=FALSE)
pt$setStyling(cells=cells, declarations=list("background-color"="#FFEB9C", "color"="#9C5700"))
pt$renderPivot()
The total revenue generated by all the representatives for the company is 9,05,702.21 rupies, covering 12 brands and involving 10,000 transactions in all the 4 regions. However, there is a difference between regions in the efficiency to generate revenue between the reps. In the table above, in green, the most efficient representatives of the 4 regions that reach more than 300 transactions obtaining total revenues higher than 20,000 rupies (yellow) are indicated.
East region: Aash and Vish
South region: Seet
North region: Rachna
West region: Santosh
Suppose that now the store manager want to see the revenue and quantity broken down by the products and region.
#Sales trends of products per reps
pt2 <- PivotTable$new()
pt2$addData(datos)
pt2$addRowDataGroups("region")
pt2$addColumnDataGroups("product")
pt2$defineCalculation(calculationName="tot_revenue", summariseExpression="sum(revenue)")
pt2$defineCalculation(calculationName="tot_qty", summariseExpression="sum(qty)")
pt2$evaluatePivot()
pt2$renderPivot()
#Normalizing data
datos_preProces <-preProcess(datos, method = c("scale", "center"))
datos_uns <- predict(datos_preProces,datos)
#One Hot-encoding
datos_uns$reps <- as.factor(datos_uns$reps)
datos_uns$region <- as.factor(datos_uns$region)
datos_uns$product <- as.factor(datos_uns$product)
library(mltools)
##
## Attaching package: 'mltools'
## The following object is masked from 'package:tidyr':
##
## replace_na
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
datos_uns <- one_hot(as.data.table(datos_uns))
#Chequeamos preprocesamiento
str(datos_uns)
## Classes 'data.table' and 'data.frame': 10000 obs. of 90 variables:
## $ reps_Aash : int 0 0 0 0 0 1 0 0 0 0 ...
## $ reps_Akila : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Alka : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Anahit : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Ananya : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Anusha : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Aparna : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Bala : int 0 0 0 0 0 0 0 0 0 1 ...
## $ reps_Bharath : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Bhat : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Chand : int 0 0 0 0 0 0 1 0 0 0 ...
## $ reps_Chandra : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Chitra : int 1 0 0 0 0 0 0 0 0 0 ...
## $ reps_Durga : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Easwar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Hussain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Jagdish : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Jaggi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Javed : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Jay : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_John : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Kamat : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Keshab : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Kishen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Kishore : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Kumar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Madhu : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Mak : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Mala : int 0 0 1 0 0 0 0 0 0 0 ...
## $ reps_Manju : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Meena : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Mehta : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Mukesh : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Mukund : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Nandini : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Nidhi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Palak : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Pooja : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Prarth : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Prasad : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Priya : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Rachna : int 0 0 0 0 1 0 0 0 1 0 ...
## $ reps_Rahul : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Rajat : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Raji : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Ram : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Ranga : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Ratna : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Ravi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Reva : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Rishi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Rohini : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Sai : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Santosh : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Satya : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Satyen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Seet : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Sesh : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Shaanth : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Sruti : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Suman : int 0 0 0 1 0 0 0 0 0 0 ...
## $ reps_Sumedh : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Suraj : int 0 0 0 0 0 0 0 1 0 0 ...
## $ reps_Suresh : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Susan : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Swami : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Vaghya : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Veeyes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Venkat : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Vidya : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reps_Vijay : int 0 1 0 0 0 0 0 0 0 0 ...
## $ reps_Vish : int 0 0 0 0 0 0 0 0 0 0 ...
## $ product_Almond : int 0 0 0 0 0 0 0 0 0 1 ...
## $ product_Alpen : int 0 0 0 1 0 0 0 0 0 0 ...
## $ product_Beacon : int 0 0 1 0 0 0 0 0 0 0 ...
## $ product_Galaxy : int 1 0 0 0 0 0 0 0 0 0 ...
## $ product_Halls : int 0 0 0 0 0 0 0 0 0 0 ...
## $ product_Jet : int 0 1 0 0 0 0 0 0 0 0 ...
## $ product_Mars : int 0 0 0 0 0 0 1 0 0 0 ...
## $ product_Milka : int 0 0 0 0 0 0 0 0 1 0 ...
## $ product_Orbit : int 0 0 0 0 1 0 0 1 0 0 ...
## $ product_Prince : int 0 0 0 0 0 0 0 0 0 0 ...
## $ product_Star : int 0 0 0 0 0 0 0 0 0 0 ...
## $ product_Trident: int 0 0 0 0 0 1 0 0 0 0 ...
## $ qty : num -0.3213 -0.3213 -0.0897 -0.0897 -0.3213 ...
## $ revenue : num 0.5049 -0.4012 -0.1277 0.0815 -0.3567 ...
## $ region_East : int 0 0 0 0 0 1 0 0 0 0 ...
## $ region_North : int 0 1 0 1 1 0 1 0 1 0 ...
## $ region_South : int 0 0 0 0 0 0 0 0 0 0 ...
## $ region_West : int 1 0 1 0 0 0 0 1 0 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
Before applying any clustering method on your data, it’s important to evaluate whether the data sets contains meaningful clusters. Because, even if the data is uniformly distributed, the k-means algorithm and hierarchical clustering impose a classification. Since the main problem faced by unsupervised learning methods is the difficulty in validating the results (because there is no response variable available to compare them), this type of analysis can be used to evaluate the validity of clustering analysis.
library(clustertend)
## Package `clustertend` is deprecated. Use package `hopkins` instead.
# Plot data set
fviz_pca_ind(prcomp(datos_uns), title = "PCA",
habillage = datos$region, palette = "jco",
geom = "point", ggtheme = theme_classic(),
legend = "bottom")
As we can see, the data is not randomly or uniformly distributed.
Principal Component Analysis belongs to the family of techniques known as unsupervised learning. This method allows to “condense” the information provided by multiple variables into just a few components, to find a linear combination of the original features that capture maximum variance in the dataset.
library(FactoMineR)
#Running a PCA.
store_pca <- PCA(datos_uns, graph = FALSE)
print(store_pca)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 10000 individuals, described by 90 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
#Exploring PCA()
# Getting the summary of the pca
summary(store_pca)
##
## Call:
## PCA(X = datos_uns, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 2.520 2.318 2.229 1.962 1.217 1.196 1.183
## % of var. 2.801 2.576 2.477 2.180 1.353 1.329 1.315
## Cumulative % of var. 2.801 5.376 7.853 10.033 11.386 12.715 14.030
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 1.170 1.136 1.134 1.132 1.125 1.105 1.097
## % of var. 1.300 1.262 1.260 1.258 1.250 1.227 1.219
## Cumulative % of var. 15.330 16.593 17.852 19.110 20.360 21.587 22.806
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 1.091 1.031 1.031 1.030 1.030 1.029 1.028
## % of var. 1.213 1.146 1.145 1.144 1.144 1.143 1.142
## Cumulative % of var. 24.019 25.165 26.310 27.454 28.598 29.742 30.884
## Dim.22 Dim.23 Dim.24 Dim.25 Dim.26 Dim.27 Dim.28
## Variance 1.025 1.024 1.022 1.020 1.019 1.019 1.018
## % of var. 1.139 1.138 1.136 1.134 1.133 1.132 1.131
## Cumulative % of var. 32.023 33.161 34.296 35.430 36.563 37.695 38.826
## Dim.29 Dim.30 Dim.31 Dim.32 Dim.33 Dim.34 Dim.35
## Variance 1.017 1.017 1.016 1.014 1.014 1.012 1.011
## % of var. 1.130 1.130 1.128 1.127 1.127 1.125 1.124
## Cumulative % of var. 39.956 41.086 42.214 43.342 44.468 45.593 46.717
## Dim.36 Dim.37 Dim.38 Dim.39 Dim.40 Dim.41 Dim.42
## Variance 1.011 1.011 1.011 1.011 1.011 1.011 1.011
## % of var. 1.124 1.124 1.124 1.124 1.123 1.123 1.123
## Cumulative % of var. 47.840 48.964 50.088 51.211 52.335 53.458 54.581
## Dim.43 Dim.44 Dim.45 Dim.46 Dim.47 Dim.48 Dim.49
## Variance 1.011 1.011 1.011 1.011 1.010 1.010 1.010
## % of var. 1.123 1.123 1.123 1.123 1.123 1.123 1.123
## Cumulative % of var. 55.704 56.827 57.950 59.073 60.196 61.319 62.441
## Dim.50 Dim.51 Dim.52 Dim.53 Dim.54 Dim.55 Dim.56
## Variance 1.010 1.010 1.010 1.010 1.010 1.010 1.010
## % of var. 1.122 1.122 1.122 1.122 1.122 1.122 1.122
## Cumulative % of var. 63.564 64.686 65.808 66.931 68.053 69.175 70.297
## Dim.57 Dim.58 Dim.59 Dim.60 Dim.61 Dim.62 Dim.63
## Variance 1.010 1.010 1.010 1.010 1.010 1.009 1.009
## % of var. 1.122 1.122 1.122 1.122 1.122 1.122 1.122
## Cumulative % of var. 71.419 72.541 73.663 74.785 75.907 77.028 78.150
## Dim.64 Dim.65 Dim.66 Dim.67 Dim.68 Dim.69 Dim.70
## Variance 1.009 1.009 1.009 1.009 1.009 1.009 1.009
## % of var. 1.121 1.121 1.121 1.121 1.121 1.121 1.121
## Cumulative % of var. 79.271 80.393 81.514 82.635 83.756 84.877 85.998
## Dim.71 Dim.72 Dim.73 Dim.74 Dim.75 Dim.76 Dim.77
## Variance 1.008 0.999 0.987 0.980 0.978 0.973 0.969
## % of var. 1.120 1.110 1.097 1.089 1.087 1.081 1.077
## Cumulative % of var. 87.118 88.228 89.324 90.414 91.500 92.581 93.659
## Dim.78 Dim.79 Dim.80 Dim.81 Dim.82 Dim.83 Dim.84
## Variance 0.960 0.952 0.947 0.938 0.934 0.917 0.060
## % of var. 1.066 1.058 1.052 1.043 1.037 1.018 0.066
## Cumulative % of var. 94.725 95.783 96.835 97.878 98.915 99.934 100.000
## Dim.85 Dim.86 Dim.87 Dim.88 Dim.89 Dim.90
## Variance 0.000 0.000 0.000 0.000 0.000 0.000
## % of var. 0.000 0.000 0.000 0.000 0.000 0.000
## Cumulative % of var. 100.000 100.000 100.000 100.000 100.000 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 11.710 | 1.821 0.013 0.024 | -1.486 0.010 0.016 | 0.114
## 2 | 10.040 | -1.996 0.016 0.040 | -0.700 0.002 0.005 | -0.124
## 3 | 8.381 | 1.960 0.015 0.055 | -1.501 0.010 0.032 | -0.092
## 4 | 8.165 | -1.984 0.016 0.059 | -0.724 0.002 0.008 | 0.022
## 5 | 6.498 | -1.974 0.015 0.092 | -0.809 0.003 0.016 | -0.060
## 6 | 7.318 | 0.438 0.001 0.004 | 2.133 0.020 0.085 | -2.557
## 7 | 10.881 | -1.973 0.015 0.033 | -0.784 0.003 0.005 | -0.128
## 8 | 11.362 | 1.871 0.014 0.027 | -1.507 0.010 0.018 | -0.052
## 9 | 6.507 | -1.927 0.015 0.088 | -0.655 0.002 0.010 | -0.101
## 10 | 10.791 | 1.825 0.013 0.029 | -1.431 0.009 0.018 | -0.132
## ctr cos2
## 1 0.000 0.000 |
## 2 0.000 0.000 |
## 3 0.000 0.000 |
## 4 0.000 0.000 |
## 5 0.000 0.000 |
## 6 0.029 0.122 |
## 7 0.000 0.000 |
## 8 0.000 0.000 |
## 9 0.000 0.000 |
## 10 0.000 0.000 |
##
## Variables (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## reps_Aash | 0.048 0.091 0.002 | 0.258 2.880 0.067 | -0.299 4.001
## reps_Akila | 0.123 0.605 0.015 | -0.100 0.429 0.010 | -0.007 0.002
## reps_Alka | 0.028 0.031 0.001 | 0.143 0.887 0.021 | 0.191 1.631
## reps_Anahit | 0.121 0.580 0.015 | -0.098 0.413 0.010 | -0.005 0.001
## reps_Ananya | -0.185 1.352 0.034 | -0.073 0.229 0.005 | -0.003 0.000
## reps_Anusha | 0.024 0.022 0.001 | 0.128 0.711 0.016 | -0.153 1.050
## reps_Aparna | -0.126 0.635 0.016 | -0.049 0.102 0.002 | -0.002 0.000
## reps_Bala | 0.116 0.533 0.013 | -0.094 0.383 0.009 | 0.001 0.000
## reps_Bharath | 0.035 0.049 0.001 | 0.175 1.322 0.031 | 0.231 2.390
## reps_Bhat | 0.025 0.025 0.001 | 0.124 0.661 0.015 | 0.166 1.241
## cos2
## reps_Aash 0.089 |
## reps_Akila 0.000 |
## reps_Alka 0.036 |
## reps_Anahit 0.000 |
## reps_Ananya 0.000 |
## reps_Anusha 0.023 |
## reps_Aparna 0.000 |
## reps_Bala 0.000 |
## reps_Bharath 0.053 |
## reps_Bhat 0.028 |
#Tracing variable contributions
store_pca$var$contrib
## Dim.1 Dim.2 Dim.3 Dim.4
## reps_Aash 9.126914e-02 2.880312e+00 4.000514e+00 2.585428e-01
## reps_Akila 6.049895e-01 4.293797e-01 2.075660e-03 6.110025e-02
## reps_Alka 3.085068e-02 8.870338e-01 1.630592e+00 1.296619e-02
## reps_Anahit 5.804718e-01 4.130298e-01 1.237727e-03 4.865336e-02
## reps_Ananya 1.351809e+00 2.287304e-01 4.400193e-04 4.345170e-03
## reps_Anusha 2.237026e-02 7.114428e-01 1.049974e+00 1.141791e-04
## reps_Aparna 6.347325e-01 1.022493e-01 1.306734e-04 1.459846e-03
## reps_Bala 5.326743e-01 3.832123e-01 3.652381e-05 3.319142e-02
## reps_Bharath 4.880789e-02 1.322191e+00 2.390174e+00 8.107790e-02
## reps_Bhat 2.467424e-02 6.608977e-01 1.240882e+00 2.180997e-04
## reps_Chand 6.155843e-01 1.001215e-01 3.600039e-04 6.653208e-03
## reps_Chandra 5.547627e-01 3.997474e-01 4.733758e-04 4.298805e-06
## reps_Chitra 5.269208e-01 3.752940e-01 2.499613e-04 1.265482e-03
## reps_Durga 2.817092e-02 8.365107e-01 1.188191e+00 3.649964e-02
## reps_Easwar 1.012080e+00 7.205027e-01 1.461209e-03 3.608812e-02
## reps_Hussain 6.311231e-02 1.828517e+00 2.590902e+00 2.710314e-02
## reps_Jagdish 1.694150e+00 1.211890e+00 2.358543e-03 4.044017e-02
## reps_Jaggi 4.891125e-01 3.517084e-01 8.137054e-04 1.807803e-02
## reps_Javed 2.601138e-02 7.759286e-01 1.103459e+00 3.205657e-02
## reps_Jay 2.448346e-02 6.468499e-01 1.177036e+00 4.939788e-03
## reps_John 5.968412e-01 4.241554e-01 2.231347e-04 5.902222e-02
## reps_Kamat 5.583464e-01 9.164844e-02 1.039784e-03 6.904572e-02
## reps_Keshab 3.141269e-02 8.721165e-01 1.200158e+00 4.034408e-02
## reps_Kishen 2.944384e-02 8.755794e-01 1.230285e+00 4.096359e-02
## reps_Kishore 5.377989e-01 8.909523e-02 1.123309e-03 1.261610e-01
## reps_Kumar 2.629355e-02 7.978231e-01 1.073492e+00 1.073231e-01
## reps_Madhu 3.102445e-02 9.449676e-01 1.369027e+00 2.529577e-03
## reps_Mak 5.613522e-01 3.931571e-01 2.598404e-04 8.010751e-02
## reps_Mala 1.094638e+00 7.884700e-01 1.370512e-04 6.734804e-03
## reps_Manju 5.163068e-01 3.731615e-01 2.820414e-04 4.023765e-03
## reps_Meena 5.538185e-01 3.918799e-01 1.475420e-04 4.885188e-04
## reps_Mehta 6.270911e-01 4.470969e-01 3.853633e-04 9.544612e-02
## reps_Mukesh 5.433165e-01 3.930628e-01 8.957424e-04 1.500349e-02
## reps_Mukund 7.053313e-01 1.165761e-01 1.802971e-04 3.226557e-05
## reps_Nandini 1.836139e+00 3.025454e-01 2.732726e-05 9.420612e-03
## reps_Nidhi 3.042405e-02 8.119538e-01 1.437608e+00 8.063841e-02
## reps_Palak 6.384705e-01 4.496514e-01 6.797111e-05 2.327589e-03
## reps_Pooja 6.052876e-01 1.012586e-01 3.730242e-04 1.339135e-02
## reps_Prarth 5.821456e-01 9.917017e-02 1.304760e-03 5.386421e-02
## reps_Prasad 5.232909e-01 8.620927e-02 8.746298e-05 2.117441e-02
## reps_Priya 6.103982e-01 4.307019e-01 4.608751e-06 7.067387e-03
## reps_Rachna 2.033926e+00 3.354797e-01 1.798543e-04 3.699611e-04
## reps_Rahul 2.912412e-02 8.112981e-01 1.409991e+00 1.596909e-01
## reps_Rajat 6.045053e-01 1.042727e-01 7.346125e-05 9.023017e-04
## reps_Raji 3.036097e-02 8.766230e-01 1.510060e+00 1.489057e-01
## reps_Ram 2.599799e-02 7.139268e-01 1.062281e+00 1.392439e-03
## reps_Ranga 6.905234e-01 1.170781e-01 7.636377e-06 1.136959e-02
## reps_Ratna 5.834138e-01 9.526292e-02 8.322010e-05 3.256593e-03
## reps_Ravi 1.764924e+00 2.871214e-01 8.802324e-06 3.321440e-02
## reps_Reva 4.936773e-01 3.538508e-01 3.616372e-04 3.295340e-03
## reps_Rishi 5.956418e-01 9.730827e-02 2.288875e-04 5.758519e-03
## reps_Rohini 4.828947e-01 8.114726e-02 1.177757e-03 4.530916e-02
## reps_Sai 1.863295e+00 3.070708e-01 1.082394e-03 1.514977e-02
## reps_Santosh 1.860980e+00 1.321478e+00 4.302138e-04 2.068068e-03
## reps_Satya 3.385817e-02 9.906489e-01 1.388218e+00 7.440249e-02
## reps_Satyen 3.248771e-02 8.503359e-01 1.568118e+00 1.015551e-02
## reps_Seet 1.053986e-01 2.896000e+00 5.310312e+00 4.092982e-02
## reps_Sesh 4.948702e-01 8.083477e-02 3.545893e-04 4.523198e-03
## reps_Shaanth 1.797179e+00 2.988703e-01 1.751508e-05 2.296284e-02
## reps_Sruti 2.792923e-02 7.921121e-01 1.420471e+00 1.092452e-02
## reps_Suman 1.077273e+00 1.760712e-01 5.635727e-08 7.924891e-03
## reps_Sumedh 6.918845e-01 4.861714e-01 3.450117e-09 7.507632e-03
## reps_Suraj 4.684067e-01 3.308927e-01 1.973896e-05 2.540159e-02
## reps_Suresh 1.168881e+00 8.379279e-01 7.206562e-04 2.108465e-03
## reps_Susan 6.236323e-01 1.037777e-01 2.118178e-03 1.854908e-01
## reps_Swami 3.097095e-02 8.511155e-01 1.509163e+00 6.787611e-02
## reps_Vaghya 5.848086e-02 1.627130e+00 2.902172e+00 5.295584e-02
## reps_Veeyes 5.419412e-01 9.485072e-02 2.048657e-03 8.269226e-02
## reps_Venkat 6.571476e-01 1.078194e-01 3.761756e-04 9.584989e-03
## reps_Vidya 5.418096e-01 3.890714e-01 1.557574e-04 1.887188e-03
## reps_Vijay 6.689547e-01 1.118592e-01 1.557898e-03 5.209196e-02
## reps_Vish 9.190382e-02 2.807941e+00 4.053434e+00 3.179441e-02
## product_Almond 1.252444e-02 2.820086e-03 2.897915e-02 8.839478e-02
## product_Alpen 3.557211e-03 1.684165e-03 1.492996e-02 3.379919e-01
## product_Beacon 4.476679e-02 1.072185e-02 1.216649e-02 2.560888e-02
## product_Galaxy 4.741947e-03 5.585370e-03 5.152675e-02 3.789638e+00
## product_Halls 8.651118e-03 1.414780e-02 5.573989e-03 1.800614e-01
## product_Jet 2.412117e-02 1.279931e-02 3.711257e-05 4.505851e-01
## product_Mars 1.225051e-03 1.600073e-02 2.971042e-02 6.990947e-02
## product_Milka 3.267631e-02 9.360212e-02 8.651372e-03 3.503110e-01
## product_Orbit 7.532472e-04 4.484999e-02 8.832152e-03 5.422228e-02
## product_Prince 1.141692e-02 1.674871e-03 1.007486e-04 9.753618e-02
## product_Star 4.073355e-04 7.777759e-03 5.059132e-02 1.635066e-01
## product_Trident 3.152009e-04 1.035003e-02 3.843395e-03 2.171605e-01
## qty 1.021555e-03 2.727557e-03 9.573269e-01 4.278458e+01
## revenue 1.137793e-03 1.231221e-07 1.097322e+00 4.757954e+01
## region_East 5.920567e-01 1.777238e+01 2.518601e+01 5.566755e-01
## region_North 3.439830e+01 5.699335e+00 4.739475e-03 1.173116e-03
## region_South 5.583659e-01 1.535545e+01 2.768812e+01 5.890334e-01
## region_West 2.393548e+01 1.706692e+01 7.851776e-03 2.270370e-03
## Dim.5
## reps_Aash 5.563493e-01
## reps_Akila 1.040647e-02
## reps_Alka 5.422753e-03
## reps_Anahit 6.495928e-01
## reps_Ananya 2.291160e-01
## reps_Anusha 5.818555e-01
## reps_Aparna 5.319524e-01
## reps_Bala 1.321652e-01
## reps_Bharath 4.910534e-02
## reps_Bhat 1.666652e-01
## reps_Chand 1.157881e-02
## reps_Chandra 1.025434e-01
## reps_Chitra 4.926586e-01
## reps_Durga 7.008470e-02
## reps_Easwar 9.125761e-02
## reps_Hussain 4.431965e-03
## reps_Jagdish 1.500219e-01
## reps_Jaggi 8.108294e-02
## reps_Javed 1.435683e-01
## reps_Jay 5.630798e-02
## reps_John 4.534379e-01
## reps_Kamat 3.675797e-02
## reps_Keshab 2.272323e+00
## reps_Kishen 3.965849e-01
## reps_Kishore 1.055071e-04
## reps_Kumar 3.563015e-01
## reps_Madhu 5.158448e-03
## reps_Mak 4.110437e-04
## reps_Mala 6.246380e-01
## reps_Manju 1.505673e-01
## reps_Meena 7.274106e-04
## reps_Mehta 2.179807e-01
## reps_Mukesh 2.539955e-01
## reps_Mukund 2.018587e-01
## reps_Nandini 3.750417e-01
## reps_Nidhi 1.496604e-03
## reps_Palak 1.153770e-01
## reps_Pooja 4.527582e-01
## reps_Prarth 3.818279e-02
## reps_Prasad 3.417742e-01
## reps_Priya 7.193725e-03
## reps_Rachna 2.475571e-01
## reps_Rahul 1.084123e-03
## reps_Rajat 2.579573e-02
## reps_Raji 5.931738e-03
## reps_Ram 7.826801e-01
## reps_Ranga 2.647436e-02
## reps_Ratna 1.769534e-03
## reps_Ravi 2.977884e-01
## reps_Reva 1.714722e-01
## reps_Rishi 3.048499e-03
## reps_Rohini 1.135946e-01
## reps_Sai 2.041951e-01
## reps_Santosh 1.437317e-02
## reps_Satya 5.537735e-01
## reps_Satyen 5.233452e-03
## reps_Seet 4.433331e-01
## reps_Sesh 2.982912e-01
## reps_Shaanth 5.369421e-01
## reps_Sruti 2.617994e-01
## reps_Suman 1.110145e+00
## reps_Sumedh 3.003468e-01
## reps_Suraj 1.874928e-02
## reps_Suresh 6.881097e-01
## reps_Susan 1.373540e-03
## reps_Swami 1.115170e-01
## reps_Vaghya 2.788027e-01
## reps_Veeyes 2.414574e-02
## reps_Venkat 4.627818e-04
## reps_Vidya 6.173212e-01
## reps_Vijay 2.049095e-02
## reps_Vish 3.048169e-01
## product_Almond 1.073657e-02
## product_Alpen 5.580359e+01
## product_Beacon 2.878500e-01
## product_Galaxy 5.104822e-02
## product_Halls 4.020579e-03
## product_Jet 1.158794e+00
## product_Mars 4.155618e-02
## product_Milka 6.013317e+00
## product_Orbit 1.782020e+01
## product_Prince 1.638989e-02
## product_Star 4.790145e-06
## product_Trident 1.446876e-02
## qty 8.691487e-01
## revenue 7.495622e-03
## region_East 3.414311e-03
## region_North 2.112827e-03
## region_South 1.659424e-03
## region_West 3.943604e-03
#Visualizing PCA
fviz_pca_var(store_pca, col.var = "contrib", gradient.cols = c("#002bbb", "#bb2e00"), repel = TRUE)
#Creating a factor map for the top 10 variables with the highest contributions.
fviz_pca_var(store_pca, select.var = list(contrib = 8), repel = TRUE)
#Barplotting the contributions of variables
fviz_contrib(store_pca, choice = "var", axes = 1, top = 10)
The red line corresponds to the expected percentage if the distributions were uniform.
The number of clusters (k) must be set before we start the algorithm. At first we can use several different values of k and examine the differences in the results.
library(factoextra)
library(rattle)
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Versión 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Escriba 'rattle()' para agitar, sacudir y rotar sus datos.
library(cluster)
library(gclus)
set.seed(123)
k2 <- kmeans(datos_uns, centers = 2, nstart = 25)
k3 <- kmeans(datos_uns, centers = 3, nstart = 25)
k4 <- kmeans(datos_uns, centers = 4, nstart = 25)
k5 <- kmeans(datos_uns, centers = 5, nstart = 25)
k6 <- kmeans(datos_uns, centers = 6, nstart = 25)
# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = datos_uns) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = datos_uns) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = datos_uns) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = datos_uns) + ggtitle("k = 5")
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(p1, p2, p3, p4, nrow = 2)
#Determining Optimal Clusters with elbow method
set.seed(123)
library(purrr)
##
## Attaching package: 'purrr'
## The following object is masked from 'package:data.table':
##
## transpose
## The following object is masked from 'package:caret':
##
## lift
# function to compute total within-cluster sum of square
wss <- function(k) {
kmeans(datos_uns, k, nstart = 25 )$tot.withinss
}
# Compute and plot wss for k = 1 to k = 6
k.values <- 1:6
# extract wss for 2-6 clusters
wss_values <- map_dbl(k.values, wss)
#plot
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
According to the elbow method, the optimal K-value is 2.
##Building a k-means model with a k=2
set.seed(123)
km.res1 <- kmeans(datos_uns, 2, nstart = 25)
fviz_cluster(list(data = datos_uns, cluster = km.res1$cluster),
ellipse.type = "norm", geom = "point", stand = FALSE,
palette = "jco",ggtheme = theme_classic())
#Extracting the vector of cluster assignment from the model
clust_store <- km.res1$cluster
#Building the segment_customers dataframe
segment_store <- mutate(datos_uns, cluster = clust_store)
#Calculating the mean for each category
count(segment_store, cluster)
## cluster n
## 1: 1 700
## 2: 2 9300
#Adding the cluster variable to the original dataframe
datos <- datos %>% mutate(cluster = segment_store$cluster)
#Adding the cluster variable to the original dataframe
datos_uns <- datos_uns %>% mutate(cluster = segment_store$cluster)
#It’s possible to compute the mean of each variables by clusters using the original data:
datos %>%
group_by(cluster) %>%
summarise(mean_revenue = mean(revenue),
mean_qty = mean(qty),
tot_transactions = n())
## # A tibble: 2 × 4
## cluster mean_revenue mean_qty tot_transactions
## <int> <dbl> <dbl> <int>
## 1 1 484. 17.9 700
## 2 2 60.9 2.30 9300
#visualizing revenue
datos %>% ggplot(aes(revenue)) + geom_histogram(color = "black", fill = "lightblue") + facet_wrap(vars(cluster)) + geom_vline(aes(xintercept=mean(revenue)),color="blue", linetype="dashed", size = 1)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#visualizing qty
datos %>% ggplot(aes(qty)) + geom_histogram(color = "black", fill = "lightgreen") + facet_wrap(vars(cluster)) + geom_vline(aes(xintercept=mean(qty)),color="blue", linetype="dashed", size = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
An unsupervised learning in R was perfomed with basic clustering (K-means) and dimesionality reduction (PCA) in order to get some insights from the data.
According with the results of the clustering analysis, we can compare the 2 clusters from the visuals above.
Cluster 1
Has a total of 700 transactions
This group has a high purchasing power when it comes to quantity of sold items.
Have relatively higher total revenue.
Cluster 2
Has a total of 9300 transactions.
This cluster have low purchasing power when it comes to quantity of sold items.
Have lower total revenue
Majority of the transactions have low revenue generating.
This dataset can be used to analyze various aspects of the store’s performance, including sales trends, regional differences, and the effectiveness of different products and representatives in generating revenue.
Other analysis clustering alternatives are hierarchical clustering or converting the numeric variables qty and revenue into categories and using K-modes.
Alboukadel Kassambara. Practical Guide to Cluster Analysis in R. Unsupervised Machine Learning. (2017).
Joaquín Amat Rodrigo. Análisis de Componentes Principales (Principal Component Analysis, PCA) y t-SNE (2017). Available at link