Introduction

This short report concerns the customer segmentation analysis problem. The approach presented in the following study is focused on distinction of the customers into certain groups based on purchasing habits. The methods used to view customer groups were Pricipal Component Analysis (PCA) and K-Means Clustering. PCA is particularly useful for massive data processing due to the fact that PCA is a dimensionality reduction algorithm. It decomposes the data into principal components (PC). It is a good base for further clustering models like the K-Means Clustering algorithm, which works by finding like groups based on Euclidean distance, a measure of distance or similarity. The researcher selects k groups to cluster, and the algorithm searches for the best centroids for the k groups. Then, one can use those groups to determine which factors group members relate. In this case, for customers, these are their buying preferences.

The data used for the following research is a collection of fictional data for bike shops (customers), bikes (products), and sales orders for the bike manufacturer (https://github.com/mdancho84/orderSimulatoR/tree/master/data).

The hypothesis is the following: bike shops are interested in buying bikes on the base of the features like: unit price (high end vs affordable), primary category (mountain vs road), frame (aluminum vs carbon), etc. The unit of measure to be used for clustering will be quantity of the purchased bikes.

Data preparation

Three data sources (customers, products, orders) have been read and merged. The following table presents first 6 rows of the combined dataset.

head(customerTrends )
## # A tibble: 6 x 35
## # Groups:   model, category1, category2, frame [6]
##                 model category1  category2    frame        price
##                 <chr>     <chr>      <chr>    <chr>       <fctr>
## 1         Bad Habit 1  Mountain      Trail Aluminum [ 415, 3500)
## 2         Bad Habit 2  Mountain      Trail Aluminum [ 415, 3500)
## 3 Beast of the East 1  Mountain      Trail Aluminum [ 415, 3500)
## 4 Beast of the East 2  Mountain      Trail Aluminum [ 415, 3500)
## 5 Beast of the East 3  Mountain      Trail Aluminum [ 415, 3500)
## 6   CAAD Disc Ultegra      Road Elite Road Aluminum [ 415, 3500)
## # ... with 30 more variables: `Albuquerque Cycles` <dbl>, `Ann Arbor
## #   Speed` <dbl>, `Austin Cruisers` <dbl>, `Cincinnati Speed` <dbl>,
## #   `Columbus Race Equipment` <dbl>, `Dallas Cycles` <dbl>, `Denver Bike
## #   Shop` <dbl>, `Detroit Cycles` <dbl>, `Indianapolis Velocipedes` <dbl>,
## #   `Ithaca Mountain Climbers` <dbl>, `Kansas City 29ers` <dbl>, `Las
## #   Vegas Cycles` <dbl>, `Los Angeles Cycles` <dbl>, `Louisville Race
## #   Equipment` <dbl>, `Miami Race Equipment` <dbl>, `Minneapolis Bike
## #   Shop` <dbl>, `Nashville Cruisers` <dbl>, `New Orleans
## #   Velocipedes` <dbl>, `New York Cycles` <dbl>, `Oklahoma City Race
## #   Equipment` <dbl>, `Philadelphia Bike Shop` <dbl>, `Phoenix
## #   Bi-peds` <dbl>, `Pittsburgh Mountain Machines` <dbl>, `Portland
## #   Bi-peds` <dbl>, `Providence Bi-peds` <dbl>, `San Antonio Bike
## #   Shop` <dbl>, `San Francisco Cruisers` <dbl>, `Seattle Race
## #   Equipment` <dbl>, `Tampa 29ers` <dbl>, `Wichita Speed` <dbl>

One can see, that the dataset is multidimentional - it contains 35 variables.

PCA application

The analysis has been performed with R software. PCA can be applied through prcomp() function. For obtaining the best results the data has been scaled and centered.

The main purpose of the PCA analysis is the reduction of the dimensions of the data. Therefore, there is a need for the analysis of the explained variance after the deployment of the PCA algorithm. PCA transforms the data into dimensions which are orthogonal to the variation. The greater the variance explained, the more information that is summarized by the PC.

# PCA using prcomp() -----------------------------------------------------------
pca <- prcomp(t(customerTrends[,-(1:5)]), scale. = T, center = T) 
summary(pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4    PC5     PC6     PC7
## Standard deviation     4.9852 4.2206 2.2308 2.18304 2.1169 2.03586 1.86265
## Proportion of Variance 0.2562 0.1836 0.0513 0.04913 0.0462 0.04273 0.03577
## Cumulative Proportion  0.2562 0.4399 0.4911 0.54028 0.5865 0.62921 0.66498
##                            PC8     PC9    PC10    PC11    PC12    PC13
## Standard deviation     1.80739 1.76329 1.70165 1.61928 1.55759 1.50889
## Proportion of Variance 0.03368 0.03205 0.02985 0.02703 0.02501 0.02347
## Cumulative Proportion  0.69865 0.73071 0.76056 0.78759 0.81260 0.83607
##                           PC14    PC15    PC16    PC17    PC18    PC19
## Standard deviation     1.46505 1.30348 1.26612 1.22705 1.18511 1.13653
## Proportion of Variance 0.02213 0.01752 0.01653 0.01552 0.01448 0.01332
## Cumulative Proportion  0.85820 0.87572 0.89224 0.90776 0.92224 0.93556
##                           PC20    PC21    PC22    PC23    PC24    PC25
## Standard deviation     1.04480 0.98599 0.87912 0.84801 0.80364 0.74209
## Proportion of Variance 0.01125 0.01002 0.00797 0.00741 0.00666 0.00568
## Cumulative Proportion  0.94681 0.95684 0.96480 0.97222 0.97888 0.98455
##                           PC26   PC27   PC28    PC29      PC30
## Standard deviation     0.69318 0.6607 0.5990 0.47169 2.214e-15
## Proportion of Variance 0.00495 0.0045 0.0037 0.00229 0.000e+00
## Cumulative Proportion  0.98951 0.9940 0.9977 1.00000 1.000e+00
# PCA using prcomp() -----------------------------------------------------------

plot(pca, type = "l")

The presented table and PCA plot shows the importance of the components and variance explained by the provided model. It seems that the first 5 or 6 components may explain the variance. However, there is a significant disparity between the second and the third component. Therefore, probably 2 first components may be sufficient.

Visualization of the results

For further analysis K-Means Clustering can be deployd. The centers taken into consideration vary between three and six.

gg2

gg2

gg2

gg2

The analysis of the outputs reveals that the best configuration for the classification is the K-Means algorithms with the number of centers equal to 5. Nonetheless, it is visible that the group number 5 is not correctly classified. On the other hand, it is quite clear that there exist 5 segments. After the insepction of Group 2 and 5, one can say that they were very similar in their preferences for bikes in the low-end price range.

Network Visualization

The research embraces the graphical interactive analysis of the customer connections and relationship strengths.

One of the visualization methods can be a dendogram. However, this solution requires limitation of the number of egdes to make the graph clear.

dendPlot(ceb, mode="hclust")

## R Markdown

plot(x=ceb, y=simIgraph)

# Create force directed network plot
forceNetwork(Links = simIgraph_d3$links, Nodes = simIgraph_d3$nodes, 
             Source = 'source', Target = 'target', 
             NodeID = 'name', Group = 'group', 
             fontSize = 16, fontFamily = 'Arial', linkDistance = 100,
             zoom = TRUE)

Such network visualization is a sufficient way of presentation of the relationships between the customers. It is possible to easily distinguish certain groups of the customers.