Introduction

Data is taken from https://www.kaggle.com/datasets/amritdhaliwal/shopify-shoe-store-data. It is a dataset related to users buying shoes from different stores. We are going to investigate different users behaviour of choosing stores to buy shoes.

Data Loading and Preparation

First, let’s load the dataset and take a look:

shoe_data <- read.csv("shoe_data.csv")
shoe_data <- shoe_data[1:1000,] # my memory is not enough for the rules :)

head(shoe_data)
##   order_id shop_id user_id order_amount total_items payment_method
## 1        1      53     746          224           2           cash
## 2        2      92     925           90           1           cash
## 3        3      44     861          144           1           cash
## 4        4      18     935          156           1    credit_card
## 5        5      18     883          156           1    credit_card
## 6        6      58     882          138           1    credit_card
##         created_at
## 1 2017-03-13 12:36
## 2 2017-03-03 17:38
## 3  2017-03-14 4:23
## 4 2017-03-26 12:43
## 5  2017-03-01 4:35
## 6 2017-03-14 15:25

We only want to investigate user_id and shop_id relation since that’s the scope. In order to accomplish this, let’s create another dataframe and format it to suit arules package with as function:

basket_data <- shoe_data %>%
  select(user_id, shop_id)

#transactions <- as(basket_data, "transactions")
transactions <- as(split(basket_data$shop_id, basket_data$user_id), "transactions")

Exploratory Data Analysis

Performing a quick exploratory analysis to understand the distribution of transactions yields:

hist(basket_data$shop_id,type = "absolute", main = "Histogram of Transactions by Shop",col = viridis(10))

Market Basket Analysis

Now, we’ll perform the market basket analysis using the Apriori algorithm:

rules <- apriori(transactions, parameter = list(supp = 0.005, conf = 0.5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[100 item(s), 288 transaction(s)] done [0.00s].
## sorting and recoding items ... [99 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [7 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(sort(rules, by = "lift"))
##     lhs         rhs  support     confidence coverage    lift     count
## [1] {29, 45} => {15} 0.006944444 1.0        0.006944444 32.00000 2    
## [2] {51, 60} => {61} 0.006944444 1.0        0.006944444 32.00000 2    
## [3] {60, 61} => {51} 0.006944444 1.0        0.006944444 28.80000 2    
## [4] {15, 29} => {45} 0.006944444 1.0        0.006944444 26.18182 2    
## [5] {15, 45} => {29} 0.006944444 1.0        0.006944444 24.00000 2    
## [6] {51, 61} => {60} 0.006944444 1.0        0.006944444 18.00000 2    
## [7] {76}     => {50} 0.010416667 0.5        0.020833333 11.07692 3

Setting support to 0.005 means that we consider transactions that are in at least 0.5% of all transactions. After lots of trial and error with setting a support value, I concluded that there is no significance relation between users preferences of different shops. Most users tend to shop from the same shop I presume.

Conclusion

Even though we get some anti-climactic results from our dataset, we saw how doing a market basket analysis on a dataset yields conclusions about users’ shopping behaviours. It is important for businesses to allocate their resources based on consumer behaviour.