Data is taken from https://www.kaggle.com/datasets/amritdhaliwal/shopify-shoe-store-data. It is a dataset related to users buying shoes from different stores. We are going to investigate different users behaviour of choosing stores to buy shoes.
First, let’s load the dataset and take a look:
shoe_data <- read.csv("shoe_data.csv")
shoe_data <- shoe_data[1:1000,] # my memory is not enough for the rules :)
head(shoe_data)
## order_id shop_id user_id order_amount total_items payment_method
## 1 1 53 746 224 2 cash
## 2 2 92 925 90 1 cash
## 3 3 44 861 144 1 cash
## 4 4 18 935 156 1 credit_card
## 5 5 18 883 156 1 credit_card
## 6 6 58 882 138 1 credit_card
## created_at
## 1 2017-03-13 12:36
## 2 2017-03-03 17:38
## 3 2017-03-14 4:23
## 4 2017-03-26 12:43
## 5 2017-03-01 4:35
## 6 2017-03-14 15:25
We only want to investigate user_id and shop_id relation since that’s
the scope. In order to accomplish this, let’s create another dataframe
and format it to suit arules package with as
function:
basket_data <- shoe_data %>%
select(user_id, shop_id)
#transactions <- as(basket_data, "transactions")
transactions <- as(split(basket_data$shop_id, basket_data$user_id), "transactions")
Performing a quick exploratory analysis to understand the distribution of transactions yields:
hist(basket_data$shop_id,type = "absolute", main = "Histogram of Transactions by Shop",col = viridis(10))
Now, we’ll perform the market basket analysis using the Apriori algorithm:
rules <- apriori(transactions, parameter = list(supp = 0.005, conf = 0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[100 item(s), 288 transaction(s)] done [0.00s].
## sorting and recoding items ... [99 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [7 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(sort(rules, by = "lift"))
## lhs rhs support confidence coverage lift count
## [1] {29, 45} => {15} 0.006944444 1.0 0.006944444 32.00000 2
## [2] {51, 60} => {61} 0.006944444 1.0 0.006944444 32.00000 2
## [3] {60, 61} => {51} 0.006944444 1.0 0.006944444 28.80000 2
## [4] {15, 29} => {45} 0.006944444 1.0 0.006944444 26.18182 2
## [5] {15, 45} => {29} 0.006944444 1.0 0.006944444 24.00000 2
## [6] {51, 61} => {60} 0.006944444 1.0 0.006944444 18.00000 2
## [7] {76} => {50} 0.010416667 0.5 0.020833333 11.07692 3
Setting support to 0.005 means that we consider transactions that are in at least 0.5% of all transactions. After lots of trial and error with setting a support value, I concluded that there is no significance relation between users preferences of different shops. Most users tend to shop from the same shop I presume.
Even though we get some anti-climactic results from our dataset, we saw how doing a market basket analysis on a dataset yields conclusions about users’ shopping behaviours. It is important for businesses to allocate their resources based on consumer behaviour.