1. Introduction

This workbook explores a dataset pertaining to the transactions of a non-store online retailer based in the United Kingdom. This workbook will firsty perform basic data exploration of the dataset and implement add value models focusing on CRM management and consumer purchasing patterns.

Variable Summary:
- InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
- StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
- Description: Product (item) name. Nominal.
- Quantity: The quantities of each product (item) per transaction. Numeric.
- InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
- UnitPrice: Unit price. Numeric, Product price per unit in sterling.
- CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
- Country: Country name. Nominal, the name of the country where each customer resides.

1.2 Goals for this project

Analyze data initially to derive insights that typically add value in an online retail context. Features such as stock code, unit price, customer id etc will allow us answer such question as which SKUs and customers drive revenue the most and when are the busiest times of the year for the retailer.

The features present will allow us segment customers by performing a Recency, Frequency and Monetary(RFM) based on their past purchasing behavior. Based on the customer’s segmentation the most optimal add value action can be executed in relation to each respective customer.

Finally, a market basket analysis will be performed to identify which SKUs are frequently purchased together.Knowledge of this will allow for revenue to be maximised by upselling.

1.3 Load data and view variable summaries.

retail <- read.csv('C:\\Users\\Finta\\Downloads\\OnlineRetail.csv\\OnlineRetail.csv')
summary(retail)

##   InvoiceNo          StockCode         Description           Quantity        
##  Length:541909      Length:541909      Length:541909      Min.   :-80995.00  
##  Class :character   Class :character   Class :character   1st Qu.:     1.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :     3.00  
##                                                           Mean   :     9.55  
##                                                           3rd Qu.:    10.00  
##                                                           Max.   : 80995.00  
##                                                                              
##  InvoiceDate          UnitPrice           CustomerID       Country         
##  Length:541909      Min.   :-11062.06   Min.   :12346    Length:541909     
##  Class :character   1st Qu.:     1.25   1st Qu.:13953    Class :character  
##  Mode  :character   Median :     2.08   Median :15152    Mode  :character  
##                     Mean   :     4.61   Mean   :15288                      
##                     3rd Qu.:     4.13   3rd Qu.:16791                      
##                     Max.   : 38970.00   Max.   :18287                      
##                                         NA's   :135080

2. Preprocessing

Add additional variables to the data set and standardise the variable names. The additional variables will allow us analyse the data better and will prove useful when building our respective models.

retail <- retail %>% set_colnames(names(.) %>% to_snake_case())


retail <- retail %>% mutate(
  total_spend = unit_price * quantity, 
  customer_id = as.character(customer_id),
  invoice_hour = as.numeric(format(strptime(invoice_date, "%m/%d/%Y %H:%M"),format = "%H")),
  invoice_date = as.Date(invoice_date, format = '%m/%d/%Y'),
  invoice_month = as.factor(format(invoice_date,"%m")),
  invoice_day = as.factor(format(invoice_date,"%A")), 
  invoice_week = as.factor(format(invoice_date,"%V")),
  refund = str_detect(retail$invoice_no,"^C"),
)

New Variable Notes:
- refund - Denotes whether observation was canceled transaction and money refunded. Refunds can be identified by having the letter C as the first letter of an observation’s invoice number value.
- total_spend - Total revenue for each respective transaction.

2.1 Remove obvious errors

Some obvious errors exist such as an extremely large order quantity input which was canceled. Some empty transactions exist which have a unit price of zero and customer id value. The value these observations bring is minimal so are removed.

#Remove obvious transaction errors
transaction_errors <- retail %>% filter(total_spend < -5000 | total_spend > 5000) %>%
  pull(total_spend)
`%notin%` <- Negate(`%in%`)
retail <- retail %>% filter(total_spend %notin% transaction_errors)

# Remove 'empty' transactions: Total spend equals 0.00
retail <- retail %>% filter(total_spend != 0.00)

2.2 Non-transactional records

Using stock code and description variables we can see the rows that contain non-transactional information such as samples.

# Non-transactional stock code & description values

lapply(1:9, function(x){
  j <- retail[which(nchar(retail$stock_code) == x),]
  head(table(j$description))
})

## [[1]]
## 
## Discount   Manual  SAMPLES 
##       77      563       63 
## 
## [[2]]
## 
## CARRIAGE 
##      143 
## 
## [[3]]
## 
## DOTCOM POSTAGE 
##            707 
## 
## [[4]]
## 
##            CRUK Commission PADS TO MATCH ALL CUSHIONS 
##                         16                          3 
##                    POSTAGE 
##                       1250 
## 
## [[5]]
## 
##  50'S CHRISTMAS GIFT BAG LARGE              DOLLY GIRL BEAKER 
##                            130                            181 
##    I LOVE LONDON MINI BACKPACK    I LOVE LONDON MINI RUCKSACK 
##                             88                              1 
##        NINE DRAWER OFFICE TIDY     OVAL WALL MIRROR DIAMANTE  
##                             34                            161 
## 
## [[6]]
## 
##     4 PURPLE FLOCK DINNER CANDLES  3 BLACK CATS W HEARTS BLANK CARD 
##                                41                                 3 
##   3 GARDENIA MORRIS BOXED CANDLES       3 ROSE MORRIS BOXED CANDLES 
##                                89                               174 
## 3 WHITE CHOC MORRIS BOXED CANDLES      3D DOG PICTURE PLAYING CARDS 
##                               125                                90 
## 
## [[7]]
## 
## EDWARDIAN PARASOL BLACK 
##                     388 
## 
## [[8]]
## 
##          BOXED GLASS ASHTRAY               BOYS PARTY BAG 
##                            4                           11 
##        CAMOUFLAGE DOG COLLAR   HAYNES CAMPER SHOULDER BAG 
##                            1                            1 
##        OOH LA LA DOGS COLLAR SUNJAR LED NIGHT NIGHT LIGHT 
##                            1                            2 
## 
## [[9]]
## 
##      AMAZON FEE GIRLS PARTY BAG 
##               8              13

Counting the most common stock codes according to the number of characters in it. Stock codes with values of D,M,S represent discounts, manual imput and samples respectively.

Stock code characters = 1: Discount, Manual input or Sample.
Stock code characters = 2: ‘carriage’ primarily pertaining to transactions to Ireland.
Stock code characters = 3: Dotcom Postage.
Stock code characters = 4: Postage.
Stock code characters = 7: Unsaleable, destroyed.
Stock code characters = 9: Amazon Fee

# Amazon fee
amazon_fee <- retail[which(retail$stock_code == 'AMAZONFEE'),]
amazon_fee_total <- amazon_fee %>% 
  summarise(total_amazon_fee = round(sum(total_spend))) %>%
  pull(total_amazon_fee) %>%
  comma()
# Total Amazon Fee cost

Outgoing expense labeled ‘AMAZONFEE’ totals £-20,207

# Postage
postage_fee <- retail[which(retail$stock_code == 'POST'),]
postage_fee_total <- postage_fee %>%
  summarise(total_postage_fee = round(sum(total_spend))) %>%
  pull(total_postage_fee) %>%
  comma()

Revenue labeled ‘POST’ recorded in primarily non UK transactions totals £66,231.

#Remove chosen values
retail <- retail %>% filter(stock_code %notin% c('S','M'))
retail <- retail[-which(retail$stock_code == 'AMAZONFEE'),]
retail <- retail[-which(retail$stock_code == 'POST'),]

Removing transactions pertaining to stock code values for samples, manual input, amazon fees and postage.

2.3 Refunds

refund_per_transaction <- round(mean(retail$refund == 'TRUE'),3)
refund_per_transaction

## [1] 0.016

Total number of refunds per item bought is 0.016

refund_invoice <-retail %>% filter(refund == 'TRUE') %>% 
  group_by(invoice_no,customer_id) %>%
  summarise(refund_total = sum(total_spend)) %>%
  mutate(refund_total = abs(refund_total)) # Data frame containing refunded invoices only. 

sale_invoice <- retail %>% filter(refund == 'FALSE') %>% 
  group_by(invoice_no,customer_id) %>%
  summarise(refund_total = sum(total_spend)) # Data frame containing sales invoices only. 


# Match refunds with sales based off total invoice amount and customer id
sale_refund_df <- sale_invoice %>% inner_join(refund_invoice,by = c('customer_id','refund_total'))
sale_refund_invoices <- c(sale_refund_df$invoice_no.x,sale_refund_df$invoice_no.y) 
# Combining sales & refund invoice numbers in one vector

retail %>% filter(invoice_no == '537217' | invoice_no == 'C537406') %>% select(-c(invoice_date,invoice_hour,invoice_month, invoice_day,invoice_week))

##   invoice_no stock_code                    description quantity unit_price
## 1     537217      22849   BREAD BIN, DINER STYLE, MINT        4      14.95
## 2     537217      22847  BREAD BIN, DINER STYLE, IVORY        4      14.95
## 3     537217      22927 GREEN GIANT GARDEN THERMOMETER        4       5.95
## 4     537217      22926 IVORY GIANT GARDEN THERMOMETER        4       5.95
## 5    C537406      22926 IVORY GIANT GARDEN THERMOMETER       -4       5.95
## 6    C537406      22927 GREEN GIANT GARDEN THERMOMETER       -4       5.95
## 7    C537406      22847    BREAD BIN DINER STYLE IVORY       -4      14.95
## 8    C537406      22849     BREAD BIN DINER STYLE MINT       -4      14.95
##   customer_id        country total_spend refund
## 1       15502 United Kingdom        59.8  FALSE
## 2       15502 United Kingdom        59.8  FALSE
## 3       15502 United Kingdom        23.8  FALSE
## 4       15502 United Kingdom        23.8  FALSE
## 5       15502 United Kingdom       -23.8   TRUE
## 6       15502 United Kingdom       -23.8   TRUE
## 7       15502 United Kingdom       -59.8   TRUE
## 8       15502 United Kingdom       -59.8   TRUE

retail <- retail %>% filter(invoice_no %notin% sale_refund_invoices) 
# Removing invoice numbers pertaining to initial transaction and refund.

Some canceled sales invoices can be matched up with their sales equivalent by matching the customer id and if the absolute value of each invoice is identical.

2.4 Missing values

#Missing Values
lapply(retail, function(x){
  missing <- is.na(x)
  length(which(missing == 'TRUE'))
}) #Function counting the number of missing values in each variable

## $invoice_no
## [1] 0
## 
## $stock_code
## [1] 0
## 
## $description
## [1] 0
## 
## $quantity
## [1] 0
## 
## $invoice_date
## [1] 0
## 
## $unit_price
## [1] 0
## 
## $customer_id
## [1] 132118
## 
## $country
## [1] 0
## 
## $total_spend
## [1] 0
## 
## $invoice_hour
## [1] 0
## 
## $invoice_month
## [1] 0
## 
## $invoice_day
## [1] 0
## 
## $invoice_week
## [1] 0
## 
## $refund
## [1] 0

customer_id is the only variable with missing values

#customer_id missing values
mean(is.na(retail$customer_id))

## [1] 0.2470442

Proportion of missing values in customer_id variable.

# Total spend comparing existing and missing customer ids
m1 <- retail %>% mutate(missing_cust_id = is.na(customer_id)) %>%
  ggplot(aes(total_spend, fill = missing_cust_id)) +
  geom_histogram() +
  scale_y_continuous(trans = "log10",labels = comma) +
  scale_x_continuous(trans = "log2",labels = comma) +
  xlab("Total Spend") +
  ylab("Count") +
  ggtitle("Total Spend by missing and existing customer id") +
  guides(fill=guide_legend('Missing Customer Id')) 


# Daily transactions comparing exisiting and missing customer ids
m2 <-retail %>% mutate(missing_cust_id = is.na(customer_id)) %>%
  ggplot(aes(invoice_day, fill = missing_cust_id,label = )) +
  geom_histogram(stat = "count") +
  xlab("Day") +
  ylab("Transaction Count") +
  ggtitle("Transactions by day of the week") +
  guides(fill=guide_legend('Missing Customer Id'))

# Country transactions comparing existing and missing customer ids
m3 <-retail %>% mutate(missing_cust_id = is.na(customer_id)) %>%
  ggplot(aes(country, fill = missing_cust_id)) +
  geom_histogram(stat = "count") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  guides(fill=guide_legend('Missing Customer Id'))

# Month transactions comparing existing and missing customer ids
m4 <- retail %>% mutate(missing_cust_id = is.na(customer_id)) %>%
  ggplot(aes(invoice_month, fill = missing_cust_id)) +
  geom_histogram(stat = "count") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  xlab("Month") +
  ylab("Transaction Count") +
  ggtitle("Transactions by month") +
  guides(fill=guide_legend('Missing Customer Id'))

# Epoch transactions comparing existing and missing customer ids
m5 <- retail %>% mutate(missing_cust_id = is.na(customer_id)) %>%
  ggplot(aes(invoice_week, fill = missing_cust_id)) +
  geom_histogram(stat = "count") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  guides(fill=guide_legend('Missing Customer Id'))

grid.arrange(m1,m2,m4)

No discernible reason as to why missing customer id values exist.

3. Exploring Variables

3.1 Invoice_Number

Each individual invoice number represents the purchase(s) made by a customer on a given occasion.

unique_invoices <- comma(length(unique(retail$invoice_no)))
# Number of transactions

A total of 22,801 unique invoices contained in this data set

# Total spend per invoice
invoice_spend <- retail %>% group_by(invoice_no) %>%
  summarise(total_invoice_amount = sum(total_spend)) %>% 
  arrange(desc(total_invoice_amount))
summary(invoice_spend$total_invoice_amount)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -16459.00     97.09    250.60    435.13    439.23  52940.94

Difference between median and mean caused by a number of high revenue generating invoices.

# Total spend per invoice histogram
invoice_hist <- invoice_spend %>%
  ggplot(aes(total_invoice_amount)) +
  geom_histogram(bins = 10) +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  geom_vline(xintercept = mean(invoice_spend$total_invoice_amount),        # Add line for mean
             col = "green",
             lwd = 3) +
  geom_vline(xintercept = median(invoice_spend$total_invoice_amount),        # Add line for median
             col = "red",
             lwd = 3) +
  xlab("Total Invoice Amount") +
  ylab("Count") +
  ggtitle("Invoice Amount Histogram")

invoice_hist_log <- invoice_spend %>%
  ggplot(aes(total_invoice_amount)) +
  geom_histogram(bins = 10) +
  scale_x_continuous(trans = 'log2', labels = comma) +
  scale_y_continuous(trans = "log2",labels = comma) +
  geom_vline(xintercept = mean(invoice_spend$total_invoice_amount),       
             col = "green",
             lwd = 3) +                                                     # Add line for mean
  geom_vline(xintercept = median(invoice_spend$total_invoice_amount),        
             col = "red",
             lwd = 3) +                                                     # Add line for median
  xlab("Total Invoice Amount") +
  ylab("Count") +
  ggtitle("Invoice Amount Histogram - Log 2") 

grid.arrange(invoice_hist,invoice_hist_log)

: Plot 1: Right skewed distribution of the total invoice amounts; a relatively small number of invoices have much larger total amounts than the median invoice amount. Plot 2: Scaled distribution of total invoice amount using log2.

Green Line = Mean: 435
Red Line = Median: 250.6

invoice_item_quantities <- retail %>% group_by(invoice_no) %>%
 summarise(quantities_bought = sum(quantity)) %>%
  arrange(desc(quantities_bought))
summary(invoice_item_quantities$quantities_bought)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -9360.0    35.0   124.0   232.2   264.0 15049.0

invoice_item_quantities %>% ggplot(aes('',quantities_bought)) +
  geom_boxplot() +
  xlab("") +
  ylab("Quanties per Invoice") +
  ggtitle("Quantity of Items per Invoice")

: Summarized distribution of quantity of items bought per invoice.

# number of items bought per transactions

invoice_quantities <- invoice_item_quantities %>%
  ggplot(aes(quantities_bought)) +
  geom_histogram() +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  geom_vline(xintercept = mean(invoice_item_quantities$quantities_bought),        # Add line for mean
             col = "green",
             lwd = 3) +
  geom_vline(xintercept = median(invoice_item_quantities$quantities_bought),        # Add line for median
             col = "red",
             lwd = 3) +
  xlab("Quantities Bought Per Invoice") +
  ylab("Count") +
  ggtitle("Quantities Per Invoice Histogram") 

invoice_quantities_log <- invoice_item_quantities %>%
  ggplot(aes(quantities_bought)) +
  geom_histogram() +
  scale_x_continuous(trans = 'log2',labels = comma) +
  scale_y_continuous(trans = "log2",labels = comma) +
  geom_vline(xintercept = mean(invoice_item_quantities$quantities_bought),        # Add line for mean
             col = "green",
             lwd = 3) +
  geom_vline(xintercept = median(invoice_item_quantities$quantities_bought),        # Add line for median
             col = "red",
             lwd = 3) +
  xlab("Quantities Bought Per Invoice") +
  ylab("Count") +
  ggtitle("Quantities Per Invoice Histogram - Log2") 

grid.arrange(invoice_quantities,invoice_quantities_log)

3.2 Stock Code

## stock_code
sku_num <- comma(length(unique(retail$stock_code))) #Number of SKUS

Number of unique SKUs is 3,930

# SKU sales count
sku_sales_count <- retail %>% filter(nchar(retail$stock_code) == 6 |nchar(retail$stock_code) == 7 | nchar(retail$stock_code) == 5) %>%
  count(stock_code)
summary(sku_sales_count$n)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    17.0    66.0   136.5   171.0  2290.0

Summary of individual SKU sales count

*Note: Sales count in this context denotes how often an indivual stock code was included in an invoice.

#SKU sales distribution
sku_sales_count  %>%
  arrange(desc(n)) %>%
  ggplot(aes(x = reorder(stock_code,n),n)) +
  geom_bar(stat = 'identity',position="dodge") +
  theme(axis.text.y=element_blank(),
        axis.ticks.y=element_blank()) +
  coord_flip() +
  xlab("Stock Code") +
  ylab("SKU Invoice Inclusion") +
  ggtitle("SKU Invoice Inclusion Count") +
  scale_y_continuous(labels = comma)# Conforms strongly to the power rule

SKU purchase distribution conforms strongly to the power rule; a small number of SKUs account for a large percentage of overall SKUs sold.

# SKU sales distribution by quantile - Mosaic graph
sku_sales_count  %>%
  mutate(stock_quant = ntile(rank(n),10)) %>%
  group_by(stock_quant) %>%
  summarise(total_quantile_count = sum(n)) %>%
  mutate(items_sold_proportion = round(total_quantile_count/sum(total_quantile_count),2)) %>%
  ggplot(aes(area = total_quantile_count, fill = stock_quant, label = paste0('Q',stock_quant,'\n\n',items_sold_proportion)))+
  geom_treemap() +
  geom_treemap_text(colour = "Black",
                    place = "centre",
                    size = 20,
                    show.legend = TRUE) +
  theme(legend.position = "none") +
  scale_fill_viridis(option="mako") +
  ggtitle("Quantile Sales Count Proportion")

: Each SKU is allocated a quantile based on its respective sales count with the top selling SKUs assigned to Q10.

The top 10% selling SKUs account for 45% of total SKUs sold.
The top 20% selling SKUs account for 65% of total SKUs sold.

# Most popular skus
popular_stock_code <- sku_sales_count %>% slice_max(n,n = 20)
popular_stock_code <- popular_stock_code %>% left_join(retail) %>% group_by(stock_code,description,n) %>%
    summarise(total_spend = sum(total_spend)) %>%
  filter(total_spend > 0) %>%
  arrange(desc(n)) %>% 
  rename('total_transaction_number' = n)
popular_stock_code[1:2,]

## # A tibble: 2 x 4
## # Groups:   stock_code, description [2]
##   stock_code description                        total_transaction_n~ total_spend
##   <chr>      <chr>                                             <int>       <dbl>
## 1 85123A     CREAM HANGING HEART T-LIGHT HOLDER                 2290        179.
## 2 85123A     WHITE HANGING HEART T-LIGHT HOLDER                 2290      97610.

Some stock codes have more than one corresponding description as can be seen above with the highest selling SKU.

popular_stock_code %>%
  arrange(total_transaction_number) %>%
  ggplot(aes(total_transaction_number,fct_inorder(description))) +
  geom_bar(stat = 'identity') +
  xlab("Count of SKUs included in an invoice") +
  ylab("SKU Description") +
  ggtitle("Most Popular SKUs") +
  scale_x_continuous(labels = comma)

: Some stock codes have more than one corresponding description as can be seen above with the highest selling SKU.

# Most profitable skus
profitable_stock_code <- popular_stock_code %>% left_join(retail) %>% group_by(stock_code,description,total_transaction_number) %>%
    summarise(total_spend = sum(total_spend)) %>%
  filter(total_spend > 0) %>%
  arrange(desc(total_spend))
profitable_stock_code

## # A tibble: 26 x 4
## # Groups:   stock_code, description [26]
##    stock_code description                          total_transactio~ total_spend
##    <chr>      <chr>                                            <int>       <dbl>
##  1 22423      "REGENCY CAKESTAND 3 TIER"                        2163     162625.
##  2 47566      "PARTY BUNTING"                                   1714      98244.
##  3 85123A     "WHITE HANGING HEART T-LIGHT HOLDER"              2290      97610.
##  4 85099B     "JUMBO BAG RED RETROSPOT"                         2142      92356.
##  5 22086      "PAPER CHAIN KIT 50'S CHRISTMAS "                 1195      63486.
##  6 84879      "ASSORTED COLOUR BIRD ORNAMENT"                   1495      58960.
##  7 22386      "JUMBO BAG PINK POLKADOT"                         1240      41600.
##  8 22720      "SET OF 3 CAKE TINS PANTRY DESIGN "               1461      37428.
##  9 22960      "JAM MAKING SET WITH JARS"                        1218      36116.
## 10 20725      "LUNCH BAG RED RETROSPOT"                         1629      34897.
## # ... with 16 more rows

Highest revenue generating SKUs

profitable_stock_code %>%
  arrange(total_spend) %>%
  ggplot(aes(total_spend,fct_inorder(description))) +
  geom_bar(stat = 'identity') +
  xlab("Total Revenue") +
  ylab("SKU Description") +
  ggtitle("Most Profitable SKUs") +
  scale_x_continuous(labels = comma)

# Non performing SKUs
non_performing_skus <- sku_sales_count %>% filter(n < 4) %>%
  left_join(retail) %>%
  filter(total_spend < 50) %>%
  group_by(stock_code,description) %>%
  summarise(stock_code_revenue = sum(total_spend),number_of_basket_entries = n())
non_performing_skus_nrow <- non_performing_skus %>% nrow()

367 SKUs have a total revenue of under £50.

# frequently returned stock items 
retail %>% filter(nchar(retail$stock_code) == 6 |nchar(retail$stock_code) == 7) %>%
  filter(refund == 'TRUE') %>%
  count(stock_code) %>%
  arrange(desc(n)) %>%
  top_n(6)

##   stock_code  n
## 1     85099B 37
## 2     85123A 35
## 3     75049L 20
## 4     82494L 13
## 5     85159B 13
## 6     79191C 12

3.3 Quantity

summary(retail$quantity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -9360.0     1.0     3.0     9.9    10.0  4800.0

#Distribution of SKU quantities purchased in a transaction
retail %>% 
  filter(quantity > 0) %>%
  ggplot(aes(quantity)) +
  geom_histogram(binwidth = 0.5) +
  scale_x_continuous(trans = 'log2',labels = comma) +
  geom_vline(xintercept = mean(retail$quantity),        # Add line for mean
             col = "green",
             lwd = 3) +
    geom_vline(xintercept = median(retail$quantity),        # Add line for median
             col = "red",
             lwd = 3) +
  xlab("Quantities of SKUs bought") +
  ylab("Count") +
  ggtitle("SKU Quantities Bought") +
  scale_y_continuous(labels = comma)

: Green Line = Mean: 10
Red Line = Median: 3

quantities_sold_per_invoice <- retail %>% group_by(invoice_no) %>%
  summarise(total_quantities_sold = sum(quantity))
quantities_sold_per_invoice %>% ggplot(aes(total_quantities_sold)) +
  geom_histogram(binwidth = 0.5) +
  scale_x_continuous(trans = 'log2',labels = comma) +
  xlab("Quantity Per Invoice") +
  ylab("Count") +
  ggtitle("Quantity Count Per Invoice") +
  scale_y_continuous(labels = comma)

3.4 Customer ID

mean(is.na(retail$customer_id))

## [1] 0.2470442

Total proportion of missing ids is 0.2470442

length(unique(retail$customer_id))

## [1] 4352

Total number of unique customer ids is 4352

customer_spend <- retail %>% 
  filter(!is.na(customer_id)) %>%
  group_by(customer_id) %>%
  summarise(total_customer_spend = sum(total_spend)) %>%
  arrange(desc(total_customer_spend)) %>%
  mutate(proportion_of_total_spend = total_customer_spend/sum(total_customer_spend))

customer_spend %>% top_n(10)

## # A tibble: 10 x 3
##    customer_id total_customer_spend proportion_of_total_spend
##    <chr>                      <dbl>                     <dbl>
##  1 14646                    278572.                   0.0339 
##  2 18102                    259657.                   0.0316 
##  3 17450                    182506.                   0.0222 
##  4 14911                    128222.                   0.0156 
##  5 12415                    123638.                   0.0150 
##  6 14156                    114505.                   0.0139 
##  7 17511                     88138.                   0.0107 
##  8 16684                     65892.                   0.00801
##  9 13694                     62677.                   0.00762
## 10 15311                     59444.                   0.00723

Highest revenue customers and their proportion of overall revenue.

# Customer spend by quantile
customer_spend %>%
  mutate(quantile = ntile(rank(total_customer_spend),10)) %>%
  group_by(quantile) %>%
  summarise(total_spend = sum(total_customer_spend)) %>%
  mutate(percent_of_sales = total_spend/sum(total_spend),percent_of_sales = scales::percent(percent_of_sales),total_spend = total_spend/1000000) %>%
  ggplot(aes(area = total_spend,fill = quantile,label = paste(percent_of_sales, round(total_spend,digits = 2),paste('Q',quantile), sep = "\n"))) +
  geom_treemap() +
  geom_treemap_text(colour = "Black",
                    place = "centre",
                    size = 20,
                    show.legend = TRUE) +
  theme(legend.position = "none") +
  scale_fill_viridis(option="mako")

: Line 1: Percentage of Revenue.
Line 2: Revenue in millions.
Line 3: Quantile.

The top 20% of customers account for around 73% of total revenue.

3.5 Country

length(unique(retail$country))

## [1] 38

Number of countries sold to.

customer_country <- retail %>% select(customer_id,country) %>%
  group_by(customer_id) %>%
  summarise(country = unique(country))
table(customer_country$country)

## 
##            Australia              Austria              Bahrain 
##                    9                   11                    2 
##              Belgium               Brazil               Canada 
##                   25                    1                    4 
##      Channel Islands               Cyprus       Czech Republic 
##                    9                    7                    1 
##              Denmark                 EIRE   European Community 
##                    9                    4                    1 
##              Finland               France              Germany 
##                   12                   88                   95 
##               Greece            Hong Kong              Iceland 
##                    4                    1                    1 
##               Israel                Italy                Japan 
##                    5                   14                    8 
##              Lebanon            Lithuania                Malta 
##                    1                    1                    2 
##          Netherlands               Norway               Poland 
##                    9                   10                    6 
##             Portugal                  RSA         Saudi Arabia 
##                   20                    1                    1 
##            Singapore                Spain               Sweden 
##                    1                   29                    8 
##          Switzerland United Arab Emirates       United Kingdom 
##                   22                    2                 3935 
##          Unspecified                  USA 
##                    5                    2

Number of customers per country

retail %>% group_by(country) %>%
  summarise(spend_per_country = round(sum(total_spend))) %>%
  mutate(prop = round(spend_per_country/sum(spend_per_country),2)) %>% 
  ggplot(aes(area = spend_per_country, fill = country, label = paste(country,comma(spend_per_country),prop,sep = "\n"))) +
  geom_treemap() +
  geom_treemap_text(colour = "Black",
                    place = "centre",
                    size = 20,
                    show.legend = TRUE) +
  theme(legend.position = "none")

: Line 1: Country.
Line 2: Total Revenue.
Line 3: Proportion of Revenue.

# Non UK Country Spend
retail %>% filter(country != 'United Kingdom') %>%
  group_by(country) %>%
  summarise(spend_per_country = round(sum(total_spend))) %>%
  arrange(spend_per_country) %>%
  ggplot(aes(fct_inorder(country),spend_per_country)) +
  geom_bar(stat = 'identity') +
  scale_y_continuous(trans = 'log2') +
  coord_flip() +
  ylab("Revenue Per Country") +
  xlab("Country") +
  ggtitle("Total Revenue Per Non-UK Country") +
  scale_y_continuous(labels = comma)

# Customers per country
retail %>% group_by(country) %>%
  summarise(no_of_customers = length(unique(customer_id))) %>%
  arrange(no_of_customers) %>%
  ggplot(aes(fct_inorder(country),no_of_customers)) +
  geom_bar(stat = 'identity') +
  coord_flip() +
  ylab("Customers Per Country") +
  xlab("Country") +
  ggtitle("Total Customers Per Country") +
  scale_y_continuous(labels = comma)

# Customers per country minus UK
retail %>% filter(country != 'United Kingdom') %>%
  group_by(country) %>%
  summarise(no_of_customers = length(unique(customer_id))) %>%
  arrange(no_of_customers) %>%
  ggplot(aes(fct_inorder(country),no_of_customers)) +
  geom_bar(stat = 'identity') +
  coord_flip() +
  ylab("Customers Per Country") +
  xlab("Country") +
  ggtitle("Total Customers Per Non-UK Country") +
  scale_y_continuous(labels = comma)

# Invoices per country
retail %>% group_by(country) %>%
  summarise(invoices_per_country = length(unique(invoice_no))) %>%
  arrange(invoices_per_country) %>%
  ggplot(aes(fct_inorder(country),invoices_per_country)) +
  geom_bar(stat = 'identity') +
  coord_flip() +
  ylab("Invoices Per Country") +
  xlab("Country") +
  ggtitle("Total Invoices Per Country") +
  scale_y_continuous(labels = comma)

# Invoices per country minus UK
retail %>% filter(country != 'United Kingdom') %>%
  group_by(country) %>%
  summarise(invoices_per_country = length(unique(invoice_no))) %>%
  arrange(invoices_per_country) %>%
  ggplot(aes(fct_inorder(country),invoices_per_country)) +
  geom_bar(stat = 'identity') +
  coord_flip() +
  ylab("Invoices Per Country") +
  xlab("Country") +
  ggtitle("Total Invoices Per Non-UK Country") +
  scale_y_continuous(labels = comma)

4. Time Series

range(retail$invoice_date)

## [1] "2010-12-01" "2011-12-09"

Incomplete calender year for 2011 so dates should range from Dec 2010 to Nov 2011

calender_year_sales <- retail %>% filter(between(invoice_date,as.Date('2010-12-01'),as.Date('2011-11-30')))
range(calender_year_sales$invoice_date)

## [1] "2010-12-01" "2011-11-30"

Remove dates non pertinent to the above range.

epoch_summary <- calender_year_sales %>% 
  group_by(invoice_week) %>%
  summarise(total_weekly_earnings = sum(total_spend),
            weekly_transactions_no = length(unique(invoice_no)))
epoch_summary_ordered <- epoch_summary[c(48:51,1:47),]

epoch_summary_scaled <- epoch_summary_ordered %>%
  mutate(total_weekly_earnings = scale(total_weekly_earnings),
         weekly_transactions_no = scale(weekly_transactions_no))
head(epoch_summary_ordered)

## # A tibble: 6 x 3
##   invoice_week total_weekly_earnings weekly_transactions_no
##   <fct>                        <dbl>                  <int>
## 1 48                         365926.                    874
## 2 49                         308210.                    606
## 3 50                         204954.                    540
## 4 51                          88393.                    199
## 5 01                         127541.                    281
## 6 02                         182902.                    301

head(epoch_summary_scaled)

## # A tibble: 6 x 3
##   invoice_week total_weekly_earnings[,1] weekly_transactions_no[,1]
##   <fct>                            <dbl>                      <dbl>
## 1 48                              2.45                        3.29 
## 2 49                              1.67                        1.31 
## 3 50                              0.265                       0.822
## 4 51                             -1.32                       -1.69 
## 5 01                             -0.789                      -1.09 
## 6 02                             -0.0354                     -0.942

Data frames detailing both scaled and total weekly revenue and number of transactions(invoices recorded)

epoch_summary_scaled %>%
  gather(factor,value,-invoice_week) %>%
  mutate(factor = as.factor(factor)) %>%
  mutate(factor = recode_factor(factor,total_weekly_earnings = 'Weekly Revenue',weekly_transactions_no = 'Weekly Transactions')) %>%
  ggplot(aes(fct_inorder(invoice_week),value,group= factor,color = factor)) +
  geom_line() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  guides(color =guide_legend('')) +
  ylab("Z Score") +
  xlab("Invoice Week") +
  ggtitle("Transaction and Revenue Weekly Distribution") +
  scale_y_continuous(labels = comma)

: Weekly revenue heavily correlated with the number of transactions.
Week 37 to 41 sees a divergence from this correlation. Possible that some wholesalers place large revenue generating orders during this window.

epoch_summary_earnings <- epoch_summary_ordered %>% 
  select(invoice_week,total_weekly_earnings) %>%
  gather(total_weekly_earnings,value,-invoice_week) %>%
  ggplot(aes(fct_inorder(invoice_week),value,group = total_weekly_earnings)) +
  geom_point() +
  geom_line() +
  geom_smooth() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylab("Revenue") +
  xlab("Invoice Week") +
  ggtitle("Total Weekly Revenue") +
  scale_y_continuous(labels = comma)
epoch_summary_earnings

: Invoice revenue per week.

# Total sales by month
monthly_earnings <- calender_year_sales %>% group_by(invoice_month) %>%
  summarise(earnings_by_month = sum(total_spend))
monthly_earnings <- monthly_earnings[c(12,1:11),]
monthly_earnings %>%
  ggplot(aes(fct_inorder(invoice_month),earnings_by_month,group = 1)) +
  geom_line() +
  ylab("Revenue") +
  xlab("Invoice Month") +
  ggtitle("Total Montly Revenue") +
  scale_y_continuous(labels = comma)

5. RFM Analysis

arb_date <- max(retail$invoice_date) + 1

rfm_retail <- retail %>% filter(!is.na(customer_id)) %>%
  group_by(invoice_no,customer_id,invoice_date) %>%
  summarise(total_spend = sum(total_spend))

Prepare data frame for RFM analysis by removing rows with missing customer id values.

#RFM Table
rfm_data  <- rfm_table_order(
  data = rfm_retail,
  customer_id = customer_id,
  order_date = invoice_date,
  revenue = total_spend,
  analysis_date = arb_date,
  recency_bins = 5,
  frequency_bins = 5,
  monetary_bins = 5,
)
rfm_data[1:6,]

## # A tibble: 6 x 9
##   customer_id date_most_recent recency_days transaction_count amount
##   <chr>       <date>                  <dbl>             <dbl>  <dbl>
## 1 12347       2011-12-07                  3                 7  4310 
## 2 12348       2011-09-25                 76                 4  1437.
## 3 12349       2011-11-21                 19                 1  1458.
## 4 12350       2011-02-02                311                 1   294.
## 5 12352       2011-11-03                 37                 5  1145.
## 6 12353       2011-05-19                205                 1    89 
## # ... with 4 more variables: recency_score <int>, frequency_score <int>,
## #   monetary_score <int>, rfm_score <dbl>

Assign RFM score to each customer id.

# Assign segments names
segment_names <- c("Champions", "Loyal Customers", "Potential Loyalist",
                   "New Customers", "Promising", "Need Attention", "About To Sleep",
                   "At Risk", "Can't Lose Them", "Hibernating", "Lost")

# Assign classification rules for segments
recency_lower   <- c(4, 2, 3, 4, 3, 3, 2, 1, 1, 2, 1)
recency_upper   <- c(5, 4, 5, 5, 4, 4, 3, 2, 1, 3, 1)
frequency_lower <- c(4, 3, 1, 1, 1, 3, 1, 2, 4, 2, 1)
frequency_upper <- c(5, 4, 3, 1, 1, 4, 2, 5, 5, 3, 1)
monetary_lower  <- c(4, 4, 1, 1, 1, 3, 1, 2, 4, 2, 1)
monetary_upper  <- c(5, 5, 3, 1, 1, 4, 2, 5, 5, 3, 1)

segments <- rfm_segment(rfm_data, segment_names, recency_lower, recency_upper,
                        frequency_lower, frequency_upper, monetary_lower, monetary_upper)

# Combine customer segmentation with main data frame
combined_rfm_df <- retail %>% left_join(segments) %>% 
  select(customer_id,total_spend,segment,rfm_score,recency_score,frequency_score,monetary_score,stock_code)
combined_rfm_df <- combined_rfm_df %>% group_by(customer_id) %>%
  summarise(total_spend = sum(total_spend),segment = unique(segment),rfm_score = unique(rfm_score),
            recency_score = unique(recency_score),frequency_score = unique(frequency_score), monetary_score = unique(monetary_score))
combined_rfm_df

## # A tibble: 4,352 x 7
##    customer_id total_spend segment         rfm_score recency_score frequency_score
##    <chr>             <dbl> <chr>               <dbl>         <int>           <int>
##  1 12347             4310  Champions             555             5               5
##  2 12348             1437. Loyal Customers       244             2               4
##  3 12349             1458. Others                414             4               1
##  4 12350              294. Others                112             1               1
##  5 12352             1145. Loyal Customers       344             3               4
##  6 12353               89  Lost                  111             1               1
##  7 12354             1079. Others                114             1               1
##  8 12355              459. Others                112             1               1
##  9 12356             2487. Loyal Customers       435             4               3
## 10 12357             6208. Others                315             3               1
## # ... with 4,342 more rows, and 1 more variable: monetary_score <int>

# Count of customers per section
combined_rfm_df %>% filter(!is.na(customer_id)) %>%
  group_by(segment) %>%
  summarise(count = n(), spend = round(sum(total_spend)))

## # A tibble: 8 x 3
##   segment            count   spend
##   <chr>              <int>   <dbl>
## 1 About To Sleep       317   74833
## 2 At Risk              552  530908
## 3 Champions            957 5561999
## 4 Lost                 319   39460
## 5 Loyal Customers      365  673955
## 6 Need Attention       121   83678
## 7 Others               724  882363
## 8 Potential Loyalist   996  380382

segment_count <- combined_rfm_df %>% filter(!is.na(customer_id)) %>%
  group_by(segment) %>%
  summarise(count = n(), spend = sum(total_spend)) %>%
    arrange(count) %>%
  ggplot(aes(count,fct_inorder(segment))) +
  geom_bar(stat = 'identity') +
  ylab("Segment") +
  xlab("Count") +
  ggtitle("Total Segment Count") +
  scale_x_continuous(labels = comma)

segment_spend <- combined_rfm_df %>% filter(!is.na(customer_id)) %>%
  group_by(segment) %>%
  summarise(count = n(), spend = sum(total_spend)) %>%
    arrange(spend) %>%
  ggplot(aes(spend,fct_inorder(segment))) +
  geom_bar(stat = 'identity') +
  ylab("Segment") +
  xlab("Revenue") +
  ggtitle("Total Segment Revenue") +
  scale_x_continuous(labels = comma)
  
grid.arrange(segment_count,segment_spend)

6. Market Basket Analysis

6.1 Feature: Description

Model to identify what products are likely to be purchased together and be used to develop or optimize cross selling initiatives on retail sites.

# Preprocess data for basket analysis - Description
retail_basket <- retail %>% filter(refund == 'FALSE') %>%
  select(invoice_no,description)
retail_basket <- retail_basket %>% group_by(invoice_no) %>%
  mutate(id = cur_group_id()) %>% 
  ungroup() %>%
  select(id, description) # Assign basket variable as a key.

# Convert data to class transaction
retail_list <- split(retail$description,retail_basket$id) # Group items according the basket they belong to
retail_trans <- as(retail_list,'transactions') #Convert list to transactions
retail_rules <- apriori(retail_trans, parameter=list(supp=0.007, conf=0.4)) # Create rule object

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   0.007      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 136 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4033 item(s), 19503 transaction(s)] done [0.29s].
## sorting and recoding items ... [1187 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 done [0.13s].
## writing ... [1962 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

# Visualise rules
plot(retail_rules)

: Support: The percentage of transactions that contain all of the items in an itemset.
Confidence: The probability that a transaction that contains the items on the left hand side of the rule also contains the item on the right hand side.
Lift: The increase in likelihood of the right hand side item of the rule being purchased given the left hand side item(s) of the rule are purchased.

# Inspect rules
inspect(head(sort(retail_rules,by='lift'),35))

##      lhs                        rhs                        support confidence    coverage     lift count
## [1]  {HERB MARKER CHIVES ,                                                                              
##       HERB MARKER MINT,                                                                                 
##       HERB MARKER THYME}     => {HERB MARKER PARSLEY}  0.007075834  0.9787234 0.007229657 80.20186   138
## [2]  {HERB MARKER CHIVES ,                                                                              
##       HERB MARKER ROSEMARY,                                                                             
##       HERB MARKER THYME}     => {HERB MARKER PARSLEY}  0.007537302  0.9735099 0.007742399 79.77464   147
## [3]  {HERB MARKER CHIVES ,                                                                              
##       HERB MARKER MINT,                                                                                 
##       HERB MARKER ROSEMARY}  => {HERB MARKER PARSLEY}  0.007127109  0.9720280 0.007332205 79.65320   139
## [4]  {HERB MARKER CHIVES ,                                                                              
##       HERB MARKER PARSLEY,                                                                              
##       HERB MARKER ROSEMARY}  => {HERB MARKER THYME}    0.007537302  0.9545455 0.007896221 78.88347   147
## [5]  {HERB MARKER MINT,                                                                                 
##       HERB MARKER PARSLEY,                                                                              
##       HERB MARKER ROSEMARY}  => {HERB MARKER CHIVES }  0.007127109  0.8424242 0.008460237 78.61148   139
## [6]  {HERB MARKER CHIVES ,                                                                              
##       HERB MARKER PARSLEY,                                                                              
##       HERB MARKER THYME}     => {HERB MARKER ROSEMARY} 0.007537302  0.9607843 0.007844947 78.40241   147
## [7]  {HERB MARKER CHIVES ,                                                                              
##       HERB MARKER THYME}     => {HERB MARKER PARSLEY}  0.007844947  0.9562500 0.008203866 78.36027   153
## [8]  {HERB MARKER BASIL,                                                                                
##       HERB MARKER PARSLEY,                                                                              
##       HERB MARKER ROSEMARY}  => {HERB MARKER THYME}    0.007844947  0.9444444 0.008306414 78.04873   153
## [9]  {HERB MARKER BASIL,                                                                                
##       HERB MARKER MINT,                                                                                 
##       HERB MARKER ROSEMARY}  => {HERB MARKER THYME}    0.007434754  0.9415584 0.007896221 77.81023   145
## [10] {HERB MARKER MINT,                                                                                 
##       HERB MARKER PARSLEY,                                                                              
##       HERB MARKER ROSEMARY}  => {HERB MARKER THYME}    0.007896221  0.9333333 0.008460237 77.13051   154
## [11] {HERB MARKER MINT,                                                                                 
##       HERB MARKER PARSLEY,                                                                              
##       HERB MARKER THYME}     => {HERB MARKER CHIVES }  0.007075834  0.8263473 0.008562785 77.11125   138
## [12] {HERB MARKER CHIVES ,                                                                              
##       HERB MARKER THYME}     => {HERB MARKER ROSEMARY} 0.007742399  0.9437500 0.008203866 77.01237   151
## [13] {HERB MARKER BASIL,                                                                                
##       HERB MARKER MINT,                                                                                 
##       HERB MARKER THYME}     => {HERB MARKER ROSEMARY} 0.007434754  0.9415584 0.007896221 76.83353   145
## [14] {HERB MARKER BASIL,                                                                                
##       HERB MARKER PARSLEY,                                                                              
##       HERB MARKER THYME}     => {HERB MARKER ROSEMARY} 0.007844947  0.9386503 0.008357689 76.59622   153
## [15] {HERB MARKER MINT,                                                                                 
##       HERB MARKER ROSEMARY,                                                                             
##       HERB MARKER THYME}     => {HERB MARKER PARSLEY}  0.007896221  0.9333333 0.008460237 76.48235   154
## [16] {HERB MARKER PARSLEY,                                                                              
##       HERB MARKER ROSEMARY,                                                                             
##       HERB MARKER THYME}     => {HERB MARKER CHIVES }  0.007537302  0.8166667 0.009229349 76.20789   147
## [17] {HERB MARKER BASIL,                                                                                
##       HERB MARKER MINT,                                                                                 
##       HERB MARKER THYME}     => {HERB MARKER PARSLEY}  0.007332205  0.9285714 0.007896221 76.09214   143
## [18] {HERB MARKER CHIVES ,                                                                              
##       HERB MARKER ROSEMARY}  => {HERB MARKER PARSLEY}  0.007896221  0.9277108 0.008511511 76.02162   154
## [19] {HERB MARKER PARSLEY,                                                                              
##       HERB MARKER ROSEMARY}  => {HERB MARKER THYME}    0.009229349  0.9183673 0.010049736 75.89372   180
## [20] {HERB MARKER MINT,                                                                                 
##       HERB MARKER THYME}     => {HERB MARKER PARSLEY}  0.008562785  0.9226519 0.009280623 75.60706   167
## [21] {HERB MARKER MINT,                                                                                 
##       HERB MARKER PARSLEY,                                                                              
##       HERB MARKER THYME}     => {HERB MARKER ROSEMARY} 0.007896221  0.9221557 0.008562785 75.25022   154
## [22] {HERB MARKER CHIVES ,                                                                              
##       HERB MARKER ROSEMARY}  => {HERB MARKER THYME}    0.007742399  0.9096386 0.008511511 75.17238   151
## [23] {HERB MARKER BASIL,                                                                                
##       HERB MARKER MINT,                                                                                 
##       HERB MARKER ROSEMARY}  => {HERB MARKER PARSLEY}  0.007229657  0.9155844 0.007896221 75.02791   141
## [24] {HERB MARKER MINT,                                                                                 
##       HERB MARKER ROSEMARY}  => {HERB MARKER THYME}    0.008460237  0.9065934 0.009331898 74.92073   165
## [25] {HERB MARKER BASIL,                                                                                
##       HERB MARKER ROSEMARY}  => {HERB MARKER THYME}    0.008614059  0.9032258 0.009536994 74.64243   168
## [26] {HERB MARKER BASIL,                                                                                
##       HERB MARKER ROSEMARY,                                                                             
##       HERB MARKER THYME}     => {HERB MARKER PARSLEY}  0.007844947  0.9107143 0.008614059 74.62883   153
## [27] {HERB MARKER PARSLEY,                                                                              
##       HERB MARKER THYME}     => {HERB MARKER ROSEMARY} 0.009229349  0.9137056 0.010101010 74.56067   180
## [28] {HERB MARKER BASIL,                                                                                
##       HERB MARKER THYME}     => {HERB MARKER ROSEMARY} 0.008614059  0.9130435 0.009434446 74.50664   168
## [29] {HERB MARKER MINT,                                                                                 
##       HERB MARKER THYME}     => {HERB MARKER ROSEMARY} 0.008460237  0.9116022 0.009280623 74.38903   165
## [30] {HERB MARKER BASIL,                                                                                
##       HERB MARKER MINT,                                                                                 
##       HERB MARKER PARSLEY}   => {HERB MARKER THYME}    0.007332205  0.8993711 0.008152592 74.32387   143
## [31] {HERB MARKER MINT,                                                                                 
##       HERB MARKER ROSEMARY}  => {HERB MARKER PARSLEY}  0.008460237  0.9065934 0.009331898 74.29114   165
## [32] {HERB MARKER CHIVES ,                                                                              
##       HERB MARKER MINT}      => {HERB MARKER PARSLEY}  0.007947495  0.9064327 0.008767882 74.27797   155
## [33] {HERB MARKER MINT,                                                                                 
##       HERB MARKER PARSLEY}   => {HERB MARKER CHIVES }  0.007947495  0.7948718 0.009998462 74.17409   155
## [34] {HERB MARKER CHIVES ,                                                                              
##       HERB MARKER PARSLEY,                                                                              
##       HERB MARKER ROSEMARY}  => {HERB MARKER MINT}     0.007127109  0.9025974 0.007896221 73.96369   139
## [35] {HERB MARKER CHIVES ,                                                                              
##       HERB MARKER PARSLEY,                                                                              
##       HERB MARKER THYME}     => {HERB MARKER MINT}     0.007075834  0.9019608 0.007844947 73.91152   138

retail_hi_lift <- head(sort(retail_rules, by = 'lift'),50)
plot(retail_hi_lift,method = 'graph',col = 'blue',cex = 1)

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

itemFrequencyPlot(retail_trans,topN=30,type = 'absolut',horiz = TRUE)

: Most frequently purchased items.
## 6.2 Feature: Stock Code

Model is applied in the same way except the variable now used in stock code.

# Preprocess data for basket analysis - Stock Code
retail_basket_stock <- retail %>% filter(refund == 'FALSE') %>%
  select(invoice_no,stock_code)
retail_basket_stock <- retail_basket_stock %>% group_by(invoice_no) %>%
  mutate(id = cur_group_id()) %>% 
  ungroup() %>%
  select(id, stock_code) # Assign basket variable as a key. 
summary(retail_basket_stock)

##        id         stock_code       
##  Min.   :    1   Length:527223     
##  1st Qu.: 4836   Class :character  
##  Median :10324   Mode  :character  
##  Mean   :10058                     
##  3rd Qu.:15171                     
##  Max.   :19503

# Convert data to class transaction
retail_list_stock <- split(retail$stock_code,retail_basket_stock$id) # Group items according the basket they belong to
retail_trans_stock <- as(retail_list_stock,'transactions') #Convert list to transactions
retail_rules_stock <- apriori(retail_trans_stock, parameter=list(supp=0.007, conf=0.4)) # Create rule object

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   0.007      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 136 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[3930 item(s), 19503 transaction(s)] done [0.31s].
## sorting and recoding items ... [1185 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 done [0.12s].
## writing ... [2345 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

# Visualise rules
plot(retail_rules_stock)

# Inspect rules
inspect(head(sort(retail_rules_stock,by='lift'),35))

##      lhs                      rhs     support     confidence coverage   
## [1]  {22916, 22919, 22921} => {22918} 0.007075834 0.9787234  0.007229657
## [2]  {22916, 22917, 22921} => {22918} 0.007537302 0.9735099  0.007742399
## [3]  {22917, 22919, 22921} => {22918} 0.007127109 0.9720280  0.007332205
## [4]  {22917, 22918, 22921} => {22916} 0.007537302 0.9545455  0.007896221
## [5]  {22917, 22918, 22919} => {22921} 0.007127109 0.8424242  0.008460237
## [6]  {22916, 22918, 22921} => {22917} 0.007537302 0.9607843  0.007844947
## [7]  {22916, 22921}        => {22918} 0.007844947 0.9562500  0.008203866
## [8]  {22917, 22918, 22920} => {22916} 0.007844947 0.9444444  0.008306414
## [9]  {22917, 22919, 22920} => {22916} 0.007434754 0.9415584  0.007896221
## [10] {22917, 22918, 22919} => {22916} 0.007896221 0.9333333  0.008460237
## [11] {22916, 22918, 22919} => {22921} 0.007075834 0.8263473  0.008562785
## [12] {22916, 22921}        => {22917} 0.007742399 0.9437500  0.008203866
## [13] {22916, 22919, 22920} => {22917} 0.007434754 0.9415584  0.007896221
## [14] {22916, 22918, 22920} => {22917} 0.007844947 0.9386503  0.008357689
## [15] {22916, 22917, 22919} => {22918} 0.007896221 0.9333333  0.008460237
## [16] {22916, 22917, 22918} => {22921} 0.007537302 0.8166667  0.009229349
## [17] {22916, 22919, 22920} => {22918} 0.007332205 0.9285714  0.007896221
## [18] {22917, 22921}        => {22918} 0.007896221 0.9277108  0.008511511
## [19] {22917, 22918}        => {22916} 0.009229349 0.9183673  0.010049736
## [20] {22916, 22919}        => {22918} 0.008562785 0.9226519  0.009280623
## [21] {22916, 22918, 22919} => {22917} 0.007896221 0.9221557  0.008562785
## [22] {22917, 22921}        => {22916} 0.007742399 0.9096386  0.008511511
## [23] {22917, 22919, 22920} => {22918} 0.007229657 0.9155844  0.007896221
## [24] {22917, 22919}        => {22916} 0.008460237 0.9065934  0.009331898
## [25] {22917, 22920}        => {22916} 0.008614059 0.9032258  0.009536994
## [26] {22916, 22917, 22920} => {22918} 0.007844947 0.9107143  0.008614059
## [27] {22916, 22918}        => {22917} 0.009229349 0.9137056  0.010101010
## [28] {22916, 22920}        => {22917} 0.008614059 0.9130435  0.009434446
## [29] {22916, 22919}        => {22917} 0.008460237 0.9116022  0.009280623
## [30] {22918, 22919, 22920} => {22916} 0.007332205 0.8993711  0.008152592
## [31] {22917, 22919}        => {22918} 0.008460237 0.9065934  0.009331898
## [32] {22919, 22921}        => {22918} 0.007947495 0.9064327  0.008767882
## [33] {22918, 22919}        => {22921} 0.007947495 0.7948718  0.009998462
## [34] {22917, 22918, 22921} => {22919} 0.007127109 0.9025974  0.007896221
## [35] {22916, 22918, 22921} => {22919} 0.007075834 0.9019608  0.007844947
##      lift     count
## [1]  80.20186 138  
## [2]  79.77464 147  
## [3]  79.65320 139  
## [4]  78.88347 147  
## [5]  78.61148 139  
## [6]  78.40241 147  
## [7]  78.36027 153  
## [8]  78.04873 153  
## [9]  77.81023 145  
## [10] 77.13051 154  
## [11] 77.11125 138  
## [12] 77.01237 151  
## [13] 76.83353 145  
## [14] 76.59622 153  
## [15] 76.48235 154  
## [16] 76.20789 147  
## [17] 76.09214 143  
## [18] 76.02162 154  
## [19] 75.89372 180  
## [20] 75.60706 167  
## [21] 75.25022 154  
## [22] 75.17238 151  
## [23] 75.02791 141  
## [24] 74.92073 165  
## [25] 74.64243 168  
## [26] 74.62883 153  
## [27] 74.56067 180  
## [28] 74.50664 168  
## [29] 74.38903 165  
## [30] 74.32387 143  
## [31] 74.29114 165  
## [32] 74.27797 155  
## [33] 74.17409 155  
## [34] 73.96369 139  
## [35] 73.91152 138

retail_hi_lift_stock <- head(sort(retail_rules_stock, by = 'lift'),50)
plot(retail_hi_lift_stock,method = 'graph',col = 'blue',cex = 1)

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

itemFrequencyPlot(retail_trans_stock,topN=30,type = 'absolut',horiz = TRUE)

Online Retail Dataset

Fintan Cannon

3/9/2022