Load required libraries and data sets

Note that you will need to install these libraries if you have never used these before.

#### Example code to install packages
#install.packages("data.table")

#### Load required libraries
library(data.table)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.1
library(ggmosaic)
library(readr)
library(skimr)
library(here)
## here() starts at /Users/NguyenTran/Documents/GitHub/Quantium-purchasing-behavior-project
# library(dplyr)

filePathTXN <- 
#### Import data
transactionData <- fread(here("data","QVI_transaction_data.csv"))
customerData <- fread(here("data","QVI_purchase_behaviour.csv"))

Exploratory data analysis

The first step in any analysis is to first understand the data. Let’s take a look at each of the data sets provided.

Examining transaction data

We can use str() to look at the format of each column and see a sample of the data. As we have read in the dataset as a data.table object, we can also run transactionData in the console to see a sample of the data or use head(transactionData) to look at the first 10 rows.

Let’s check if columns we would expect to be numeric are in numeric form and date columns are in date format.

#### Examine transaction data
str(transactionData)
## Classes 'data.table' and 'data.frame': 264836 obs. of 8 variables:
## $ DATE : int 43390 43599 43605 43329 43330 43604 43601 43601 43332 43330 ...
## $ STORE_NBR : int 1 1 1 2 2 4 4 4 5 7 ...
## $ LYLTY_CARD_NBR: int 1000 1307 1343 2373 2426 4074 4149 4196 5026 7150 ...
## $ TXN_ID : int 1 348 383 974 1038 2982 3333 3539 4525 6900 ...
## $ PROD_NBR : int 5 66 61 69 108 57 16 24 42 52 ...
## $ PROD_NAME : chr "Natural Chip Compny SeaSalt175g" "CCs Nacho Cheese 175g"
"Smiths Crinkle Cut Chips Chicken 170g" "Smiths Chip Thinly S/Cream&Onion 175g"
...
## $ PROD_QTY : int 2 3 2 5 3 1 1 1 1 2 ...
## $ TOT_SALES : num 6 6.3 2.9 15 13.8 5.1 5.7 3.6 3.9 7.2 ...
## - attr(*, ".internal.selfref")=<externalptr>
head(transactionData)
##     DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR
## 1: 43390         1           1000      1        5
## 2: 43599         1           1307    348       66
## 3: 43605         1           1343    383       61
## 4: 43329         2           2373    974       69
## 5: 43330         2           2426   1038      108
## 6: 43604         4           4074   2982       57
##                                   PROD_NAME PROD_QTY TOT_SALES
## 1:   Natural Chip        Compny SeaSalt175g        2       6.0
## 2:                 CCs Nacho Cheese    175g        3       6.3
## 3:   Smiths Crinkle Cut  Chips Chicken 170g        2       2.9
## 4:   Smiths Chip Thinly  S/Cream&Onion 175g        5      15.0
## 5: Kettle Tortilla ChpsHny&Jlpno Chili 150g        3      13.8
## 6: Old El Paso Salsa   Dip Tomato Mild 300g        1       5.1
summary(transactionData)
##       DATE         STORE_NBR     LYLTY_CARD_NBR        TXN_ID       
##  Min.   :43282   Min.   :  1.0   Min.   :   1000   Min.   :      1  
##  1st Qu.:43373   1st Qu.: 70.0   1st Qu.:  70021   1st Qu.:  67602  
##  Median :43464   Median :130.0   Median : 130358   Median : 135138  
##  Mean   :43464   Mean   :135.1   Mean   : 135550   Mean   : 135158  
##  3rd Qu.:43555   3rd Qu.:203.0   3rd Qu.: 203094   3rd Qu.: 202701  
##  Max.   :43646   Max.   :272.0   Max.   :2373711   Max.   :2415841  
##     PROD_NBR       PROD_NAME            PROD_QTY         TOT_SALES      
##  Min.   :  1.00   Length:264836      Min.   :  1.000   Min.   :  1.500  
##  1st Qu.: 28.00   Class :character   1st Qu.:  2.000   1st Qu.:  5.400  
##  Median : 56.00   Mode  :character   Median :  2.000   Median :  7.400  
##  Mean   : 56.58                      Mean   :  1.907   Mean   :  7.304  
##  3rd Qu.: 85.00                      3rd Qu.:  2.000   3rd Qu.:  9.200  
##  Max.   :114.00                      Max.   :200.000   Max.   :650.000
# expected numeric columns are indeed numeric

We can see that the date column is in an integer format. Let’s change this to a date format.

#### Convert DATE column to a date format
#### A quick search online tells us that CSV and Excel integer dates begin on 30
#### Dec 1899
transactionData$DATE <- as.Date(transactionData$DATE, origin = "1899-12-30")

We should check that we are looking at the right products by examining PROD_NAME.

#### Examine PROD_NAME by generating a summary of the PROD_NAME column
levels(as.factor(transactionData$PROD_NAME))
##   [1] "Burger Rings 220g"                       
##   [2] "CCs Nacho Cheese    175g"                
##   [3] "CCs Original 175g"                       
##   [4] "CCs Tasty Cheese    175g"                
##   [5] "Cheetos Chs & Bacon Balls 190g"          
##   [6] "Cheetos Puffs 165g"                      
##   [7] "Cheezels Cheese 330g"                    
##   [8] "Cheezels Cheese Box 125g"                
##   [9] "Cobs Popd Sea Salt  Chips 110g"          
##  [10] "Cobs Popd Sour Crm  &Chives Chips 110g"  
##  [11] "Cobs Popd Swt/Chlli &Sr/Cream Chips 110g"
##  [12] "Dorito Corn Chp     Supreme 380g"        
##  [13] "Doritos Cheese      Supreme 330g"        
##  [14] "Doritos Corn Chip Mexican Jalapeno 150g" 
##  [15] "Doritos Corn Chip Southern Chicken 150g" 
##  [16] "Doritos Corn Chips  Cheese Supreme 170g" 
##  [17] "Doritos Corn Chips  Nacho Cheese 170g"   
##  [18] "Doritos Corn Chips  Original 170g"       
##  [19] "Doritos Mexicana    170g"                
##  [20] "Doritos Salsa       Medium 300g"         
##  [21] "Doritos Salsa Mild  300g"                
##  [22] "French Fries Potato Chips 175g"          
##  [23] "Grain Waves         Sweet Chilli 210g"   
##  [24] "Grain Waves Sour    Cream&Chives 210G"   
##  [25] "GrnWves Plus Btroot & Chilli Jam 180g"   
##  [26] "Infuzions BBQ Rib   Prawn Crackers 110g" 
##  [27] "Infuzions Mango     Chutny Papadums 70g" 
##  [28] "Infuzions SourCream&Herbs Veg Strws 110g"
##  [29] "Infuzions Thai SweetChili PotatoMix 110g"
##  [30] "Infzns Crn Crnchers Tangy Gcamole 110g"  
##  [31] "Kettle 135g Swt Pot Sea Salt"            
##  [32] "Kettle Chilli 175g"                      
##  [33] "Kettle Honey Soy    Chicken 175g"        
##  [34] "Kettle Mozzarella   Basil & Pesto 175g"  
##  [35] "Kettle Original 175g"                    
##  [36] "Kettle Sea Salt     And Vinegar 175g"    
##  [37] "Kettle Sensations   BBQ&Maple 150g"      
##  [38] "Kettle Sensations   Camembert & Fig 150g"
##  [39] "Kettle Sensations   Siracha Lime 150g"   
##  [40] "Kettle Sweet Chilli And Sour Cream 175g" 
##  [41] "Kettle Tortilla ChpsBtroot&Ricotta 150g" 
##  [42] "Kettle Tortilla ChpsFeta&Garlic 150g"    
##  [43] "Kettle Tortilla ChpsHny&Jlpno Chili 150g"
##  [44] "Natural Chip        Compny SeaSalt175g"  
##  [45] "Natural Chip Co     Tmato Hrb&Spce 175g" 
##  [46] "Natural ChipCo      Hony Soy Chckn175g"  
##  [47] "Natural ChipCo Sea  Salt & Vinegr 175g"  
##  [48] "NCC Sour Cream &    Garden Chives 175g"  
##  [49] "Old El Paso Salsa   Dip Chnky Tom Ht300g"
##  [50] "Old El Paso Salsa   Dip Tomato Med 300g" 
##  [51] "Old El Paso Salsa   Dip Tomato Mild 300g"
##  [52] "Pringles Barbeque   134g"                
##  [53] "Pringles Chicken    Salt Crips 134g"     
##  [54] "Pringles Mystery    Flavour 134g"        
##  [55] "Pringles Original   Crisps 134g"         
##  [56] "Pringles Slt Vingar 134g"                
##  [57] "Pringles SourCream  Onion 134g"          
##  [58] "Pringles Sthrn FriedChicken 134g"        
##  [59] "Pringles Sweet&Spcy BBQ 134g"            
##  [60] "Red Rock Deli Chikn&Garlic Aioli 150g"   
##  [61] "Red Rock Deli Sp    Salt & Truffle 150G" 
##  [62] "Red Rock Deli SR    Salsa & Mzzrlla 150g"
##  [63] "Red Rock Deli Thai  Chilli&Lime 150g"    
##  [64] "RRD Chilli&         Coconut 150g"        
##  [65] "RRD Honey Soy       Chicken 165g"        
##  [66] "RRD Lime & Pepper   165g"                
##  [67] "RRD Pc Sea Salt     165g"                
##  [68] "RRD Salt & Vinegar  165g"                
##  [69] "RRD SR Slow Rst     Pork Belly 150g"     
##  [70] "RRD Steak &         Chimuchurri 150g"    
##  [71] "RRD Sweet Chilli &  Sour Cream 165g"     
##  [72] "Smith Crinkle Cut   Bolognese 150g"      
##  [73] "Smith Crinkle Cut   Mac N Cheese 150g"   
##  [74] "Smiths Chip Thinly  Cut Original 175g"   
##  [75] "Smiths Chip Thinly  CutSalt/Vinegr175g"  
##  [76] "Smiths Chip Thinly  S/Cream&Onion 175g"  
##  [77] "Smiths Crinkle      Original 330g"       
##  [78] "Smiths Crinkle Chips Salt & Vinegar 330g"
##  [79] "Smiths Crinkle Cut  Chips Barbecue 170g" 
##  [80] "Smiths Crinkle Cut  Chips Chicken 170g"  
##  [81] "Smiths Crinkle Cut  Chips Chs&Onion170g" 
##  [82] "Smiths Crinkle Cut  Chips Original 170g" 
##  [83] "Smiths Crinkle Cut  French OnionDip 150g"
##  [84] "Smiths Crinkle Cut  Salt & Vinegar 170g" 
##  [85] "Smiths Crinkle Cut  Snag&Sauce 150g"     
##  [86] "Smiths Crinkle Cut  Tomato Salsa 150g"   
##  [87] "Smiths Crnkle Chip  Orgnl Big Bag 380g"  
##  [88] "Smiths Thinly       Swt Chli&S/Cream175G"
##  [89] "Smiths Thinly Cut   Roast Chicken 175g"  
##  [90] "Snbts Whlgrn Crisps Cheddr&Mstrd 90g"    
##  [91] "Sunbites Whlegrn    Crisps Frch/Onin 90g"
##  [92] "Thins Chips         Originl saltd 175g"  
##  [93] "Thins Chips Light&  Tangy 175g"          
##  [94] "Thins Chips Salt &  Vinegar 175g"        
##  [95] "Thins Chips Seasonedchicken 175g"        
##  [96] "Thins Potato Chips  Hot & Spicy 175g"    
##  [97] "Tostitos Lightly    Salted 175g"         
##  [98] "Tostitos Smoked     Chipotle 175g"       
##  [99] "Tostitos Splash Of  Lime 175g"           
## [100] "Twisties Cheese     270g"                
## [101] "Twisties Cheese     Burger 250g"         
## [102] "Twisties Chicken270g"                    
## [103] "Tyrrells Crisps     Ched & Chives 165g"  
## [104] "Tyrrells Crisps     Lightly Salted 165g" 
## [105] "Woolworths Cheese   Rings 190g"          
## [106] "Woolworths Medium   Salsa 300g"          
## [107] "Woolworths Mild     Salsa 300g"          
## [108] "WW Crinkle Cut      Chicken 175g"        
## [109] "WW Crinkle Cut      Original 175g"       
## [110] "WW D/Style Chip     Sea Salt 200g"       
## [111] "WW Original Corn    Chips 200g"          
## [112] "WW Original Stacked Chips 160g"          
## [113] "WW Sour Cream &OnionStacked Chips 160g"  
## [114] "WW Supreme Cheese   Corn Chips 200g"
prod_counts <- table(transactionData$PROD_NAME)

# count of unique products
length(unique(transactionData$PROD_NAME))
## [1] 114
# most common product
names(prod_counts)[which.max(prod_counts)]
## [1] "Kettle Mozzarella   Basil & Pesto 175g"
# freq of the most common product
max(prod_counts)
## [1] 3304
# frequency of occurrence for the top ten products
head(sort(prod_counts,decreasing = TRUE),n = 10)
## 
##   Kettle Mozzarella   Basil & Pesto 175g 
##                                     3304 
## Kettle Tortilla ChpsHny&Jlpno Chili 150g 
##                                     3296 
## Cobs Popd Swt/Chlli &Sr/Cream Chips 110g 
##                                     3269 
##   Tyrrells Crisps     Ched & Chives 165g 
##                                     3268 
##           Cobs Popd Sea Salt  Chips 110g 
##                                     3265 
##             Kettle 135g Swt Pot Sea Salt 
##                                     3257 
##            Tostitos Splash Of  Lime 175g 
##                                     3252 
## Infuzions Thai SweetChili PotatoMix 110g 
##                                     3242 
##   Smiths Crnkle Chip  Orgnl Big Bag 380g 
##                                     3233 
##     Thins Potato Chips  Hot & Spicy 175g 
##                                     3229
# sort(summary.factor(transactionData[, PROD_NAME]), decreasing = TRUE)

Looks like we are definitely looking at potato chips but how can we check that these are all chips? We can do some basic text analysis by summarizing the individual words in the product name.

#### Examine the words in PROD_NAME to see if there are any incorrect entries
#### such as products that are not chips

# split product names of the string unique(transactionData[, PROD_NAME])
# then unlist the list of vectors into a single vector
# then it into a data.table object
productWords <- data.table(unlist(strsplit(unique(transactionData[, PROD_NAME]), "\t"))) 
setnames(productWords, 'words')
productWords
##                                         words
##   1:   Natural Chip        Compny SeaSalt175g
##   2:                 CCs Nacho Cheese    175g
##   3:   Smiths Crinkle Cut  Chips Chicken 170g
##   4:   Smiths Chip Thinly  S/Cream&Onion 175g
##   5: Kettle Tortilla ChpsHny&Jlpno Chili 150g
##  ---                                         
## 110:    Red Rock Deli Chikn&Garlic Aioli 150g
## 111:      RRD SR Slow Rst     Pork Belly 150g
## 112:                 RRD Pc Sea Salt     165g
## 113:       Smith Crinkle Cut   Bolognese 150g
## 114:                 Doritos Salsa Mild  300g
# play around
head(unique(transactionData[, PROD_NAME]))
## [1] "Natural Chip        Compny SeaSalt175g"  
## [2] "CCs Nacho Cheese    175g"                
## [3] "Smiths Crinkle Cut  Chips Chicken 170g"  
## [4] "Smiths Chip Thinly  S/Cream&Onion 175g"  
## [5] "Kettle Tortilla ChpsHny&Jlpno Chili 150g"
## [6] "Old El Paso Salsa   Dip Tomato Mild 300g"
# play around
head(strsplit(unique(transactionData[, PROD_NAME]), "\t"))
## [[1]]
## [1] "Natural Chip        Compny SeaSalt175g"
## 
## [[2]]
## [1] "CCs Nacho Cheese    175g"
## 
## [[3]]
## [1] "Smiths Crinkle Cut  Chips Chicken 170g"
## 
## [[4]]
## [1] "Smiths Chip Thinly  S/Cream&Onion 175g"
## 
## [[5]]
## [1] "Kettle Tortilla ChpsHny&Jlpno Chili 150g"
## 
## [[6]]
## [1] "Old El Paso Salsa   Dip Tomato Mild 300g"
# this is a list of vectors
# [[1]] means the first element in this list, [[2]] second element and so on
# [1] means the first element within the first vector

head(transactionData$PROD_NAME)
## [1] "Natural Chip        Compny SeaSalt175g"  
## [2] "CCs Nacho Cheese    175g"                
## [3] "Smiths Crinkle Cut  Chips Chicken 170g"  
## [4] "Smiths Chip Thinly  S/Cream&Onion 175g"  
## [5] "Kettle Tortilla ChpsHny&Jlpno Chili 150g"
## [6] "Old El Paso Salsa   Dip Tomato Mild 300g"

As we are only interested in words that will tell us if the product is chips or not, let’s remove all words with digits and special characters such as ‘&’ from our set of product words. We can do this using grepl().

# Remove digits, and special characters, and then sort the distinct
# words by frequency of occurrence.

# str(productWords) to recall what object we are working with

#### Removing digits
# productWordsd is a data.table with one column 'words'
# regex "\\d+g$" matches one or more digits followed by any character at the end of every word
# gsub fucnction then replaces pattern with empty string
productWords[, words := gsub("\\d+.$", "", words)]

# note the := operator is specific to the data.table package. 
# it is used to update or assign values to columns within a data.table object. 

#### Removing special characters
productWords[, words := gsub("\\W+", " ", words)]

#### Let's look at the most common words by counting the number of times a word
# appears and sorting them by this frequency in order of highest to lowest frequency
sort(table(unlist(strsplit(unique(productWords[, words]), " "))), decreasing = TRUE)
##
## Chips Smiths Crinkle Cut Kettle
## 21 16 14 14 13
## Cheese Salt Original Chip Doritos
## 12 12 10 9 9
## Salsa Chicken Corn Cream Pringles
## 9 8 8 8 8
## RRD Chilli WW Sea Sour
## 8 7 7 6 6
## Crisps Thinly Thins Vinegar Chives
## 5 5 5 5 4
## Deli Infuzions Lime Natural Red
## 4 4 4 4 4
## Rock Supreme Sweet BBQ CCs
## 4 4 4 3 3
## Cobs Dip El Mild Old
## 3 3 3 3 3
## Onion Paso Popd Sensations Soy
## 3 3 3 3 3
## Swt Tomato Tortilla Tostitos Twisties
## 3 3 3 3 3
## Woolworths And Burger Cheetos Cheezels
## 3 2 2 2 2
## ChipCo Chs French Garlic Grain
## 2 2 2 2 2
## Honey Lightly Medium Nacho Potato
## 2 2 2 2 2
## Rings S Salted Smith SourCream
## 2 2 2 2 2
## SR Tangy Thai Tyrrells Vinegr
## 2 2 2 2 2
## Waves 135g Aioli Bacon Bag
## 2 1 1 1 1
## Balls Barbecue Barbeque Basil Belly
## 1 1 1 1 1
## Big Bolognese Box Btroot Camembert
## 1 1 1 1 1
## Chckn Ched Cheddr Chikn Chili
## 1 1 1 1 1
## Chimuchurri Chipotle Chli Chlli Chnky
## 1 1 1 1 1
## Chp ChpsBtroot ChpsFeta ChpsHny Chutny
## 1 1 1 1 1
## Co Coconut Compny Crackers Crips
## 1 1 1 1 1
## Crm Crn Crnchers Crnkle CutSalt
## 1 1 1 1 1
## D Dorito Fig Flavour Frch
## 1 1 1 1 1
## FriedChicken Fries Garden Gcamole GrnWves
## 1 1 1 1 1
## Herbs Hony Hot Hrb Ht
## 1 1 1 1 1
## Infzns Jalapeno Jam Jlpno Light
## 1 1 1 1 1
## Mac Mango Maple Med Mexican
## 1 1 1 1 1
## Mexicana Mozzarella Mstrd Mystery Mzzrlla
## 1 1 1 1 1
## N NCC Of Onin OnionDip
## 1 1 1 1 1
## OnionStacked Orgnl Originl Papadums Pc
## 1 1 1 1 1
## Pepper Pesto Plus Pork Pot
## 1 1 1 1 1
## PotatoMix Prawn Puffs Rib Ricotta
## 1 1 1 1 1
## Roast Rst saltd Sauce SeaSalt
## 1 1 1 1 1
## Seasonedchicken Siracha Slow Slt Smoked
## 1 1 1 1 1
## Snag Snbts Southern Sp Spce
## 1 1 1 1 1
## Spcy Spicy Splash Sr Stacked
## 1 1 1 1 1
## Steak Sthrn Strws Style Sunbites
## 1 1 1 1 1
## SweetChili Tasty Tmato Tom Truffle
## 1 1 1 1 1
## Veg Vingar Whlegrn Whlgrn
## 1 1 1 1

There are salsa products in the dataset but we are only interested in the chips category, so let’s remove these.

#### Remove salsa products
# 1) identify rows that are salsa products
# 2) remove these rows

# helpful to recall data.table structure
# DT[i, j, by]
##   R:                 i                 j        by
## SQL:  where | order by   select | update  group by

# create a new column 'SALSA' containing logical values
# to identify rows that have the word 'salsa' in PROD_NAME
# use tolower to standardize the word 'salsa'
transactionData[, SALSA := grepl("salsa", tolower(PROD_NAME))]
# [SALSA == FALSE, ] selects only rows that are not salsa products
# then [, SALSA := NULL] deletes the 'SALSA' column
transactionData <- transactionData[SALSA == FALSE, ][, SALSA := NULL]

Next, we can use summary() to check summary statistics such as mean, min and max values for each feature to see if there are any obvious outliers in the data and if there are any nulls in any of the columns (NA's : number of nulls will appear in the output if there are any nulls).

#### Summarise the data to check for nulls and possible outliers
summary(transactionData)
##       DATE              STORE_NBR     LYLTY_CARD_NBR        TXN_ID       
##  Min.   :2018-07-01   Min.   :  1.0   Min.   :   1000   Min.   :      1  
##  1st Qu.:2018-09-30   1st Qu.: 70.0   1st Qu.:  70015   1st Qu.:  67569  
##  Median :2018-12-30   Median :130.0   Median : 130367   Median : 135183  
##  Mean   :2018-12-30   Mean   :135.1   Mean   : 135531   Mean   : 135131  
##  3rd Qu.:2019-03-31   3rd Qu.:203.0   3rd Qu.: 203084   3rd Qu.: 202654  
##  Max.   :2019-06-30   Max.   :272.0   Max.   :2373711   Max.   :2415841  
##     PROD_NBR       PROD_NAME            PROD_QTY         TOT_SALES      
##  Min.   :  1.00   Length:246742      Min.   :  1.000   Min.   :  1.700  
##  1st Qu.: 26.00   Class :character   1st Qu.:  2.000   1st Qu.:  5.800  
##  Median : 53.00   Mode  :character   Median :  2.000   Median :  7.400  
##  Mean   : 56.35                      Mean   :  1.908   Mean   :  7.321  
##  3rd Qu.: 87.00                      3rd Qu.:  2.000   3rd Qu.:  8.800  
##  Max.   :114.00                      Max.   :200.000   Max.   :650.000

There are no nulls in the columns but product quantity appears to have an outlier which we should investigate further. Let’s investigate further the case where 200 packets of chips are bought in one transaction.

#### Filter the dataset to find the outlier
transactionData[PROD_QTY == "200"]
##          DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR
## 1: 2018-08-19       226         226000 226201        4
## 2: 2019-05-20       226         226000 226210        4
##                           PROD_NAME PROD_QTY TOT_SALES
## 1: Dorito Corn Chp     Supreme 380g      200       650
## 2: Dorito Corn Chp     Supreme 380g      200       650

There are two transactions where 200 packets of chips are bought in one transaction and both of these transactions were by the same customer.

#### Let's see if the customer has had other transactions
transactionData[LYLTY_CARD_NBR == "226000"]
##          DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR
## 1: 2018-08-19       226         226000 226201        4
## 2: 2019-05-20       226         226000 226210        4
##                           PROD_NAME PROD_QTY TOT_SALES
## 1: Dorito Corn Chp     Supreme 380g      200       650
## 2: Dorito Corn Chp     Supreme 380g      200       650

It looks like this customer has only had the two transactions over the year and is not an ordinary retail customer. The customer might be buying chips for commercial purposes instead. We’ll remove this loyalty card number from further analysis.

#### Filter out the customer based on the loyalty card number
transactionData <- transactionData[LYLTY_CARD_NBR != "226000"]

#### Re-examine transaction data
summary(transactionData)
##       DATE              STORE_NBR     LYLTY_CARD_NBR        TXN_ID       
##  Min.   :2018-07-01   Min.   :  1.0   Min.   :   1000   Min.   :      1  
##  1st Qu.:2018-09-30   1st Qu.: 70.0   1st Qu.:  70015   1st Qu.:  67569  
##  Median :2018-12-30   Median :130.0   Median : 130367   Median : 135182  
##  Mean   :2018-12-30   Mean   :135.1   Mean   : 135530   Mean   : 135130  
##  3rd Qu.:2019-03-31   3rd Qu.:203.0   3rd Qu.: 203083   3rd Qu.: 202652  
##  Max.   :2019-06-30   Max.   :272.0   Max.   :2373711   Max.   :2415841  
##     PROD_NBR       PROD_NAME            PROD_QTY       TOT_SALES     
##  Min.   :  1.00   Length:246740      Min.   :1.000   Min.   : 1.700  
##  1st Qu.: 26.00   Class :character   1st Qu.:2.000   1st Qu.: 5.800  
##  Median : 53.00   Mode  :character   Median :2.000   Median : 7.400  
##  Mean   : 56.35                      Mean   :1.906   Mean   : 7.316  
##  3rd Qu.: 87.00                      3rd Qu.:2.000   3rd Qu.: 8.800  
##  Max.   :114.00                      Max.   :5.000   Max.   :29.500

That’s better. Now, let’s look at the number of transaction lines over time to see if there are any obvious data issues such as missing data.

# this part works on performing operations by group

#### Count the number of transactions by date
transaction_count <- transactionData[, .N, by = DATE]
transaction_count
##            DATE   N
##   1: 2018-10-17 682
##   2: 2019-05-14 705
##   3: 2019-05-20 707
##   4: 2018-08-17 663
##   5: 2018-08-18 683
##  ---               
## 360: 2018-12-08 622
## 361: 2019-01-30 689
## 362: 2019-02-09 671
## 363: 2018-08-31 658
## 364: 2019-02-12 684

There’s only 364 rows, meaning only 364 dates which indicates a missing date. Let’s create a sequence of dates from 1 Jul 2018 to 30 Jun 2019 and use this to create a chart of number of transactions over time to find the missing date.

#### Create a sequence of dates and join this onto the count of transactions by date

# create a column of dates that includes every day from 1 Jul 2018 to
# 30 Jun 2019, then join it onto the data to fill in the missing day.

# use this format seq(start_date, end_date, by = "day")
date_seq <- seq(as.Date("2018-7-1"), as.Date("2019-6-30"), by = "day")
# convert date_seq to data.table
date_seq_dt <- data.table(DATE = date_seq)

# perform a right join between transaction count and date sequence
# a right join is a type of database join operation that combines two tables based on a common column, 
# while including all rows from the right table and matching rows from the left table
# right table in this case is date_seq_dt
transactions_by_day <- merge.data.table(transaction_count,date_seq_dt, 
                                        by = "DATE", all.y = TRUE)
# or we can do transaction_count[date_seq_dt, on = "DATE"] but this is a bit less intuitive


#### Setting plot themes to format graphs
theme_set(theme_bw())
theme_update(plot.title = element_text(hjust = 0.5))

#### Plot transactions by date
ggplot(transactions_by_day, aes(x = DATE, y = N)) +
  geom_line() +
  labs(x = "Day", y = "Number of transactions", title = "Transactions over time") +
  scale_x_date(breaks = "1 month") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

We can see that there is an increase in purchases in December and a break in late December. Let’s zoom in on this.

#### Filter to December and look at individual days

# Over to you - recreate the chart above zoomed in to the relevant dates.
ggplot(transactions_by_day[DATE %like% "-12-", ], aes(x = DATE, y = N)) +
  geom_line() +
  labs(x = "Day", y = "Number of transactions", title = "Transactions over time") +
  scale_x_date(breaks = "1 day") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

We can see that the increase in sales occurs in the lead-up to Christmas and that there are zero sales on Christmas day itself. This is due to shops being closed on Christmas day.

Now that we are satisfied that the data no longer has outliers, we can move on to creating other features such as brand of chips or pack size from PROD_NAME. We will start with pack size.

#### Pack size
#### We can work this out by taking the digits that are in PROD_NAME
transactionData[, PACK_SIZE := parse_number(PROD_NAME)]

#### Always check your output
#### Let's check if the pack sizes look sensible
transactionData[, .N, PACK_SIZE][order(PACK_SIZE)]
##     PACK_SIZE     N
##  1:        70  1507
##  2:        90  3008
##  3:       110 22387
##  4:       125  1454
##  5:       134 25102
##  6:       135  3257
##  7:       150 40203
##  8:       160  2970
##  9:       165 15297
## 10:       170 19983
## 11:       175 66390
## 12:       180  1468
## 13:       190  2995
## 14:       200  4473
## 15:       210  6272
## 16:       220  1564
## 17:       250  3169
## 18:       270  6285
## 19:       330 12540
## 20:       380  6416

The largest size is 380g and the smallest size is 70g - seems sensible!

#### Let's plot a histogram of PACK_SIZE since we know that it is a categorical
# variable and not a continuous variable even though it is numeric.

# plot a histogram showing the number of transactions by pack size.
ggplot(transactionData, aes(x = PACK_SIZE)) +
  geom_histogram(bins = 20)

Pack sizes created look reasonable.

Now to create brands, we can use the first word in PROD_NAME to work out the brand name…

#### Brands
# Over to you! Create a column which contains the brand of the product, by
# extracting it from the product name.


# Extract brand name from PROD_NAME
transactionData[, BRAND := sub("^([A-Za-z]+).*", "\\1", PROD_NAME)]
# '^([A-Za-z]+)' matches and captures one or more consecutive alphabetic characters at the start of the string.
# '.*' matches any remaining characters in the string.
# '\\1' is the replacement string that refers to the captured group (brand name).



# another way to write
# transactionData$Brand <- gsub("^(\\S+).*", "\\1", transactionData$PROD_NAME)

# another (long) way, using loop to iterate through PROD_NAME and apply a function to extract brand name
# transactionData[, BRAND := unlist(lapply(transactionData$PROD_NAME,
#                                          FUN = function(x){unlist(strsplit(x," "))[1]}))]



#### Checking brands
# check to see if the results look reasonable.
head(transactionData$BRAND)
## [1] "Natural" "CCs"     "Smiths"  "Smiths"  "Kettle"  "Smiths"

Some of the brand names look like they are of the same brands - such as RED and RRD, which are both Red Rock Deli chips. Let’s combine these together.

#### Clean brand names
transactionData[BRAND %in% c("RED", "Red"), BRAND := "RRD"]
transactionData[BRAND == "Dorito", BRAND := "Doritos"]
transactionData[BRAND == "Smith", BRAND := "Smiths"]
transactionData[BRAND == "Snbts", BRAND := "Sunbites"]
transactionData[BRAND == "WW", BRAND := "Woolworths"]

# transactionData[BRAND == "GrnWves", BRAND := "Sunbites"] maybe?

#### Check to see if the results look reasonable
table(transactionData$BRAND)
## 
##     Burger        CCs    Cheetos   Cheezels       Cobs    Doritos     French 
##       1564       4551       2927       4603       9693      25224       1418 
##      Grain    GrnWves  Infuzions     Infzns     Kettle    Natural        NCC 
##       6272       1468      11057       3144      41288       6050       1419 
##   Pringles        RRD     Smiths   Sunbites      Thins   Tostitos   Twisties 
##      25102      16321      30353       3008      14075       9471       9454 
##   Tyrrells Woolworths 
##       6442      11836

Examining customer data

Now that we are happy with the transaction dataset, let’s have a look at the customer dataset.

#### Examining customer data
str(customerData)
## Classes 'data.table' and 'data.frame': 72637 obs. of 3 variables:
## $ LYLTY_CARD_NBR : int 1000 1002 1003 1004 1005 1007 1009 1010 1011 1012 ...
## $ LIFESTAGE : chr "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG
FAMILIES" "OLDER SINGLES/COUPLES" ...
## $ PREMIUM_CUSTOMER: chr "Premium" "Mainstream" "Budget" "Mainstream" ...
## - attr(*, ".internal.selfref")=<externalptr>
skim(customerData)
Data summary
Name customerData
Number of rows 72637
Number of columns 3
Key NULL
_______________________
Column type frequency:
character 2
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
LIFESTAGE 0 1 8 22 0 7 0
PREMIUM_CUSTOMER 0 1 6 10 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
LYLTY_CARD_NBR 0 1 136185.9 89892.93 1000 66202 134040 203375 2373711 ▇▁▁▁▁
# distributions of columns
sort(table(customerData$LIFESTAGE), decreasing = TRUE)
## 
##               RETIREES  OLDER SINGLES/COUPLES  YOUNG SINGLES/COUPLES 
##                  14805                  14609                  14441 
##         OLDER FAMILIES         YOUNG FAMILIES MIDAGE SINGLES/COUPLES 
##                   9780                   9178                   7275 
##           NEW FAMILIES 
##                   2549
sort(table(customerData$PREMIUM_CUSTOMER), decreasing = TRUE)
## 
## Mainstream     Budget    Premium 
##      29245      24470      18922
#### Merge transaction data to customer data
data <- merge(transactionData, customerData, all.x = TRUE)

As the number of rows in data is the same as that of transactionData, we can be sure that no duplicates were created. This is because we created data by setting all.x = TRUE (in other words, a left join) which means take all the rows in transactionData and find rows with matching values in shared columns and then joining the details in these rows to the x or the first mentioned table.

Let’s also check if some customers were not matched on by checking for nulls.

# check if any transactions did not have a matched customer.
skim(data)
Data summary
Name data
Number of rows 246740
Number of columns 12
Key LYLTY_CARD_NBR
_______________________
Column type frequency:
character 4
Date 1
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
PROD_NAME 0 1 17 40 0 105 0
BRAND 0 1 3 10 0 23 0
LIFESTAGE 0 1 8 22 0 7 0
PREMIUM_CUSTOMER 0 1 6 10 0 3 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
DATE 0 1 2018-07-01 2019-06-30 2018-12-30 364

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
LYLTY_CARD_NBR 0 1 135530.25 80715.20 1000.0 70015.00 130367.0 203083.2 2373711.0 ▇▁▁▁▁
STORE_NBR 0 1 135.05 76.79 1.0 70.00 130.0 203.0 272.0 ▆▇▇▇▇
TXN_ID 0 1 135130.36 78147.60 1.0 67568.75 135181.5 202652.2 2415841.0 ▇▁▁▁▁
PROD_NBR 0 1 56.35 33.70 1.0 26.00 53.0 87.0 114.0 ▇▇▆▇▇
PROD_QTY 0 1 1.91 0.34 1.0 2.00 2.0 2.0 5.0 ▁▇▁▁▁
TOT_SALES 0 1 7.32 2.47 1.7 5.80 7.4 8.8 29.5 ▆▇▁▁▁
PACK_SIZE 0 1 175.58 59.43 70.0 150.00 170.0 175.0 380.0 ▁▇▁▁▁
# to only specifically check for nulls
sapply(data, function(x) sum(is.na(x)))
##   LYLTY_CARD_NBR             DATE        STORE_NBR           TXN_ID 
##                0                0                0                0 
##         PROD_NBR        PROD_NAME         PROD_QTY        TOT_SALES 
##                0                0                0                0 
##        PACK_SIZE            BRAND        LIFESTAGE PREMIUM_CUSTOMER 
##                0                0                0                0

Great, there are no nulls! So all our customers in the transaction data has been accounted for in the customer dataset.

Note that if you are continuing with Task 2, you may want to retain this dataset which you can write out as a csv

fwrite(data, here("data","QVI_data.csv"))

Data exploration is now complete!

Data analysis on customer segments

Now that the data is ready for analysis, we can define some metrics of interest to the client:

  • Who spends the most on chips (total sales), describing customers by lifestage and how premium their general purchasing behaviour is
  • How many customers are in each segment
  • How many chips are bought per customer by segment
  • What’s the average chip price by customer segment

We could also ask our data team for more information. Examples are:

  • The customer’s total grocery spend over the period and total spend for each chips transaction to understand what proportion of their grocery spend is on chips
  • Proportion of customers in each customer segment overall to compare against the mix of customers who purchase chips

Let’s start with calculating total sales by LIFESTAGE and PREMIUM_CUSTOMER and plotting the split by these segments to describe which customer segment contributes most to chip sales.

#### calculate Total sales by LIFESTAGE and PREMIUM_CUSTOMER and create a plot
sales <- data[, .(SALES = sum(TOT_SALES)), .(LIFESTAGE, PREMIUM_CUSTOMER)]

#### Create plot
p <- ggplot(data = sales) +
  # weight = SALES means that the size or area of each mosaic tile will be proportional to the corresponding sales value. 
  # product() is used to create a combined factor variable that represents different combinations of PREMIUM_CUSTOMER and LIFESTAGE
  geom_mosaic(aes(weight = SALES, x = product(PREMIUM_CUSTOMER, LIFESTAGE), fill = PREMIUM_CUSTOMER)) +
  labs(x = "Lifestage", y = "Premium customer flag", title = "Proportion of sales") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
# The labels are generated by calculating the proportion of each tile's weight (SALES) 
# relative to the sum of all weights in the plot.
p + geom_text(data = ggplot_build(p)$data[[1]], 
              # ggplot_build(p)$data is a list of layers that contains the data used to create the layers in a ggplot object p.
              aes(x = (xmin + xmax)/2 , # this is to position the text 
                  y = (ymin + ymax)/2,  # in the middle of each tile
                  label = as.character(paste(round(.wt/sum(.wt),3)*100, '%'))))
## Warning: `unite_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `unite()` instead.
## ℹ The deprecated feature was likely used in the ggmosaic package.
##   Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Note that the percentage of each tile is relative to the total sales.

Sales are coming mainly from Budget - older families, Mainstream - young singles/couples, and Mainstream - retirees

Let’s see if the higher sales are due to there being more customers who buy chips.

#### Calculate Number of customers by LIFESTAGE and PREMIUM_CUSTOMER and create a plot.
# uniqueN is equivalent to length(unique(x))
customers <- data[, .(CUSTOMERS = uniqueN(LYLTY_CARD_NBR)), .(LIFESTAGE, PREMIUM_CUSTOMER)][order(-CUSTOMERS)]
 
p <- ggplot(data = customers) +
  geom_mosaic(aes(weight = CUSTOMERS, x = product(PREMIUM_CUSTOMER,LIFESTAGE), fill = PREMIUM_CUSTOMER)) +
  labs(x = "Lifestage", y = "Premium customer flag", title = "Proportion of customers") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

p + geom_text(data = ggplot_build(p)$data[[1]], 
              aes(x = (xmin + xmax)/2 , 
                  y = (ymin + ymax)/2, 
                  label = as.character(paste(round(.wt/sum(.wt),3)*100,'%'))))

# .wt refers to the weight variable specified in 'weight' aesthetic of the geom_mosaic() layer


ggplot(customers, aes(x = LIFESTAGE, y = CUSTOMERS, fill = PREMIUM_CUSTOMER)) +
  geom_bar(position = "dodge", stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

There are more Mainstream - young singles/couples and Mainstream - retirees who buy chips. This contributes to there being more sales to these customer segments but this is not a major driver for the Budget - Older families segment.

Higher sales may also be driven by more units of chips being bought per customer. Let’s have a look at this next.

#### Average number of units per customer by LIFESTAGE and PREMIUM_CUSTOMER
avg_units <- data[, .(AVG = sum(PROD_QTY)/uniqueN(LYLTY_CARD_NBR)),.(LIFESTAGE, PREMIUM_CUSTOMER)][order(-AVG)]
avg_units
##                  LIFESTAGE PREMIUM_CUSTOMER      AVG
##  1:         OLDER FAMILIES       Mainstream 9.255380
##  2:         OLDER FAMILIES           Budget 9.076773
##  3:         OLDER FAMILIES          Premium 9.071717
##  4:         YOUNG FAMILIES           Budget 8.722995
##  5:         YOUNG FAMILIES          Premium 8.716013
##  6:         YOUNG FAMILIES       Mainstream 8.638361
##  7:  OLDER SINGLES/COUPLES           Budget 6.781398
##  8:  OLDER SINGLES/COUPLES          Premium 6.769543
##  9:  OLDER SINGLES/COUPLES       Mainstream 6.712021
## 10: MIDAGE SINGLES/COUPLES       Mainstream 6.432080
## 11:               RETIREES           Budget 6.141847
## 12:               RETIREES          Premium 6.103358
## 13: MIDAGE SINGLES/COUPLES          Premium 6.078514
## 14: MIDAGE SINGLES/COUPLES           Budget 6.026459
## 15:               RETIREES       Mainstream 5.925920
## 16:           NEW FAMILIES       Mainstream 4.891566
## 17:           NEW FAMILIES           Budget 4.821527
## 18:           NEW FAMILIES          Premium 4.815652
## 19:  YOUNG SINGLES/COUPLES       Mainstream 4.575597
## 20:  YOUNG SINGLES/COUPLES          Premium 4.264113
## 21:  YOUNG SINGLES/COUPLES           Budget 4.250069
##                  LIFESTAGE PREMIUM_CUSTOMER      AVG
ggplot(avg_units, aes(x = LIFESTAGE, y = AVG, fill = PREMIUM_CUSTOMER)) +
  geom_bar(stat = "identity", position = "dodge") +   
  labs(x = "LIFESTAGE", y = "Units per Customer", title = "Average Units per Customer by LIFESTAGE and PREMIUM_CUSTOMER") +
  theme(axis.text.x = element_text(angle = 45, hjust = .7))

Older families and young families in general buy more chips per customer

Let’s also investigate the average price per unit chips bought for each customer segment as this is also a driver of total sales.

#### Calculate the average price per unit sold (average sale price) 
#### by LIFESTAGE and PREMIUM_CUSTOMER

avg_price <- data[, mean(TOT_SALES/PROD_QTY), by = .(LIFESTAGE,PREMIUM_CUSTOMER)]
avg_price[order(-V1)]
##                  LIFESTAGE PREMIUM_CUSTOMER       V1
##  1:  YOUNG SINGLES/COUPLES       Mainstream 4.065642
##  2: MIDAGE SINGLES/COUPLES       Mainstream 3.994241
##  3:               RETIREES           Budget 3.924404
##  4:               RETIREES          Premium 3.920942
##  5:           NEW FAMILIES           Budget 3.917688
##  6:           NEW FAMILIES       Mainstream 3.916133
##  7:  OLDER SINGLES/COUPLES          Premium 3.893182
##  8:  OLDER SINGLES/COUPLES           Budget 3.882096
##  9:           NEW FAMILIES          Premium 3.872110
## 10:               RETIREES       Mainstream 3.844294
## 11:  OLDER SINGLES/COUPLES       Mainstream 3.814665
## 12: MIDAGE SINGLES/COUPLES          Premium 3.770698
## 13:         YOUNG FAMILIES          Premium 3.762150
## 14:         YOUNG FAMILIES           Budget 3.760737
## 15:         OLDER FAMILIES           Budget 3.745340
## 16: MIDAGE SINGLES/COUPLES           Budget 3.743328
## 17:         OLDER FAMILIES       Mainstream 3.737077
## 18:         YOUNG FAMILIES       Mainstream 3.724533
## 19:         OLDER FAMILIES          Premium 3.717000
## 20:  YOUNG SINGLES/COUPLES          Premium 3.665414
## 21:  YOUNG SINGLES/COUPLES           Budget 3.657366
##                  LIFESTAGE PREMIUM_CUSTOMER       V1
ggplot(avg_price, aes(x = LIFESTAGE, y = V1, fill = PREMIUM_CUSTOMER)) +
  geom_bar(stat = "identity", position = "dodge") +   
  labs(x = "LIFESTAGE", y = "Average price per unit", 
       title = "Average price per unit chips bought by LIFESTAGE and PREMIUM_CUSTOMER") +
  theme(axis.text.x = element_text(angle = 45, hjust = .7))

Mainstream midage and young singles and couples are more willing to pay more per packet of chips compared to their budget and premium counterparts. This may be due to premium shoppers being more likely to buy healthy snacks and when they buy chips, this is mainly for entertainment purposes rather than their own habit of consumption. This is also supported by there being fewer premium midage and young singles and couples buying chips compared to their mainstream counterparts.

As the difference in average price per unit isn’t large, we can test if this difference is statistically different.

Let mu_m be the mean unit price of mainstream midage and young singles and couples Let mu_pb be the mean unit price of premium and budget midage and young singles and couples

H0: true difference in mu_m and mu_pb is equal to 0

Ha: true difference in mu_m and mu_pb is greater than 0

#### Perform an independent t-test between mainstream vs premium and budget midage
#### and young singles and couples

# create a new column consisting of the price for each row
pricePerUnit <- data[, price := TOT_SALES/PROD_QTY]

# Subset the data for each sample
mainstream <- pricePerUnit[PREMIUM_CUSTOMER %in% c("Mainstream") &
                           LIFESTAGE %in% c("YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES")]
other <- pricePerUnit[PREMIUM_CUSTOMER %in% c("Budget", "Premium") &
                      LIFESTAGE %in% c("YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES")]

# Perform a t-test
# t.test will take care of calculating the sample mean for each sample, 
# just input the price vector for each sample
result <- t.test(mainstream$price, other$price, alternative = "greater")
result
## 
##  Welch Two Sample t-test
## 
## data:  mainstream$price and other$price
## t = 37.624, df = 54791, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.3187234       Inf
## sample estimates:
## mean of x mean of y 
##  4.039786  3.706491
# Extract the p-value
p_value <- result$p.value
p_value
## [1] 3.483677e-306
# # Check if the p-value is below the desired significance level (e.g., 0.05)
# if (p_value < 0.05) {
#   print("The unit price for mainstream, young and mid-age singles and couples is significantly higher.")
# } else {
#   print("The unit price for mainstream, young and mid-age singles and couples is not significantly higher.")
# }

The t-test results in a p-value of 3.483677e-306, i.e. the unit price for mainstream, young and mid-age singles and couples ARE significantly higher than that of budget or premium, young and midage singles and couples.

Deep dive into specific customer segments for insights

We have found quite a few interesting insights that we can dive deeper into.

We might want to target customer segments that contribute the most to sales to retain them or further increase sales. Let’s look at Mainstream - young singles/couples.

For instance, let’s find out if they tend to buy a particular brand of chips compared to other segments.

#### Deep dive into Mainstream, young singles/couples

# Are there brands that these this customer segment prefers
# more than others. 
# Could use affinity analysis or apriori analysis


# 'data' is a transactional data set where each row is a transaction


# affinity analysis

# subset two relevant groups to compare
# this is the segment of interest
segment1 <- data[PREMIUM_CUSTOMER == "Mainstream" &
                 LIFESTAGE == "YOUNG SINGLES/COUPLES", ]
# the rest of population serves as the baseline comparison
Other <- data[!(PREMIUM_CUSTOMER == "Mainstream" &
                 LIFESTAGE == "YOUNG SINGLES/COUPLES"), ]

# how many units did segment1 buy?
quantity_segment1 <- segment1[, sum(PROD_QTY)]
# how many units did the rest buy?
quantity_other <- Other[, sum(PROD_QTY)]

# calculate proportion that bought a specific brand by segment1
quantity_segment1_byBrand <- segment1[, .(TargetSeg = sum(PROD_QTY)/quantity_segment1), by = BRAND]
# calculate proportion that bought a specific brand by the rest of the population
quantity_Other_byBrand <- Other[, .(otherSegs = sum(PROD_QTY)/quantity_other), by = BRAND]

# calculate and compare to see which brands segment1 particularly tends to buy compared to other segments.
brand_proportions <- merge(quantity_segment1_byBrand,quantity_Other_byBrand)[, AffinityToBrand := TargetSeg/otherSegs] 
brand_proportions[order(-AffinityToBrand)]
##          BRAND   TargetSeg   otherSegs AffinityToBrand
##  1:   Tyrrells 0.031552795 0.025692464       1.2280953
##  2:   Twisties 0.046183575 0.037876520       1.2193194
##  3:    Doritos 0.122760524 0.101074684       1.2145526
##  4:     Kettle 0.197984817 0.165553442       1.1958967
##  5:   Tostitos 0.045410628 0.037977861       1.1957131
##  6:     Infzns 0.014934438 0.012573300       1.1877898
##  7:   Pringles 0.119420290 0.100634769       1.1866703
##  8:      Grain 0.029123533 0.025121265       1.1593180
##  9:       Cobs 0.044637681 0.039048861       1.1431238
## 10:  Infuzions 0.049744651 0.044491379       1.1180739
## 11:      Thins 0.060372671 0.056986370       1.0594230
## 12:   Cheezels 0.017971014 0.018646902       0.9637534
## 13:     Smiths 0.096369910 0.124583692       0.7735355
## 14:     French 0.003947550 0.005758060       0.6855694
## 15:    Cheetos 0.008033126 0.012066591       0.6657329
## 16:        RRD 0.043809524 0.067493678       0.6490908
## 17:    Natural 0.015955832 0.024980768       0.6387246
## 18:        NCC 0.003643892 0.005873221       0.6204248
## 19:        CCs 0.011180124 0.018895650       0.5916771
## 20:    GrnWves 0.003588682 0.006066692       0.5915385
## 21:   Sunbites 0.006349206 0.012580210       0.5046980
## 22: Woolworths 0.024099379 0.049427188       0.4875733
## 23:     Burger 0.002926156 0.006596434       0.4435967
##          BRAND   TargetSeg   otherSegs AffinityToBrand

We can see that :

  • Mainstream young singles/couples are 23% more likely to purchase Tyrrells chips compared to the rest of the population

  • Mainstream young singles/couples are 56% less likely to purchase Burger Rings compared to the rest of the population

Let’s also find out if our target segment tends to buy larger packs of chips.

#### Preferred pack size compared to the rest of the population

# Over to you! Do the same for pack size.
# calculate proportion that bought a specific brand by segment1
quantity_segment1_byPackSize <- segment1[, .(TargetSeg = sum(PROD_QTY)/quantity_segment1), by = PACK_SIZE]
# calculate proportion that bought a specific brand by the rest of the population
quantity_Other_byPackSize <- Other[, .(otherSegs = sum(PROD_QTY)/quantity_other), by = PACK_SIZE]

# calculate and compare to see which brands segment1 particularly tends to buy compared to other segments.
packSize_proportions <- merge(quantity_segment1_byPackSize,quantity_Other_byPackSize)[, AffinityToPack := TargetSeg/otherSegs] 
packSize_proportions[order(-AffinityToPack)]
##     PACK_SIZE   TargetSeg   otherSegs AffinityToPack
##  1:       270 0.031828847 0.025095929      1.2682873
##  2:       380 0.032160110 0.025584213      1.2570295
##  3:       330 0.061283644 0.050161917      1.2217166
##  4:       134 0.119420290 0.100634769      1.1866703
##  5:       110 0.106280193 0.089791190      1.1836372
##  6:       210 0.029123533 0.025121265      1.1593180
##  7:       135 0.014768806 0.013075403      1.1295106
##  8:       250 0.014354727 0.012780590      1.1231662
##  9:       170 0.080772947 0.080985964      0.9973697
## 10:       150 0.157598344 0.163420656      0.9643722
## 11:       175 0.254989648 0.270006956      0.9443818
## 12:       165 0.055652174 0.062267662      0.8937572
## 13:       190 0.007481021 0.012442016      0.6012708
## 14:       180 0.003588682 0.006066692      0.5915385
## 15:       160 0.006404417 0.012372920      0.5176157
## 16:        90 0.006349206 0.012580210      0.5046980
## 17:       125 0.003008972 0.006036750      0.4984423
## 18:       200 0.008971705 0.018656115      0.4808989
## 19:        70 0.003036577 0.006322350      0.4802924
## 20:       220 0.002926156 0.006596434      0.4435967
  • Mainstream young singles/couples are 27% more likely to purchase pack size of 270g compared to the rest of the population

  • Mainstream young singles/couples are 56% less likely to purchase pack size of 220g compared to the rest of the population

Let’s dive into what brands sell this pack size.

data[PACK_SIZE == 270, unique(PROD_NAME)]
## [1] "Twisties Cheese     270g" "Twisties Chicken270g"

Twisties is the only brand that offers 270g pack size. This brand also comes in 2nd in our Brand Affinity analysis. This contributes to why our target segment buys a lot of this pack size.

Conclusion

Let’s recap what we’ve found! This may serve as a precursor to more formal testing and analysis.

  • Sales have mainly been due to Budget - older families, Mainstream - young singles/couples, and Mainstream - retirees shoppers.
  • Although Budget - older families accounts for the largest proportion of sales (8.7%), the amount of customers in this segment (6.5%) is not as great as the Mainstream - young singles/couples (11.1%) and Mainstream - retirees shoppers (8.9%)
  • This implies the sales can be further improved in the two Mainstream - young singles/couples and Mainstream - retirees shoppers segments
  • We also found out the Mainstream - young singles/couples are among the segments that bought the least units of chips per customer. However, they are willing to spend the most on chips.
  • Notable mention is Mainstream - young singles/couples, who are also willing to spend more on chips than others.
  • This implies that the Mainstream - young singles/couples prefer higher quality chips and/or more popular brands. Their most preferred brands are Tyrrells, Twisties, Doritos, Kettle, and Tostitos
  • Mainstream young singles/couples are 27% more likely to purchase pack size of 270g
  • 270g seems to be the ideal pack size for Mainstream young singles/couples. However, the only brand that provides this pack size is Twisties in this store. Perhaps, including more brands with 270g pack size is a potential step towards more sales.
  • Overall, we suggest an emphasis on the Mainstream - young singles/couples. One suggestion would be to have some Tyrrells chips near areas in the store that this target segment tends to visit the most.