Note that you will need to install these libraries if you have never used these before.
#### Example code to install packages
#install.packages("data.table")
#### Load required libraries
library(data.table)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.1
library(ggmosaic)
library(readr)
library(skimr)
library(here)
## here() starts at /Users/NguyenTran/Documents/GitHub/Quantium-purchasing-behavior-project
# library(dplyr)
filePathTXN <-
#### Import data
transactionData <- fread(here("data","QVI_transaction_data.csv"))
customerData <- fread(here("data","QVI_purchase_behaviour.csv"))
The first step in any analysis is to first understand the data. Let’s take a look at each of the data sets provided.
We can use str() to look at the format of each column
and see a sample of the data. As we have read in the dataset as a
data.table object, we can also run
transactionData in the console to see a sample of the data
or use head(transactionData) to look at the first 10
rows.
Let’s check if columns we would expect to be numeric are in numeric form and date columns are in date format.
#### Examine transaction data
str(transactionData)
## Classes 'data.table' and 'data.frame': 264836 obs. of 8 variables:
## $ DATE : int 43390 43599 43605 43329 43330 43604 43601 43601 43332 43330 ...
## $ STORE_NBR : int 1 1 1 2 2 4 4 4 5 7 ...
## $ LYLTY_CARD_NBR: int 1000 1307 1343 2373 2426 4074 4149 4196 5026 7150 ...
## $ TXN_ID : int 1 348 383 974 1038 2982 3333 3539 4525 6900 ...
## $ PROD_NBR : int 5 66 61 69 108 57 16 24 42 52 ...
## $ PROD_NAME : chr "Natural Chip Compny SeaSalt175g" "CCs Nacho Cheese 175g"
"Smiths Crinkle Cut Chips Chicken 170g" "Smiths Chip Thinly S/Cream&Onion 175g"
...
## $ PROD_QTY : int 2 3 2 5 3 1 1 1 1 2 ...
## $ TOT_SALES : num 6 6.3 2.9 15 13.8 5.1 5.7 3.6 3.9 7.2 ...
## - attr(*, ".internal.selfref")=<externalptr>
head(transactionData)
## DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR
## 1: 43390 1 1000 1 5
## 2: 43599 1 1307 348 66
## 3: 43605 1 1343 383 61
## 4: 43329 2 2373 974 69
## 5: 43330 2 2426 1038 108
## 6: 43604 4 4074 2982 57
## PROD_NAME PROD_QTY TOT_SALES
## 1: Natural Chip Compny SeaSalt175g 2 6.0
## 2: CCs Nacho Cheese 175g 3 6.3
## 3: Smiths Crinkle Cut Chips Chicken 170g 2 2.9
## 4: Smiths Chip Thinly S/Cream&Onion 175g 5 15.0
## 5: Kettle Tortilla ChpsHny&Jlpno Chili 150g 3 13.8
## 6: Old El Paso Salsa Dip Tomato Mild 300g 1 5.1
summary(transactionData)
## DATE STORE_NBR LYLTY_CARD_NBR TXN_ID
## Min. :43282 Min. : 1.0 Min. : 1000 Min. : 1
## 1st Qu.:43373 1st Qu.: 70.0 1st Qu.: 70021 1st Qu.: 67602
## Median :43464 Median :130.0 Median : 130358 Median : 135138
## Mean :43464 Mean :135.1 Mean : 135550 Mean : 135158
## 3rd Qu.:43555 3rd Qu.:203.0 3rd Qu.: 203094 3rd Qu.: 202701
## Max. :43646 Max. :272.0 Max. :2373711 Max. :2415841
## PROD_NBR PROD_NAME PROD_QTY TOT_SALES
## Min. : 1.00 Length:264836 Min. : 1.000 Min. : 1.500
## 1st Qu.: 28.00 Class :character 1st Qu.: 2.000 1st Qu.: 5.400
## Median : 56.00 Mode :character Median : 2.000 Median : 7.400
## Mean : 56.58 Mean : 1.907 Mean : 7.304
## 3rd Qu.: 85.00 3rd Qu.: 2.000 3rd Qu.: 9.200
## Max. :114.00 Max. :200.000 Max. :650.000
# expected numeric columns are indeed numeric
We can see that the date column is in an integer format. Let’s change this to a date format.
#### Convert DATE column to a date format
#### A quick search online tells us that CSV and Excel integer dates begin on 30
#### Dec 1899
transactionData$DATE <- as.Date(transactionData$DATE, origin = "1899-12-30")
We should check that we are looking at the right products by examining PROD_NAME.
#### Examine PROD_NAME by generating a summary of the PROD_NAME column
levels(as.factor(transactionData$PROD_NAME))
## [1] "Burger Rings 220g"
## [2] "CCs Nacho Cheese 175g"
## [3] "CCs Original 175g"
## [4] "CCs Tasty Cheese 175g"
## [5] "Cheetos Chs & Bacon Balls 190g"
## [6] "Cheetos Puffs 165g"
## [7] "Cheezels Cheese 330g"
## [8] "Cheezels Cheese Box 125g"
## [9] "Cobs Popd Sea Salt Chips 110g"
## [10] "Cobs Popd Sour Crm &Chives Chips 110g"
## [11] "Cobs Popd Swt/Chlli &Sr/Cream Chips 110g"
## [12] "Dorito Corn Chp Supreme 380g"
## [13] "Doritos Cheese Supreme 330g"
## [14] "Doritos Corn Chip Mexican Jalapeno 150g"
## [15] "Doritos Corn Chip Southern Chicken 150g"
## [16] "Doritos Corn Chips Cheese Supreme 170g"
## [17] "Doritos Corn Chips Nacho Cheese 170g"
## [18] "Doritos Corn Chips Original 170g"
## [19] "Doritos Mexicana 170g"
## [20] "Doritos Salsa Medium 300g"
## [21] "Doritos Salsa Mild 300g"
## [22] "French Fries Potato Chips 175g"
## [23] "Grain Waves Sweet Chilli 210g"
## [24] "Grain Waves Sour Cream&Chives 210G"
## [25] "GrnWves Plus Btroot & Chilli Jam 180g"
## [26] "Infuzions BBQ Rib Prawn Crackers 110g"
## [27] "Infuzions Mango Chutny Papadums 70g"
## [28] "Infuzions SourCream&Herbs Veg Strws 110g"
## [29] "Infuzions Thai SweetChili PotatoMix 110g"
## [30] "Infzns Crn Crnchers Tangy Gcamole 110g"
## [31] "Kettle 135g Swt Pot Sea Salt"
## [32] "Kettle Chilli 175g"
## [33] "Kettle Honey Soy Chicken 175g"
## [34] "Kettle Mozzarella Basil & Pesto 175g"
## [35] "Kettle Original 175g"
## [36] "Kettle Sea Salt And Vinegar 175g"
## [37] "Kettle Sensations BBQ&Maple 150g"
## [38] "Kettle Sensations Camembert & Fig 150g"
## [39] "Kettle Sensations Siracha Lime 150g"
## [40] "Kettle Sweet Chilli And Sour Cream 175g"
## [41] "Kettle Tortilla ChpsBtroot&Ricotta 150g"
## [42] "Kettle Tortilla ChpsFeta&Garlic 150g"
## [43] "Kettle Tortilla ChpsHny&Jlpno Chili 150g"
## [44] "Natural Chip Compny SeaSalt175g"
## [45] "Natural Chip Co Tmato Hrb&Spce 175g"
## [46] "Natural ChipCo Hony Soy Chckn175g"
## [47] "Natural ChipCo Sea Salt & Vinegr 175g"
## [48] "NCC Sour Cream & Garden Chives 175g"
## [49] "Old El Paso Salsa Dip Chnky Tom Ht300g"
## [50] "Old El Paso Salsa Dip Tomato Med 300g"
## [51] "Old El Paso Salsa Dip Tomato Mild 300g"
## [52] "Pringles Barbeque 134g"
## [53] "Pringles Chicken Salt Crips 134g"
## [54] "Pringles Mystery Flavour 134g"
## [55] "Pringles Original Crisps 134g"
## [56] "Pringles Slt Vingar 134g"
## [57] "Pringles SourCream Onion 134g"
## [58] "Pringles Sthrn FriedChicken 134g"
## [59] "Pringles Sweet&Spcy BBQ 134g"
## [60] "Red Rock Deli Chikn&Garlic Aioli 150g"
## [61] "Red Rock Deli Sp Salt & Truffle 150G"
## [62] "Red Rock Deli SR Salsa & Mzzrlla 150g"
## [63] "Red Rock Deli Thai Chilli&Lime 150g"
## [64] "RRD Chilli& Coconut 150g"
## [65] "RRD Honey Soy Chicken 165g"
## [66] "RRD Lime & Pepper 165g"
## [67] "RRD Pc Sea Salt 165g"
## [68] "RRD Salt & Vinegar 165g"
## [69] "RRD SR Slow Rst Pork Belly 150g"
## [70] "RRD Steak & Chimuchurri 150g"
## [71] "RRD Sweet Chilli & Sour Cream 165g"
## [72] "Smith Crinkle Cut Bolognese 150g"
## [73] "Smith Crinkle Cut Mac N Cheese 150g"
## [74] "Smiths Chip Thinly Cut Original 175g"
## [75] "Smiths Chip Thinly CutSalt/Vinegr175g"
## [76] "Smiths Chip Thinly S/Cream&Onion 175g"
## [77] "Smiths Crinkle Original 330g"
## [78] "Smiths Crinkle Chips Salt & Vinegar 330g"
## [79] "Smiths Crinkle Cut Chips Barbecue 170g"
## [80] "Smiths Crinkle Cut Chips Chicken 170g"
## [81] "Smiths Crinkle Cut Chips Chs&Onion170g"
## [82] "Smiths Crinkle Cut Chips Original 170g"
## [83] "Smiths Crinkle Cut French OnionDip 150g"
## [84] "Smiths Crinkle Cut Salt & Vinegar 170g"
## [85] "Smiths Crinkle Cut Snag&Sauce 150g"
## [86] "Smiths Crinkle Cut Tomato Salsa 150g"
## [87] "Smiths Crnkle Chip Orgnl Big Bag 380g"
## [88] "Smiths Thinly Swt Chli&S/Cream175G"
## [89] "Smiths Thinly Cut Roast Chicken 175g"
## [90] "Snbts Whlgrn Crisps Cheddr&Mstrd 90g"
## [91] "Sunbites Whlegrn Crisps Frch/Onin 90g"
## [92] "Thins Chips Originl saltd 175g"
## [93] "Thins Chips Light& Tangy 175g"
## [94] "Thins Chips Salt & Vinegar 175g"
## [95] "Thins Chips Seasonedchicken 175g"
## [96] "Thins Potato Chips Hot & Spicy 175g"
## [97] "Tostitos Lightly Salted 175g"
## [98] "Tostitos Smoked Chipotle 175g"
## [99] "Tostitos Splash Of Lime 175g"
## [100] "Twisties Cheese 270g"
## [101] "Twisties Cheese Burger 250g"
## [102] "Twisties Chicken270g"
## [103] "Tyrrells Crisps Ched & Chives 165g"
## [104] "Tyrrells Crisps Lightly Salted 165g"
## [105] "Woolworths Cheese Rings 190g"
## [106] "Woolworths Medium Salsa 300g"
## [107] "Woolworths Mild Salsa 300g"
## [108] "WW Crinkle Cut Chicken 175g"
## [109] "WW Crinkle Cut Original 175g"
## [110] "WW D/Style Chip Sea Salt 200g"
## [111] "WW Original Corn Chips 200g"
## [112] "WW Original Stacked Chips 160g"
## [113] "WW Sour Cream &OnionStacked Chips 160g"
## [114] "WW Supreme Cheese Corn Chips 200g"
prod_counts <- table(transactionData$PROD_NAME)
# count of unique products
length(unique(transactionData$PROD_NAME))
## [1] 114
# most common product
names(prod_counts)[which.max(prod_counts)]
## [1] "Kettle Mozzarella Basil & Pesto 175g"
# freq of the most common product
max(prod_counts)
## [1] 3304
# frequency of occurrence for the top ten products
head(sort(prod_counts,decreasing = TRUE),n = 10)
##
## Kettle Mozzarella Basil & Pesto 175g
## 3304
## Kettle Tortilla ChpsHny&Jlpno Chili 150g
## 3296
## Cobs Popd Swt/Chlli &Sr/Cream Chips 110g
## 3269
## Tyrrells Crisps Ched & Chives 165g
## 3268
## Cobs Popd Sea Salt Chips 110g
## 3265
## Kettle 135g Swt Pot Sea Salt
## 3257
## Tostitos Splash Of Lime 175g
## 3252
## Infuzions Thai SweetChili PotatoMix 110g
## 3242
## Smiths Crnkle Chip Orgnl Big Bag 380g
## 3233
## Thins Potato Chips Hot & Spicy 175g
## 3229
# sort(summary.factor(transactionData[, PROD_NAME]), decreasing = TRUE)
Looks like we are definitely looking at potato chips but how can we check that these are all chips? We can do some basic text analysis by summarizing the individual words in the product name.
#### Examine the words in PROD_NAME to see if there are any incorrect entries
#### such as products that are not chips
# split product names of the string unique(transactionData[, PROD_NAME])
# then unlist the list of vectors into a single vector
# then it into a data.table object
productWords <- data.table(unlist(strsplit(unique(transactionData[, PROD_NAME]), "\t")))
setnames(productWords, 'words')
productWords
## words
## 1: Natural Chip Compny SeaSalt175g
## 2: CCs Nacho Cheese 175g
## 3: Smiths Crinkle Cut Chips Chicken 170g
## 4: Smiths Chip Thinly S/Cream&Onion 175g
## 5: Kettle Tortilla ChpsHny&Jlpno Chili 150g
## ---
## 110: Red Rock Deli Chikn&Garlic Aioli 150g
## 111: RRD SR Slow Rst Pork Belly 150g
## 112: RRD Pc Sea Salt 165g
## 113: Smith Crinkle Cut Bolognese 150g
## 114: Doritos Salsa Mild 300g
# play around
head(unique(transactionData[, PROD_NAME]))
## [1] "Natural Chip Compny SeaSalt175g"
## [2] "CCs Nacho Cheese 175g"
## [3] "Smiths Crinkle Cut Chips Chicken 170g"
## [4] "Smiths Chip Thinly S/Cream&Onion 175g"
## [5] "Kettle Tortilla ChpsHny&Jlpno Chili 150g"
## [6] "Old El Paso Salsa Dip Tomato Mild 300g"
# play around
head(strsplit(unique(transactionData[, PROD_NAME]), "\t"))
## [[1]]
## [1] "Natural Chip Compny SeaSalt175g"
##
## [[2]]
## [1] "CCs Nacho Cheese 175g"
##
## [[3]]
## [1] "Smiths Crinkle Cut Chips Chicken 170g"
##
## [[4]]
## [1] "Smiths Chip Thinly S/Cream&Onion 175g"
##
## [[5]]
## [1] "Kettle Tortilla ChpsHny&Jlpno Chili 150g"
##
## [[6]]
## [1] "Old El Paso Salsa Dip Tomato Mild 300g"
# this is a list of vectors
# [[1]] means the first element in this list, [[2]] second element and so on
# [1] means the first element within the first vector
head(transactionData$PROD_NAME)
## [1] "Natural Chip Compny SeaSalt175g"
## [2] "CCs Nacho Cheese 175g"
## [3] "Smiths Crinkle Cut Chips Chicken 170g"
## [4] "Smiths Chip Thinly S/Cream&Onion 175g"
## [5] "Kettle Tortilla ChpsHny&Jlpno Chili 150g"
## [6] "Old El Paso Salsa Dip Tomato Mild 300g"
As we are only interested in words that will tell us if the product
is chips or not, let’s remove all words with digits and special
characters such as ‘&’ from our set of product words. We can do this
using grepl().
# Remove digits, and special characters, and then sort the distinct
# words by frequency of occurrence.
# str(productWords) to recall what object we are working with
#### Removing digits
# productWordsd is a data.table with one column 'words'
# regex "\\d+g$" matches one or more digits followed by any character at the end of every word
# gsub fucnction then replaces pattern with empty string
productWords[, words := gsub("\\d+.$", "", words)]
# note the := operator is specific to the data.table package.
# it is used to update or assign values to columns within a data.table object.
#### Removing special characters
productWords[, words := gsub("\\W+", " ", words)]
#### Let's look at the most common words by counting the number of times a word
# appears and sorting them by this frequency in order of highest to lowest frequency
sort(table(unlist(strsplit(unique(productWords[, words]), " "))), decreasing = TRUE)
##
## Chips Smiths Crinkle Cut Kettle
## 21 16 14 14 13
## Cheese Salt Original Chip Doritos
## 12 12 10 9 9
## Salsa Chicken Corn Cream Pringles
## 9 8 8 8 8
## RRD Chilli WW Sea Sour
## 8 7 7 6 6
## Crisps Thinly Thins Vinegar Chives
## 5 5 5 5 4
## Deli Infuzions Lime Natural Red
## 4 4 4 4 4
## Rock Supreme Sweet BBQ CCs
## 4 4 4 3 3
## Cobs Dip El Mild Old
## 3 3 3 3 3
## Onion Paso Popd Sensations Soy
## 3 3 3 3 3
## Swt Tomato Tortilla Tostitos Twisties
## 3 3 3 3 3
## Woolworths And Burger Cheetos Cheezels
## 3 2 2 2 2
## ChipCo Chs French Garlic Grain
## 2 2 2 2 2
## Honey Lightly Medium Nacho Potato
## 2 2 2 2 2
## Rings S Salted Smith SourCream
## 2 2 2 2 2
## SR Tangy Thai Tyrrells Vinegr
## 2 2 2 2 2
## Waves 135g Aioli Bacon Bag
## 2 1 1 1 1
## Balls Barbecue Barbeque Basil Belly
## 1 1 1 1 1
## Big Bolognese Box Btroot Camembert
## 1 1 1 1 1
## Chckn Ched Cheddr Chikn Chili
## 1 1 1 1 1
## Chimuchurri Chipotle Chli Chlli Chnky
## 1 1 1 1 1
## Chp ChpsBtroot ChpsFeta ChpsHny Chutny
## 1 1 1 1 1
## Co Coconut Compny Crackers Crips
## 1 1 1 1 1
## Crm Crn Crnchers Crnkle CutSalt
## 1 1 1 1 1
## D Dorito Fig Flavour Frch
## 1 1 1 1 1
## FriedChicken Fries Garden Gcamole GrnWves
## 1 1 1 1 1
## Herbs Hony Hot Hrb Ht
## 1 1 1 1 1
## Infzns Jalapeno Jam Jlpno Light
## 1 1 1 1 1
## Mac Mango Maple Med Mexican
## 1 1 1 1 1
## Mexicana Mozzarella Mstrd Mystery Mzzrlla
## 1 1 1 1 1
## N NCC Of Onin OnionDip
## 1 1 1 1 1
## OnionStacked Orgnl Originl Papadums Pc
## 1 1 1 1 1
## Pepper Pesto Plus Pork Pot
## 1 1 1 1 1
## PotatoMix Prawn Puffs Rib Ricotta
## 1 1 1 1 1
## Roast Rst saltd Sauce SeaSalt
## 1 1 1 1 1
## Seasonedchicken Siracha Slow Slt Smoked
## 1 1 1 1 1
## Snag Snbts Southern Sp Spce
## 1 1 1 1 1
## Spcy Spicy Splash Sr Stacked
## 1 1 1 1 1
## Steak Sthrn Strws Style Sunbites
## 1 1 1 1 1
## SweetChili Tasty Tmato Tom Truffle
## 1 1 1 1 1
## Veg Vingar Whlegrn Whlgrn
## 1 1 1 1
There are salsa products in the dataset but we are only interested in the chips category, so let’s remove these.
#### Remove salsa products
# 1) identify rows that are salsa products
# 2) remove these rows
# helpful to recall data.table structure
# DT[i, j, by]
## R: i j by
## SQL: where | order by select | update group by
# create a new column 'SALSA' containing logical values
# to identify rows that have the word 'salsa' in PROD_NAME
# use tolower to standardize the word 'salsa'
transactionData[, SALSA := grepl("salsa", tolower(PROD_NAME))]
# [SALSA == FALSE, ] selects only rows that are not salsa products
# then [, SALSA := NULL] deletes the 'SALSA' column
transactionData <- transactionData[SALSA == FALSE, ][, SALSA := NULL]
Next, we can use summary() to check summary statistics
such as mean, min and max values for each feature to see if there are
any obvious outliers in the data and if there are any nulls in any of
the columns (NA's : number of nulls will appear in the
output if there are any nulls).
#### Summarise the data to check for nulls and possible outliers
summary(transactionData)
## DATE STORE_NBR LYLTY_CARD_NBR TXN_ID
## Min. :2018-07-01 Min. : 1.0 Min. : 1000 Min. : 1
## 1st Qu.:2018-09-30 1st Qu.: 70.0 1st Qu.: 70015 1st Qu.: 67569
## Median :2018-12-30 Median :130.0 Median : 130367 Median : 135183
## Mean :2018-12-30 Mean :135.1 Mean : 135531 Mean : 135131
## 3rd Qu.:2019-03-31 3rd Qu.:203.0 3rd Qu.: 203084 3rd Qu.: 202654
## Max. :2019-06-30 Max. :272.0 Max. :2373711 Max. :2415841
## PROD_NBR PROD_NAME PROD_QTY TOT_SALES
## Min. : 1.00 Length:246742 Min. : 1.000 Min. : 1.700
## 1st Qu.: 26.00 Class :character 1st Qu.: 2.000 1st Qu.: 5.800
## Median : 53.00 Mode :character Median : 2.000 Median : 7.400
## Mean : 56.35 Mean : 1.908 Mean : 7.321
## 3rd Qu.: 87.00 3rd Qu.: 2.000 3rd Qu.: 8.800
## Max. :114.00 Max. :200.000 Max. :650.000
There are no nulls in the columns but product quantity appears to have an outlier which we should investigate further. Let’s investigate further the case where 200 packets of chips are bought in one transaction.
#### Filter the dataset to find the outlier
transactionData[PROD_QTY == "200"]
## DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR
## 1: 2018-08-19 226 226000 226201 4
## 2: 2019-05-20 226 226000 226210 4
## PROD_NAME PROD_QTY TOT_SALES
## 1: Dorito Corn Chp Supreme 380g 200 650
## 2: Dorito Corn Chp Supreme 380g 200 650
There are two transactions where 200 packets of chips are bought in one transaction and both of these transactions were by the same customer.
#### Let's see if the customer has had other transactions
transactionData[LYLTY_CARD_NBR == "226000"]
## DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR
## 1: 2018-08-19 226 226000 226201 4
## 2: 2019-05-20 226 226000 226210 4
## PROD_NAME PROD_QTY TOT_SALES
## 1: Dorito Corn Chp Supreme 380g 200 650
## 2: Dorito Corn Chp Supreme 380g 200 650
It looks like this customer has only had the two transactions over the year and is not an ordinary retail customer. The customer might be buying chips for commercial purposes instead. We’ll remove this loyalty card number from further analysis.
#### Filter out the customer based on the loyalty card number
transactionData <- transactionData[LYLTY_CARD_NBR != "226000"]
#### Re-examine transaction data
summary(transactionData)
## DATE STORE_NBR LYLTY_CARD_NBR TXN_ID
## Min. :2018-07-01 Min. : 1.0 Min. : 1000 Min. : 1
## 1st Qu.:2018-09-30 1st Qu.: 70.0 1st Qu.: 70015 1st Qu.: 67569
## Median :2018-12-30 Median :130.0 Median : 130367 Median : 135182
## Mean :2018-12-30 Mean :135.1 Mean : 135530 Mean : 135130
## 3rd Qu.:2019-03-31 3rd Qu.:203.0 3rd Qu.: 203083 3rd Qu.: 202652
## Max. :2019-06-30 Max. :272.0 Max. :2373711 Max. :2415841
## PROD_NBR PROD_NAME PROD_QTY TOT_SALES
## Min. : 1.00 Length:246740 Min. :1.000 Min. : 1.700
## 1st Qu.: 26.00 Class :character 1st Qu.:2.000 1st Qu.: 5.800
## Median : 53.00 Mode :character Median :2.000 Median : 7.400
## Mean : 56.35 Mean :1.906 Mean : 7.316
## 3rd Qu.: 87.00 3rd Qu.:2.000 3rd Qu.: 8.800
## Max. :114.00 Max. :5.000 Max. :29.500
That’s better. Now, let’s look at the number of transaction lines over time to see if there are any obvious data issues such as missing data.
# this part works on performing operations by group
#### Count the number of transactions by date
transaction_count <- transactionData[, .N, by = DATE]
transaction_count
## DATE N
## 1: 2018-10-17 682
## 2: 2019-05-14 705
## 3: 2019-05-20 707
## 4: 2018-08-17 663
## 5: 2018-08-18 683
## ---
## 360: 2018-12-08 622
## 361: 2019-01-30 689
## 362: 2019-02-09 671
## 363: 2018-08-31 658
## 364: 2019-02-12 684
There’s only 364 rows, meaning only 364 dates which indicates a missing date. Let’s create a sequence of dates from 1 Jul 2018 to 30 Jun 2019 and use this to create a chart of number of transactions over time to find the missing date.
#### Create a sequence of dates and join this onto the count of transactions by date
# create a column of dates that includes every day from 1 Jul 2018 to
# 30 Jun 2019, then join it onto the data to fill in the missing day.
# use this format seq(start_date, end_date, by = "day")
date_seq <- seq(as.Date("2018-7-1"), as.Date("2019-6-30"), by = "day")
# convert date_seq to data.table
date_seq_dt <- data.table(DATE = date_seq)
# perform a right join between transaction count and date sequence
# a right join is a type of database join operation that combines two tables based on a common column,
# while including all rows from the right table and matching rows from the left table
# right table in this case is date_seq_dt
transactions_by_day <- merge.data.table(transaction_count,date_seq_dt,
by = "DATE", all.y = TRUE)
# or we can do transaction_count[date_seq_dt, on = "DATE"] but this is a bit less intuitive
#### Setting plot themes to format graphs
theme_set(theme_bw())
theme_update(plot.title = element_text(hjust = 0.5))
#### Plot transactions by date
ggplot(transactions_by_day, aes(x = DATE, y = N)) +
geom_line() +
labs(x = "Day", y = "Number of transactions", title = "Transactions over time") +
scale_x_date(breaks = "1 month") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
We can see that there is an increase in purchases in December and a break in late December. Let’s zoom in on this.
#### Filter to December and look at individual days
# Over to you - recreate the chart above zoomed in to the relevant dates.
ggplot(transactions_by_day[DATE %like% "-12-", ], aes(x = DATE, y = N)) +
geom_line() +
labs(x = "Day", y = "Number of transactions", title = "Transactions over time") +
scale_x_date(breaks = "1 day") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
We can see that the increase in sales occurs in the lead-up to Christmas and that there are zero sales on Christmas day itself. This is due to shops being closed on Christmas day.
Now that we are satisfied that the data no longer has outliers, we can move on to creating other features such as brand of chips or pack size from PROD_NAME. We will start with pack size.
#### Pack size
#### We can work this out by taking the digits that are in PROD_NAME
transactionData[, PACK_SIZE := parse_number(PROD_NAME)]
#### Always check your output
#### Let's check if the pack sizes look sensible
transactionData[, .N, PACK_SIZE][order(PACK_SIZE)]
## PACK_SIZE N
## 1: 70 1507
## 2: 90 3008
## 3: 110 22387
## 4: 125 1454
## 5: 134 25102
## 6: 135 3257
## 7: 150 40203
## 8: 160 2970
## 9: 165 15297
## 10: 170 19983
## 11: 175 66390
## 12: 180 1468
## 13: 190 2995
## 14: 200 4473
## 15: 210 6272
## 16: 220 1564
## 17: 250 3169
## 18: 270 6285
## 19: 330 12540
## 20: 380 6416
The largest size is 380g and the smallest size is 70g - seems sensible!
#### Let's plot a histogram of PACK_SIZE since we know that it is a categorical
# variable and not a continuous variable even though it is numeric.
# plot a histogram showing the number of transactions by pack size.
ggplot(transactionData, aes(x = PACK_SIZE)) +
geom_histogram(bins = 20)
Pack sizes created look reasonable.
Now to create brands, we can use the first word in PROD_NAME to work out the brand name…
#### Brands
# Over to you! Create a column which contains the brand of the product, by
# extracting it from the product name.
# Extract brand name from PROD_NAME
transactionData[, BRAND := sub("^([A-Za-z]+).*", "\\1", PROD_NAME)]
# '^([A-Za-z]+)' matches and captures one or more consecutive alphabetic characters at the start of the string.
# '.*' matches any remaining characters in the string.
# '\\1' is the replacement string that refers to the captured group (brand name).
# another way to write
# transactionData$Brand <- gsub("^(\\S+).*", "\\1", transactionData$PROD_NAME)
# another (long) way, using loop to iterate through PROD_NAME and apply a function to extract brand name
# transactionData[, BRAND := unlist(lapply(transactionData$PROD_NAME,
# FUN = function(x){unlist(strsplit(x," "))[1]}))]
#### Checking brands
# check to see if the results look reasonable.
head(transactionData$BRAND)
## [1] "Natural" "CCs" "Smiths" "Smiths" "Kettle" "Smiths"
Some of the brand names look like they are of the same brands - such as RED and RRD, which are both Red Rock Deli chips. Let’s combine these together.
#### Clean brand names
transactionData[BRAND %in% c("RED", "Red"), BRAND := "RRD"]
transactionData[BRAND == "Dorito", BRAND := "Doritos"]
transactionData[BRAND == "Smith", BRAND := "Smiths"]
transactionData[BRAND == "Snbts", BRAND := "Sunbites"]
transactionData[BRAND == "WW", BRAND := "Woolworths"]
# transactionData[BRAND == "GrnWves", BRAND := "Sunbites"] maybe?
#### Check to see if the results look reasonable
table(transactionData$BRAND)
##
## Burger CCs Cheetos Cheezels Cobs Doritos French
## 1564 4551 2927 4603 9693 25224 1418
## Grain GrnWves Infuzions Infzns Kettle Natural NCC
## 6272 1468 11057 3144 41288 6050 1419
## Pringles RRD Smiths Sunbites Thins Tostitos Twisties
## 25102 16321 30353 3008 14075 9471 9454
## Tyrrells Woolworths
## 6442 11836
Now that we are happy with the transaction dataset, let’s have a look at the customer dataset.
#### Examining customer data
str(customerData)
## Classes 'data.table' and 'data.frame': 72637 obs. of 3 variables:
## $ LYLTY_CARD_NBR : int 1000 1002 1003 1004 1005 1007 1009 1010 1011 1012 ...
## $ LIFESTAGE : chr "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG
FAMILIES" "OLDER SINGLES/COUPLES" ...
## $ PREMIUM_CUSTOMER: chr "Premium" "Mainstream" "Budget" "Mainstream" ...
## - attr(*, ".internal.selfref")=<externalptr>
skim(customerData)
| Name | customerData |
| Number of rows | 72637 |
| Number of columns | 3 |
| Key | NULL |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| LIFESTAGE | 0 | 1 | 8 | 22 | 0 | 7 | 0 |
| PREMIUM_CUSTOMER | 0 | 1 | 6 | 10 | 0 | 3 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| LYLTY_CARD_NBR | 0 | 1 | 136185.9 | 89892.93 | 1000 | 66202 | 134040 | 203375 | 2373711 | ▇▁▁▁▁ |
# distributions of columns
sort(table(customerData$LIFESTAGE), decreasing = TRUE)
##
## RETIREES OLDER SINGLES/COUPLES YOUNG SINGLES/COUPLES
## 14805 14609 14441
## OLDER FAMILIES YOUNG FAMILIES MIDAGE SINGLES/COUPLES
## 9780 9178 7275
## NEW FAMILIES
## 2549
sort(table(customerData$PREMIUM_CUSTOMER), decreasing = TRUE)
##
## Mainstream Budget Premium
## 29245 24470 18922
#### Merge transaction data to customer data
data <- merge(transactionData, customerData, all.x = TRUE)
As the number of rows in data is the same as that of
transactionData, we can be sure that no duplicates were
created. This is because we created data by setting
all.x = TRUE (in other words, a left join) which means take
all the rows in transactionData and find rows with matching
values in shared columns and then joining the details in these rows to
the x or the first mentioned table.
Let’s also check if some customers were not matched on by checking for nulls.
# check if any transactions did not have a matched customer.
skim(data)
| Name | data |
| Number of rows | 246740 |
| Number of columns | 12 |
| Key | LYLTY_CARD_NBR |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| Date | 1 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| PROD_NAME | 0 | 1 | 17 | 40 | 0 | 105 | 0 |
| BRAND | 0 | 1 | 3 | 10 | 0 | 23 | 0 |
| LIFESTAGE | 0 | 1 | 8 | 22 | 0 | 7 | 0 |
| PREMIUM_CUSTOMER | 0 | 1 | 6 | 10 | 0 | 3 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| DATE | 0 | 1 | 2018-07-01 | 2019-06-30 | 2018-12-30 | 364 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| LYLTY_CARD_NBR | 0 | 1 | 135530.25 | 80715.20 | 1000.0 | 70015.00 | 130367.0 | 203083.2 | 2373711.0 | ▇▁▁▁▁ |
| STORE_NBR | 0 | 1 | 135.05 | 76.79 | 1.0 | 70.00 | 130.0 | 203.0 | 272.0 | ▆▇▇▇▇ |
| TXN_ID | 0 | 1 | 135130.36 | 78147.60 | 1.0 | 67568.75 | 135181.5 | 202652.2 | 2415841.0 | ▇▁▁▁▁ |
| PROD_NBR | 0 | 1 | 56.35 | 33.70 | 1.0 | 26.00 | 53.0 | 87.0 | 114.0 | ▇▇▆▇▇ |
| PROD_QTY | 0 | 1 | 1.91 | 0.34 | 1.0 | 2.00 | 2.0 | 2.0 | 5.0 | ▁▇▁▁▁ |
| TOT_SALES | 0 | 1 | 7.32 | 2.47 | 1.7 | 5.80 | 7.4 | 8.8 | 29.5 | ▆▇▁▁▁ |
| PACK_SIZE | 0 | 1 | 175.58 | 59.43 | 70.0 | 150.00 | 170.0 | 175.0 | 380.0 | ▁▇▁▁▁ |
# to only specifically check for nulls
sapply(data, function(x) sum(is.na(x)))
## LYLTY_CARD_NBR DATE STORE_NBR TXN_ID
## 0 0 0 0
## PROD_NBR PROD_NAME PROD_QTY TOT_SALES
## 0 0 0 0
## PACK_SIZE BRAND LIFESTAGE PREMIUM_CUSTOMER
## 0 0 0 0
Great, there are no nulls! So all our customers in the transaction data has been accounted for in the customer dataset.
Note that if you are continuing with Task 2, you may want to retain this dataset which you can write out as a csv
fwrite(data, here("data","QVI_data.csv"))
Data exploration is now complete!
Now that the data is ready for analysis, we can define some metrics of interest to the client:
We could also ask our data team for more information. Examples are:
Let’s start with calculating total sales by LIFESTAGE and PREMIUM_CUSTOMER and plotting the split by these segments to describe which customer segment contributes most to chip sales.
#### calculate Total sales by LIFESTAGE and PREMIUM_CUSTOMER and create a plot
sales <- data[, .(SALES = sum(TOT_SALES)), .(LIFESTAGE, PREMIUM_CUSTOMER)]
#### Create plot
p <- ggplot(data = sales) +
# weight = SALES means that the size or area of each mosaic tile will be proportional to the corresponding sales value.
# product() is used to create a combined factor variable that represents different combinations of PREMIUM_CUSTOMER and LIFESTAGE
geom_mosaic(aes(weight = SALES, x = product(PREMIUM_CUSTOMER, LIFESTAGE), fill = PREMIUM_CUSTOMER)) +
labs(x = "Lifestage", y = "Premium customer flag", title = "Proportion of sales") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# The labels are generated by calculating the proportion of each tile's weight (SALES)
# relative to the sum of all weights in the plot.
p + geom_text(data = ggplot_build(p)$data[[1]],
# ggplot_build(p)$data is a list of layers that contains the data used to create the layers in a ggplot object p.
aes(x = (xmin + xmax)/2 , # this is to position the text
y = (ymin + ymax)/2, # in the middle of each tile
label = as.character(paste(round(.wt/sum(.wt),3)*100, '%'))))
## Warning: `unite_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `unite()` instead.
## ℹ The deprecated feature was likely used in the ggmosaic package.
## Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Note that the percentage of each tile is relative to the total sales.
Sales are coming mainly from Budget - older families, Mainstream - young singles/couples, and Mainstream - retirees
Let’s see if the higher sales are due to there being more customers who buy chips.
#### Calculate Number of customers by LIFESTAGE and PREMIUM_CUSTOMER and create a plot.
# uniqueN is equivalent to length(unique(x))
customers <- data[, .(CUSTOMERS = uniqueN(LYLTY_CARD_NBR)), .(LIFESTAGE, PREMIUM_CUSTOMER)][order(-CUSTOMERS)]
p <- ggplot(data = customers) +
geom_mosaic(aes(weight = CUSTOMERS, x = product(PREMIUM_CUSTOMER,LIFESTAGE), fill = PREMIUM_CUSTOMER)) +
labs(x = "Lifestage", y = "Premium customer flag", title = "Proportion of customers") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p + geom_text(data = ggplot_build(p)$data[[1]],
aes(x = (xmin + xmax)/2 ,
y = (ymin + ymax)/2,
label = as.character(paste(round(.wt/sum(.wt),3)*100,'%'))))
# .wt refers to the weight variable specified in 'weight' aesthetic of the geom_mosaic() layer
ggplot(customers, aes(x = LIFESTAGE, y = CUSTOMERS, fill = PREMIUM_CUSTOMER)) +
geom_bar(position = "dodge", stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
There are more Mainstream - young singles/couples and Mainstream - retirees who buy chips. This contributes to there being more sales to these customer segments but this is not a major driver for the Budget - Older families segment.
Higher sales may also be driven by more units of chips being bought per customer. Let’s have a look at this next.
#### Average number of units per customer by LIFESTAGE and PREMIUM_CUSTOMER
avg_units <- data[, .(AVG = sum(PROD_QTY)/uniqueN(LYLTY_CARD_NBR)),.(LIFESTAGE, PREMIUM_CUSTOMER)][order(-AVG)]
avg_units
## LIFESTAGE PREMIUM_CUSTOMER AVG
## 1: OLDER FAMILIES Mainstream 9.255380
## 2: OLDER FAMILIES Budget 9.076773
## 3: OLDER FAMILIES Premium 9.071717
## 4: YOUNG FAMILIES Budget 8.722995
## 5: YOUNG FAMILIES Premium 8.716013
## 6: YOUNG FAMILIES Mainstream 8.638361
## 7: OLDER SINGLES/COUPLES Budget 6.781398
## 8: OLDER SINGLES/COUPLES Premium 6.769543
## 9: OLDER SINGLES/COUPLES Mainstream 6.712021
## 10: MIDAGE SINGLES/COUPLES Mainstream 6.432080
## 11: RETIREES Budget 6.141847
## 12: RETIREES Premium 6.103358
## 13: MIDAGE SINGLES/COUPLES Premium 6.078514
## 14: MIDAGE SINGLES/COUPLES Budget 6.026459
## 15: RETIREES Mainstream 5.925920
## 16: NEW FAMILIES Mainstream 4.891566
## 17: NEW FAMILIES Budget 4.821527
## 18: NEW FAMILIES Premium 4.815652
## 19: YOUNG SINGLES/COUPLES Mainstream 4.575597
## 20: YOUNG SINGLES/COUPLES Premium 4.264113
## 21: YOUNG SINGLES/COUPLES Budget 4.250069
## LIFESTAGE PREMIUM_CUSTOMER AVG
ggplot(avg_units, aes(x = LIFESTAGE, y = AVG, fill = PREMIUM_CUSTOMER)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "LIFESTAGE", y = "Units per Customer", title = "Average Units per Customer by LIFESTAGE and PREMIUM_CUSTOMER") +
theme(axis.text.x = element_text(angle = 45, hjust = .7))
Older families and young families in general buy more chips per customer
Let’s also investigate the average price per unit chips bought for each customer segment as this is also a driver of total sales.
#### Calculate the average price per unit sold (average sale price)
#### by LIFESTAGE and PREMIUM_CUSTOMER
avg_price <- data[, mean(TOT_SALES/PROD_QTY), by = .(LIFESTAGE,PREMIUM_CUSTOMER)]
avg_price[order(-V1)]
## LIFESTAGE PREMIUM_CUSTOMER V1
## 1: YOUNG SINGLES/COUPLES Mainstream 4.065642
## 2: MIDAGE SINGLES/COUPLES Mainstream 3.994241
## 3: RETIREES Budget 3.924404
## 4: RETIREES Premium 3.920942
## 5: NEW FAMILIES Budget 3.917688
## 6: NEW FAMILIES Mainstream 3.916133
## 7: OLDER SINGLES/COUPLES Premium 3.893182
## 8: OLDER SINGLES/COUPLES Budget 3.882096
## 9: NEW FAMILIES Premium 3.872110
## 10: RETIREES Mainstream 3.844294
## 11: OLDER SINGLES/COUPLES Mainstream 3.814665
## 12: MIDAGE SINGLES/COUPLES Premium 3.770698
## 13: YOUNG FAMILIES Premium 3.762150
## 14: YOUNG FAMILIES Budget 3.760737
## 15: OLDER FAMILIES Budget 3.745340
## 16: MIDAGE SINGLES/COUPLES Budget 3.743328
## 17: OLDER FAMILIES Mainstream 3.737077
## 18: YOUNG FAMILIES Mainstream 3.724533
## 19: OLDER FAMILIES Premium 3.717000
## 20: YOUNG SINGLES/COUPLES Premium 3.665414
## 21: YOUNG SINGLES/COUPLES Budget 3.657366
## LIFESTAGE PREMIUM_CUSTOMER V1
ggplot(avg_price, aes(x = LIFESTAGE, y = V1, fill = PREMIUM_CUSTOMER)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "LIFESTAGE", y = "Average price per unit",
title = "Average price per unit chips bought by LIFESTAGE and PREMIUM_CUSTOMER") +
theme(axis.text.x = element_text(angle = 45, hjust = .7))
Mainstream midage and young singles and couples are more willing to pay more per packet of chips compared to their budget and premium counterparts. This may be due to premium shoppers being more likely to buy healthy snacks and when they buy chips, this is mainly for entertainment purposes rather than their own habit of consumption. This is also supported by there being fewer premium midage and young singles and couples buying chips compared to their mainstream counterparts.
As the difference in average price per unit isn’t large, we can test if this difference is statistically different.
Let mu_m be the mean unit price of mainstream midage and young singles and couples Let mu_pb be the mean unit price of premium and budget midage and young singles and couples
H0: true difference in mu_m and mu_pb is equal to 0
Ha: true difference in mu_m and mu_pb is greater than 0
#### Perform an independent t-test between mainstream vs premium and budget midage
#### and young singles and couples
# create a new column consisting of the price for each row
pricePerUnit <- data[, price := TOT_SALES/PROD_QTY]
# Subset the data for each sample
mainstream <- pricePerUnit[PREMIUM_CUSTOMER %in% c("Mainstream") &
LIFESTAGE %in% c("YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES")]
other <- pricePerUnit[PREMIUM_CUSTOMER %in% c("Budget", "Premium") &
LIFESTAGE %in% c("YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES")]
# Perform a t-test
# t.test will take care of calculating the sample mean for each sample,
# just input the price vector for each sample
result <- t.test(mainstream$price, other$price, alternative = "greater")
result
##
## Welch Two Sample t-test
##
## data: mainstream$price and other$price
## t = 37.624, df = 54791, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.3187234 Inf
## sample estimates:
## mean of x mean of y
## 4.039786 3.706491
# Extract the p-value
p_value <- result$p.value
p_value
## [1] 3.483677e-306
# # Check if the p-value is below the desired significance level (e.g., 0.05)
# if (p_value < 0.05) {
# print("The unit price for mainstream, young and mid-age singles and couples is significantly higher.")
# } else {
# print("The unit price for mainstream, young and mid-age singles and couples is not significantly higher.")
# }
The t-test results in a p-value of 3.483677e-306, i.e. the unit price for mainstream, young and mid-age singles and couples ARE significantly higher than that of budget or premium, young and midage singles and couples.
We have found quite a few interesting insights that we can dive deeper into.
We might want to target customer segments that contribute the most to sales to retain them or further increase sales. Let’s look at Mainstream - young singles/couples.
For instance, let’s find out if they tend to buy a particular brand of chips compared to other segments.
#### Deep dive into Mainstream, young singles/couples
# Are there brands that these this customer segment prefers
# more than others.
# Could use affinity analysis or apriori analysis
# 'data' is a transactional data set where each row is a transaction
# affinity analysis
# subset two relevant groups to compare
# this is the segment of interest
segment1 <- data[PREMIUM_CUSTOMER == "Mainstream" &
LIFESTAGE == "YOUNG SINGLES/COUPLES", ]
# the rest of population serves as the baseline comparison
Other <- data[!(PREMIUM_CUSTOMER == "Mainstream" &
LIFESTAGE == "YOUNG SINGLES/COUPLES"), ]
# how many units did segment1 buy?
quantity_segment1 <- segment1[, sum(PROD_QTY)]
# how many units did the rest buy?
quantity_other <- Other[, sum(PROD_QTY)]
# calculate proportion that bought a specific brand by segment1
quantity_segment1_byBrand <- segment1[, .(TargetSeg = sum(PROD_QTY)/quantity_segment1), by = BRAND]
# calculate proportion that bought a specific brand by the rest of the population
quantity_Other_byBrand <- Other[, .(otherSegs = sum(PROD_QTY)/quantity_other), by = BRAND]
# calculate and compare to see which brands segment1 particularly tends to buy compared to other segments.
brand_proportions <- merge(quantity_segment1_byBrand,quantity_Other_byBrand)[, AffinityToBrand := TargetSeg/otherSegs]
brand_proportions[order(-AffinityToBrand)]
## BRAND TargetSeg otherSegs AffinityToBrand
## 1: Tyrrells 0.031552795 0.025692464 1.2280953
## 2: Twisties 0.046183575 0.037876520 1.2193194
## 3: Doritos 0.122760524 0.101074684 1.2145526
## 4: Kettle 0.197984817 0.165553442 1.1958967
## 5: Tostitos 0.045410628 0.037977861 1.1957131
## 6: Infzns 0.014934438 0.012573300 1.1877898
## 7: Pringles 0.119420290 0.100634769 1.1866703
## 8: Grain 0.029123533 0.025121265 1.1593180
## 9: Cobs 0.044637681 0.039048861 1.1431238
## 10: Infuzions 0.049744651 0.044491379 1.1180739
## 11: Thins 0.060372671 0.056986370 1.0594230
## 12: Cheezels 0.017971014 0.018646902 0.9637534
## 13: Smiths 0.096369910 0.124583692 0.7735355
## 14: French 0.003947550 0.005758060 0.6855694
## 15: Cheetos 0.008033126 0.012066591 0.6657329
## 16: RRD 0.043809524 0.067493678 0.6490908
## 17: Natural 0.015955832 0.024980768 0.6387246
## 18: NCC 0.003643892 0.005873221 0.6204248
## 19: CCs 0.011180124 0.018895650 0.5916771
## 20: GrnWves 0.003588682 0.006066692 0.5915385
## 21: Sunbites 0.006349206 0.012580210 0.5046980
## 22: Woolworths 0.024099379 0.049427188 0.4875733
## 23: Burger 0.002926156 0.006596434 0.4435967
## BRAND TargetSeg otherSegs AffinityToBrand
We can see that :
Mainstream young singles/couples are 23% more likely to purchase Tyrrells chips compared to the rest of the population
Mainstream young singles/couples are 56% less likely to purchase Burger Rings compared to the rest of the population
Let’s also find out if our target segment tends to buy larger packs of chips.
#### Preferred pack size compared to the rest of the population
# Over to you! Do the same for pack size.
# calculate proportion that bought a specific brand by segment1
quantity_segment1_byPackSize <- segment1[, .(TargetSeg = sum(PROD_QTY)/quantity_segment1), by = PACK_SIZE]
# calculate proportion that bought a specific brand by the rest of the population
quantity_Other_byPackSize <- Other[, .(otherSegs = sum(PROD_QTY)/quantity_other), by = PACK_SIZE]
# calculate and compare to see which brands segment1 particularly tends to buy compared to other segments.
packSize_proportions <- merge(quantity_segment1_byPackSize,quantity_Other_byPackSize)[, AffinityToPack := TargetSeg/otherSegs]
packSize_proportions[order(-AffinityToPack)]
## PACK_SIZE TargetSeg otherSegs AffinityToPack
## 1: 270 0.031828847 0.025095929 1.2682873
## 2: 380 0.032160110 0.025584213 1.2570295
## 3: 330 0.061283644 0.050161917 1.2217166
## 4: 134 0.119420290 0.100634769 1.1866703
## 5: 110 0.106280193 0.089791190 1.1836372
## 6: 210 0.029123533 0.025121265 1.1593180
## 7: 135 0.014768806 0.013075403 1.1295106
## 8: 250 0.014354727 0.012780590 1.1231662
## 9: 170 0.080772947 0.080985964 0.9973697
## 10: 150 0.157598344 0.163420656 0.9643722
## 11: 175 0.254989648 0.270006956 0.9443818
## 12: 165 0.055652174 0.062267662 0.8937572
## 13: 190 0.007481021 0.012442016 0.6012708
## 14: 180 0.003588682 0.006066692 0.5915385
## 15: 160 0.006404417 0.012372920 0.5176157
## 16: 90 0.006349206 0.012580210 0.5046980
## 17: 125 0.003008972 0.006036750 0.4984423
## 18: 200 0.008971705 0.018656115 0.4808989
## 19: 70 0.003036577 0.006322350 0.4802924
## 20: 220 0.002926156 0.006596434 0.4435967
Mainstream young singles/couples are 27% more likely to purchase pack size of 270g compared to the rest of the population
Mainstream young singles/couples are 56% less likely to purchase pack size of 220g compared to the rest of the population
Let’s dive into what brands sell this pack size.
data[PACK_SIZE == 270, unique(PROD_NAME)]
## [1] "Twisties Cheese 270g" "Twisties Chicken270g"
Twisties is the only brand that offers 270g pack size. This brand also comes in 2nd in our Brand Affinity analysis. This contributes to why our target segment buys a lot of this pack size.
Let’s recap what we’ve found! This may serve as a precursor to more formal testing and analysis.