Replication of chapter 4 & 5 Master Thesis

knitr::opts_chunk$set(echo = TRUE)

** Chapter 4 **

We load the packages & data sets in this analysis.

library(readr)
library(readxl)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ dplyr   1.0.10
## ✔ tibble  3.1.8      ✔ stringr 1.5.0 
## ✔ tidyr   1.2.1      ✔ forcats 0.5.2 
## ✔ purrr   1.0.1      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(plyr)

## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following object is masked from 'package:purrr':
## 
##     compact

library(dplyr)
library(ggplot2)
library(reshape2)

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

edible_grocery <- read.csv("edible_grocery.csv")
sku_weights <- read.csv("sku_weight.csv")

Adding the necessary columns

edible_grocery$spend=edible_grocery$units*edible_grocery$price

# to add volume we need sku weights. 
edible_merge=join(edible_grocery,sku_weights,by='sku_id')
edible_grocery$volume=(edible_merge$units*edible_merge$weight)/1000

Creating the year variable and using subsetting by year.

edible_grocery$year=ifelse(edible_grocery$week<52, 1,2)

# let's subset the data using year
year_1=edible_grocery[edible_grocery$year==1,]
year_2=edible_grocery[edible_grocery$year==2,]

We insert a pivot table where Rows is trans id and Columns is brand, Filters is year (selecting just year 1), and Values is (Max of) panel id. This creates Sheet2.

# Sheet2 (1)

# we subset the data to the columns that we neeed
sheet2_data=year_1[,c(1,2,7)]

# create the pivot table
p=dcast(sheet2_data, trans_id ~ brand, value.var="panel_id", fun.aggregate=max)

## Warning in .fun(.value[0], ...): no non-missing arguments to max; returning -Inf

p= do.call(data.frame,                      # Replace Inf in data by NA
                   lapply(p,
                          function(x) replace(x, is.infinite(x), NA)))
                                            # Print updated data frame
p$grandtotal=pmax(p$Alpha, p$Bravo, p$Other,p$Charlie,p$Delta,na.rm = T)
p=p[order(p$trans_id),]

Our goal is to create a transaction summary table which indicates whether or not each brand was purchased on each transaction. (Obviously a category transaction occurred.) Working in Sheet2, we enter the following eight variable names in row 4, starting at I4: trans id panel id Alpha Bravo Charlie Delta Other Category • The transaction id is in column A, so we enter =A5 in cell I5. • The panellist id associated witth each transaction is given in columns B–G. We enter =G5 in cell J5. • We want to create an indicator of whether or not each brand was purchased on each transaction. A non-zero entry in columns B–F indicates that a transaction did occur. We enter =1*(B5>0) in cell K5 and copy the formula across to P5. • We copy this row of formulas down to row 20034.

# Sheet2 (2)

# replace the non-missing values with 1
D <- data.frame(ifelse(is.na(p[c(2,3,4,5,6)]),0,1))
D$grandtotal=p$grandtotal
category=as.numeric(ifelse(D$grandtotal>=1,1, 0))
D=D[,-7]
trans_id=p$trans_id
panel_id=p$grandtotal

sheet2=data.frame(trans_id,panel_id,D, category)
sheet2=sheet2[,-8]

With the active cell somewhere in this new table, we insert a pivot table where Rows is panel id and Values is (Sum of) Alpha, Bravo, Charlie, Delta, Other, Category. This creates the Sheet3 worksheet. • You will see a button at the RHS of cell A3. Click on it and “Sort Smallest to Largest”. • We create a cleaned-up version of this table by copying A4:G4577 to a new worksheet, which we call panellist x brand (trans), and we add the following seven variable names to row 1: panel id Alpha Bravo Charlie Delta Other Category

# Sheet3

# since we are doing a new analysis, we again subset the data to the way we want it.
sheet_3=sheet2[,c(2,3,4,5,6,7,8)]
agg=aggregate(.~panel_id, sheet_3, sum)

# panellist x brand (trans)

panellist_x_brand_trans=agg

We create a cleaned-up version of this table by copying A4:G4577 to a new worksheet, which we call panellist x brand (trans), and we add the following seven variable names to row 1: panel id Alpha Bravo Charlie Delta Other Category We also need to create a similar table that summarises how much each panellist spent on each brand (and in the category) during year 1. This is much easier to create.

• Going back to edible grocery, we insert a pivot table where Rows is panel id, Columns is brand, Filters is year (selecting just year 1), and Values is (Sum of) spend. This creates the Sheet5 worksheet.

# Sheet5

# we use year1 data
sheet_5=year_1[,c(1,7,8)]
sheet_5_pivot=dcast(sheet_5, panel_id ~ brand, value.var="spend", fun.aggregate=sum)
sheet_5_pivot[sheet_5_pivot == 0] <- NA

sheet_5_pivot$grandtotal=rowSums(sheet_5_pivot[ , c(2,3,4,5,6)], na.rm = T)

Cleaned up version of sheet 5

# Sheet5 cleaned up

panellist_x_brand_spend=sheet_5_pivot
colnames(panellist_x_brand_spend)[7] <- "Category"

Going back to edible grocery, we insert a yet another pivot table where Rows is panel id, Columns is brand, Filters is year (selecting just year 1), and Values is (Sum of) volume. This creates the Sheet7 worksheet. • You will see a button at the RHS of cell A4. Click on it and “Sort Smallest to Largest”. • As above, we create a cleaned-up version of this table in a new worksheet, which we call panellist x brand (volume).

# Sheet7

# replace the non-missing values with 1
sheet_7=year_1[,c(1,7,9)]
sheet_7_pivot=dcast(sheet_7, panel_id ~ brand, value.var="volume", fun.aggregate=sum)
sheet_7_pivot[sheet_7_pivot == 0] <- NA
sheet_7_pivot$grandtotal=rowSums(sheet_7_pivot[ , c(2,3,4,5,6)], na.rm = T)

# panellist x brand (volume)

# cleaned up version of the data
panellist_x_brand_volume=sheet_7_pivot
colnames(panellist_x_brand_volume)[7] <- "Category"

# solutions to 4.2.1 (panellist x brand (trans))

# number of buyers
buyers=colSums(agg[c("Alpha", "Bravo","Charlie", "Delta" , "Other" ,  "category")] > 0)

# number of transactions
transactions=colSums(agg[c("Alpha", "Bravo",   "Charlie", "Delta" ,   "Other" ,  "category")])

# penetration (%)
penetration=buyers/transactions*100

# PPB
ppb=transactions/buyers

Looking at the panellist x brand (trans) worksheet, the distribution of category purchasing is determined by counting how many households made one category purchase (panellists 3102011, 3102046, etc.), two category purchases (panellists 3102012, 3102021, etc.), and so on. We do this in the following manner. • With the active cell somewhere in the main table, we insert a pivottable where Rows is Category and Values is (Count of) panel id. We rename the worksheet Trans distribution – Category

# Trans distribution -- Category

# since we are doing this in R, we have to subset the panellist_X_brand_trans to get the relevant rows that are needed.

trans_subset=panellist_x_brand_trans[,c(1,7)]
trans_subset$panel_id=as.character(trans_subset$panel_id)
trans_pivot=aggregate(.~category, trans_subset, length)

# here is a step towards the creation of the histogram
panel_0=5021-sum(trans_pivot$panel_id)
row_o=data.frame(0,panel_0)
colnames(row_o)=colnames(trans_pivot)

trans_pivot=rbind(row_o,trans_pivot)
trans_pivot$percent=round(trans_pivot$panel_id/sum(trans_pivot$panel_id)*100,2)
trans_pivot2=trans_pivot[-1,]

new_trans=trans_pivot[c(1:15),-3]
panel_over15=sum(trans_pivot[c(16:25),2], na.rm = T)
over_15=data.frame('15+', panel_over15)
colnames(over_15)=colnames(new_trans)

final_trans=rbind(new_trans, over_15)
final_trans$percent=round(final_trans$panel_id/sum(final_trans$panel_id)*100,2)

Making the first plot.

barplot(final_trans$percent,
xlab = "Number of category transactions",
ylim = c(0, 20),
ylab='% Households',
names.arg = final_trans$category,
col = "blue",
main = 'Distribution of category purchasing in year 1',
horiz = F)

**4.2.2

Let us now create the distribution of purchase frequency for Alpha, which is given in Figure 4.2. With the one exception noted below, the logic follows that associated with creating the distribution of category purchasing. With the active cell somewhere in the main table in panellist x brand (trans), we insert a pivot table where Rows is Alpha and Values is (Count of) panel id. We rename the worksheet Trans distribution – Alpha • We copy cells A4:B22 to D4:E22. • In contrast to the pivot table output associated with our summary of category purchasing, this pivot table does contain a zero category. However, we must be careful in our interpretation of the associated frequency. We see that 1950 category buyers did not buy Alpha in year 1. However, in order to have a complete summary of brand purchasing, we should also account for those 447 households that made no category purchases that year. The number of panellists making zero purchases of Alpha is the total number of panellists (5021) minus the number of panellists that made 1, 2, 3, . . . purchases. We enter =5021-SUM(E5:E22) in cell E4.5 • In column F, we compute the proportion of panellists that made 0, 1, 2, . . . Alpha purchases, expressed as a percentage (via number formatting). • When plotting the data, we choose to right-censor the distribution at 10 — see columns H–J and the associated plot. we repeat the same processes that we did above with a new approach.

# Trans distribution -- Alpha

# we subset the data to have alpha
alpha_subset=panellist_x_brand_trans[,c(1,2)]
alpha_pivot=aggregate(.~Alpha, alpha_subset, length)

# here is a step towards the creation of the histogram
alpha_pivot=alpha_pivot[-1,]
alpha_0=5021-sum(alpha_pivot$panel_id)
row_0=data.frame(0,alpha_0)
colnames(row_0)=colnames(alpha_pivot)

alpha_pivot=rbind(row_0,alpha_pivot)
alpha_pivot$percent=round(alpha_pivot$panel_id/sum(alpha_pivot$panel_id)*100,2)
#alpha_pivot=alpha_pivot[-1,]

We create the second frequency table in sheet “Trans distribution – Alpha”

new_alpha=alpha_pivot[c(1:10),][,-3]
panel_over10=sum(trans_pivot[c(11:20),2], na.rm = T)
over_10=data.frame('10+', panel_over10)
colnames(over_10)=colnames(new_alpha)
final_alpha=rbind(new_alpha, over_10)
final_alpha$percent=round(final_alpha$panel_id/sum(final_alpha$panel_id)*100,2)

Let’s make the barchart

barplot(final_alpha$percent,
xlab = "Number of category transactions",
ylim = c(0,50),
ylab='% Households',
names.arg = final_alpha$Alpha,
col = "blue",
horiz = F)

4.3

We now turn our attention to creating summaries of total spend. Our initial goal is to create Figure 4.3, which is a histogram of category spend (in dollars) across those panellists that made at least one purchase in the category in year 1. In this plot, the raw total spend data have been binned in bins with a width of dollar 5. We see that 16% of category buyers spent up to $5 in the category during this one-year period; 23% spent between dollar 5 and $10; . . . and 2% spent more than $50.

We copy the panel id and Category columns from panellist x brand (spend) into a new worksheet, which we rename Spend distribution – Category.

# Spend distribution -- Category

spend_distribution_category=panellist_x_brand_spend[,c(1,7)]
summary(spend_distribution_category$Category)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.27    6.75   12.44   15.57   20.64  166.70

There appears to be quite a bit of variability in category spend. To get further insight into the distribution of total category spend across the panellists, we use Excel’s percentile function to determine the total spend quantities associate with the various percentile levels. – We enter 0.05 in cell D8, =D8+0.05 in cell D9, and copy this formula down to D26. Cells D8:D26 are then formatted as percentages.

# Percentile
percent=seq(0.05, 0.95, by=0.05)
percentiles=quantile(spend_distribution_category$Category, probs = percent)

How many bins do we go with? This is an empirical question. Since 5% of the panel spent more than $39.72, we certainly want to go above 40 in order to get a sense of how they are spread between $39.72 and the maximum of 166.70. We will go with $50. If too many panellists have spent more than $50, we can always add more bins. If too few panellists fall into this bin, we can always combine the bins we have created. • We will use Excel Histogram data analysis tool to determine how many panellists’ total spend fell into each bin with a width of $5. – We first specify the range of histogram bins we wish to use. We enter 5 in cell G8, =G8+5 in cell G9, and copy this formula down to G17. This gives us 10 bins with a width of $5. – Clicking on Data > Data Analysis > Histogram, we specify cells B2:B4575 as the input range, cells G8:G17 as the bin range, and select I7 as the output range. – The resulting histogram is given in I7:J18. We see that 750 panellists spent $5 or less in the category during year 1, 1073 spent between 5 and 10, . . . , and 109 spent more than $50. – We convert the raw counts into percentages and plot these percentages as a bar chart

# Bins, Frequency, Percentage
tags=c('(0-5)', '(5-10)', '(10-15)', '(15-20)', '(20-25)', '(25-30)', '(30-35)','(35-40)', '(40-45)','(45-50)', '50+')
binns=cut(spend_distribution_category$Category, 
                  breaks=c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50,100), 
                  include.lowest=TRUE, 
                  right=TRUE,
    labels=tags)
bins=data.frame(table(binns))
bins$percent=round(bins$Freq/sum(bins$Freq)*100,2)

# Distribution of Category Spend (Barplot)
barplot(bins$percent,
xlab = "Category spend ($)",
main='Distribution of Category Spend',
ylim = c(0,25),
ylab='% Households',
names.arg = bins$binns,
col = "blue",
horiz = F)

We copy the panel id and Alpha columns from panellist x brand (spend) into a new worksheet, which we rename Spend distribution – Alpha.

# Spend distribution -- Alpha

Spend_distribution_Alpha=panellist_x_brand_spend[,c(1,2)]

We immediately notice that there are a number of empty cells in the Alpha column. While these panelists purchased in the category during the year, they did not purchase any of Alpha’s SKUs. The first thing we need to do is remove these observations. • One way to do ths is to first sort this table on Alpha from “Largest to Smallest”. The last entry for Alpha occurs in cell B2625. We delete the contents of A2626:A4575 (1950 cells), giving us a table that contains only those panellists that purchased at least one Alpha SKU in year 1.

# removing NA
Spend_distribution_Alpha=na.omit(Spend_distribution_Alpha)

Having looked at some basic descriptive stats (as above), we will summarise the data using bins of width $2, right censoring at $40.

# Descriptive stats (summary)
summary(Spend_distribution_Alpha$Alpha)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.57    3.69    9.57   12.59   17.03  105.90

# Percentile
percentile_alpha=quantile(Spend_distribution_Alpha$Alpha, probs = percent)

# Bins, Frequency, Percentage
tags_alpha=c('(0-2)', '(2-4)', '(4-6)', '(6-8)', '(8-10)', '(10-12)', '(12-14)','(14-16)', '(16-18)','(18-20)', '(20-22)','(22-24)', '(24-26)','(26-28)','(28-30)', '(30-32)', '(32-34)','(34-36)', '(36-38)', '(38-40)', '(40+)')
binns_alpha=cut(Spend_distribution_Alpha$Alpha, 
                  breaks=c(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20,22,24,26,28,30,32,34,36,38,40,100), 
                  include.lowest=TRUE, 
                  right=TRUE,
    labels=tags_alpha)
bins_alpha=data.frame(table(binns_alpha))
bins_alpha$percent=round(bins_alpha$Freq/sum(bins_alpha$Freq)*100,2)

Creating the barchart (on the sheet “Spend distribution – Alpha”).

# Distribution of Spend on Alpha (barchart)
barplot(bins_alpha$percent,
xlab = "Spend ($)",
main='Distribution of Category Spend on Alpha',
ylim = c(0,25),
ylab='% Households',
names.arg = bins_alpha$binns,
col = "blue",
horiz = F,las=2)

4.4

We insert a new worksheet, which we call Decile data (i), and copy the panel id and Category columns from panellist x brand (spend) into columns A and B. We rename column B spend. • We copy the Category column from panellist x brand (trans) into column C of Decile data (i) and rename this column trans.

# Decile data (i)
decile_data_i=data.frame(panellist_x_brand_spend[,c(1,7)], panellist_x_brand_trans$category)
colnames(decile_data_i)=c('panel_id','spend', 'trans')

We would like to count the number of unique brands purchased by each panelist in year 1. – We enter # unique brands in cell D1. – The number of unique brands purchased by each customer is simply the number of brands in panellist x brand (trans) that have a non-zero number of transactions. – We perform this calculation by entering =COUNTIF(’panellist x brand (trans)’!B2:F2,“>0”) in cell D2.9 – We copy this formula down to D4575. – Finally, we copy cells D2:D4575 and “paste values” onto these same cells. (We need to do this as we will sort the table at a later stage and keeping the formula “live” would introduce errors.)

decile_data_i$unique_brands=rowSums(panellist_x_brand_trans[,c(2,3,4,5,6)]!=0)

• Before we go any further, let’s make a copy of this worksheet, calling it Decile data (ii).

decile_data_ii=decile_data_i

• Returning to Decile data (i), the next step is to determine the decile bin into which each panellist falls. There are several ways to do this. Here’s one approach: – We sort the table of data by spend from “Largest to Smallest”. – Next we create a rank variable. Having entered rank in cell E1, we enter 1 in cell E2, =E2+1 in E3, and copy the formula down to E4575. – We convert the rank number into a decile number by entering =INT(10*(E2-1)/E$4575)+1in F2 and copying the formula down to F4575. (Make sure you understand the logic of the formula. Why is −1 in the formula?) – We enter decile in cell E1.

decile_data_i=decile_data_i[order(decile_data_i$spend, decreasing = T),]
decile_data_i$rank=order(decile_data_i$spend, decreasing = T)

# we are creating the deciles
decile_data_i$decile=ntile(decile_data_i$rank,10)

# Final version of sheet "Decile data (i)"

We insert a pivot table where Rows is decile and Values is (Count of) panel id, (Sum of) spend, (Sum of) trans, and (Average of) # unique brands. This creates the Sheet16 worksheet.

decile_sub=decile_data_i[,c(1,2,3,4,6)]

sheet16 = decile_sub %>%
  group_by(decile) %>%
  summarise_all(funs(n(),sum, sum, mean))

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

# columns to remain with, in order to get the final version of "Sheet16"
sheet16=sheet16[,c(1,2,7,8,17)]
colnames(sheet16)=c('decile', 'panel_id', 'spend','trans','unique_brands')

We create the various columns of Table 4.1 in the following manner. For decile 1, we enter the following formulas: cell H4 % HHs is =B4/B$14 (formatted as a percentage) cell I4 % Spend is =C4/C$14 (formatted as a percentage) cell J4 % Trans is =D4/D$14 (formatted as a percentage) cell K4 spend/HH is =C4/B4 cell L4 Cat trans/HH is =D4/B4 cell M4 Avg spend/trans is =C4/D4 cell N4 # unique brands is simply a copy of relevant entry in column E, =E4 • We copy this row of formulas down to row 13.

# (Due to naming in R, we do not start with % for the columns "%HHs ; %Spend ; %Trans.)

hhs=round(sheet16$panel_id/sum(sheet16$panel_id),2)*100
spend_per=round(sheet16$spend/sum(sheet16$spend),2)*100
trans_per=round(sheet16$trans/sum(sheet16$trans),2)*100
spend_div_hhs=round(sheet16$spend/sheet16$panel_id, 2)
trans_div_hhs=round(sheet16$trans/sheet16$panel_id,1)
spend_div_trans=round(sheet16$spend/sheet16$trans, 2)

# Sheet16_2 (data frame)
sheet_16_2=data.frame(hhs, spend_per, trans_per, spend_div_hhs,trans_div_hhs,spend_div_trans, sheet16[,5])

**4.5 [Optional] Creating Lorenz Curves

Turning to Decile data (ii), we sort the table of data by spend from “Largest to Smallest”. • Next we create a cumulative spend column. We enter Cum spend in cell E1, =B2 in E2, =E2+B3 in E3, and copy the formula down to E4575. • The next step is to determine the cumulative total spend numbers that correspond to the boundaries between each decile. – We first need to know the value of overall category purchasing by these panellists. This lurks in cell E4575. So as to make our analysis a little more transparent, we copy this value to cell J1 (using the formula =E4575) and enter Total category spend in cell I1). – One-tenth is this is computed by entering =J1/10 in cell J2; we label this Size of 10th equal split (cell I2). – We compute the cutoffs between the first and second, second and third, . . . , and ninth and tenth deciles by entering =J2 in cell J5, =J5+J2 in cell J6, and copying this formula down to cell J13.

# Decile data (ii)

decile_data_ii=decile_data_ii[order(decile_data_ii$spend, decreasing = T),]
decile_data_ii$cum_spend=cumsum(decile_data_ii$spend)

# we determine the cutoffs
total_category_spend = sum(decile_data_ii$spend)  #Total category spend
size_of_split = total_category_spend/10           #Size of 10th equal split       

# we create the cutoffs
cutoff = size_of_split*(seq(1,9,by=1))
deciles = quantile(cutoff, seq(0, 1, 0.1))

#as.numeric(cut(decile_data_ii$cum_spend, deciles, include.lowest = TRUE)) # we create
decile_data_ii$decile=NA
decile_data_ii$decile[decile_data_ii$cum_spend<7091.854] =1
decile_data_ii$decile[decile_data_ii$cum_spend>=7091.854&decile_data_ii$cum_spend<14183.708]=2
decile_data_ii$decile[decile_data_ii$cum_spend>=14183.708 & decile_data_ii$cum_spend<21275.562]=3
decile_data_ii$decile[decile_data_ii$cum_spend>=21275.562 &decile_data_ii$cum_spend<28367.41]= 4
decile_data_ii$decile[decile_data_ii$cum_spend>=28367.41& decile_data_ii$cum_spend<35459.270]=5
decile_data_ii$decile[decile_data_ii$cum_spend>=35459.270 & decile_data_ii$cum_spend<42551.124]= 6 
decile_data_ii$decile[decile_data_ii$cum_spend>=42551.124 & decile_data_ii$cum_spend<49642.978]= 7
decile_data_ii$decile[decile_data_ii$cum_spend>=49642.978 & decile_data_ii$cum_spend<56734.832]= 8
decile_data_ii$decile[decile_data_ii$cum_spend>=56734.832& decile_data_ii$cum_spend<63826.686]=9
decile_data_ii$decile[decile_data_ii$cum_spend>=63826.686] = 10

# Sheet17

decile_sub1=decile_data_ii[,c(1,2,3,4,6)]
sheet17 = decile_sub1 %>%
  group_by(decile) %>%
  summarise_all(funs(length,sum, sum, mean))

# columns to remain with in order to get the final version of "Sheet17".
sheet17=sheet17[,c(1,2,7,8,17)]
colnames(sheet17)=c('decile', 'panel_id', 'spend','trans','unique_brands')


# Sheet17_2
percent_HHs=round(sheet17$panel_id/sum(sheet17$panel_id)*100,0)
percent_spend=round(sheet17$spend/sum(sheet17$spend)*100,0)
percent_trans=round(sheet17$trans/sum(sheet17$trans)*100,0)
spend_HH=round(sheet17$spend/sheet17$panel_id,2)
cat_trans_HH=round(sheet17$trans/sheet17$panel_id,1)
avgspend_trans=round(sheet17$spend/sheet17$trans,2)
sheet17_2=data.frame(decile=sheet17$decile,'% HHs'=percent_HHs,"%Spend"=percent_spend,"% Trans"=percent_trans, "Spend/HH"=spend_HH, "Cat trans/HH"=cat_trans_HH, "Avg spend/trans"=avgspend_trans,"# unique brands"=round(sheet17$unique_brands,1))

# Lorenz -- Alpha (trans)

# Count of Panel_id with Alpha
lorenz_alpha = alpha_pivot[c(1,2)]
lorenz_alpha = lorenz_alpha[-1,]


# to get the second data frame of sheet "Lorenz -- Alpha (trans)"
transactions_lorenz=lorenz_alpha$Alpha
frequency_alpha=lorenz_alpha$panel_id
total_purchases=transactions_lorenz*frequency_alpha
percent_buyers=frequency_alpha/sum(frequency_alpha)
percent_transactions=total_purchases/sum(total_purchases)
cum_buyers=cumsum(percent_buyers)*100
cum_transactions=cumsum(percent_transactions)*100
lorenz_alpha_df = data.frame(transactions_lorenz, frequency_alpha,
                             total_purchases, percent_buyers,
                             percent_transactions, cum_buyers, cum_transactions)


# to include the 0%
cum_buyers=append(0, cum_buyers)
cum_transactions=append(0, cum_transactions)              

# let's make the plot (Lorenz Curve for Alpha transactions)
plot(cum_buyers, cum_transactions, type = 'l', col='blue', xlab='% Alpha buyers', ylab = '% Alpha transactions', main='Lorenz Curve for Alpha transactions')

We start by making a copy of Spend distribution – Alpha, renaming it Lorenz – Alpha (spend) and delete all the content right of column C. We then sort this table by Alpha, from “Smallest to Largest” and insert a row above row 2. (Row 2 is now empty.) • We start by calculating the percentage of total Alpha purchasing accounted for by each panellist. We first compute in cell B2627 the total spend on Alpha (=SUM(B3:B2626)). Next we enter =B3/B2627 in cell C3, formatting the result as a percentage, and copy the formula down to cell C2626. • Next, we compute the cumulative percentage of buyers number. The total of Alpha buyers is determined by entering =COUNT(A3:A2626) in cell A2627. We enter =(ROW()-2)/A2627 in cell D2, format the result as a percentage, and copy the formula down to cell D2626. • In order to compute the cumulative percentage of spend numbers, we first enter 0 in cell E2 and format the cell as a percentage. Next we enter=E2+C3in cell E3, formatting the result as a percentage, and copy the formula down to cell E2626.

# Lorenz -- Alpha (spend)

# panel_id & Alpha
lorenz_spend=Spend_distribution_Alpha
lorenz_spend=lorenz_spend[order(lorenz_spend$Alpha),]

lorenz_spend$per_spend=lorenz_spend$Alpha/sum(lorenz_spend$Alpha)
percen_buyers=round(as.numeric(rownames(lorenz_spend))/sum(as.numeric(rownames(lorenz_spend))), 4)
lorenz_spend$cum_buyers=cumsum(percen_buyers)*100
lorenz_spend$cum_spend=cumsum(lorenz_spend$per_spend)*100

# let's make the plot (Lorenz Curve for Alpha spend)
plot(lorenz_spend$cum_buyers,lorenz_spend$cum_spend, type = 'l', col='blue',xlab='% Alpha buyers',ylab = '% Alpha revenue', main='Lorenz Curve for Alpha spend')

** Chapter 5 **

We start by copying the panellist x brand (trans) worksheet from the workbook used in the previous chapter and save it in a new workbook (say chapter 5.xlsx). We delete the content to the right of column H. • We insert a new worksheet (# brands) and copy column A from panellist x brand (trans). • We compute the number of brands purchased by each household by counting the number of non-zero entries in each row of the panellist x brand (trans) table. We enter =COUNTIF(’panellist x brand (trans)’!B2:F2,“>0”) in cell B2 and copy the formula down to B4575. We label column this column # brands (cell B1).

# # brands

brands=rowSums(panellist_x_brand_trans[,c(2,3,4,5,6)] > 0)
brands=data.frame(panellist_x_brand_trans[1],brands)

Inserting a pivot table where Rows is # brands and Values is (Count of) panel id gives us the distribution of the number of brands purchased in year 1 by those panellists that made at least one category purchase that year. Expressing these counts as percentages gives us Table 5.1

# Sheet3

sheet3=aggregate(.~brands, brands, length)
sheet3$percentages=round((sheet3$panel_id/sum(sheet3$panel_id))*100,0)

We make a copy of the worksheet # brands and rename it # brands x cat purchasing. In column C we report the number of category purchases made by each household; we simply copy column G from panelist x brand (trans).

# # brands x cat. purchasing

brands_of_cat_purchasing=brands
brands_of_cat_purchasing$category=panellist_x_brand_trans$category

We insert a pivot table where Rows is Category, Columns is # brands, and Values is (Count of) panel id. Looking at the resulting pivot table output (Sheet5), we see that there quite a high level of sole-brand loyalty (i.e., only buying one brand), even as the number of category purchases increases. We have two panellists that made 20 category purchases, all with the same brand. That’s some level of loyalty! We can compute the average number of different brands purchased for each level of category purchasing by entering =SUMPRODUCT(B4:E4, B5:E5)/F5 in cell H5 and copying the formula down to H27. We note that the average number of brands purchased does increase as a function of category purchasing. This should not come as a surprise, as more category purchases equals more opportunities to buy different brands. We saw in Chapter 3 that two brands (Alpha and Bravo) had a combined value market share of 86%. As such, the relatively low number of different brands purchased in the year is not too surprising. We now consider three common analyses designed to give insight into the nature of multibrand buying behaviour.

# Sheet5

sheet_5_chapter5=dcast(brands_of_cat_purchasing, category ~ brands, value.var="panel_id", fun.aggregate=length)

sheet_5_chapter5$grandtotal=rowSums(sheet_5_chapter5[c(2,3,4,5)],na.rm = T)
pl=c(1,2,3,4)

sheet_5_chapter5$average=round((as.matrix(sheet_5_chapter5[c(2,3,4,5)])%*%pl)/sheet_5_chapter5$grandtotal,1)

sheet_5_chapter5[sheet_5_chapter5 == 0] <- NA

sheet_5_chapter5

The first thing we need to do is to create the “ever buyers” matrix. We insert a new worksheet called Ever buyers. We copy in both column A and cells B1:F1 from panellist x brand (trans). We want to populate this table with zeros and ones, depending on whether or not each household made at least one purchase of the brand in question. We enter =1*(’panellist x brand (trans)’!B2>0) in cell B2, and copy the formula across and down to cell F4575.

# Ever buyers

ever_buyers=panellist_x_brand_trans[c(2,3,4,5,6)]
ever_buyers[ever_buyers > 0] <- 1 
ever_buyers=data.frame(panellist_x_brand_trans[1], ever_buyers)

We insert a new worksheet and rename it Duplication of purchase. The first thing we do is compute the total number of buyers of each brand. (These numbers are simply the column totals of the “ever buyers” matrix.) We enter the brand names in cells C1:G1. Next we enter=SUM(’Ever buyers’!B2:B4575)in cell C2 and copy the formula across to cell G2.

# Duplication of purchase

# computing the column total to get "# brand buyers":
totals=colSums(ever_buyers[c(2,3,4,5,6)])

The duplication count table is created by pre-multiplying the “ever buyers” matrix by its own transpose. Excel has a series of functions for basic matrix operations, including transpose and matrix multiplication. Having entered the brand names in cells C5:G5 and B6:B10, we type=MMULT(TRANSPOSE(’Ever buyers’!B2:F4575),’Ever buyers’ !B2:F4575) in cell C6. At the time of writing, using the latest version of Office 365, the rest of the table is automatically filled when is pressed. (If you are using an older version, the cell should display #VALUE! when you press . Copy this formula across and down to cell G10. (Cells C6:G10 should all display #VALUE!.) With cells C6:G10 highlighted, press and then + + .)

# # people who purchased
ever_buyers1=t(ever_buyers[-1])
dup_count=as.data.frame(as.matrix(ever_buyers1)%*%as.matrix(ever_buyers[-1]))

To make the final step of the calculation easy, we want the totals given in cells C2:G2 in cells I6:I10. We do this using the Excel’s transpose function. We type =TRANSPOSE(C2:G2) in cell I6. At the time of writing, using the latest version of Office 365, cells I6:I10 are automatically filled when is pressed. (If using an older version of Excel , you will need to copy the formula down to cell I10. With cells I6:I10 highlighted, press and then + + .)

# totals transposed
totals_transpose=t(totals)

The duplication of purchase table is created by dividing each entry of the duplication count table (cells C6:G10) by the total number of buyers of the brand associated with each row (cells I6:I10). We first enter the brand names in cells C13:G13 and B14:B18. Next we enter =IF(B14=C5,““,C6/I6) in cell C14, format as a percentage, and copy the formula across and down to cell G18.

# % people who purchased
dup_puchase=dup_count/matrix(totals_transpose,ncol = 1)*100
round(dup_puchase,0)

We first copy the panellist x brand (volume) worksheet from the workbook used in the previous chapter into the workbook we are using for this chapter’s analysis. We should first check that the rows in this worksheet correspond to rows in Ever buyers. We can check this by entering =1-(A2=’Ever buyers’!A2) in cell H2 of panellist x brand (volume) and copying the formula down to H4575. If the same panellist is associated with each row in both worksheet, which is what we want, the sum of this column is 0.

# panellist x brand (volume)

volume_chapter_5=panellist_x_brand_volume
volume_chapter_5[is.na(volume_chapter_5)]=0
volume_chapter_5$check=ifelse(volume_chapter_5[1]==ever_buyers[1],0,1)

Inserting a new worksheet and renaming it SCR, we enter the five brand names in cells B2:F2. • We compute in row 3 the total purchasing of each brand, which is simply the sum of the relevant column in panellist x brand (volume). We enter =SUM(’panellist x brand (volume)’!B2:B4575) in cell B3 and copy the formula across to cell F3. The next step is to compute the total amount of category purchasing conditioned on the fact that at least one purchase of the brand of interest was made. Recall that the “ever buyers” matrix (Ever buyers) indicates whether or not each panellist ever purchased each brand. For any given brand, we want to sum up total category purchasing (column G from panellist x brand (volume)) across those

panellists for whom the ever-buy indicator is 1. We can do this by multiplying each cell in the relevant column of the “ever buyers” matrix by the associated panellist’s cell in the Category column in panellist x brand (volume) and then summing across panellists. We enter =SUMPRODUCT(’Ever buyers’!B2:B4575,’panellist x brand (volume)’!G2:G4575) in cell B4 and copy the formula across to cell F4. • SCR is the ratio of brand purchasing to category purchasing. We enter =B3/B4 in cell B5, format the result as a percentage, and copy the formula across to cell F5.

# SCR

brand_purchasing = colSums(volume_chapter_5[c(2,3,4,5,6)],na.rm = T)
category_purchasing = t(colSums(crossprod(as.matrix(ever_buyers[c(2,3,4,5,6)]),as.matrix(volume_chapter_5[c(2,3,4,5,6)]))))

# SCR (%)
scr = brand_purchasing/category_purchasing* 100

• Looking at panellist x brand (volume), we see that each panellist’s lack of purchasing of any brand is indicated by an empty cell. These empty cells cause a problem when using MMULT. We need a lack of purchasing to be indicated by 0. We could use the Excel feature that allows us to fill empty cells with zeros or we can replicate the table in the following manner. We enter =B2 in cell J2 and copy the formula across and down to N4575. We insert a new worksheet, renaming it Cross purchase, and enter the brand names in cells C2:G2 and B3:B7. • We type=MMULT(TRANSPOSE(’Ever buyers’!B2:F4575),’panellist x brand (volume)’!J2:N4575) in cell C3 and press Enter. As noted above, the rest of the table should automatically be populated if you are using the latest version of Excel.

# Cross purchase

# "Total volume purchased of x" as a function of "Purchasers of x"
purchasers=data.frame(t(as.matrix(ever_buyers[c(2,3,4,5,6)])) %*%as.matrix(volume_chapter_5[c(2,3,4,5,6)]))

Having entered the brand names in cells C10:G10 and B11:B15, we enter =C3/SUM(C3:G3) in cell C11, formatting the result as a percentage. We then copy this formula across and down to G15, giving us the cross purchase table. (The diagonal of this table gives us the SCR numbers computed in the worksheet SCR.)

# idem (%)
purchasers_percent=data.frame(t(round(sweep(purchasers,2,colSums(purchasers),`/`)*100,0)))

For a given brand, we can plot the associated row entries as a pie chart — see, for example, the worksheet Importance of competition plot, which plots the numbers in cells C11:G11 in the Cross purchase.

# Importance of competition plot

ggplot(purchasers_percent, aes(x = "", y = Alpha, fill = row.names(purchasers_percent))) +
  geom_col() +
  geom_text(aes(label = Alpha),
            position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y")+
  ggtitle('Importance of Competition plot to Buyers of Alpha')

We insert a new worksheet, renaming it Importance against expectation, and enter the brand names in cells B1:F1. • We first enter the cross purchasing percentages for Alpha in row 2: we enter =’Cross purchase’!C11 in cell B2 and copy the formula across to cell F2. Next, we compute the volume market share numbers for each brand in row 3. We enter =SUM(’panellist x brand (volume)’!B2:B4575)/SUM(’panellist x brand (volume)’!G2: G4575) in cell B3, format the result as a percentage, and copy the formula across to cell F3.

# Importance against expectation

sums=colSums(panellist_x_brand_volume[c(2,3,4,5,6,7)], na.rm = T)
# share of category purchasing
share_category_purchasing=purchasers_percent$Alpha

# market share (%)
marketshare=(sums/sums[6])[-6] *100

In row 5, we compute the share of residual purchasing amongst Alpha buyers (i.e., the percentage of the category purchasing not accounted for by Alpha that goes to each of the other brands). We enter =C2/(1-B2) in cell C5 and copy the formula across to cell F5.

# Share of residual purchasing -- Alpha buyers
Share_of_residual_purchasing_Alpha_buyers=share_category_purchasing/(100-share_category_purchasing[1])

In row 6, we compute the residual share of category purchasing (across all category buyers) once Alpha is removed (i.e., when we exclude Alpha, what percentage of (the remaining) category purchasing goes to each of the other brands). We enter =C3/(1-B3) in cell C6 and copy the formula across to cell F6.

# Share of residual purchasing -- category
Share_of_residual_purchasing_category=marketshare/(100-marketshare[1])

Let us consider Charlie. With reference to cell D5, we see that it accounted for 26% of the category purchasing by Alpha buyers that did not go to Alpha. If the purchasing of Alpha buyers was consistent with overall market patterns (as reflected in the volume market shares), we would expect Charlie to account for 19% of their purchasing (cell D6). We can therefore say that Charlie’s share of purchasing amongst the buyers of Alpha is above expectation (when expectation is based on overall patterns of buying behaviour). • In row 7 we create an index against expectation. We enter =100*C5/C6 in cell C7 and copy the formula across to cell F7. We plot these numbers using a horizontal bar chart.

# Index against expectation
Index_against_expectation=(Share_of_residual_purchasing_Alpha_buyers/Share_of_residual_purchasing_category)*100

# plotting "Importance against expectation"
barplot(Index_against_expectation[-1],
main='Importance Against Expectation',
col = "blue",
horiz = T,las=2)

We can repeat these analyses using spend rather than volume purchasing. Using the panellist x brand (spend) worksheet from the workbook and following the logic outlined above, we create Cross purchase (spend). We see, for example, that buyers of Alpha spent 71% of their category spend on Alpha. This is in contrast to the 69% of their category volume requirements satisfied by Alpha. We can create a spend-based importantance against expectation plot, using value market share as the reference.

# panellist x brand (spend)

spend_chapter_5=panellist_x_brand_spend
spend_chapter_5[is.na(spend_chapter_5)]=0
spend_chapter_5$check=ifelse(spend_chapter_5[1]==ever_buyers[1],0,1)

# Cross purchase (spend)

# "Total spend of x" as a function of "Purchasers of x"
purchasers_spend=data.frame(t(as.matrix(ever_buyers[c(2,3,4,5,6)])) %*%as.matrix(spend_chapter_5[c(2,3,4,5,6)]))

# idem (%)
purchasers_spend_percent=round(sweep(purchasers_spend,2,colSums(purchasers_spend),`/`)*100,0)

Replication of chapter 4 & 5 Master Thesis

Panji BHARATA

January 2023

4.4