Replication of chapter 6b Master Thesis

knitr::opts_chunk$set(echo = TRUE)

We start by loading the libraries needed.

library(readr)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ dplyr   1.0.10
## ✔ tibble  3.1.8      ✔ stringr 1.5.0 
## ✔ tidyr   1.2.1      ✔ forcats 0.5.2 
## ✔ purrr   1.0.1      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(magrittr)

## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

library(plyr)

## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following object is masked from 'package:purrr':
## 
##     compact

library(dplyr)
library(ggplot2)
library(readxl)
library(lattice)
library(reshape2)

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, mutate, rename, summarise
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

We continue by opening kiwibubbles tran.csv in Excel. We immediately save it as an Excel workbook, say chapter 6b.xlsx).

# kiwibubbles_trans
chapter6b=read.csv("kiwibubbles_trans.csv")

We will just focus on purchasing by those panellists in Market 2. We make a copy of the kiwibubbles tran worksheet (renaming it Market2), delete the records associated with market 1 (rows 2–552) and the column corresponding to the Market field.

# Market 2

market2=chapter6b[chapter6b$Market==2,-2]

The next step is to determine the so-called “depth of repeat” (DoR) level associated with each transaction. Is it a trial purchase (DoR = 0), a first repeat purchase (DoR = 1), a second repeat purchase (DoR = 2), etc. Labelling column E DoR (cell E1), we enter =IF(A2<>A1,0,E1 +1) in cell E2 and copy the formula down to E307. (Note that this formula assumes the records are sorted chronologically for each panellist.)

n <- nrow(market2)
if (n > 1) for(i in 2:n) market2$DOR[i] <- ifelse(market2$ID[i]!=market2$ID[i-1],0,market2$DOR[i-1]+1)

market2$DOR[1]=0

The next step is to create a table that tells us how many trial, first repeat, etc. purchases (columns) occurred in each week (rows). We want there to be 52 rows, one for each week of the test. However, it turns out that this panel of 1499 households only purchased the test product in 49 weeks; no purchases occurred in weeks 25, 39, and 41. How can we create a table that will contain zeros in the rows corresponding to these three weeks? At the bottom of column B, we add the numbers 1, 2, . . . , 52. For these new records, we assign a depth-of-repeat level of −1 (cells E308:E359). • We insert a pivot table where Rows is Week, Columns is DoR, and Values is (Count of) ID, and rename the resulting worksheet DoR by week. 4 We create a cleaned-up version of this table to the right of the pivot table output, adding meaningful column names.

# DoR by week

dor_by_week=dcast(market2, Week ~ DOR, value.var="ID", fun.aggregate=length)
dor_by_week=dor_by_week[,-14]
dor_by_week$grandtotal=rowSums(dor_by_week[,-1])

# create a cleaned up version by changing the column names
dor_by_week_cleaned=dor_by_week
colnames(dor_by_week_cleaned)=c('Week', 'Trial', 'R1', 'R2', 'R3', 'R4', "R5", 'R6', 'R7', 'R8','R9', 'R10', 'R11', 'Total')

We create Figure 6.8, which breaks weekly sales into its trial and repeat components, in the following manner: – We insert a new worksheet, renaming it TR decomposition, and add the week numbers in column A. – The weekly trial numbers are extracted from the worksheet DoR by week by entering =’DoR by week’!R5 in cell B2 and copying the formula down to B53. We label this column Trial (cell B1).

# TR decomposition

TR_decomposition=data.frame(dor_by_week_cleaned[,c(1,2)])
colnames(TR_decomposition)=c('Week', 'Trial')
TR_decomposition$Repeat=dor_by_week_cleaned$Total - dor_by_week_cleaned$Trial

Given these two columns of data, we create the basic trial/repeat decomposition of weekly sales plot using the chart type “2-D Stacked Area”. – We can also used the chart type “2-D 100% Stacked Area” to create a plot that shows the percentage of weekly sales due to trial versus repeat purchasing.

# creating the first chart: "Trial/Repeat Decomposition of Sales"
fig <- plot_ly(x = ~TR_decomposition$Week, y = ~TR_decomposition$Trial, type = 'scatter', mode = 'lines', name = 'Trial', fill = 'tozeroy')
fig = fig %>% add_trace(x = ~TR_decomposition$Week, y = ~TR_decomposition$Repeat, name = 'Repeat', fill = 'tozeroy')
fig = fig %>% layout(xaxis = list(title = 'Week'),
         yaxis = list(title = 'Sales (transactions)')) %>% layout(title = 'Trial/Repeat Decomposition of Sales')

fig

We create Figure 6.9, which breaks cumulative sales into its trial, first repeat (FR), and additional repeat (AR) components, in the following manner: – We insert a new worksheet, renaming it Cum. sales decomposition, and add the week numbers in column A.

The weekly trial and first-repeat numbers are extracted from the worksheet DoR by week by entering =’DoR by week’!R5 in cell B2 and =’DoR by week’!S5 in cell C2, and copying these formulas down to row 53. – The AR sales numbers are computed by entering =SUM(’DoR by week’!T5:AC5) in cell D2 and copying the formula down to D53. – We enter the column headings Trial FR AR in cells B1:D1. – Having copied the week numbers from column A to column F and entered the column headings Trial FR AR in cells G1:I1, we create the cumulative trial, first-repeat, and additional-repeat by first entering =B2 in cell G2, =G2+B3 in cell G3, copying this second formula down to G53, and then copying the formulas in cells G2:G53 across columns H and I

# Cum. sales decomposition

# First table
cum_sales_decomposition=data.frame(dor_by_week_cleaned[1],dor_by_week_cleaned[2], FR=dor_by_week_cleaned[3],AR=rowSums(dor_by_week_cleaned[,c(4,5,6,7,8,9,10,11,12,13)]))
colnames(cum_sales_decomposition)=c("Week" , "Trial", "FR"  ,  "AR"   )

# Cumulative table
cum_sales_decomposition1=cum_sales_decomposition
cum_sales_decomposition1$Trial=cumsum(cum_sales_decomposition1$Trial)
cum_sales_decomposition1$FR=cumsum(cum_sales_decomposition1$FR)
cum_sales_decomposition1$AR=cumsum(cum_sales_decomposition1$AR)

Given these three columns of data, we create the trial/FR/AR decomposition of cumulative sales plot using the chart type “2-D Stacked Area”.

We insert a new worksheet, renaming it Cum. trial, and add the week numbers in column A. – We need to extract the cumulative trial numbers from the worksheet Cum. sales decomposition. We do so by entering=’Cum. sales decomposition’!G2 in cell B2 and copying the formula down to B53. – Recall that there are 1499 panellists in Market 2. We compute the % cumulative trial numbers by entering =B2/1499 in cell C2 (formatting the answer as a percentage) and copying the formula down to C53.

# Cum. trial

cum_trial=data.frame(cum_sales_decomposition[1],cum_trail=cum_sales_decomposition[,2])
cum_trial$'% cum_trial'=round((cum_trial$cum_trail/1499)*100,1)
cum_trial1=cum_trial    # creating a copy of cum_trial
cum_trial1$cum_trail=cumsum(cum_trial$cum_trail)
cum_trial1$`% cum_trial`=round((cum_trial1$cum_trail/1499)*100,1)


# let's create the plot !
plot(cum_trial1$`% cum_trial`,type = 'l',col='blue',xlab = 'week', ylab = '% Households', main = 'Kiwi Bubbles Cumulative Trial')

We create Figure 6.11, which plots the evolution of the percentage of panellists that made a trial purchase that have gone on to make a (first) repeat purchase, in the following manner: – We insert a new worksheet, renaming it % Triers repeating, and add the week numbers in column A. – We enter =’Cum. sales decomposition’!G2 in cell B2 and =’Cum. sales decomposition’!H2 in cell C2, and copy these formulas down to row 53. This gives us the trial and first repeat numbers from the worksheet Cum. sales decomposition. – The % triers repeating numbers are computed by entering =C2/B2 in cell D2 and copying the formula down to D53. – We insert a line chart which plots the numbers in D2:D53.

# % Triers repeating

triers_repeating=data.frame(cum_sales_decomposition1[,c(1,2,3)])
colnames(triers_repeating)=c('Week','cum.Trial', 'Cum.FR')
triers_repeating$'% Triers Repeating'=round(((triers_repeating$Cum.FR/triers_repeating$cum.Trial)*100),1)


# let's create the plot !
plot(triers_repeating$`% Triers Repeating`,type = 'l',col='blue',xlab = 'week', ylab = '% Triers repeating', main = '% Triers repeating')

We make a copy of the worksheet Market 2, renaming it Cum. FR by trial class (a), and delete the numbers we added below row 307. • We create a Trial week variable (cell F1) that equals the value of Week if this is a trial purchase and −99 otherwise. We enter the formula =IF(E2=0,B2,-99) in cell F2 and copy it down to F307. • Next, we create a FR delta variable (cell G1) that tells us how many weeks after the trial purchase the panellist’s first repeat purchase occurred (assuming it was observed). We enter the formula =IF(E3=1, B3-F2,-99) in cell G2 and copy down to cell G307; −99 indicates that a first repeat purchase was not observed for this panellist in the 52-week observation period.6

# Cum. FR by trial class (a)

cum_FR_by_trial_class_a=market2
cum_FR_by_trial_class_a$Trial_Week=ifelse(cum_FR_by_trial_class_a$DOR==0,cum_FR_by_trial_class_a$Week,-99)
n <- nrow(cum_FR_by_trial_class_a)
if (n > 1) for(i in 2:n) cum_FR_by_trial_class_a$FR_delta[i] <- ifelse(cum_FR_by_trial_class_a$DOR[i+1]==1,(cum_FR_by_trial_class_a$Week[i+1]-cum_FR_by_trial_class_a$Trial_Week[i]),-99)

# before we do anything let's expand the rows
D=data.frame(matrix(NA, nrow = 104, ncol = 7))

# the values that will be added to trial week
p=rep(-99,52)
p1=seq(1,52)
pp=c(p,p1)
D$X6=pp

# the values that will be added to FR delta
k=seq(0,51)
KK=c(k,p)
D$X7=KK
colnames(D)=colnames(cum_FR_by_trial_class_a)
cum_FR_by_trial_class_a=rbind(cum_FR_by_trial_class_a,D)
cum_FR_by_trial_class_a[1,7]=-99

# Cum. FR by trial class (b)

cum_FR_by_trial_class_b=dcast(cum_FR_by_trial_class_a, Trial_Week ~ FR_delta, value.var="ID", fun.aggregate=length)
cum_FR_by_trial_class_b=cum_FR_by_trial_class_b[,-55]
cum_FR_by_trial_class_b[,2]=cum_FR_by_trial_class_b[,2]-1
cum_FR_by_trial_class_b[1,]=cum_FR_by_trial_class_b[1,]-1

# I noted that everything has a value that is 1 larger. So I subtracted 1 from the dataframe.

cum_FR_by_trial_class_b[1,2]=167 # we leave the column names as they are because R does not allow numeric column names.
cum_FR_by_trial_class_b[1,1]=-99
cum_FR_by_trial_class_b$grandtotal=rowSums(cum_FR_by_trial_class_b[,-1])

We now create a cleaned-up version of this table. – We first insert a new worksheet, renaming it Cum. FR by trial class (c). – We enter the “Trial week” numbers in cells B3:B54. The number of triers in each week are extracted from the relevant row totals in the pivot table we just created. We enter =’Cum. FR by trial class (b)’!C6 in cell C3 and copy the formula down to C54. – The “Weeks after trial” numbers (0–51) are entered in D2:BC2, and the main entries from the pivot table are extracted by entering=’Cum. FR by trial class (b)’!C6 in cell D3 and copying the formula across and down to BC54

# Cum. FR by trial class (c)

cum_FR_by_trial_class_c=data.frame(Trial_week=cum_FR_by_trial_class_b[-1,1],"#tiers"=cum_FR_by_trial_class_b$grandtotal[-1])
cum_FR_by_trial_class_c=cbind(cum_FR_by_trial_class_c,cum_FR_by_trial_class_b[-1,-c(1,2,55)])
row.names(cum_FR_by_trial_class_c)=1:52

• We want to create a version of this table in which the rows report the cumulative percentage of triers that have made a first repeat purchase so many weeks after their trial purchase. We enter the “Trial week” numbers in cells B58:B109 and the corresponding number of triers numbers in cells C58:C109. (Enter =C3 in cell C58 and copy the formula down to cell C109). The “Weeks after trial” numbers (0–51) are entered in cells D57:BC57. Next, we enter =IF( B58>52-D57,““,IF(C58>0,SUM(D3:D3)/C58,0))in cell D58, format the result as a percentage, and copy the formula across and down to BC109. (Notice how this formula automatically suppresses the entries that were previously greyed out.)

# first let's replace those rows where the condition is this IF($B58>52-D$57,""
cum_FR_by_trial_class_c1=cum_FR_by_trial_class_c[,-c(1,2)]

for (i in 1:nrow(cum_FR_by_trial_class_c1)){
  for (j in 1:(ncol(cum_FR_by_trial_class_c1))){
    if (isTRUE((as.integer(rownames(cum_FR_by_trial_class_c1[j])[i]))>=(53-as.integer(colnames(cum_FR_by_trial_class_c1[j]))))){
      cum_FR_by_trial_class_c1[i,j]=NA
    }
  }
}

# next step is to apply the next if statement
# to make things easier, let's replace NA with -99

cum_FR_by_trial_class_c2=cbind(cum_FR_by_trial_class_c[2],cum_FR_by_trial_class_c1)
for (i in 1:nrow(cum_FR_by_trial_class_c2)){
  for (j in 1:(ncol(cum_FR_by_trial_class_c2))){
    if (cum_FR_by_trial_class_c2$X.tiers[i]>0){
      cum_FR_by_trial_class_c1[i,j]=(sum(cum_FR_by_trial_class_c2[i,1:j]))/cum_FR_by_trial_class_c2$X.tiers[i]
    }
  }
}

cum_FR_by_trial_class_c1=cum_FR_by_trial_class_c1-1
cum_FR_by_trial_class_c1=cbind(cum_FR_by_trial_class_c[2],cum_FR_by_trial_class_c1)[-2]
cum_FR_by_trial_class_c1[cum_FR_by_trial_class_c1<0]=0
colnames(cum_FR_by_trial_class_c1)=seq(0,51)

# get the sumproduct
seq1=seq(0,26)
subs=cum_FR_by_trial_class_c1[1:27,2:28]
H=round((crossprod(as.matrix(seq1),as.matrix(subs))/101)*100,1)

# create the plot "Time to First Repeat"
plot(H[1,], type='l', col='blue', main='Time to First Repeat',xlab = 'Weeks after trial', ylab = '% Triers')

We can perform similar analyses for the time from first repeat to second repeat, second repeat to third repeat, and so on. An undocumented analysis of first repeat to second repeat is given in the worksheets Cum. 2R by FR class (a), Cum. 2R by FR class (b), and Cum. 2R by FR class (c)

# Cum. 2R by FR class (a)

# we borrow a lot from what was done in the previous section. We only change the namings.

cum_2FR_by_trial_class_a=market2

cum_2FR_by_trial_class_a$FR_week=ifelse(cum_2FR_by_trial_class_a$DOR==1,cum_2FR_by_trial_class_a$Week,-99)
n <- nrow(cum_2FR_by_trial_class_a)
if (n > 1) for(i in 2:n) cum_2FR_by_trial_class_a$FR2_delta[i] <- ifelse(cum_2FR_by_trial_class_a$DOR[i+1]==2,(cum_2FR_by_trial_class_a$Week[i+1]-cum_2FR_by_trial_class_a$FR_week[i]),-99)

# before we do anything lets expand the rows
D=data.frame(matrix(NA, nrow = 104, ncol = 7))

# the values that will be added to trial week
p=rep(-99,52)
p1=seq(1,52)
pp=c(p,p1)
D$X6=pp

# the values that will be added to FR delta
k=seq(0,51)
KK=c(k,p)
D$X7=KK
colnames(D)=colnames(cum_2FR_by_trial_class_a)
cum_2FR_by_trial_class_a=rbind(cum_2FR_by_trial_class_a,D)
cum_2FR_by_trial_class_a[1,7]=-99

# the final chef-d'oeuvre

# Cum. 2R by FR class (b)

cum_2FR_by_trial_class_b=dcast(cum_2FR_by_trial_class_a, FR_week ~ FR2_delta, value.var="ID", fun.aggregate=length)
cum_2FR_by_trial_class_b=cum_2FR_by_trial_class_b[,-55]
cum_2FR_by_trial_class_b[,2]=cum_2FR_by_trial_class_b[,2]-1
cum_2FR_by_trial_class_b[1,]=cum_2FR_by_trial_class_b[1,]-1

# I noted that everything has a value that is 1 larger. So I subtracted 1 from the dataframe.

cum_2FR_by_trial_class_b[1,2]=254 # we leave the column names as they are because R does not allow numeric column names.
cum_2FR_by_trial_class_b[1,1]=-99
cum_2FR_by_trial_class_b$grandtotal=rowSums(cum_2FR_by_trial_class_b[,-1])

# Cum. 2R by FR class (c)

cum_2FR_by_trial_class_c=data.frame(FR_week=cum_2FR_by_trial_class_b[-1,1],"#tiers"=cum_2FR_by_trial_class_b$grandtotal[-1])
cum_2FR_by_trial_class_c=cbind(cum_2FR_by_trial_class_c,cum_2FR_by_trial_class_b[-1,-c(1,2,55)])
row.names(cum_2FR_by_trial_class_c)=1:52

cum_2FR_by_trial_class_c1=cum_2FR_by_trial_class_c[,-c(1,2)]
         
for (i in 1:nrow(cum_2FR_by_trial_class_c1)){
  for (j in 1:(ncol(cum_2FR_by_trial_class_c1))){
    if (isTRUE((as.integer(rownames(cum_2FR_by_trial_class_c1[j])[i]))>=(53-as.integer(colnames(cum_2FR_by_trial_class_c1[j]))))){
      cum_2FR_by_trial_class_c1[i,j]=NA
    }
  }
}

# next step is to apply the next if statement
# to make things easier, let's replace NA with -99

cum_2FR_by_trial_class_c2=cbind(cum_2FR_by_trial_class_c[2],cum_2FR_by_trial_class_c1)
for (i in 1:nrow(cum_2FR_by_trial_class_c2)){
  for (j in 1:(ncol(cum_2FR_by_trial_class_c2))){
    if (cum_2FR_by_trial_class_c2$X.tiers[i]>0){
      cum_2FR_by_trial_class_c1[i,j]=(sum(cum_2FR_by_trial_class_c2[i,1:j]))/cum_2FR_by_trial_class_c2$X.tiers[i]
    }
  }
}

cum_2FR_by_trial_class_c1=cum_2FR_by_trial_class_c1-1
cum_2FR_by_trial_class_c1=cbind(cum_2FR_by_trial_class_c[2],cum_2FR_by_trial_class_c1)[-2]
cum_2FR_by_trial_class_c1[cum_2FR_by_trial_class_c1<0]=0
colnames(cum_2FR_by_trial_class_c1)=seq(0,51)
cum_2FR_by_trial_class_c1=cum_2FR_by_trial_class_c1[-53]

subs1=cum_2FR_by_trial_class_c1[1:27,2:28]
R=round((crossprod(as.matrix(seq1),as.matrix(subs1))/101)*100,1)

# create the plot "Time to Second Repeat"
plot(R[1,], type='l', col='blue', main='Time to Second Repeat',xlab='Weeks after first repeat', ylab='% First repeaters')

Replication of chapter 6b Master Thesis

Panji BHARATA

January 2023