Replication of chapter 6a Master Thesis

knitr::opts_chunk$set(echo = TRUE)

We start by loading the libraries needed.

library(readr)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ dplyr   1.0.10
## ✔ tibble  3.1.8      ✔ stringr 1.5.0 
## ✔ tidyr   1.2.1      ✔ forcats 0.5.2 
## ✔ purrr   1.0.1      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(plyr)

## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following object is masked from 'package:purrr':
## 
##     compact

library(dplyr)
library(ggplot2)
library(readxl)
library(lattice)
library(reshape2)

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

We copy the edible grocery worksheet from the workbook used in Chapter 4 to a new workbook. (Don’t forget to save this new workbook, say as chapter 6a.xlsx.)

# edible_grocery

chapter_6a=read_excel("edible_grocery.xlsx")

## New names:
## • `sku_id` -> `sku_id...4`
## • `weight` -> `weight...9`
## • `` -> `...12`
## • `sku_id` -> `sku_id...13`
## • `weight` -> `weight...14`

# R gives new names to the columns, especially the ones that have a common name.

The first thing we will do is check whether PPB = 1. We insert a pivot table where Rows is trans id, Filter is brand (selecting just Alpha) and Values is (Average of) week and (Average of) panel id. This gives us a table (Sheet2) that reports the panellist id and week associated with each transaction.

# Sheet2

func=function(data, filt){
    sheet2_6=data[,c(1,2,3,7)]
    sheet_another=data[,c(3,7,10)]
    rev_spend=data[,c(3,7,8)]
    sheet1=sheet2_6[sheet2_6$brand==filt,]
    sheet2=aggregate( cbind(week,panel_id)~trans_id, data=sheet1[,-4],FUN=mean)
    volume=dcast(sheet_another, week ~ brand, value.var="volume", fun.aggregate=sum)
    revenue=dcast(rev_spend, week ~ brand, value.var="spend", fun.aggregate=sum)
     #list("integer" = foo, "names" = bar)
    return (list (sheet2=sheet2, Volume=volume, revenue=revenue)) 
}

# The function above has two roles. One is to subset the data and get the sheet 2 based on the filtered brand value.

# Secondly, it calculates the volume and revenue in chapter 3 basing on the brand value chosen in the filters.

func_return=func(chapter_6a,'Alpha')
sheet2=func_return$sheet2

We copy the contents of this table (cells A4:C18303) to adjacent cells (e.g., E4:G18303) and add the relevant column labels trans id week panel id in row 4. • With the active cell somewhere in this new table, we insert a pivot table where Rows is week and Values is (Count of) trans id and (Count of) panel id. We note that the number of transaction ids associated with each week equals the number of panellist ids associated with each week. In other words, no panellist made more than one purchase in any given week, which means PPB = 1, as assumed above.

# Sheet3

sheet3=aggregate( cbind(trans_id,panel_id)~week, data=sheet2,FUN=length)

We finish by copying the week numbers and panellist counts into a new worksheet (which we call Weekly summary), adding the labels week and # panellists in cells A1 and B1, respectively. The next step is to add the weekly volume and revenue numbers for Alpha to Weekly summary. We computed these numbers in Chapter 3 and so we copy the relevant columns from the associated workbook, adding the labels volume and revenue in cells C1 and D1, respectively. We now have all the data to create the numbers associated with the revenue decomposition.

# Weekly summary

# let's get the volume from chapter 3.
# here, we compute the volume and revenue again.

volume=func_return$Volume[,2]
revenue=func_return$revenue[,2]

Weekly_summary=cbind(sheet3[,1],sheet3[,3],volume,revenue)
colnames(Weekly_summary)=c('Week','# panellists', 'Volume', "revenue")

Weekly penetration is simply the number of panellists active in any given week divided by the size of the panel. We enter penetration in cell E1, enter =B2/5021 in cell E2, format as a percentage, and copy the formula down to E105

# Adding "penetration"
Weekly_summary=as.data.frame(Weekly_summary)
Weekly_summary$penetration=(Weekly_summary$`# panellists`/5021)*100

• Average order value is simply total revenue for any given week divided by the number of panellists that made at least one purchase of Alpha in that week. Having entered avg order value in cell F1, we enter =D2/B2 in cell F2 and copy the formula down to F105.

# Adding "avg order value"
Weekly_summary$avg_order_value=Weekly_summary$revenue/Weekly_summary$`# panellists`

As noted above, this quantity can be decomposed into average order volume and average price per unit volume (in this case, kg). • Average order volume is simply (total) volume sold in any given week divided by the number of panellists that made at least one purchase of Alpha in that week. Having entered avg order volume in cell G1, we enter =C2/B2 in cell G2 and copy the formula down to G105. • Average price per unit volume is simply total revenue for any given week divided by (total) volume sold in that week. Having entered avg price/kg in cell H1, we enter =D2/C2 in cell H2 and copy the formula down to H105.

# Adding "avg order volume" & "avg price kg"
Weekly_summary$avg_order_volume=Weekly_summary$Volume/Weekly_summary$`# panellists`
Weekly_summary$avg_price_kg=Weekly_summary$revenue/Weekly_summary$Volume

Weekly_summary1=Weekly_summary

# Plotting "Weekly Penetration"
ggplot(data=Weekly_summary, aes(x=Week, y=penetration)) +
  geom_line(color="blue") +ggtitle('Weekly Penetration') +
  ylab("weekly penetration (%)")

# Plotting "Weekly Average Order Value"
ggplot(data=Weekly_summary, aes(x=Week, y=avg_order_value)) +
  geom_line(color="blue") +ggtitle('Weekly Average Order Value') +
  ylab("average order value ($)")

# Plotting "Average Price/kg"
ggplot(data=Weekly_summary, aes(x=Week, y=avg_price_kg)) +
  geom_line(color="blue") +ggtitle('Average Price/kg') +
  ylab("average price per kg ($)")

# Plotting "Average Order Volume (kg)"
ggplot(data=Weekly_summary, aes(x=Week, y=avg_order_volume)) +
  geom_line(color="blue") +ggtitle('Average Order Volume (kg)') +
  ylab("average order volume")

# Plotting Revenue and Penetration
ggplot(Weekly_summary, aes(Week, revenue)) +
  ggtitle("Revenue (orange) & Penetration (blue)") +
  geom_line(color = "blue") +
  geom_line(aes(y = penetration*200), colour = "orange") +
    scale_y_continuous("revenue ($)", sec.axis = sec_axis(~./20000, name =
                                                        "penetration"))

#twoord.plot(lx=Weekly_summary$Week,ly=Weekly_summary$revenue,rx= Weekly_summary$Week,ry=Weekly_summary$penetration, type= c("l", "l"), xaxt = 'n', yaxt = 'n', xlab="week", ylab='Revenue',lylim=c(0,2000))
#par(mar = c(5, 4, 4, 4) + 0.3)     # Additional space for second y-axis
#plot(Weekly_summary$Week, Weekly_summary$revenue, type = 'l', col = 'blue', lwd=2, ylab = 'Revenue', xlab = 'weeks')   # Create first plot
#par(new = TRUE)                    # Add new plot
#plot(Weekly_summary$Week, Weekly_summary$penetration, type='l', col = '#ee9a00',                                       # Create second plot without axes
#axes = TRUE, xlab = "", ylab = "",lwd=2 , ylim = c(0,12))
#axis(side = 4, at = pretty(range(Weekly_summary$penetration)))     # Add second axis
#mtext("penetration", side = 4, line =3)

# all weeks correlation
cor(Weekly_summary1[,-c(1,2,3)])

##                     revenue penetration avg_order_value avg_order_volume
## revenue           1.0000000   0.9818491      0.54234453        0.7584870
## penetration       0.9818491   1.0000000      0.38687004        0.7034499
## avg_order_value   0.5423445   0.3868700      1.00000000        0.6914778
## avg_order_volume  0.7584870   0.7034499      0.69147783        1.0000000
## avg_price_kg     -0.5895791  -0.6540344     -0.07304781       -0.7541574
##                  avg_price_kg
## revenue           -0.58957906
## penetration       -0.65403436
## avg_order_value   -0.07304781
## avg_order_volume  -0.75415741
## avg_price_kg       1.00000000

# weeks 1-52 correlation
week_1_52=Weekly_summary1[Weekly_summary$Week<=52,]
cor(week_1_52[,-c(1,2,3)])

##                     revenue penetration avg_order_value avg_order_volume
## revenue           1.0000000   0.9901953       0.7876445        0.8397546
## penetration       0.9901953   1.0000000       0.7072936        0.7778116
## avg_order_value   0.7876445   0.7072936       1.0000000        0.9564885
## avg_order_volume  0.8397546   0.7778116       0.9564885        1.0000000
## avg_price_kg     -0.5335911  -0.5531073      -0.3491929       -0.6064231
##                  avg_price_kg
## revenue            -0.5335911
## penetration        -0.5531073
## avg_order_value    -0.3491929
## avg_order_volume   -0.6064231
## avg_price_kg        1.0000000

# weeks 53-104 correlation
week_53_104=Weekly_summary1[Weekly_summary$Week>=53,]
cor(week_53_104[,-c(1,2,3)])

##                     revenue penetration avg_order_value avg_order_volume
## revenue           1.0000000   0.9779584      0.41359141        0.7699498
## penetration       0.9779584   1.0000000      0.22708650        0.7104118
## avg_order_value   0.4135914   0.2270865      1.00000000        0.6020706
## avg_order_volume  0.7699498   0.7104118      0.60207059        1.0000000
## avg_price_kg     -0.6850206  -0.7482855     -0.03484664       -0.8041476
##                  avg_price_kg
## revenue           -0.68502064
## penetration       -0.74828547
## avg_order_value   -0.03484664
## avg_order_volume  -0.80414759
## avg_price_kg       1.00000000

# Sheet5

sheet5_func=function(data, filt){
  sheet5=data[,c(1,7,11)]
  sheet5_1=sheet5[sheet5$brand==filt,]
  dcast(sheet5_1, panel_id ~ year, value.var="panel_id", fun.aggregate=sum)
}

sheet5=sheet5_func(chapter_6a,'Alpha')
sheet5$grandtotal=rowSums(sheet5[,c(2,3)])
sheet5$sum_year1=ifelse(sheet5$`1`>=1,1,0)
sheet5$sum_year2=ifelse(sheet5$`2`>=1,1,0)

Next, we compute total revenue and volume for Alpha by year. Going back to the edible grocery worksheet, we insert a pivot table where Rows is year, Filter is brand (selecting just Alpha) and Values is (Sum of) spend and (Sum of) volume. This gives us a new worksheet (Sheet6).

# Sheet6

sheet6=chapter_6a[,c(7,11,10,8)]
sheet6=sheet6[sheet6$brand=="Alpha",][,-1]
sheet6_final=aggregate(cbind(spend, volume)~year, data = sheet6, FUN=sum)

We add a new worksheet, which we call Annual Summary. We enter the column labels Year 1 and Year 2 in cells B1:C1 and the row labels # buyers # transactions revenue total volume in cells A2:A5. The relevant values are copied from Sheet5 and Sheet6. Recalling that Sheet3 contains a weekly count of the number of transactions associated with Alpha, we compute the total number of transactions for years 1 and 2 by entering =SUM(Sheet2!B4:B55) and =SUM(Sheet2!B56:B107) in cells B3 and C3, respectively.

# Annual summary

# the following creates the first part of the annual summary table
num_buyers=c(2624, 2759)
num_transactions=c(sum(head(sheet3$trans_id,52)),sum(tail(sheet3$panel_id,52)))
revenue_years=sheet6_final$spend
total_volume=sheet6_final$volume

annual_summary1=data.frame(matrix(ncol = 2, nrow = 0))
colnames(annual_summary1)=c('year1', 'year2')
annual_summary1[1,]=num_buyers
annual_summary1[2,]=num_transactions
annual_summary1[3,]=revenue_years
annual_summary1[4,]=total_volume
rownames(annual_summary1)=c('num_buyers','num_transactions','revenue_years','total_volume')

# this creates the second part of the annual summary table
penetrate=num_buyers/5021
ppb=num_transactions/num_buyers
avg_order_value=revenue_years/num_transactions
avg_order_volume=total_volume/num_transactions
avg_price_kg=revenue_years/total_volume

annual_summary=data.frame(matrix(nrow = 0, ncol = 2))
colnames(annual_summary)=c('year1', 'year2')
annual_summary[1,]=penetrate
annual_summary[2,]=ppb
annual_summary[3,]=avg_order_value
annual_summary[4,]=avg_order_volume
annual_summary[5,]=avg_price_kg
rownames(annual_summary)=c('penetration','ppb','avg_order_value','avg_order_volume','avg_price_kg')
annual_summary$percentchange=((annual_summary[,2]/annual_summary[,1])-1)*100

We make a copy of Sheet2, and delete rows 1–2 and columns A–D

# Sheet2 (2)

sheet2_copy=sheet2

We create a year variable which indicates whether transaction is associated with the first or second year. We first enter year in cell D1. Next we enter =IF(B2<=52,1,2) in D2 and copy this formula down to D18301.

# creating a year variable
sheet2_copy$year=ifelse(sheet2_copy$week<=52,1,2)

We insert a pivot table where Rows is panel id, Columns is year, and Values is (Count of) trans id. • We enter the column headings panel id year 1 year 2 in cells F4:H4. • We enter =A5 in cell F5 and copying the formula across and down to H3146.

# Sheet9

sheet9=dcast(sheet2_copy, panel_id ~ year, value.var="trans_id", fun.aggregate=length)
colnames(sheet9)=c('panel_id', 'year_1', 'year_2')

With the active cell somewhere in this new table, we insert a new pivot table where Rows is year 1, Columns is year 2, and Values is (Count of) panel id. We rename the resulting worksheet y1 vs. y2 joint distrib).

# y1 vs. y2 joint distrib

y1_vs_y2_joint_distrib=dcast(sheet9, year_1 ~ year_2, value.var="panel_id", fun.aggregate=length)
rownames(y1_vs_y2_joint_distrib)=y1_vs_y2_joint_distrib$year_1
y1_vs_y2_joint_distrib=y1_vs_y2_joint_distrib[,-1]

The pivot table is very sparse for high transaction counts in both the first and second periods. We therefore create a right-censored version of the table (replacing 10 with 10+). Having entering 0, 1, . . . , 10+ in cells Z4:AJ4 and Y5:Y15, we enter =B5 in cell Z5 and copy the formula across and down to AI14.

# creating a right-censored version of the previous table, replacing 10 with 10+
y1vsy2=y1_vs_y2_joint_distrib[1:10,]
sumy1_y2=colSums(y1_vs_y2_joint_distrib[11:19,])

new_y1vsy2=rbind(y1vsy2, '10+'=sumy1_y2)
first_subset=new_y1vsy2[, 1:10]
sum_cols=rowSums(new_y1vsy2[,11:18])
y1_vs_y2_joint_distrib=cbind(first_subset,'10+'=sum_cols)

Note that we have been working with a subset of the original dataset that only contains the purchasing of those that bought Alpha at least once in the two years. Looking at the bottom-right cell of the pivot table, we see that there are 3142 such households. The panel contains 5021 panellists. Therefore the correct number of households that made zero purchases of Alpha in years 1 and 2 is obtained by entering =5021-3142 in the (0,0) cell of the table (Z5). Next, we compute the 10+ numbers by first entering =SUM(B15:B23) in cell Z15 and copying the formula across to AI15. Next, we enter =SUM(L5:S5)in cell AJ5 and copy the formula down to AJ14. Finally, the number of panellists that made 10+ purchases in both years is computed by entering =SUM(L15:S23) in cell AJ15. This gives us Table 6.2.

y1_vs_y2_joint_distrib[1,1]=5021-3142

How do we read this table? Cells Z6:AJ6 tell us how many people who bought Alpha once in year 1 bought Alpha 0, 1, 2, . . . times in year 2. For example, 259 households didn’t buy Alpha in year 2, 201 bought Alpha once in year 2, and so on.

# Columns sums & Percentages
col_sum=colSums(y1_vs_y2_joint_distrib)
col_perc=(col_sum/5021)*100

Having computed the row (cells AL5:AL15) and column (cells Z17:AJ17) totals, we compute the marginal distribution of purchasing of Alpha in the first year in cells AM5:AM15, and the marginal distribution of purchasing in the second year in cells Z18:AJ18.

# Row sums & Percentages
row_sum=rowSums(y1_vs_y2_joint_distrib)
row_perc=(row_sum/5021)*100

# Distribution of Alpha transactions
alpha_transact=data.frame(transactions = rownames(y1_vs_y2_joint_distrib),year2=col_perc, year1=row_perc)

# Distribution of Alpha transactions
data_long3 <- alpha_transact%>%               # Apply gather function
  gather(year, value, - c(transactions))


# drawing the barchart "Distribution of Alpha transactions"
ggplot(data_long3, aes(x=transactions, y=value, fill=year)) + 
  geom_bar(stat="identity",position=position_dodge()) +
  ggtitle("Distribution of Alpha transactions") +
  xlab("Number of transactions") + ylab ("distribution") +
  xlim(c(0,1,2,3,4,5,6,7,8,9,'10+')) + scale_y_continuous(labels=function(x) paste0(x,"%"))

# Sheet2 (3)

sheet2_new=sheet2
sheet2_new$quarter=as.integer((sheet2_new$week-1)/13)+1

We insert a pivot table where Rows is panel id, Columns is quarter, and Values is (Sum of) trans id. We rename this worksheet Repeat rates.

# Repeat rates

# creating the first table
repeat_rates=dcast(sheet2_new, panel_id ~ quarter, value.var="trans_id", fun.aggregate=sum)

Having entered in the quarter labels ( Q1 . . . Q8 ) in cells L4:S4, we create the ever-buyers (by quarter) table by entering =1*(B5>0) in cell L5 and copying the formula across and down to S3146. An entry of 1 means the panellist made at least one purchase of Alpha in the quarter; 0 means no purchase (of Alpha) occurred.

# creating the second table
repeat_rates1=ifelse(repeat_rates[,-1]>=1, 1,0)
colnames(repeat_rates1)=c('Q1', 'Q2', 'Q3','Q4', 'Q5','Q6','Q7','Q8')

Recall that the repeat rate is the percentage of a brand’s customers in a given period who also purchase the product in the following period. How many households bought Alpha in Q1? It is simply the sum of the numbers under Q1. (This is the denominator.) If you purchased Alpha in both periods, you will have a 1 for Q1 and a 1 for Q2. If we multiple these two columns of numbers together, only those customers that purchased Alpha in both periods will have a 1. Any other combination of purchasing will result in a zero. Summing up this product gives us the number of households that purchased Alpha at least once in both periods. (This is the numerator.) • Entering =SUMPRODUCT(L5:L3146,M5:M3146)/SUM(L5:L3146) in cell U5 gives us the Q1-Q2 repeat rate. We see that 66% of those households that purchased Alpha in Q1 bought the brand again at least once in Q2.

# first step to create the "Repeat rate (Alpha)" chart
q1_q2=round(sum(repeat_rates1[,1]*repeat_rates1[,2])/sum(repeat_rates1[,1]),2)*100
q2_q3=round(sum(repeat_rates1[,2]*repeat_rates1[,3])/sum(repeat_rates1[,2]),2)*100
q3_q4=round(sum(repeat_rates1[,3]*repeat_rates1[,4])/sum(repeat_rates1[,3]),2)*100
q4_q5=round(sum(repeat_rates1[,4]*repeat_rates1[,5])/sum(repeat_rates1[,4]),2)*100
q5_q6=round(sum(repeat_rates1[,5]*repeat_rates1[,6])/sum(repeat_rates1[,5]),2)*100
q6_q7=round(sum(repeat_rates1[,6]*repeat_rates1[,7])/sum(repeat_rates1[,6]),2)*100
q7_q8=round(sum(repeat_rates1[,7]*repeat_rates1[,8])/sum(repeat_rates1[,7]),2)*100

repeat_rat_alpha=c(q1_q2,q2_q3,q3_q4,q4_q5,q5_q6,q7_q8)
repeat_rate_labels=c('q1_q2','q2_q3','q3_q4','q4_q5','q5_q6','q7_q8')

Copying this formula across to AA5 gives us the repeat rates for the other quarters. Plotting these numbers gives us Figure 6.7.

# final step to plot the "Repeat rate (Alpha)" chart
repeat_data=data.frame(quarter=repeat_rate_labels, values=repeat_rat_alpha)
ggplot(repeat_data, aes(x=quarter,y=values, group=1))+ geom_line(color="blue") +ylim(0,80) +scale_y_continuous(labels=function(x) paste0(x,"%")) + ggtitle("Repeat rate (Alpha)") + ylab("repeat rate (alpha)")

## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.

Replication of chapter 6a Master Thesis

Panji BHARATA

January 2023