Growth and Marketing Analytics in R

$Marketing Analytics - Jaime Iglesias$

Marketing Analytics - Jaime Iglesias

Introduction

This personal project shows some of the most important metrics in growth and marketing analytics, the data manipulation was conducted with the statistical programming language, R. This analysis is a simple approach for descriptive and predictive marketing analytics as a way to fuel a company’s user and revenue growth.

This same dataset has been used for a Statistical Cluster Segmentation, this project can also be found in my website. https://www.jaimeiglesias.com/data-science-projects

About the dataset used: The Dataset was obtained from the UCI Machine Learning Repository.

Dataset columns:

InvoiceNo
StockCode
Description
Quantity
InvoiceDate
UnitPrice
CustomerID
Country

Importing the Online Retail Dataset

The data is provided in an excel file and has been imported into R for the analysis. The type of columns (text, date, numeric) has been assigned as imported for easyness of use throughout the project.

It is particularly important to assign the right type of variable to column date as trends over time are key for the type of analysis conducted.

# Importing libraries 
library(readxl) # To read Excel files
library(dplyr) # Package for Data Manipulation in R 
library(ggplot2) #  Visualization Package
library(DT) # Package used to better display tables in the Rmarkdown
library(lubridate) # Pacakge specifically for date manipulation
library(scales) # Package for number formatting

# Importing Data into R
OnlineRetailData1 <- read_excel("1.Online Retail.xlsx", 
                               col_types = c("text", "text", "text", 
                                             "numeric", "date", "numeric", "numeric", 
                                             "text"))

# Showing the first 20 observations of the dataset
DT::datatable(head(OnlineRetailData1, 20),
              rownames = FALSE,
              options = list(
                pageLength = 5))

Data Preprocessing and Cleaning

For this analysis, only complete and non duplicated observations are kept.

These are two important asssumptions as they impact later results of the analysis. Unfortunately, there is not much context to the dataset nor an engineering team to contact for data collection practices questions.

Note

Part of the cleaning code referenced from Jerry Dormetus - Link to his project at the bottom of this analysis.

# Preprocessing - Eliminating duplicates
# observations with duplicated values
dupes <- which(duplicated(OnlineRetailData1))
# Subsetting out the duplicated values
OnlineRetailData2 <- OnlineRetailData1[-dupes,]
# Preprocessing for Missing Values
OnlineRetailData <- OnlineRetailData2[complete.cases(OnlineRetailData2),]
# Removing not useful datasets
rm(OnlineRetailData1, dupes)
rm(OnlineRetailData2)

Dataset Description

Below is a summary statistics of the numeric and date variables in the dataset.

There are 401,604 transactions in the dataset.
Ranging from December 1, 2010 to December 9, 2011.
Average Unit Price of $3.47
Average Quantity per transaction 12.18

OnlineRetailData %>%
  select(InvoiceNo, Quantity, InvoiceDate, UnitPrice) %>%
  summary()
##   InvoiceNo            Quantity          InvoiceDate                 
##  Length:401604      Min.   :-80995.00   Min.   :2010-12-01 08:26:00  
##  Class :character   1st Qu.:     2.00   1st Qu.:2011-04-06 15:02:00  
##  Mode  :character   Median :     5.00   Median :2011-07-29 15:40:00  
##                     Mean   :    12.18   Mean   :2011-07-10 12:08:23  
##                     3rd Qu.:    12.00   3rd Qu.:2011-10-20 11:58:30  
##                     Max.   : 80995.00   Max.   :2011-12-09 12:50:00  
##    UnitPrice       
##  Min.   :    0.00  
##  1st Qu.:    1.25  
##  Median :    1.95  
##  Mean   :    3.47  
##  3rd Qu.:    3.75  
##  Max.   :38970.00

Online Retailer Performance Analysis

Number of Countries

There are more than 35 countries where this business has sold at least one product, most of the transactions were made in the UK, about (88%) of the total transactions.

This is important to know as we move into prioritazition of efforts and strategy, however, this analysis comprehends a big picture of the overall performance and growth in all the countries.

# Transactions per Country 
Transactions_per_Country <- OnlineRetailData %>%
  group_by(Country) %>%
  summarise(Number_of_Transactions = n()) %>%
  arrange(-Number_of_Transactions)


DT::datatable((Transactions_per_Country),
              rownames = FALSE)

Monthly Revenue

Total Revenue is calculated using the unit price and quantity (UnitPrice * Quantity) per observation. Here we are digging for potential swings on monthly revenue over the course of December 2010 and December 2011.

There is decrease in Revenue at the start of the year, reaching the lowest point in the month of April (USD 425,222). Upward trend after April with slight dips in June and July; revenue swings up again the rest of the year, reaching peak monthly revenue in November (USD 1,126,815). December 2011 transactions are comprised of only 9 days, reason why we see the sudden monthly revenue change decrease.

# Attaching variables for easy use moving forward
attach(OnlineRetailData)
# Monthly Revenue
# Multiplyng UnitPrice and Quantity to create a Revenue column
OnlineRetailData <- OnlineRetailData %>%
  mutate(Revenue = `UnitPrice`*`Quantity`)

# Grouping revenu by month to generate a plot
Monthly_Revenue <- OnlineRetailData %>%
  select(InvoiceDate, Revenue) %>%
  group_by(Date=floor_date(InvoiceDate, "month")) %>% # Grouping by months in type POSIXct
  summarise(Monthly_Revenue = sum(Revenue))  # Aggregating by Revenue

# Plotting a graph line of Online Retailer Monthly Revenue
Monthly_Revenue_Visz <- ggplot(data=Monthly_Revenue, aes(Date, Monthly_Revenue)) +
  geom_line(colour='darkblue') +
  geom_point() + 
  scale_y_continuous(labels = scales::dollar) + # Changing the Monthly_Revenue variable to Dollars
  labs(y='Revenue') + # Chanign y-axis name
  theme_light()
print(Monthly_Revenue_Visz)

Revenue Percentage Change over Month

Breaking point in the month of April where Revenue percentage change oscillates drastically from March and moving forward to May 2011. Further analysis can reveal where this change happens on geographic, product or customer level to plan stratigically in the future.

# Percentage Change 
# Calculating the percentage change 
Perc_Change <- Monthly_Revenue %>%
  mutate(Percent_Change = (Monthly_Revenue/lag(Monthly_Revenue) - 1))  # Percentage change 
  
# Plotting the percent change Monthly Revenue
Perc_Change_Visz <- ggplot(data=Perc_Change, aes(Date, Percent_Change)) +
  geom_line(colour='darkblue') +
  geom_point() +
  labs(y='% Change in Revenue') + # Changing y-axis name
  scale_y_continuous(labels = scales::percent)  + # Better display of numbers
  theme_light()
print(Perc_Change_Visz)

Monthly Active Users (MAU)

For the purpose of this analysis, MAU is calculated based on the number of users that made at least one purchase a month.

A lot more activity after the month of August where the number of active users increases until November, suprassing the 1,600 MAU mark.

Unique_MU <- OnlineRetailData %>%
  mutate(Date=floor_date(InvoiceDate, "month")) %>%
  group_by(Date)  %>%
  summarise(Unique_Active_Users=n_distinct(CustomerID)) # Selecting unique buyers a month 

# Plotting Unique Monthly Users
UM_line <- ggplot(data=Unique_MU, aes(Date, Unique_Active_Users)) +
  geom_line(colour='darkblue') +
  geom_point() +
  labs(y='Active Users') +
  scale_y_continuous(labels = scales::comma) +
  theme_light()
print(UM_line)

Monthly Transactions

So far, there has been a nice increasing trend of active users, it is expected that this trend is also followed by an increse in the total number of transactions. In fact, this is the case as total number of transactions also increase. November represents the month with the greatest number of transactions surpassing 30,000.

# Number of Unique Invoices a Month - Monthly Order Count
Monthly_Orders <- OnlineRetailData %>%
  mutate(Date=floor_date(InvoiceDate, "month")) %>%
  group_by(Date)  %>%
  summarise(Invoices=n_distinct(InvoiceNo)) # Selecting unique Invoice Number a month

Monthly_Order_line <- ggplot(data=Monthly_Orders, aes(Date,Invoices)) +
  geom_line(colour='darkblue') +
  geom_point() +
  labs(y='Number of Transactions') +
  scale_y_continuous(labels = scales::comma) +
  theme_light()
print(Monthly_Order_line)

Average Revenue per Order

There is a slight variation of USD 5 between the highets and lowest monthly average revenue per order. It remains near USD 20 throughout the year.

# Average Revenue per Order by Month
Revenue_per_Order <- OnlineRetailData %>%
  mutate(Date=floor_date(InvoiceDate, "month")) %>%
  group_by(Date) %>%
  summarise(Monthly_Revenue_Average = mean(Revenue))

Average_Revenue_per_Transc <- ggplot(Revenue_per_Order, aes(Date, Monthly_Revenue_Average))  +
  geom_text(aes(label = round(Monthly_Revenue_Average,2)), vjust = -0.3) +
  geom_bar(stat = 'identity')  +
  labs(y= 'Avg. Revenue per Order') +
  scale_y_continuous(labels = scales::dollar, limits = c(0,30)) +
  theme_light()
print(Average_Revenue_per_Transc)

Revenue per Customer Type - NEW vs EXISTING

Digging into revenue generated by new and existing customer, there are clear trends that will help the Online Retailer stratigize in months to come.

Revenue from Existing users remain high and strong throught the year, however there are two possible scenarios to explain the decrease in revenue of New users.

There are less New users being acquired over time. There is less money spend by New users when they first joined. To answer this question, let’s look into how many new customer are being acquired per month in the next graph.

Can revenue generated by existing customers be maintained in the long run? Further analysis required.

First_purchase <- OnlineRetailData %>%
  group_by(CustomerID) %>%
  mutate(Date=floor_date(InvoiceDate, "month"),First_Purchase= min(InvoiceDate), # New month column and minimum purchase date column by user
         Type = case_when(InvoiceDate == First_Purchase ~'NEW', TRUE ~ 'EXISTING')) # New column for NEW and EXISTING users based on date joined

New_vs_Existing_Revenue <- First_purchase %>%
  group_by(Date = floor_date(InvoiceDate, 'month'), Type) %>%
  summarise(Revenue = sum(Revenue))

# Ploting New vs Existing Revenue over month
ggplot(New_vs_Existing_Revenue, aes(Date, Revenue, col=Type)) + 
  geom_line() + 
  scale_y_continuous(labels = scales::dollar, limits = c(0,900000), breaks = c(150000, 300000, 450000, 600000, 750000, 900000)) +
  theme_light()

Monthly New Users Ratio

Here the ratio of new users per month. On January 2011, more than half of the users were new customers, over the course of the year, this ratio declines to less than 20% of new customer a month by the end of November.

Can we implement growth and user acquisition strategies to increase the monthly number of new users?

# Counting New Unique users per month
New_User_Ratio <-  First_purchase %>%
  filter(Type == 'NEW') %>%
  group_by(Date) %>%
  summarise(Number_New_Users = n_distinct(CustomerID))

All_Users_Month <- First_purchase %>%
  group_by(Date) %>%
  summarise(Total_Users= n_distinct(CustomerID))
  
New_Old_User_Ratio <- inner_join(New_User_Ratio, All_Users_Month) %>%
  mutate(Ratio = round((`Number_New_Users`/`Total_Users`)*100))

# Visualizing New Users Ratio
New_Users <- ggplot(New_Old_User_Ratio, aes(Date,Ratio))  +
  geom_text(aes(label = Ratio), vjust = -0.3) +
  geom_bar(stat = 'identity') +
  labs(y= 'User Acquisition Ratio %') +
  theme_light()
print(New_Users)

Retention Rate Panel

The retention panel has been built to understand the how many users have remained purchasing over the course of every month.

Column 0, represents the number of customers acquired by month and reading horizontally the percentage of these cutomers who still purchase from the Online Retailer in later months.

A lot can be learned from the last row, which summarizes the overall retention rate information. Here in column 0, it can be seen the average number of users acquired monthly (398). Also, the month with highest rate of retention was October with a rate of 24.80%.

# Creating a Cohort Analysis to estimate how well we retain customers/users

# Setting up a column for the year data to make it easier to focus on 2011
OnlineRetailData$Year <- as.numeric(format(OnlineRetailData$InvoiceDate, '%Y'))

# There is only one month previous to 2011
Cohorts_2011 <- OnlineRetailData[OnlineRetailData$Year==2011,]

# Selecting relevant features for the cohort
Cohorts_2011 <- Cohorts_2011 %>%
  select(CustomerID, InvoiceDate, Year)

Date_Joined <- Cohorts_2011 %>%
  group_by(CustomerID) %>%
  summarise(First_Purchase_Date = min(InvoiceDate))

# -------------------------------------------------------
# Below - Code referenced from Jerry Dormetus 
# Link to his project: https://rstudio-pubs-static.s3.amazonaws.com/365184_904c4369586e49fc8fa08adcae1d559d.html#preprocessing:_data_types

Date_Joined$Cohort <- as.numeric(format(Date_Joined$First_Purchase_Date, "%m"))
Cohorts_2011 <- merge(Cohorts_2011, Date_Joined)

# Number of days and months the customer has been active - (Making a purchase)
Cohorts_2011$Days_Active <- as.numeric(difftime(Cohorts_2011$InvoiceDate, Cohorts_2011$First_Purchase_Date, units = 'days'))

# Dividing the days by 30 to get the number of months
Cohorts_2011$Months_Active <- floor(Cohorts_2011$Days_Active/30)

# Dumping the day element from the join date column
Cohorts_2011$First_Purchase_Date <- format(Cohorts_2011$First_Purchase_Date, "%Y-%m")

# Now we remove the day element from the InvoiceDate data since
# this Cohort Analysis is based on monthly activity.
Cohorts_2011$InvoiceDate <- format(Cohorts_2011$InvoiceDate, "%Y-%m")
# Its important that we remove the day data so that we can remove
# extra observations in the months where customers have multiple
# observations in a single month due to having multiple orders in
# a single month. We'll use the function duplicated for this later

# We relabel the cohort column data to something more intuitive for the sake
# of the report consumers, then factor them since these are sequential
groups <- c("January",
            "February",
            "March",
            "April",
            "May",
            "June",
            "July",
            "August",
            "September",
            "October",
            "November",
            "December")

for(i in 1:12){
  Cohorts_2011[Cohorts_2011$Cohort==i,"Cohort"] <- groups[i]
}

Cohorts_2011$Cohort <- factor(Cohorts_2011$Cohort,ordered = T,levels = groups)

# Montlhy Active Users Panel
# By excluding both columns Age_by_Day and Age_by_Month
# we're able to remove the extra monthly observations to
# avoid counting unique customer IDs multiple times in 
# any single month.

# The day and month Age variables keep us from removing
# duplicates which is why we need to exclude them both
dupes2 <- which(duplicated(Cohorts_2011[,c(-6,-7)]))

# Removing the duplicate observations
Cohorts_2011 <- Cohorts_2011[-dupes2,]

rm(dupes2)


# Creating rows for each cohort group
# Creating columns for each value in the Age_by_Month column;0-11
# The default aggregation setup for dcast is, fun.aggregate = length
cohorts.wide <- reshape2::dcast(Cohorts_2011,Cohort~Months_Active,
                                value.var="CustomerID",
                                fun.aggregate = length)

# Cloning the output for retention and churn mixpanels
# to be used later
cw.retention <- cohorts.wide
cw.churn <- cohorts.wide


# Calculating the percentages. month number/join month number
# DT will handle the *100 and % formating.
# The sequence needs to be reversed because if the first
# column is worked on, everything else will be divided by 1.
# Instead of formatting column 0 to show 100% for each row, it seems
# more useful to leave this as the original count, showing how
# many new customers were acquired in its respective month. This
# is why the for loop ends right before column 0.
for (i in rev(3:ncol(cw.retention))){
  cw.retention[,i] <- round(cw.retention[,i]/cw.retention[,2],4)
}
rm(i)

# Cloning the retention mixpanel
retention.avgs <- cw.retention

# When calculating the column averages, 0 won't get ignored,
# which is a problem. Converting these 0 to NAs solves this issue.
retention.avgs[retention.avgs == 0.0000] <- NA
avgs.ret <- round(apply(retention.avgs[,-1],2,mean, na.rm=TRUE),4)

# We use the zero because this is a numerical vector
# Changing it after the merge can't happen due to the
# factoring of the Cohort labels
avgs.ret <- c(0,avgs.ret)

# Adding the averages row to the retention mixpanel
cw.retention <- rbind(cw.retention,avgs.ret)

# Creating 19 breaks and 20 rgb color values ranging from blue to white
breaks <- quantile(cw.retention[,3:13], probs = seq(.05, .95, .05), na.rm = TRUE)
colors <- sapply(round(seq(155, 80, length.out = length(breaks) + 1), 0),
                 function(x){ rgb(x,x,155, maxColorValue = 155) } )


# The retention rate mixpanel
DT::datatable(cw.retention,
              class = 'cell-border stripe',
             rownames = FALSE,
             options = list(
               ordering=F,
               dom = 't',
               pageLength = 13) ) %>%
             formatStyle("0",
                         backgroundColor = 'lightgrey',
                         fontWeight = 'bold') %>%
  formatPercentage(c(3:13),2) %>% # We don't want column 0 in %
  formatStyle("1", fontWeight = 'bold') %>%
  formatStyle(names(cw.retention[c(-1,-2)]),color = 'white',fontWeight = 'bold', backgroundColor = styleInterval(breaks,colors))

Reference

Part of the Retention Rate code referenced from Jerry Dormetus -

Link to his project.

Summary

The intention of the following highlights is to provide a performance snapshot for the Online Retailer in 2011. Aditionally, to give broad recommendations and generalized on strategies for the upcoming year.

Highlights

88% of the 400,000 transactions coming from United Kingdom. Followed by Germany and France with less than 10,000 transactions in 2011 each.
Good and steady revenue after Q2 of 2011, specially after Q2 and Q3.
High variations on revenue before and after the month April that deserve further investigation.
Steady upward trend of monthly active users MAU and unique transactions.
Increasing revenue trend from Existing users with a concerning decrease in New users revenue.
Lower user acquisiton rate as year progresses, affecting revenue generated by these New customers and future healthy growth of the Online Retailer.
Strong retention rate thorughout the year with more than 20% of exiting customers purchasing every month.