Introduction

The analysis of product performance over a period and customer transactions for a period can provide useful insights on how different products are performing and how customer engagement changes over time. It can be used to answer important questions like:

  • What products are growing or shrinking with changing customer engagement?
  • How does parameters like basket penetration, revenue and units sold change for each particular product over time?
  • which demographic group is spending more on which categories/commodities/products?
  • what is the general trends of household spending over time?
  • How does the general trend of spend vary among different age groups and income groups?

In general, these analyses can be useful for the management to get insights on which product or commodity are doing well in terms of sales and number of units sold and also can be used for selective marketing of products among each demographic group

Packages Required

To analyze this data, we will use the following R packages:

library(ggplot2)   # visualizing data
library(dplyr)     # transforming (joining, summarizing, etc.) data
## Warning: package 'dplyr' was built under R version 3.5.2
library(DT)        # for printing nice HTML output tables
library(tidyr)     # tidying data
library(ggfittext) # for labelling treemap elements
library(treemapify)# for plotting treemaps
library(data.table)# for importing large datasets
library(knitr)     # for printing HTML outputs of summary tables

Data Preparation

Source Data

The data is obtained from leading retailer and the source dataset contains household level transactions over two years from a group of 5000 households who are frequent shoppers at the retailer. It contains all of each household’s purchases, not just those from a limited number of categories. For certain households, demographic information as well as direct marketing contact history are included.

The source data consists of three csv files which consists of the Household data, products data and transactions data. The households data includes the data corresponding to 5000 households with demographic details such as Age range, Marital status, Income range, Homeowner description, Household compensation, Size and Number of children. The Products data has the product information such as Product number, Department, Commodity, Private/National brand and Natural/Organic flag. The transactions data has the transactions details corresponding to each household for the products they purchased for a period of two years.

The missing values are recorded mainly in three different formats. For certain variables it is marked as ‘Unknown’, certain others it is ‘null’,‘NOT AVAILABLE’ and some are marked as ‘NA’ as well.

Importing the data

Prior to assessing how customer engagement over time by analyzing the household transactions, data is imported, vaidated and cleaned.The data is obtained from three different csv file corresponding to Households data, Products data and Transaction data respectively.

households <- fread("data/5000_households.csv", strip.white = T) 

products <- fread("data/5000_products.csv", strip.white = T)

transactions <- fread("data/5000_transactions.csv", strip.white = T)

#using strip.white to remove the trailing spaces in the data

Creating Tidy Data

Once the data has been imported from the source files, the next step is to validate the data and clean the data into a tidy format.

For each data set, the columns which have incomlete/unidentifiable names are renamed with identifiable names. Since the missing values are recorded differently for each variable, we need to change those values to ‘NA’ in order to make them uniform. The code below takes care of this process. The variables are first converted into characters for this purpose and then after the data is cleaned, they are changed to appropriate data types.

Finally, the cleaned data sets are merged to a single final dataset with all the required variables.

# create tidy households data

households[] <- lapply(households, as.character)

# null values and all the data points where the data is not available are assigned a uniform 
# value NA

households[households[,] == "null"] <- NA
households[households[,] == "NOT AVAILABLE"] <- NA
households[households[,] == "Not Available"] <- NA
households[households[,] == "Unknown"] <- NA

households[] <- lapply(households, factor)
households$HSHD_NUM <- as.integer(households$HSHD_NUM)
colnames(households)[2] <- "LOYALTY_FLAG"
colnames(households)[4] <- "MARITAL_STATUS"
colnames(households)[6] <- "HOMEOWNER_DESCRIPTION"

# create tidy products data

colnames(products)[5] <- "NATURAL/ORGANIC_FLAG"
colnames(products)[4] <- "BRAND_TYPE"

# create tidy transactions data

colnames(transactions)[3] <- "PURCHASE_DATE"
colnames(transactions)[7] <- "STORE_REGION"

# converting purchase date to a date column

transactions$PURCHASE_DATE <- as.character(transactions$PURCHASE_DATE)
transactions$PURCHASE_DATE <- as.Date(transactions$PURCHASE_DATE, format = "%d-%b-%y")

# create tidy merged data frame

hh_trans <- merge(households, transactions, by = "HSHD_NUM")
final_dataset <- merge(hh_trans, products, by = "PRODUCT_NUM")

# slicing and dicing data for new visuals

data_year_2016 <- final_dataset[which(final_dataset$YEAR == 2016) , ]
View(data_year_2016)

data_year_2016 <- data_year_2016 %>%
  separate(PURCHASE_DATE, sep = "-", into = c("year_1", "month_1", "day_1"))

data_year_2017 <- final_dataset[which(final_dataset$YEAR == 2017) , ]
View(data_year_2017)

data_year_2017 <- data_year_2017 %>%
  separate(PURCHASE_DATE, sep = "-", into = c("year_1", "month_1", "day_1"))

Final Data set

The final cleaned dataset consists of 21 variables in total as shown in the data table below. The variables which we are mainly concerned about are WEEK_NUM (varies from 1-104), SPEND(spend corresponding to each transaction), UNITS (number of units for each transaction). In addition to this, we are interested in factors such as DEPARTMENT, COMMODITY, AGE_RANGE, MARITAL_STATUS, INCOME_RANGE, HSHD_COMPOSITION, HH_SIZE and CHILDREN.

data_sample <- head(final_dataset, n = 50)

datatable(data_sample)

Exploratory Data Analysis

The main goal of this analysis is to find out how customer engagement changes over time. In order to get an overall picture of this, we may need to perform the following analyses.

  • Product wise Analysis - Top 10 products, Revenue and number of units sold, Sales vs year ago, Basket penetration
  • Demographic Analysis - Trend of Household spend over time, Spend across Income Range and Age Range

A.Top 10 Products

### top 10 commodities 2016

data_top_10_commodity_2016 <- data_year_2016 %>%
  group_by(COMMODITY) %>%
  summarise(SPEND = sum(SPEND)) %>%
  arrange(desc(SPEND)) %>%
  top_n(10)

top_10_plot_2016 <- ggplot(data = data_top_10_commodity_2016) +
  geom_bar(mapping = aes(x = reorder(data_top_10_commodity_2016$COMMODITY, -data_top_10_commodity_2016$SPEND), 
                         y = data_top_10_commodity_2016$SPEND), stat = "identity" ,fill = "tomato3") +
  labs(title = "Top 10 commodities of 2016", x = "Commodity", y = "Spend") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(top_10_plot_2016)

## top_10 commodities 2017

data_top_10_commodity_2017 <- data_year_2017 %>%
  group_by(COMMODITY) %>%
  summarise(SPEND = sum(SPEND)) %>%
  arrange(desc(SPEND)) %>%
  top_n(10)

top_10_plot_2017 <- ggplot(data = data_top_10_commodity_2017) +
  geom_bar(mapping = aes(x = reorder(data_top_10_commodity_2017$COMMODITY, -data_top_10_commodity_2017$SPEND), 
                         y = data_top_10_commodity_2017$SPEND), stat = "identity" ,fill = "tomato3") +
  labs(title = "Top 10 commodities of 2017", x = "Commodity", y = "Spend") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(top_10_plot_2017)

Summary/Explanation: The above two bar graphs shows the top 10 products of 2016 and 2017 in terms of revenue generated which is calcutaed by aggregating the Spend column available in the data. A look at both the graphs shows that the set of products in the top 10 bucket remained the same in both the years. Among them, Grocery staple, produce and diary are the top 3 revenue generating products for both the years.

B.Revenue and number of units sold

## Products tree map

data_tree_map_2016 <- data_year_2016 %>%
  group_by(COMMODITY) %>%
  summarise(SPEND = sum(SPEND), UNITS = sum(UNITS)) %>%
  arrange(SPEND) %>%
  na.omit()

View(data_tree_map_2016)

ggplot(data_tree_map_2016, aes(area = SPEND, fill = UNITS, label = COMMODITY)) +
  geom_treemap() +
  geom_treemap_text(fontface = "italic", colour = "white", place = "centre",
                    grow = TRUE) +
  labs(title = "Products Tree Map - Revenue and Number of Units sold")

Summary/Explanation: The tree map arranges the products in the order of revenue generated, the largest box being the top revenue generating product and the smallest box being the lowest. The color gradient gives an indication of the number of units sold, the lightest colored box being the product with most number of units sold.

This gives an idea about which products are doing well in the market and which are not. The marketing team can focus on improving the sales of those products which are not doing great but has the potential to generate more revenue.

C.Sales vs year ago

## Sales vs previous year

data_sales_2016 <- data_year_2016 %>%
  group_by(COMMODITY) %>%
  summarise(SPEND_previous = sum(SPEND)) %>%
  na.omit()


data_sales_2017 <- data_year_2017 %>%
  group_by(COMMODITY) %>%
  summarise(SPEND = sum(SPEND)) %>%
  na.omit()


sales_data <- merge(data_sales_2017, data_sales_2016, by = "COMMODITY")

colnames(sales_data)[1] <- "Product"

colnames(sales_data)[2] <- "Sales_Current_Year"

colnames(sales_data)[3] <- "Sales_Previous_Year"


sales_data$SALES_PERCENT_CHANGE <- ((sales_data$Sales_Current_Year - sales_data$Sales_Previous_Year) 
                                      / sales_data$Sales_Previous_Year)*100

sales_data$SALES_PERCENT_CHANGE <- round(sales_data$SALES_PERCENT_CHANGE, 2)

colnames(sales_data)[4] <- "Percentage_change_in_Sales"


index <- with(sales_data, order(-abs(Percentage_change_in_Sales)))
sales_data <- sales_data[index, ]


datatable(sales_data, caption = "Table 1: Percent change in sales for each product sorted in the descending order of magnitude of change")

Summary/Explanation: The above table is sorted in the order of percentage change in sales for the current year(2017) as compared to the year ago (2016). The products whose sales have increased significatly can easily be tracked. Also the products whose sales has gone down significantly can be identified and the reason for the same can be investigated.

D.Basket Penetration

## Basket penetration
## calculations for previous year

tot_num_basket_2016 <- length(unique(data_year_2016$BASKET_NUM))

data_basket_2016 <- data_year_2016 %>%
  select(COMMODITY, BASKET_NUM) %>%
  group_by(COMMODITY) %>%
  summarise(BASKET_NUM_count_prev_yr = length(unique(BASKET_NUM))) %>%
  na.omit()

data_basket_2016$basket_penetration_prev_yr <- (data_basket_2016$BASKET_NUM_count_prev_yr / tot_num_basket_2016)*100

data_basket_2016$basket_penetration_prev_yr <- round(data_basket_2016$basket_penetration_prev_yr, 4)

## calculations for current year

tot_num_basket_2017 <- length(unique(data_year_2017$BASKET_NUM))

data_basket_2017 <- data_year_2017 %>%
  select(COMMODITY, BASKET_NUM) %>%
  group_by(COMMODITY) %>%
  summarise(BASKET_NUM_count = length(unique(BASKET_NUM))) %>%
  na.omit()

data_basket_2017$basket_penetration <- (data_basket_2017$BASKET_NUM_count
                                                / tot_num_basket_2017)*100

data_basket_2017$basket_penetration <- round(data_basket_2017$basket_penetration, 4)

## summary table for basket penetration

basket_penetration_summary <- merge(data_basket_2017, data_basket_2016, by = 'COMMODITY')
View(basket_penetration_summary)

basket_penetration_summary$change <- (basket_penetration_summary$basket_penetration -
                                                          basket_penetration_summary$basket_penetration_prev_yr)

basket_penetration_summary$change <- round(basket_penetration_summary$change, 4)
  
index_2 <- with(basket_penetration_summary, order(-abs(change)))
basket_penetration_summary <- basket_penetration_summary[index_2, ]

colnames(basket_penetration_summary)[1] <- "Product"
colnames(basket_penetration_summary)[2] <- "Basket_count"
colnames(basket_penetration_summary)[3] <- "Basket_Penetration"
colnames(basket_penetration_summary)[4] <- "Basket_count_prev_year"
colnames(basket_penetration_summary)[5] <- "Basket_Penetration_prev_year"
colnames(basket_penetration_summary)[6] <- "change_in_basket_penetration"

datatable(basket_penetration_summary, caption = "Table 2: Basket Penetration for current year, previous year and change in percentage sorted in the descending order of magnitude of change")

Summary/Explanation: Basket Penetration is an important parameter which relates to what percentage of shopper marktes, the product is in. The above table shows a summary of the Basket Penetration for the current and previous years (2016 and 2017). The change between the two years is also displayed and the table is sorted in the descending order of the magnitude of change.

E.Trend of Household spend over time

## Line chart

data_line_chart <- final_dataset %>%
  group_by(WEEK_NUM) %>%
  summarise(SPEND = sum(SPEND)) %>%
  arrange(WEEK_NUM) %>%
  na.omit()


ggplot(data = data_line_chart, aes(x = WEEK_NUM, y = SPEND)) +
  geom_line(color = 'purple4', size = 2) +
  labs(title = "Household Spend over Time", x = "Week Number", y = "Spend") +
  theme_minimal()

Summary/Explanation: The above line graph shows the general trend of spend among the households over time. The X axis shows the week number and Y axis shows the corresponding aggregated spend. The two spikes in the graph indicates that maximum shopping occurs during the end of the year probably due to the holidays and the festival season.

F.Income Range and Age Range

## Income range and household size 

data_income_range_2016 <- data_year_2016 %>%
  group_by(INCOME_RANGE, AGE_RANGE) %>%
  summarise(SPEND = sum(SPEND)) %>%
  arrange(INCOME_RANGE) %>%
  na.omit()

plot_5 <- ggplot(data = data_income_range_2016) +
  geom_bar(mapping = aes(x = reorder(data_income_range_2016$AGE_RANGE, data_income_range_2016$AGE_RANGE), 
                         y = data_income_range_2016$SPEND), stat = "identity" ,fill = "turquoise4") +
  facet_wrap(~ data_income_range_2016$INCOME_RANGE ) +
  labs(title = "Income Range & Age Range Trend 2016", x = "Age Range", y = "Spend") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(plot_5)

Summary/Explanation: The facet wrap bar graph spend trends among different Age group across different Income Ranges.It can be observed that, the customers in the Age group ‘45-54’ and ‘55-64’ spent the maximum in all income groups except ‘Under 35k’ group. In the ‘under 35k’ group, the older people tend to spend more.

Summary

The main goal of this analysis was to find out how different products are doing in the market and how customer engagement changes over time. In order to get an overall picture of this, the data for the current year(2017) and previous year(2016) were compared.

  • The data was cleaned, transformed and manipulated accordingly for the analysis. Variuos graphs and tables were plotted for each section of the analysis and a brief summary for each were provided.

  • The various visualizations and tables listed under the anayses Top 10 products, Products tree map, Sales comparison for both years and Basket Penetration gives an idea about which products are doing well in the market and which are not.

  • The visuals in the demographic analyses gives insights about the general trends of spend among households and also on which age groups are spending more among different income groups.

  • In general, the product analysis can be useful for the management to plan on improving the sales of those particular products which are not performing as expected.

  • The information in the demographic analyses can be used for selective marketing of products among each demographic group.

  • The analysis can be improved by building on detailed analysis which includes a drill down on to a particular category of product or a particular demographic factor and the interactions between these could be found using a regression anaysis.