Carbo Loading

The purpose of this project is to perform data analysis and visualization on the Carbo-Loading data set. Carbo-Loading contains household level transactions over a period of two years from four categories: Pasta, Pasta Sauce, Syrup, and Pancake Mix. These categories were chosen so that interactions between the categories can be detected and studied.

We used R to perform data analysis and visualization to explore and identify trends in the customer behaviour, and uncover insights to understand the interactions between the categories.

1. Introduction

1.1 Problem Statement:

Penetration of Product: To identify which brands are performing better and adopt marketing strategies to low performing products.
Aisles/location recommendation for products.
Product Recomendation: Identifying buying patterns of customers to make product recommendations.

1.2 Addressing the problem

Analyze which brands are performing better and use the causal data to determine the marketing strategies of better performing brands and using it for low performing brands.
Analyze the sales of products based on their location on the weekly mailer.
Analyze the sales of products based on their in-store display location.
Analyze what products are frequently purchased together to provide recommendations to customers when they purchase one of those products

1.3 How will this analysis help the consumer?

The analysis will help the consumer in understanding the performance of brands and their markteing strategies which can be used to promote the brands of the client. The analysis can be used for providing product recommendations to customers based on their previous purchases. Analysis will help in identyifying the most effective in-store locations and in the weekly mailer.

1.4 Steps involved:

Load Required Packages
Importing multiple Datasets
Clean Up and Prepare Data for Analysis
Exploratory Data Analysis
Data Visualization
Summary of Findings

2. Required Packages

library(haven)    # to import SAS dataset
library(dplyr)    # to manuplate data
library(tidyr)    # to tidy up data
library(stringr)  # to perform string operations
library(ggplot2)  # to visualize the data
library(scales)   # rescaling the data

3. Data Prepartion

3.1 Source of Data - link

The data contains household level transactions over a period of two years from four categories: Pasta, Pasta Sauce, Syrup, and Pancake Mix. These categories were chosen so that interactions between the categories can be detected and studied. Here we use 4 data sets from a relational database:

Transactions : The table contains transactions of a sample 2 years of Pasta, paste sauce, syrup and pancake at the household level obtained through loyalty program of a leading US grocer
Store lookup : Provides the zipcode of each store
Product : Provides detailed product information for each product in the transaction table
Causal lookup : Provides trade activity of each product / week

3.2 Data importing

file_names <- c("causal_lookup", "product_lookup", "store_lookup", "transactions")
file2 <- ".sas7bdat"

for(i in seq_along(file_names)) {
  path <- paste0("data/", file_names[i], file2)
  if(file.exists(path)) {
    df <- read_sas(path)
    assign(file_names[i], data.frame(df))
  } else {
    print(paste0("invalid filename ", paste0(file_names[i]), file2))
}
}

3.3 Data Cleaning

Steps involved Cleaning:

Use functions head(), str() & summary() to identify irregularities in the datasets.
Changing the class of necessary variables.
Check for blank and missing values using custom function missval_df().
treatment for missing values.

Using custom function missval_df for cleaning the dataset. The function takes the name of the dataset as the input parameter. The function imputes blanks with NA and computes the number and percentage of missing values in each column.

missval_df<-function(df){
  df[df == ""] <- NA
  mv<-sapply(df, function(x) sum(is.na(x)))
  miss_tab<-c("Variable"="abc", "Missing_values" = 1, "Missing_percentage"=1)
  for(z in seq_along(df)){
    Variable<-colnames(df[z])
    tv<-nrow(df)
    Missing_values<-as.numeric(mv[z])
    per<-Missing_values/tv*100
    Missing_percentage<-round(per, digits = 2)
    row_n<-cbind(Variable, Missing_values, Missing_percentage)
    miss_tab<-rbind(miss_tab, row_n)
  }
  print(t(miss_tab[-1,]))
}

Causal Dataset

glimpse(causal_lookup)

## Observations: 351,372
## Variables: 6
## $ upc          <chr> "7680850108", "5100001212", "5100002792", "362000...
## $ store        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ week         <dbl> 68, 66, 72, 55, 68, 55, 66, 55, 55, 76, 66, 66, 6...
## $ feature_desc <chr> "Wrap Interior Feature", "Wrap Back Feature", "In...
## $ display_desc <chr> "Not on Display", "Not on Display", "Not on Displ...
## $ geography    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

#converting vars from char type to factor
causal_lookup$upc <- as.factor(causal_lookup$upc)
causal_lookup$store <- as.factor(causal_lookup$store)
causal_lookup$feature_desc <- as.factor(causal_lookup$feature_desc)
causal_lookup$display_desc <- as.factor(causal_lookup$display_desc)
causal_lookup$geography <- as.factor(causal_lookup$geography)

#check if blanks are present and replace it with NA and check for missing values
missval_df(causal_lookup)

##                                                                      
## Variable           "upc" "store" "week" "feature_desc" "display_desc"
## Missing_values     "0"   "0"     "0"    "0"            "0"           
## Missing_percentage "0"   "0"     "0"    "0"            "0"           
##                               
## Variable           "geography"
## Missing_values     "0"        
## Missing_percentage "0"

Product Dataset

glimpse(product_lookup)

## Observations: 927
## Variables: 5
## $ upc                 <chr> "111112360", "566300023", "566300028", "56...
## $ product_description <chr> "VINCENT S ORIG MARINARA S", "PINE MOUNTAI...
## $ commodity           <chr> "pasta sauce", "syrups", "syrups", "syrups...
## $ brand               <chr> "Vincent's", "Pine Mountain", "Miller", "M...
## $ product_size        <chr> "25 OZ", "40 OZ", "19 OZ", "12 OZ", "19 OZ...

product_lookup$commodity <- as.factor(product_lookup$commodity)
product_lookup$brand <- as.factor(product_lookup$brand)
product_lookup$product_size <- as.factor(product_lookup$product_size)

#check if blanks are present and replace it with NA and check for missing values
missval_df(product_lookup)

##                                                                   
## Variable           "upc" "product_description" "commodity" "brand"
## Missing_values     "0"   "0"                   "0"         "0"    
## Missing_percentage "0"   "0"                   "0"         "0"    
##                                  
## Variable           "product_size"
## Missing_values     "3"           
## Missing_percentage "0.32"

Store Dataset

glimpse(store_lookup)

## Observations: 387
## Variables: 2
## $ store          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ store_zip_code <chr> "37865", "30084", "30039", "31210", "30044", "3...

#converting columns of store_lookup to factors
store_lookup$store <- as.character(store_lookup$store)
store_lookup$store_zip_code <- as.character(store_lookup$store_zip_code)

#check if blanks are present and replace it with NA and check for missing values
missval_df(store_lookup)

##                                            
## Variable           "store" "store_zip_code"
## Missing_values     "0"     "0"             
## Missing_percentage "0"     "0"

Transaction Dataset

glimpse(transactions)

## Observations: 5,197,681
## Variables: 11
## $ upc                 <chr> "7680850106", "3620000470", "1800028064", ...
## $ dollar_sales        <dbl> 0.80, 3.59, 2.25, 0.85, 2.19, 2.19, 3.45, ...
## $ units               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ time_of_transaction <chr> "1100", "1100", "1137", "1148", "1323", "1...
## $ geography           <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ week                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ household           <dbl> 125434, 125434, 108320, 162016, 89437, 894...
## $ store               <dbl> 244, 244, 244, 244, 244, 244, 244, 244, 24...
## $ basket              <dbl> 1, 1, 2, 3, 4, 4, 5, 5, 6, 7, 8, 8, 8, 9, ...
## $ day                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ coupon              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

transactions$upc <- as.factor(transactions$upc)
transactions$geography <- as.factor(transactions$geography)
transactions$household <- as.factor(transactions$household)
transactions$store <- as.factor(transactions$store)
transactions$basket <- as.factor(transactions$basket)


#filtering out entries where dollar_sales < 0
transactions<- filter(transactions, dollar_sales > 0)

#check if blanks are present and replace it with NA and check for missing values
missval_df(transactions)

##                                                                      
## Variable           "upc" "dollar_sales" "units" "time_of_transaction"
## Missing_values     "0"   "0"            "0"     "0"                  
## Missing_percentage "0"   "0"            "0"     "0"                  
##                                                                         
## Variable           "geography" "week" "household" "store" "basket" "day"
## Missing_values     "0"         "0"    "0"         "0"     "0"      "0"  
## Missing_percentage "0"         "0"    "0"         "0"     "0"      "0"  
##                            
## Variable           "coupon"
## Missing_values     "0"     
## Missing_percentage "0"

4. Exploratory Data Analysis

4.1 Uncover new information

Combining variables of concern from all datasets into a master dataset which contains all the variables and observations that is necessary for our analysis. For example, we will join the transaction table and product table to add information such as brand, commodity and product size to the transaction table. Grouping of data based on brands, store location, etc., and creating variables to show aggregate values based on groups.

complete <- transactions %>%
            left_join(product_lookup, by = "upc")
#ordering the column commodity
new_order <- c("pasta", "pasta sauce", "pancake mixes", "syrups")
complete <- arrange(transform(complete,
                  commodity=factor(commodity,levels=new_order)),commodity)

A. Performance of brands

Here we use dplyr and ggplot2 packages to find the high performing brands and low performing brands in each category of commodity (pasta, pasta sauce, pancake mix and syrup) based on number of items sold and try to analyze the reason for the difference in performance between the high performing and low performing brands.

# Identifying high performing brands in each category of commodity
brand_analysis_hi <- complete %>% 
  filter(brand != "NA") %>% 
  group_by(commodity, brand ) %>% 
  summarise(total_sum = sum(dollar_sales)) %>%
  top_n(3, total_sum)

#Plot
ggplot(brand_analysis_hi, aes(x= brand, y = total_sum)) +
geom_col(col = "orange", fill = "indian red") +
facet_wrap(~ commodity, nrow = 2, scales = "free") + 
  geom_text(aes(label = round(total_sum), vjust = 0.3))

# Identifying low performing brands in each category of commodity
brand_analysis_lw <- complete %>% filter(brand != "NA") %>% 
  group_by(commodity, brand ) %>% 
  summarise(total_sum = sum(dollar_sales)) %>%
  top_n(-20, total_sum)

#Plot
ggplot(brand_analysis_lw, aes(x= brand, y = total_sum)) +
  geom_col(col = "orange", fill = "indian red") +
  facet_wrap(~ commodity, nrow = 2, scales = "free") + 
  #geom_text(aes(label = round(total_sum), vjust = 1)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Comparing the Marketing strategies of high performing and low performing brands:

In order to compare the marketing strategies of the high performing and low performing brands we compare the variables feature_desc (describes the location of the product on the weekly mailer) and display_desc (describes the location of temporary in-store display containing the product) of products in each brand.

#Selecting upc code for high performing brands in pastas by Combining 
#data from transactions, product_lookup and causal_lookup tables  
top_perf_pasta <- brand_analysis_hi %>%
  filter(commodity=="pasta") %>% 
  left_join(product_lookup, by = c("brand", "commodity")) %>% 
  select(upc, commodity, brand) %>% 
  left_join(causal_lookup, by = "upc")
for(i in 1:ncol(top_perf_pasta)) {
  top_perf_pasta[[i]] <- as.factor(top_perf_pasta[[i]])
}


#Selecting upc code for low performing brands in pastas by Combining 
#data from transactions, product_lookup and causal_lookup tables  
low_perf_pasta <- brand_analysis_lw[1:20,] %>% 
  left_join(product_lookup, by = c("brand", "commodity")) %>% 
  select(upc, product_description, commodity, brand) %>% 
  left_join(causal_lookup, by = "upc")
for(i in 1:ncol(low_perf_pasta)) {
  low_perf_pasta[[i]] <- as.factor(low_perf_pasta[[i]])
}

# Analysing feature_desc of high performing pasta brands
top_perf_pasta %>% mutate(count = 1) %>% 
  filter(!is.na(feature_desc)) %>% 
  group_by(feature_desc) %>% 
  summarise(sum = sum(count)) %>% top_n(5, sum) %>% 
  ggplot(aes(x = feature_desc, y = sum)) + 
  geom_col(col = "orange", fill = "indian red") + 
  geom_text(aes(label = sum, vjust = -0.3)) + 
  labs(title = "High performing pasta brands", subtitle = "feature_desc")

# Analysing feature_desc of low performing brands in pastas
low_perf_pasta %>% mutate(count = 1) %>%
  filter(!is.na(feature_desc)) %>%
  group_by(feature_desc) %>%
  summarise(sum = sum(count)) %>% top_n(5, sum) %>%
  ggplot(aes(x = feature_desc, y = sum)) +
  geom_col(col = "orange", fill = "indian red", width = 0.2) +
  geom_text(aes(label = sum, vjust = -0.3)) +
  labs(title = "Low performing pasta brands", subtitle = "feature_desc")

From the above plots we can observe that the high performing brands advertise more in the weekly mailer which could be the reason for more sales.

# Analysing display_desc of high performing pasta brands -----------------
top_perf_pasta %>% mutate(count = 1) %>% 
  filter(!is.na(display_desc)) %>% 
  group_by(display_desc) %>% 
  summarise(sum = sum(count)) %>% top_n(5, sum) %>% 
  ggplot(aes(x = display_desc, y = sum)) + 
  geom_col(col = "orange", fill = "indian red") + 
  geom_text(aes(label = sum, vjust = -0.3)) + 
  labs(title = "High performing pasta brands", subtitle = "display_desc")

# Analysing display_desc of low performing pasta brands ------------------
low_perf_pasta %>% mutate(count = 1) %>% 
  filter(!is.na(display_desc)) %>% 
  group_by(display_desc) %>% 
  summarise(sum = sum(count)) %>% top_n(5, sum) %>% 
  ggplot(aes(x = display_desc, y = sum)) + 
  geom_col(col = "orange", fill = "indian red") + 
  geom_text(aes(label = sum, vjust = -0.3)) + 
  labs(title = "Low performing pasta brands", subtitle = "display_desc")

From the above plots we can observe that the high performing brands are more on display at prime locations inside the store which could be the reason for more sales

B. Product Recommendation

Analyzing the correlation between pancake mix and syrup of different brands

#selecting only pancake mix and syrups from the entire data  
grouped_pc_sy <- complete %>%  
                  filter(commodity == c("pancake mixes", "syrups")) %>% 
                  select(basket, brand, commodity)

#dividing the basket into pancake mix and syrups using the spread function
grouped_pc_sy_sprd <- grouped_pc_sy %>% 
                      distinct(basket, commodity, .keep_all = TRUE) %>% 
                      spread(commodity, brand)

#selecting baskets that have both pancake mix and syrups
grouped_pc_sy_sprd <- grouped_pc_sy_sprd[complete.cases(grouped_pc_sy_sprd), ]

grouped_pc_sy_sprd_tbl <- grouped_pc_sy_sprd %>% 
                          group_by(`pancake mixes`, syrups ) %>% 
                          summarise(count = n())

#rescaling the count variable                          
grouped_pc_sy_sprd_tbl <- plyr::ddply(grouped_pc_sy_sprd_tbl, plyr::.(syrups)
                                    , transform, rescale = rescale(count))

# generating heat map using ggplot to visualize the correlation between pancake 
# mix and syrups of different brands
grouped_pc_sy_sprd_tbl %>% filter(rescale > 0.5) %>%
ggplot(aes(pancake.mixes,syrups)) + 
  geom_tile(aes(fill = rescale), colour = "white") + 
  scale_fill_gradient(low = "steel blue",high = "indian red")

In the above plot we can only see the correlation between pancake mixes and syrups with rescale value greater than 0.5 (rescale value is calculated by counting the number of times the combination of products are purchased together and then converting this count to a scale of 0 to 1 for all combination of products).

The boxes in red indicate that the combination of products are frequently bought together. Client can use this information to provide product recommendations to the customer, promote products to a customers who purchase one of the products in the frequently brought combination of products by offering coupons on one product or by offering discount when the two products are brought together. The client can also use this information to order products in the store. Products that are brought frequently can be kept next to each other.

C. Peak hours

Using the transaction table and product table we can determine the time during which maximum number of transactions occur.

transactions$time <- as.factor(substr(transactions$time_of_transaction, 1, 2))
transactions_time <-transactions %>% 
                    left_join(product_lookup, by = "upc") %>% 
                    select(upc, time, commodity) %>% 
                    filter(!is.na(commodity))

transactions_time <- arrange(transform(transactions_time,
                     commodity=factor(commodity,levels=new_order)),commodity)

transactions_time %>% 
  ggplot(aes(x=time)) + 
  geom_histogram(stat="count",fill="indianred") +
  facet_wrap(~ commodity, nrow = 2) + 
  scale_y_continuous(labels = comma) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

From the above plots we can observe that peak hours of transaction are between 15:00 hrs to 19:00 hours across all categories of commodity. We see that the sales of pasta and pasta sauce are greater than 3 times that of pancake mixes and syrups. This information can be used to suggest the client that more billing counters have to be open during the peak hours (3 pm to 7 pm) and that more assistants are required in the Pasta & Pasta Sauce sections than Pancake mix & Syrup section.

5. Summary of Findings

Summary

High performing brands advertise more(featured more in weekly mailers and in-store display areas) and have more variety (different flavors and sizes) of products as compared to low performing brands. Adopting these strategies could increase the sales of the low performing brands
We observe that certain brands of Pancake mixes and syrups are frequently bought together. The client can use this information to recommend or coupons / discounts to the customers when one of the items in the frequently brought combinations are purchased.
We observe that the sales of pasta and pasta sauce are greater than 3 times that of pancake mixes and syrups. This information can be used to suggest the client that more assistants are required in the Pasta & Pasta Sauce sections than Pancake mix & Syrup section.
The peak hours of transaction are between 15:00 hrs and 19:00 hours across all categories of commodity. This information can be used to suggest the client that more billing counters have to be open during the peak hours (3 pm to 7 pm).

Data Wrangling with R - Final Project

Trishul Gowda Ashok [M12906045]

December 09, 2018