The purpose of this project is to perform data analysis and visualization on the Carbo-Loading data set. Carbo-Loading contains household level transactions over a period of two years from four categories: Pasta, Pasta Sauce, Syrup, and Pancake Mix. These categories were chosen so that interactions between the categories can be detected and studied.
We used R to perform data analysis and visualization to explore and identify trends in the customer behaviour, and uncover insights to understand the interactions between the categories.
The analysis will help the consumer in understanding the performance of brands and their markteing strategies which can be used to promote the brands of the client. The analysis can be used for providing product recommendations to customers based on their previous purchases. Analysis will help in identyifying the most effective in-store locations and in the weekly mailer.
library(haven) # to import SAS dataset
library(dplyr) # to manuplate data
library(tidyr) # to tidy up data
library(stringr) # to perform string operations
library(ggplot2) # to visualize the data
library(scales) # rescaling the dataThe data contains household level transactions over a period of two years from four categories: Pasta, Pasta Sauce, Syrup, and Pancake Mix. These categories were chosen so that interactions between the categories can be detected and studied. Here we use 4 data sets from a relational database:
file_names <- c("causal_lookup", "product_lookup", "store_lookup", "transactions")
file2 <- ".sas7bdat"
for(i in seq_along(file_names)) {
path <- paste0("data/", file_names[i], file2)
if(file.exists(path)) {
df <- read_sas(path)
assign(file_names[i], data.frame(df))
} else {
print(paste0("invalid filename ", paste0(file_names[i]), file2))
}
}Steps involved Cleaning:
Using custom function missval_df for cleaning the dataset. The function takes the name of the dataset as the input parameter. The function imputes blanks with NA and computes the number and percentage of missing values in each column.
missval_df<-function(df){
df[df == ""] <- NA
mv<-sapply(df, function(x) sum(is.na(x)))
miss_tab<-c("Variable"="abc", "Missing_values" = 1, "Missing_percentage"=1)
for(z in seq_along(df)){
Variable<-colnames(df[z])
tv<-nrow(df)
Missing_values<-as.numeric(mv[z])
per<-Missing_values/tv*100
Missing_percentage<-round(per, digits = 2)
row_n<-cbind(Variable, Missing_values, Missing_percentage)
miss_tab<-rbind(miss_tab, row_n)
}
print(t(miss_tab[-1,]))
}glimpse(causal_lookup)## Observations: 351,372
## Variables: 6
## $ upc <chr> "7680850108", "5100001212", "5100002792", "362000...
## $ store <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ week <dbl> 68, 66, 72, 55, 68, 55, 66, 55, 55, 76, 66, 66, 6...
## $ feature_desc <chr> "Wrap Interior Feature", "Wrap Back Feature", "In...
## $ display_desc <chr> "Not on Display", "Not on Display", "Not on Displ...
## $ geography <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
#converting vars from char type to factor
causal_lookup$upc <- as.factor(causal_lookup$upc)
causal_lookup$store <- as.factor(causal_lookup$store)
causal_lookup$feature_desc <- as.factor(causal_lookup$feature_desc)
causal_lookup$display_desc <- as.factor(causal_lookup$display_desc)
causal_lookup$geography <- as.factor(causal_lookup$geography)
#check if blanks are present and replace it with NA and check for missing values
missval_df(causal_lookup)##
## Variable "upc" "store" "week" "feature_desc" "display_desc"
## Missing_values "0" "0" "0" "0" "0"
## Missing_percentage "0" "0" "0" "0" "0"
##
## Variable "geography"
## Missing_values "0"
## Missing_percentage "0"
glimpse(product_lookup)## Observations: 927
## Variables: 5
## $ upc <chr> "111112360", "566300023", "566300028", "56...
## $ product_description <chr> "VINCENT S ORIG MARINARA S", "PINE MOUNTAI...
## $ commodity <chr> "pasta sauce", "syrups", "syrups", "syrups...
## $ brand <chr> "Vincent's", "Pine Mountain", "Miller", "M...
## $ product_size <chr> "25 OZ", "40 OZ", "19 OZ", "12 OZ", "19 OZ...
product_lookup$commodity <- as.factor(product_lookup$commodity)
product_lookup$brand <- as.factor(product_lookup$brand)
product_lookup$product_size <- as.factor(product_lookup$product_size)
#check if blanks are present and replace it with NA and check for missing values
missval_df(product_lookup)##
## Variable "upc" "product_description" "commodity" "brand"
## Missing_values "0" "0" "0" "0"
## Missing_percentage "0" "0" "0" "0"
##
## Variable "product_size"
## Missing_values "3"
## Missing_percentage "0.32"
glimpse(store_lookup)## Observations: 387
## Variables: 2
## $ store <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ store_zip_code <chr> "37865", "30084", "30039", "31210", "30044", "3...
#converting columns of store_lookup to factors
store_lookup$store <- as.character(store_lookup$store)
store_lookup$store_zip_code <- as.character(store_lookup$store_zip_code)
#check if blanks are present and replace it with NA and check for missing values
missval_df(store_lookup)##
## Variable "store" "store_zip_code"
## Missing_values "0" "0"
## Missing_percentage "0" "0"
glimpse(transactions)## Observations: 5,197,681
## Variables: 11
## $ upc <chr> "7680850106", "3620000470", "1800028064", ...
## $ dollar_sales <dbl> 0.80, 3.59, 2.25, 0.85, 2.19, 2.19, 3.45, ...
## $ units <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ time_of_transaction <chr> "1100", "1100", "1137", "1148", "1323", "1...
## $ geography <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ week <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ household <dbl> 125434, 125434, 108320, 162016, 89437, 894...
## $ store <dbl> 244, 244, 244, 244, 244, 244, 244, 244, 24...
## $ basket <dbl> 1, 1, 2, 3, 4, 4, 5, 5, 6, 7, 8, 8, 8, 9, ...
## $ day <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ coupon <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
transactions$upc <- as.factor(transactions$upc)
transactions$geography <- as.factor(transactions$geography)
transactions$household <- as.factor(transactions$household)
transactions$store <- as.factor(transactions$store)
transactions$basket <- as.factor(transactions$basket)
#filtering out entries where dollar_sales < 0
transactions<- filter(transactions, dollar_sales > 0)
#check if blanks are present and replace it with NA and check for missing values
missval_df(transactions)##
## Variable "upc" "dollar_sales" "units" "time_of_transaction"
## Missing_values "0" "0" "0" "0"
## Missing_percentage "0" "0" "0" "0"
##
## Variable "geography" "week" "household" "store" "basket" "day"
## Missing_values "0" "0" "0" "0" "0" "0"
## Missing_percentage "0" "0" "0" "0" "0" "0"
##
## Variable "coupon"
## Missing_values "0"
## Missing_percentage "0"
Combining variables of concern from all datasets into a master dataset which contains all the variables and observations that is necessary for our analysis. For example, we will join the transaction table and product table to add information such as brand, commodity and product size to the transaction table. Grouping of data based on brands, store location, etc., and creating variables to show aggregate values based on groups.
complete <- transactions %>%
left_join(product_lookup, by = "upc")
#ordering the column commodity
new_order <- c("pasta", "pasta sauce", "pancake mixes", "syrups")
complete <- arrange(transform(complete,
commodity=factor(commodity,levels=new_order)),commodity)Here we use dplyr and ggplot2 packages to find the high performing brands and low performing brands in each category of commodity (pasta, pasta sauce, pancake mix and syrup) based on number of items sold and try to analyze the reason for the difference in performance between the high performing and low performing brands.
# Identifying high performing brands in each category of commodity
brand_analysis_hi <- complete %>%
filter(brand != "NA") %>%
group_by(commodity, brand ) %>%
summarise(total_sum = sum(dollar_sales)) %>%
top_n(3, total_sum)
#Plot
ggplot(brand_analysis_hi, aes(x= brand, y = total_sum)) +
geom_col(col = "orange", fill = "indian red") +
facet_wrap(~ commodity, nrow = 2, scales = "free") +
geom_text(aes(label = round(total_sum), vjust = 0.3))# Identifying low performing brands in each category of commodity
brand_analysis_lw <- complete %>% filter(brand != "NA") %>%
group_by(commodity, brand ) %>%
summarise(total_sum = sum(dollar_sales)) %>%
top_n(-20, total_sum)
#Plot
ggplot(brand_analysis_lw, aes(x= brand, y = total_sum)) +
geom_col(col = "orange", fill = "indian red") +
facet_wrap(~ commodity, nrow = 2, scales = "free") +
#geom_text(aes(label = round(total_sum), vjust = 1)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))In order to compare the marketing strategies of the high performing and low performing brands we compare the variables feature_desc (describes the location of the product on the weekly mailer) and display_desc (describes the location of temporary in-store display containing the product) of products in each brand.
#Selecting upc code for high performing brands in pastas by Combining
#data from transactions, product_lookup and causal_lookup tables
top_perf_pasta <- brand_analysis_hi %>%
filter(commodity=="pasta") %>%
left_join(product_lookup, by = c("brand", "commodity")) %>%
select(upc, commodity, brand) %>%
left_join(causal_lookup, by = "upc")
for(i in 1:ncol(top_perf_pasta)) {
top_perf_pasta[[i]] <- as.factor(top_perf_pasta[[i]])
}
#Selecting upc code for low performing brands in pastas by Combining
#data from transactions, product_lookup and causal_lookup tables
low_perf_pasta <- brand_analysis_lw[1:20,] %>%
left_join(product_lookup, by = c("brand", "commodity")) %>%
select(upc, product_description, commodity, brand) %>%
left_join(causal_lookup, by = "upc")
for(i in 1:ncol(low_perf_pasta)) {
low_perf_pasta[[i]] <- as.factor(low_perf_pasta[[i]])
}
# Analysing feature_desc of high performing pasta brands
top_perf_pasta %>% mutate(count = 1) %>%
filter(!is.na(feature_desc)) %>%
group_by(feature_desc) %>%
summarise(sum = sum(count)) %>% top_n(5, sum) %>%
ggplot(aes(x = feature_desc, y = sum)) +
geom_col(col = "orange", fill = "indian red") +
geom_text(aes(label = sum, vjust = -0.3)) +
labs(title = "High performing pasta brands", subtitle = "feature_desc")# Analysing feature_desc of low performing brands in pastas
low_perf_pasta %>% mutate(count = 1) %>%
filter(!is.na(feature_desc)) %>%
group_by(feature_desc) %>%
summarise(sum = sum(count)) %>% top_n(5, sum) %>%
ggplot(aes(x = feature_desc, y = sum)) +
geom_col(col = "orange", fill = "indian red", width = 0.2) +
geom_text(aes(label = sum, vjust = -0.3)) +
labs(title = "Low performing pasta brands", subtitle = "feature_desc")From the above plots we can observe that the high performing brands advertise more in the weekly mailer which could be the reason for more sales.
# Analysing display_desc of high performing pasta brands -----------------
top_perf_pasta %>% mutate(count = 1) %>%
filter(!is.na(display_desc)) %>%
group_by(display_desc) %>%
summarise(sum = sum(count)) %>% top_n(5, sum) %>%
ggplot(aes(x = display_desc, y = sum)) +
geom_col(col = "orange", fill = "indian red") +
geom_text(aes(label = sum, vjust = -0.3)) +
labs(title = "High performing pasta brands", subtitle = "display_desc")# Analysing display_desc of low performing pasta brands ------------------
low_perf_pasta %>% mutate(count = 1) %>%
filter(!is.na(display_desc)) %>%
group_by(display_desc) %>%
summarise(sum = sum(count)) %>% top_n(5, sum) %>%
ggplot(aes(x = display_desc, y = sum)) +
geom_col(col = "orange", fill = "indian red") +
geom_text(aes(label = sum, vjust = -0.3)) +
labs(title = "Low performing pasta brands", subtitle = "display_desc")From the above plots we can observe that the high performing brands are more on display at prime locations inside the store which could be the reason for more sales
#selecting only pancake mix and syrups from the entire data
grouped_pc_sy <- complete %>%
filter(commodity == c("pancake mixes", "syrups")) %>%
select(basket, brand, commodity)
#dividing the basket into pancake mix and syrups using the spread function
grouped_pc_sy_sprd <- grouped_pc_sy %>%
distinct(basket, commodity, .keep_all = TRUE) %>%
spread(commodity, brand)
#selecting baskets that have both pancake mix and syrups
grouped_pc_sy_sprd <- grouped_pc_sy_sprd[complete.cases(grouped_pc_sy_sprd), ]
grouped_pc_sy_sprd_tbl <- grouped_pc_sy_sprd %>%
group_by(`pancake mixes`, syrups ) %>%
summarise(count = n())
#rescaling the count variable
grouped_pc_sy_sprd_tbl <- plyr::ddply(grouped_pc_sy_sprd_tbl, plyr::.(syrups)
, transform, rescale = rescale(count))
# generating heat map using ggplot to visualize the correlation between pancake
# mix and syrups of different brands
grouped_pc_sy_sprd_tbl %>% filter(rescale > 0.5) %>%
ggplot(aes(pancake.mixes,syrups)) +
geom_tile(aes(fill = rescale), colour = "white") +
scale_fill_gradient(low = "steel blue",high = "indian red")In the above plot we can only see the correlation between pancake mixes and syrups with rescale value greater than 0.5 (rescale value is calculated by counting the number of times the combination of products are purchased together and then converting this count to a scale of 0 to 1 for all combination of products).
The boxes in red indicate that the combination of products are frequently bought together. Client can use this information to provide product recommendations to the customer, promote products to a customers who purchase one of the products in the frequently brought combination of products by offering coupons on one product or by offering discount when the two products are brought together. The client can also use this information to order products in the store. Products that are brought frequently can be kept next to each other.
Using the transaction table and product table we can determine the time during which maximum number of transactions occur.
transactions$time <- as.factor(substr(transactions$time_of_transaction, 1, 2))
transactions_time <-transactions %>%
left_join(product_lookup, by = "upc") %>%
select(upc, time, commodity) %>%
filter(!is.na(commodity))
transactions_time <- arrange(transform(transactions_time,
commodity=factor(commodity,levels=new_order)),commodity)
transactions_time %>%
ggplot(aes(x=time)) +
geom_histogram(stat="count",fill="indianred") +
facet_wrap(~ commodity, nrow = 2) +
scale_y_continuous(labels = comma) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))From the above plots we can observe that peak hours of transaction are between 15:00 hrs to 19:00 hours across all categories of commodity. We see that the sales of pasta and pasta sauce are greater than 3 times that of pancake mixes and syrups. This information can be used to suggest the client that more billing counters have to be open during the peak hours (3 pm to 7 pm) and that more assistants are required in the Pasta & Pasta Sauce sections than Pancake mix & Syrup section.
Summary
High performing brands advertise more(featured more in weekly mailers and in-store display areas) and have more variety (different flavors and sizes) of products as compared to low performing brands. Adopting these strategies could increase the sales of the low performing brands
We observe that certain brands of Pancake mixes and syrups are frequently bought together. The client can use this information to recommend or coupons / discounts to the customers when one of the items in the frequently brought combinations are purchased.
We observe that the sales of pasta and pasta sauce are greater than 3 times that of pancake mixes and syrups. This information can be used to suggest the client that more assistants are required in the Pasta & Pasta Sauce sections than Pancake mix & Syrup section.
The peak hours of transaction are between 15:00 hrs and 19:00 hours across all categories of commodity. This information can be used to suggest the client that more billing counters have to be open during the peak hours (3 pm to 7 pm).