This report analyzes the trends of customers’ purchase for select items and explore methods to improve sales. We study the interactions between four commodities, namely Pasta, Pasta Sauce, Syrup and Pancake-Mix. The available data consists of purchase information spanning two years.
The data contains purchase information mapped to household and brand data for a duration of two years. Our goal is to find a pattern in consumer purchase and explore ways to increase sales.
The analysis is geared toward using coupons to improve sales. Coupons have become an essential tool in retail. They are often used for promotional activities, for building customer loyalty and driving sale of a product by bucketing them with existing popular items.
We assess the impact of coupons on the spending habits of the consumers in terms of the number of units purchased and the overall amount spent by the consumer. In addition to this, we also want to explore the relationship between the sale of complementary and substitute product categories.
Our focus is on the following KPIs:
We analyze these KPIs at commodity, brand and household level.
We apply generalized statistical methods, which can be easily transferred to similar problems in other commercial segments. In addition, the gained insights are readily applicable to similar product combinations. This approach can also be used to understand the effectiveness of any similar promotional activity.
Based on the data provided, we plan to develop a multiple linear regression model which will facilitate the understanding of the degree of variability in the customer spending as a function of coupon usage.
This analysis will help retailers
Following R packages were used in this analysis:
The data set for this project “Carbo-Loading SAS” can be downloaded at this location.
The data used for this analysis is gathered over a period of 2 years and has been fully anonymized. It contains household level transactions over a period of two years from four categories: Pasta, Pasta Sauce, Syrup and Pancake Mix. These categories were chosen so that interactions between the categories can be detected and studied. The data-set consists of four tables exported from SAS. The schema design of the source tables involved is as follows:
Following are the details of the Source tables(Final Table)
| Column Name | Data type | Definition |
|---|---|---|
| upc | numeric | standard 10-digit upc. Uniquely identifies a commodity of a brand |
| store | numeric | Identifier for a store |
| week | numeric | Ordered identifier for a week, can vary from 43 to 104 |
| feature_desc | varchar | Product location on weekly mailer |
| display_desc | varchar | Product location in temporary in-store display |
| geography | numeric | Identifier for a geography |
| Column Name | Data type | Definition |
|---|---|---|
| upc | numeric | standard 10-digit upc. Uniquely identifies a commodity of a brand |
| dollar_sales | numeric | amount of dollars spent by customer |
| units | numeric | number of units purchased |
| time_of_transaction | numeric | military time of transaction |
| geography | numeric | Identifier for a geography |
| week | numeric | Ordered identifier for a week, can vary from 1 to 104 |
| household | numeric | identifies a unique household |
| store | numeric | Identifier for a store |
| basket | numeric | identifies a unique trip to the store |
| day | numeric | Day of transaction from 1 to 728 |
| coupon | numeric | indicates a coupon was used, 1 = Yes, 0 = No |
| Column Name | Data type | Definition |
|---|---|---|
| store | numeric | Unique identifier of store |
| store_zip_code | numeric | Zip code of the store |
| Column Name | Data type | Definition |
|---|---|---|
| upc | numeric | standard 10-digit upc. Uniquely identifies a commodity of a brand |
| product_description | varchar | description of the product |
| commodity | varchar | what type of product that is |
| brand | varchar | to which brand does the product belong |
| product_size | varchar | what are the dimensions of the product like weight |
Based on this, following entities were defined for analysis:
| Entity | Description |
|---|---|
| Brand | The Brand of the Product sold |
| Commodity | The type of Product sold, like Pasta Sauce, Pasta, etc. |
| Geography | Geographical group of stores |
| Store | Point of sale for each product |
| Time | The time when the transaction occurred |
| Day | Chronological Day of sales for a 2-year period |
| Week | Chronological Week of sales for a 2-year period |
| Household | Represents a unique customer |
| Basket | Represents the trip the customer made |
| Coupon | Whether the coupon was used or not while purchase |
| Causal | Details of the promotional offer made for a given product |
All KPIs involve the following two measures:
| KPI | Description |
|---|---|
| Dollar Sales | Amount spent during a transaction |
| Unit | number of units purchased during a transaction |
The Count aggregation of any entity can also be counted as a KPI
All the .SAS files involved were imported into R using haven package. The files to be imported are kept at the home directory for the R project and hence does not require specifying
causal_lookup <- read_sas("data/causal_lookup.sas7bdat", NULL)
product_lookup <- read_sas("data/product_lookup.sas7bdat", NULL)
store_lookup <- read_sas("data/store_lookup.sas7bdat", NULL)
transactions <- read_sas("data/transactions.sas7bdat", NULL)
While importing, not all columns are imported as per their expected data type and hence we check if the data types need recoding
str(causal_lookup)
## Classes 'tbl_df', 'tbl' and 'data.frame': 351372 obs. of 6 variables:
## $ upc : chr "7680850108" "5100001212" "5100002792" "3620000300" ...
## ..- attr(*, "format.sas")= chr "$"
## $ store : num 1 1 1 1 1 1 1 1 1 1 ...
## $ week : num 68 66 72 55 68 55 66 55 55 76 ...
## $ feature_desc: chr "Wrap Interior Feature" "Wrap Back Feature" "Interior Page Feature" "Wrap Interior Feature" ...
## ..- attr(*, "format.sas")= chr "$"
## $ display_desc: chr "Not on Display" "Not on Display" "Not on Display" "Not on Display" ...
## ..- attr(*, "format.sas")= chr "$"
## $ geography : num 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "label")= chr "DH_CAUSAL_LOOKUP"
str(product_lookup)
## Classes 'tbl_df', 'tbl' and 'data.frame': 927 obs. of 5 variables:
## $ upc : chr "111112360" "566300023" "566300028" "566300029" ...
## $ product_description: chr "VINCENT S ORIG MARINARA S" "PINE MOUNTAIN SYRUP" "MILLER CANE SYRUP" "MILLER CANE SYRUP" ...
## $ commodity : chr "pasta sauce" "syrups" "syrups" "syrups" ...
## $ brand : chr "Vincent's" "Pine Mountain" "Miller" "Miller" ...
## $ product_size : chr "25 OZ" "40 OZ" "19 OZ" "12 OZ" ...
## - attr(*, "label")= chr "DH_PRODUCT_LOOKUP"
str(store_lookup)
## Classes 'tbl_df', 'tbl' and 'data.frame': 387 obs. of 2 variables:
## $ store : num 1 2 3 4 5 6 7 8 9 10 ...
## $ store_zip_code: chr "37865" "30084" "30039" "31210" ...
## ..- attr(*, "format.sas")= chr "$"
## - attr(*, "label")= chr "DH_STORE_LOOKUP"
str(transactions)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5197681 obs. of 11 variables:
## $ upc : chr "7680850106" "3620000470" "1800028064" "9999985067" ...
## ..- attr(*, "format.sas")= chr "$"
## $ dollar_sales : num 0.8 3.59 2.25 0.85 2.19 2.19 3.45 1.29 0.75 2.19 ...
## ..- attr(*, "label")= chr "SHOP_SPEND"
## $ units : num 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "QUANTITY"
## $ time_of_transaction: chr "1100" "1100" "1137" "1148" ...
## ..- attr(*, "label")= chr "TRANS_TIME"
## ..- attr(*, "format.sas")= chr "$"
## $ geography : num 2 2 2 2 2 2 2 2 2 2 ...
## $ week : num 1 1 1 1 1 1 1 1 1 1 ...
## $ household : num 125434 125434 108320 162016 89437 ...
## $ store : num 244 244 244 244 244 244 244 244 244 244 ...
## $ basket : num 1 1 2 3 4 4 5 5 6 7 ...
## $ day : num 1 1 1 1 1 1 1 1 1 1 ...
## $ coupon : num 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "label")= chr "DH_TRANSACTIONS"
Based on the observations made above, following code transforms the require data types for the columns
store_lookup$store_zip_code = as.numeric(store_lookup$store_zip_code)
product_lookup$upc = as.numeric(product_lookup$upc)
causal_lookup$upc = as.numeric(causal_lookup$upc)
transactions$upc = as.numeric(transactions$upc)
Also, as discussed in previous section, since our analysis is not at Product level, we are creating a subset of product_lookup coded as commodity_lookup which stores distinct combinations of Brands and Commodities
commodity_lookup <- product_lookup %>% distinct(upc, brand, commodity)
We are also checking if all Zipcodes provided are valid
#checking if all stroe zip codes are valid
us_postal_codes <- read.csv("pool/us_postal_codes.csv")
setdiff(store_lookup$store_zip_code,us_postal_codes$Zip.Code)
Since Transactions is the most important table in our data-set, all fields of this table are validated to check if there are any orphaned keys, missing information or anomalous data. For example, we verify that all transactions recorded in this table do belong to the set of Commodities we have in the Commodity_Lookup
#Transaction has occurred for valid commodities
setdiff(transactions$upc, commodity_lookup$upc)
#Transaction has occurred at valid store
setdiff(transactions$store, store_lookup$store)
#Transaction Time validation
distinct(as.data.frame(nchar(transactions$time_of_transaction)))
#checking minutes
time.check1 <- as.numeric(substr(transactions$time_of_transaction, 3, 4))
time.check11 <- -1 < time.check1 & 60 > time.check1
sum(time.check11) == nrow(transactions)
#checking hours
time.check2 <- as.numeric(substr(transactions$time_of_transaction, 1, 2))
time.check22 <- -1 < time.check2 & 24 > time.check2
sum(time.check22) == nrow(transactions)
#checking geography
distinct(as.data.frame(transactions$geography))
#checking week
week.check <- 0 < transactions$week & 105 > transactions$week
sum(week.check) == nrow(transactions)
#checking day
day.check <- 0 < transactions$day & 729 > transactions$day
sum(day.check) == nrow(transactions)
#checking coupon
distinct(as.data.frame(transactions$coupon))
Similarly, the relate table causal_lookup which stores promotional campaign information for various commodities sold
#causal for valid commodities
setdiff(causal_lookup$upc, commodity_lookup$upc)
#causal at valid store
setdiff(causal_lookup$store, store_lookup$store)
#checking week
week.check <- -1 < causal_lookup$week & 43 > causal_lookup$week
sum(week.check) == 0
#causal for valid geographies
distinct(as.data.frame(causal_lookup$geography))
According to the definition of UPC, it should be of 10 digits. However, for 1.12% transactions, the length of UPC is 9 digits. Nonetheless, since this does not violate the referential integrity of this variable because we do have 9 digits’ UPCs in causal_lookup and commodity_lookup, we choose to not remove these rows. This is validated by the following code:
distinct(as.data.frame(nchar(causal_lookup$upc)))
## nchar(causal_lookup$upc)
## 1 10
## 2 9
distinct(as.data.frame(nchar(commodity_lookup$upc)))
## nchar(commodity_lookup$upc)
## 1 9
## 2 10
distinct(as.data.frame(nchar(transactions$upc)))
## nchar(transactions$upc)
## 1 10
## 2 9
Steps in subsection Data Import and Cleaning clean the data and make it ready for consumption. Below is a brief of how our concerned datasets looks like:
str(causal_lookup)
## Classes 'tbl_df', 'tbl' and 'data.frame': 351372 obs. of 6 variables:
## $ upc : num 7.68e+09 5.10e+09 5.10e+09 3.62e+09 4.11e+09 ...
## $ store : num 1 1 1 1 1 1 1 1 1 1 ...
## $ week : num 68 66 72 55 68 55 66 55 55 76 ...
## $ feature_desc: chr "Wrap Interior Feature" "Wrap Back Feature" "Interior Page Feature" "Wrap Interior Feature" ...
## ..- attr(*, "format.sas")= chr "$"
## $ display_desc: chr "Not on Display" "Not on Display" "Not on Display" "Not on Display" ...
## ..- attr(*, "format.sas")= chr "$"
## $ geography : num 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "label")= chr "DH_CAUSAL_LOOKUP"
str(commodity_lookup)
## Classes 'tbl_df', 'tbl' and 'data.frame': 927 obs. of 3 variables:
## $ upc : num 1.11e+08 5.66e+08 5.66e+08 5.66e+08 5.66e+08 ...
## $ commodity: chr "pasta sauce" "syrups" "syrups" "syrups" ...
## $ brand : chr "Vincent's" "Pine Mountain" "Miller" "Miller" ...
## - attr(*, "label")= chr "DH_PRODUCT_LOOKUP"
str(store_lookup)
## Classes 'tbl_df', 'tbl' and 'data.frame': 387 obs. of 2 variables:
## $ store : num 1 2 3 4 5 6 7 8 9 10 ...
## $ store_zip_code: num 37865 30084 30039 31210 30044 ...
## - attr(*, "label")= chr "DH_STORE_LOOKUP"
str(transactions)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5197681 obs. of 11 variables:
## $ upc : num 7.68e+09 3.62e+09 1.80e+09 1.00e+10 1.00e+10 ...
## $ dollar_sales : num 0.8 3.59 2.25 0.85 2.19 2.19 3.45 1.29 0.75 2.19 ...
## ..- attr(*, "label")= chr "SHOP_SPEND"
## $ units : num 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "QUANTITY"
## $ time_of_transaction: chr "1100" "1100" "1137" "1148" ...
## ..- attr(*, "label")= chr "TRANS_TIME"
## ..- attr(*, "format.sas")= chr "$"
## $ geography : num 2 2 2 2 2 2 2 2 2 2 ...
## $ week : num 1 1 1 1 1 1 1 1 1 1 ...
## $ household : num 125434 125434 108320 162016 89437 ...
## $ store : num 244 244 244 244 244 244 244 244 244 244 ...
## $ basket : num 1 1 2 3 4 4 5 5 6 7 ...
## $ day : num 1 1 1 1 1 1 1 1 1 1 ...
## $ coupon : num 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "label")= chr "DH_TRANSACTIONS"
Below is the list of all the variables(dimensions) and their summaries:
| Sr. No. | Variable | Table Name | Data type | Definition | Values |
|---|---|---|---|---|---|
| 1 | basket | TRANSACTIONS | numeric | unique identifier of a trip made by a Household to a store | numeric key |
| 2 | brand | COMMODITY_LOOKUP | varchar | unique name of a Brand to which multiple commodities can belong | character values |
| 3 | commodity | COMMODITY_LOOKUP | varchar | unique name of a commodity | pancake mixes, syrups, pasta sauce & pasta |
| 4 | coupon | TRANSACTIONS | numeric | indicates a coupon was used | 1 = Yes, 0 = No |
| 5 | day | TRANSACTIONS | numeric | Chronologically ordered identifier for Day of transaction | numeric, 1 to 728 |
| 6 | display_desc | CAUSAL_LOOKUP | varchar | Product location in temporary in-store display | character values |
| 7 | feature_desc | CAUSAL_LOOKUP | varchar | Product location on weekly mailer | character values |
| 8 | geography | CAUSAL_LOOKUP, TRANSACTIONS | numeric | identifier for a geography | only two groups, 1 or 2 |
| 9 | household | TRANSACTIONS | numeric | unique identifier for a household(customer) | numeric key |
| 10 | store | CAUSAL_LOOKUP, TRANSACTIONS, STORE_LOOKUP | numeric | unique identifier for a store | numeric, 1 to 387 |
| 11 | store_zip_code | STORE_LOOKUP | numeric | Zip code of the store | standard zip codes |
| 12 | time_of_transaction | TRANSACTIONS | numeric | military time of transaction | 4-digit military time |
| 13 | upc | CAUSAL_LOOKUP, TRANSACTIONS, STORE_LOOKUP | numeric | Uniquely identifies a commodity of a brand | numeric key |
| 14 | week | CAUSAL_LOOKUP, TRANSACTIONS | numeric | Chronologically ordered identifier for a week | numeric, 1 to 104 |
Below is the list of all the variables(measures) and their summaries
summary(transactions$dollar_sales)
summary(transactions$units)
| Sr. no. | variable | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|---|---|
| 1 | dollar_sales | -11.76 | 0.99 | 1.5 | 1.757 | 2.19 | 153.14 |
| 2 | units | 1 | 1 | 1 | 1.197 | 1 | 156 |
To discover hidden insights in the data, we plan on doing the following:
Following types of data visualizations shall be crucial during the course of our analysis
inner_join(transactions, commodity_lookup) %>%
group_by(week, commodity) %>%
summarize(sum_sales = sum(dollar_sales)) %>%
ggplot(aes(x=week,y = sum_sales,color = commodity)) +
geom_line()
Following are some of the points of information that we lack in order to better analyze this data-set
Geographic distribution of stores
We plan on using Linear Regression and Cluster Analysis in order predict the following: