Lab 7: Collapsing by groups and summarizing within groups

Learning Objectives

  • relational database
  • collapsing data using group_by() and summarize()
  • calculating within groups statistics using group_by() and mutate()

1. Relational database

In this lab we will work with data from a marketing analytics firm dunnhumby. Data contains household level transactions over a period of two years from four categories: Pasta, Pasta Sauce, Syrup, and Pancake Mix. The data is given to us in four different files: transactions, products, stores, promotions (they call it causal). The relationships among these four files are illustrated below. The full description of the data is here.

Data that is stored in multiple tables with relations among the tables is called a relational database.

2. Exploring the data

Let’s load in the products and transactions data into R.

library(dplyr)
#trans <- read.csv("https://www.dropbox.com/s/30gqnhk3tqsawfa/dh_transactions.csv?raw=1")
trans <- read.csv("C:/Users/dvorakt/Google Drive/business analytics/labs/lab7 summarizing by group/dh_transactions.csv")
str(trans)
## 'data.frame':    5197681 obs. of  11 variables:
##  $ upc                : num  7.68e+09 3.62e+09 1.80e+09 1.00e+10 1.00e+10 ...
##  $ dollar_sales       : num  0.8 3.59 2.25 0.85 2.19 2.19 3.45 1.29 0.75 2.19 ...
##  $ units              : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ time_of_transaction: int  1100 1100 1137 1148 1323 1323 1415 1415 1513 1523 ...
##  $ geography          : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ week               : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ household          : int  125434 125434 108320 162016 89437 89437 158549 158549 18851 118337 ...
##  $ store              : int  244 244 244 244 244 244 244 244 244 244 ...
##  $ basket             : int  1 1 2 3 4 4 5 5 6 7 ...
##  $ day                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ coupon             : int  0 0 0 0 0 0 0 0 0 0 ...

Transactions is a fairly large data set with over 5 million observations. We see that all variables are stored as numerical (either as num or int), even though there is no numerical meaning to upc code, time of transactions or household or store id.

Let’s look at summary statistics of some of the key numerical variables.

summary(select(trans,dollar_sales, units))
##   dollar_sales         units        
##  Min.   :-11.760   Min.   :  1.000  
##  1st Qu.:  0.990   1st Qu.:  1.000  
##  Median :  1.500   Median :  1.000  
##  Mean   :  1.757   Mean   :  1.197  
##  3rd Qu.:  2.190   3rd Qu.:  1.000  
##  Max.   :153.140   Max.   :156.000

We see that some of the transactions have negative values for dollar_sales. It is possible that these are returns. We will filter them out (although we should probably filter out the original purchase as well, but for now, we just drop the returns).

trans <- filter(trans, dollar_sales>0)
head(trans, n=5)
##          upc dollar_sales units time_of_transaction geography week
## 1 7680850106         0.80     1                1100         2    1
## 2 3620000470         3.59     1                1100         2    1
## 3 1800028064         2.25     1                1137         2    1
## 4 9999985067         0.85     1                1148         2    1
## 5 9999985131         2.19     1                1323         2    1
##   household store basket day coupon
## 1    125434   244      1   1      0
## 2    125434   244      1   1      0
## 3    108320   244      2   1      0
## 4    162016   244      3   1      0
## 5     89437   244      4   1      0

Let’s now look at the products data.

prods <- read.csv("C:/Users/dvorakt/Google Drive/business analytics/labs/lab7 summarizing by group/dh_product_lookup.csv")
str(prods)
## 'data.frame':    927 obs. of  5 variables:
##  $ upc                : num  1.11e+08 5.66e+08 5.66e+08 5.66e+08 5.66e+08 ...
##  $ product_description: Factor w/ 872 levels "(S)BARILLA GEMELLI PASTA",..: 864 565 452 452 565 71 93 49 99 50 ...
##  $ commodity          : Factor w/ 4 levels "pancake mixes",..: 3 4 4 4 4 3 3 3 3 3 ...
##  $ brand              : Factor w/ 131 levels "Al Dente","Alaga",..: 127 101 82 82 101 11 11 11 11 11 ...
##  $ product_size       : Factor w/ 146 levels " ","##########",..: 55 87 42 18 42 60 61 62 62 62 ...
head(prods, n=5)
##         upc       product_description   commodity         brand
## 1 111112360 VINCENT S ORIG MARINARA S pasta sauce     Vincent's
## 2 566300023       PINE MOUNTAIN SYRUP      syrups Pine Mountain
## 3 566300028         MILLER CANE SYRUP      syrups        Miller
## 4 566300029         MILLER CANE SYRUP      syrups        Miller
## 5 566300035       PINE MOUNTAIN SYRUP      syrups Pine Mountain
##   product_size
## 1        25 OZ
## 2        40 OZ
## 3        19 OZ
## 4        12 OZ
## 5        19 OZ

There are 927 products in the products file. Looking at the number of factor levels, we see 872 product descriptions, 4 product categories (commodity) as we expected (pasta, pasta sauce, pancake mix, syrup), 131 brands etc.

3. Collapsing data using group_by() and summarize()

Suppose we wanted to calculate total sales for each product (as identified by the upc code). Right now each upc code appears in the transaction data many many times - note that we have only 927 products and over 5 million transactions. We want to aggregate to the product level. We can accomplish this using dplyr’s group_by() and summarize() functions.

prodsales <- trans %>% group_by(upc) %>% summarize(tot_sales=sum(dollar_sales)) 
str(prodsales)
## Classes 'tbl_df', 'tbl' and 'data.frame':    926 obs. of  2 variables:
##  $ upc      : num  1.11e+08 5.66e+08 5.66e+08 5.66e+08 5.66e+08 ...
##  $ tot_sales: num  184.68 231.93 3446.55 2.98 33.25 ...
head(prodsales, n=5)
## Source: local data frame [5 x 2]
## 
##         upc tot_sales
##       (dbl)     (dbl)
## 1 111112360    184.68
## 2 566300023    231.93
## 3 566300028   3446.55
## 4 566300029      2.98
## 5 566300035     33.25

The function summarize() effectively collapsed our transactions data to 926 observations. This is the number of unique upc codes in the transactions data. In the process of collapsing we asked R to calculate (sum()) the dollar sales variable by each upc code. For example we see that upc code 111112360 had total sales of 184.68 dollars. To make this data a bit more useful, let’s merge in the information about these upc codes from the products table. Thankfully, both data sets have upc code allowing us to link the two data sets. The collapsed transaction data has 926 observations whereas the products data set has 927. This is probably because we filtered out one of of the products when we dropped the negative sales. We will use inner_join() and thus not include this product in the merged data set. Let’s also sort by sales so we can see the top selling products.

prodsales <- inner_join(prodsales,prods,by="upc")
prodsales <- arrange(prodsales, desc(tot_sales))
head(prodsales, n=5)
## Source: local data frame [5 x 6]
## 
##          upc tot_sales                product_description     commodity
##        (dbl)     (dbl)                             (fctr)        (fctr)
## 1 3620000250 146484.22             RAGU TRADITIONAL PLAIN   pasta sauce
## 2 5100002549 115526.82          PREGO REG SPAGHETTI SAUCE   pasta sauce
## 3 3000005970  98922.68          AUNT JEMIMA ORIGINL SYRUP        syrups
## 4 9999967727  96155.55 PRIVATE LABEL SYRUP PLASTIC BOTTLE        syrups
## 5 3000005300  93322.41          AJ BUTTERMILK PANCAKE MIX pancake mixes
## Variables not shown: brand (fctr), product_size (fctr)

With about 146 thousand dollars in sales, Ragu Traditional Plain is the top seller in our data set.

Note that we could have calculated a lot more information about the products while collapsing. We could have taken an average (=mean()), or count the number of transaction (=n()) and many other functions that can be included within summarize(). For example, suppose we also wanted to know the average unit price of each product and the number of transactions in each product.

prodsales <- trans %>% group_by(upc) %>% 
  summarize(tot_sales=sum(dollar_sales), av_price=mean(dollar_sales/units), no_trans=n())
prodsales <- inner_join(prodsales,prods,by="upc")
prodsales <- arrange(prodsales, desc(av_price))
head(prodsales, n=5)
## Source: local data frame [5 x 8]
## 
##          upc tot_sales  av_price no_trans       product_description
##        (dbl)     (dbl)     (dbl)    (int)                    (fctr)
## 1 6172005180     12.00 12.000000        1            KARO RED SYRUP
## 2 2100077260     68.88 11.480000        6   KRAFT 5 PK MAC & CHEESE
## 3 6971911731     61.93  8.847143        7      PATTY PIZZAILO SAUCE
## 4 6971911431    358.59  8.740000       40 PATSYS PUTTANESCA SCE 24Z
## 5 6971911231    566.35  8.726842       57 PATSYS TOMATO BASIL SAUCE
## Variables not shown: commodity (fctr), brand (fctr), product_size (fctr)

At the average price of 12 dollars, Karo Red Syrup is the most expensive product in our data set. Interestingly, it was purchased only once.

IN-CLASS EXERCISE 1: Calculate total sales in each product category.

IN-CLASS EXERCISE 2: How many different products are in each product category?

4. Summarizing within groups using group_by() and mutate()

Now that we have prices for each product, let’s calculate how these prices differ from the average prices in each product category. This means that we need to calculate average price within a category, but we don’t want the data to collapse. We want a new variable in our prodsales data set that has the average price in each category. We can accomplish this using functions group_by() and mutate(). This is because mutate() calculates summary statistics within groups defined by group_by().

prodsales <- prodsales %>% group_by(commodity) %>% mutate(av_cat_price=mean(av_price))
head(select(prodsales, product_description, commodity, av_price, av_cat_price), n=5)
## Source: local data frame [5 x 4]
## Groups: commodity [3]
## 
##         product_description   commodity  av_price av_cat_price
##                      (fctr)      (fctr)     (dbl)        (dbl)
## 1            KARO RED SYRUP      syrups 12.000000     3.040105
## 2   KRAFT 5 PK MAC & CHEESE       pasta 11.480000     1.770425
## 3      PATTY PIZZAILO SAUCE pasta sauce  8.847143     3.064678
## 4 PATSYS PUTTANESCA SCE 24Z pasta sauce  8.740000     3.064678
## 5 PATSYS TOMATO BASIL SAUCE pasta sauce  8.726842     3.064678

Notice that we grouped data by product category (variable commodity). The number of observations is still 926. We also see that av_cat_price is the same in rows that have the same category. Let’s calculate how each product’s price deviates from the average price in its category. Then, let’s plot this deviation against number of transactions.

library(ggplot2)
prodsales$price_dev <- prodsales$av_price - prodsales$av_cat_price
ggplot(prodsales, aes(x=price_dev, y=no_trans)) + geom_point() 

We may want to use log scale for the number of transactions. Also, to make the graph more informative let’s vary the color by product category.

ggplot(prodsales, aes(x=price_dev,y=no_trans,color=commodity)) + geom_point() +
  scale_y_continuous(trans="log", breaks=c(1, 2, 5, 10, 25, 50, 100,200, 500,1000,2000, 5000, 10000, 20000,50000,100000)) +
  labs(title="Number of transactions in each product and the deviation \nfrom average price in the product's category") +
  xlab("deviation of price from average category price") +
  ylab("number of transactions (log scale)") +
  geom_vline(aes(x=0))

It is pretty clear that relatively lower priced items have somewhat higher volume of transactions.

IN-CLASS EXERCISE 3: Suppose we wanted to focus on transactions from frequent buyers in each category. Create data that has transactions from households and product categories in which the household made at least 10 transactions in that product category.

IN-CLASS EXERCISE 4: Among households who bought at least 10 items in a product category, how many different brands in a product category does a median household buy?

Exercises:

  1. Load in dunnhumby’s transaction data and filter out negative sales.

  2. Create a data set that contains total spending by each household. What is average of spending across all households? What is the id of the highest spending household? How much did that household spend?

  3. Create a data set that contains total spending on each brand in each category. How many category and brand combinations are there? On average, how many brands are in each product category? (Hint 1: Note that you can group_by() more than one variable, e.g. if I wanted to group by both category and brand, I could do group_by(commodity, brand). Hint 2: Since neither product category nor brand are in the transactions data, you will first need to merge that information in. Hint 3: Use inner_join() function in case some of the products are not in the trans data.)

  4. Using the data from question 3, calculate the market share for each brand in each category. Which brand has the biggest share in each category?

  5. Create a data set that contains total dollar value of sales and total volume of transactions (i.e. number of transactions) for each store. Merge this with the store_lookup data at this address https://www.dropbox.com/s/lqwhwlpsbjpdo1y/dh_store_lookup.csv?raw=1. Show the first 6 rows of this data.

  6. Are there zip codes that have more than one store? Aggregate your store data to the zip code level. Which zip code has the highest volume of sales? How much is that volume?