Lab 5: Collapsing data using `group_by()` and `summarize()`

Learning Objectives

relational database
exploring data
collapsing data using group_by() and summarize()

1. Relational database

In this lab we will work with data from a marketing analytics firm dunnhumby. Data contains household level transactions over a period of two years from four categories: Pasta, Pasta Sauce, Syrup, and Pancake Mix. The data is given to us in four different files: transactions, products, stores, promotions (they call it causal). The relationships among these four files are illustrated below. The full description of the data is here.

Data that is stored in multiple tables with relations among the tables is called a relational database.

2. Exploring the data

Let’s load in the products and transactions data into R.

library(dplyr)
library(ggplot2)
library(stargazer)
#trans <- read.csv("https://www.dropbox.com/s/30gqnhk3tqsawfa/dh_transactions.csv?raw=1")
trans <- read.csv("C:/Users/dvorakt/Google Drive/business analytics/labs/lab5 summarizing by group/dh_transactions.csv")
str(trans)

## 'data.frame':    5197681 obs. of  11 variables:
##  $ upc                : num  7.68e+09 3.62e+09 1.80e+09 1.00e+10 1.00e+10 ...
##  $ dollar_sales       : num  0.8 3.59 2.25 0.85 2.19 2.19 3.45 1.29 0.75 2.19 ...
##  $ units              : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ time_of_transaction: int  1100 1100 1137 1148 1323 1323 1415 1415 1513 1523 ...
##  $ geography          : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ week               : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ household          : int  125434 125434 108320 162016 89437 89437 158549 158549 18851 118337 ...
##  $ store              : int  244 244 244 244 244 244 244 244 244 244 ...
##  $ basket             : int  1 1 2 3 4 4 5 5 6 7 ...
##  $ day                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ coupon             : int  0 0 0 0 0 0 0 0 0 0 ...

Transactions is a fairly large data set with over 5 million observations. We see that all variables are stored as numerical (either as num or int), even though there is no numerical meaning to upc code, time of transactions or household or store id.

Let’s look at summary statistics of some of the key numerical variables.

stargazer(select(trans,dollar_sales, units), type="text", median=TRUE)

## 
## ============================================================
## Statistic        N     Mean  St. Dev.   Min   Median   Max  
## ------------------------------------------------------------
## dollar_sales 5,197,681 1.757  1.129   -11.760 1.500  153.140
## units        5,197,681 1.197  0.569      1      1      156  
## ------------------------------------------------------------

We see that some of the transactions have negative values for dollar_sales. It is possible that these are merchandise returns. We will filter them out (although we should probably filter out the original purchase as well, but for now, we just drop the returns).

trans <- filter(trans, dollar_sales>0)
head(trans, n=3)

##          upc dollar_sales units time_of_transaction geography week
## 1 7680850106         0.80     1                1100         2    1
## 2 3620000470         3.59     1                1100         2    1
## 3 1800028064         2.25     1                1137         2    1
##   household store basket day coupon
## 1    125434   244      1   1      0
## 2    125434   244      1   1      0
## 3    108320   244      2   1      0

Let’s now look at the products data.

#prods <- read.csv("https://www.dropbox.com/s/ja9am94rmwtte0n/dh_product_lookup.csv?dl=0")
prods <- read.csv("C:/Users/dvorakt/Google Drive/business analytics/labs/lab5 summarizing by group/dh_product_lookup.csv")
str(prods)

## 'data.frame':    927 obs. of  5 variables:
##  $ upc                : num  1.11e+08 5.66e+08 5.66e+08 5.66e+08 5.66e+08 ...
##  $ product_description: Factor w/ 872 levels "(S)BARILLA GEMELLI PASTA",..: 864 565 452 452 565 71 93 49 99 50 ...
##  $ commodity          : Factor w/ 4 levels "pancake mixes",..: 3 4 4 4 4 3 3 3 3 3 ...
##  $ brand              : Factor w/ 131 levels "Al Dente","Alaga",..: 127 101 82 82 101 11 11 11 11 11 ...
##  $ product_size       : Factor w/ 146 levels " ","##########",..: 55 87 42 18 42 60 61 62 62 62 ...

head(prods, n=3)

##         upc       product_description   commodity         brand
## 1 111112360 VINCENT S ORIG MARINARA S pasta sauce     Vincent's
## 2 566300023       PINE MOUNTAIN SYRUP      syrups Pine Mountain
## 3 566300028         MILLER CANE SYRUP      syrups        Miller
##   product_size
## 1        25 OZ
## 2        40 OZ
## 3        19 OZ

There are 927 products in the products file. Looking at the number of factor levels, we see 872 product descriptions, 4 product categories (commodity) as we expected (pasta, pasta sauce, pancake mix, syrup), 131 brands etc.

3. Collapsing data using `group_by()` and `summarize()`

Suppose we wanted to calculate total sales for each product (as identified by the upc code). Right now each upc code appears in the transaction data many many times - note that we have only 927 products and over 5 million transactions. We want to aggregate to the product level. We can accomplish this using dplyr’s group_by() and summarize() functions.

prodsales <- trans %>% group_by(upc) %>% summarize(tot_sales=sum(dollar_sales)) 
str(prodsales)

## Classes 'tbl_df', 'tbl' and 'data.frame':    926 obs. of  2 variables:
##  $ upc      : num  1.11e+08 5.66e+08 5.66e+08 5.66e+08 5.66e+08 ...
##  $ tot_sales: num  184.68 231.93 3446.55 2.98 33.25 ...

head(prodsales, n=3)

## # A tibble: 3 × 2
##         upc tot_sales
##       <dbl>     <dbl>
## 1 111112360    184.68
## 2 566300023    231.93
## 3 566300028   3446.55

The function summarize() effectively collapsed our transactions data to 926 observations. This is the number of unique upc codes in the transactions data. In the process of collapsing we asked R to calculate (sum()) the dollar sales variable by each upc code. For example we see that upc code 111112360 had total sales of 184.68 dollars.

To make this data a bit more useful, let’s merge in the information about these upc codes from the products table. Thankfully, both data sets have upc code allowing us to link the two data sets. The collapsed transaction data has 926 observations whereas the products data set has 927. This is probably because we filtered out one of of the products when we dropped the negative sales. We will use inner_join() and thus not include this product in the merged data set. Let’s also sort by sales so we can see the top selling products.

prodsales <- inner_join(prodsales,prods,by="upc")
prodsales <- arrange(prodsales, desc(tot_sales))
head(prodsales, n=3)

## # A tibble: 3 × 6
##          upc tot_sales       product_description   commodity       brand
##        <dbl>     <dbl>                    <fctr>      <fctr>      <fctr>
## 1 3620000250 146484.22    RAGU TRADITIONAL PLAIN pasta sauce        Ragu
## 2 5100002549 115526.82 PREGO REG SPAGHETTI SAUCE pasta sauce       Prego
## 3 3000005970  98922.68 AUNT JEMIMA ORIGINL SYRUP      syrups Aunt Jemima
## # ... with 1 more variables: product_size <fctr>

With about 146 thousand dollars in sales, Ragu Traditional Plain is the top seller in our data set.

Note that we could have calculated a lot more information about the products while collapsing. We could have taken an average (=mean()), or count the number of transaction (=n()) and many other functions that can be included within summarize(). For example, suppose we also wanted to know the average unit price of each product and the number of transactions in each product.

prodsales <- trans %>% group_by(upc) %>% 
  summarize(tot_sales=sum(dollar_sales), av_price=mean(dollar_sales/units), no_trans=n())
prodsales <- inner_join(prodsales,prods,by="upc")
prodsales <- arrange(prodsales, desc(av_price))
head(prodsales, n=5)

## # A tibble: 5 × 8
##          upc tot_sales  av_price no_trans       product_description
##        <dbl>     <dbl>     <dbl>    <int>                    <fctr>
## 1 6172005180     12.00 12.000000        1            KARO RED SYRUP
## 2 2100077260     68.88 11.480000        6   KRAFT 5 PK MAC & CHEESE
## 3 6971911731     61.93  8.847143        7      PATTY PIZZAILO SAUCE
## 4 6971911431    358.59  8.740000       40 PATSYS PUTTANESCA SCE 24Z
## 5 6971911231    566.35  8.726842       57 PATSYS TOMATO BASIL SAUCE
## # ... with 3 more variables: commodity <fctr>, brand <fctr>,
## #   product_size <fctr>

At the average price of 12 dollars, Karo Red Syrup is the most expensive product in our data set. Interestingly, it was purchased only once.

Let’s plot the number of transactions against the average unit price.

ggplot(prodsales) + geom_point(aes(x=no_trans, y=av_price, color=commodity)) + 
  scale_y_continuous(trans = "log", breaks=c(1,2,4,10)) +
  scale_x_continuous(trans = "log", breaks=c(1,2,5,10,20,50,100,500,2000,10000,40000))

IN-CLASS EXERCISE 1: Which brands have the biggest sales in each product category?

IN-CLASS EXERCISE 2: Load in the NHL data and create win variable that equals one if the home team won and 0 otherwise.

nhl <- read.csv("https://www.dropbox.com/s/krlwr6z38ol6wer/NHLseason2016.csv?raw=1")
nhl$win <- ifelse(nhl$goals_home-nhl$goals_visit>0,1,0)

Calculate the number of wins for each team.

Exercises:

Load in dunnhumby’s transaction data and filter out negative sales.
Create a data set that contains total spending by each household. What is average of spending across all households? What is the id of the highest spending household? How much did that household spend?
In how many stores does a typical household shop?
Create a data set that contains total dollar value of sales and total volume of transactions (i.e. number of transactions) for each store. Merge this with the store_lookup data at this address https://www.dropbox.com/s/lqwhwlpsbjpdo1y/dh_store_lookup.csv?raw=1. Show the first 6 rows of this data.
Are there zip codes that have more than one store? Aggregate your store data to the zip code level. Which zip code has the highest volume of sales? How much is that volume?
FUN BONUS: Where are these stores? Use this (https://www.dropbox.com/s/q8es41fulxwpge7/US%20zip%20codes.csv?raw=1) file to get the longitude and latitude of each zip code. Merge it with your zip code level store data and plot the location of these zip codes using the leaflet() function from the leaflet package. Below is the syntax for plotting points in yourdata on a map. Good luck!

library(leaflet)
leaflet() %>% 
  addTiles() %>%
  addCircleMarkers(~LNG,~LAT, data=yourdata)