group_by() and summarize()group_by() and mutate()In this lab we will work with data from a marketing analytics firm dunnhumby. Data contains household level transactions over a period of two years from four categories: Pasta, Pasta Sauce, Syrup, and Pancake Mix. The data is given to us in four different files: transactions, products, stores, promotions (they call it causal). The relationships among these four files are illustrated below. The full description of the data is here.
Data that is stored in multiple tables with relations among the tables is called a relational database.
Let’s load in the products and transactions data into R.
library(dplyr)
#trans <- read.csv("https://www.dropbox.com/s/30gqnhk3tqsawfa/dh_transactions.csv?raw=1")
trans <- read.csv("C:/Users/dvorakt/Google Drive/business analytics/labs/lab7 summarizing by group/dh_transactions.csv")
str(trans)
## 'data.frame': 5197681 obs. of 11 variables:
## $ upc : num 7.68e+09 3.62e+09 1.80e+09 1.00e+10 1.00e+10 ...
## $ dollar_sales : num 0.8 3.59 2.25 0.85 2.19 2.19 3.45 1.29 0.75 2.19 ...
## $ units : int 1 1 1 1 1 1 1 1 1 1 ...
## $ time_of_transaction: int 1100 1100 1137 1148 1323 1323 1415 1415 1513 1523 ...
## $ geography : int 2 2 2 2 2 2 2 2 2 2 ...
## $ week : int 1 1 1 1 1 1 1 1 1 1 ...
## $ household : int 125434 125434 108320 162016 89437 89437 158549 158549 18851 118337 ...
## $ store : int 244 244 244 244 244 244 244 244 244 244 ...
## $ basket : int 1 1 2 3 4 4 5 5 6 7 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ coupon : int 0 0 0 0 0 0 0 0 0 0 ...
Transactions is a fairly large data set with over 5 million observations. We see that all variables are stored as numerical (either as num or int), even though there is no numerical meaning to upc code, time of transactions or household or store id.
Let’s look at summary statistics of some of the key numerical variables.
summary(select(trans,dollar_sales, units))
## dollar_sales units
## Min. :-11.760 Min. : 1.000
## 1st Qu.: 0.990 1st Qu.: 1.000
## Median : 1.500 Median : 1.000
## Mean : 1.757 Mean : 1.197
## 3rd Qu.: 2.190 3rd Qu.: 1.000
## Max. :153.140 Max. :156.000
We see that some of the transactions have negative values for dollar_sales. It is possible that these are returns. We will filter them out (although we should probably filter out the original purchase as well, but for now, we just drop the returns).
trans <- filter(trans, dollar_sales>0)
head(trans, n=5)
## upc dollar_sales units time_of_transaction geography week
## 1 7680850106 0.80 1 1100 2 1
## 2 3620000470 3.59 1 1100 2 1
## 3 1800028064 2.25 1 1137 2 1
## 4 9999985067 0.85 1 1148 2 1
## 5 9999985131 2.19 1 1323 2 1
## household store basket day coupon
## 1 125434 244 1 1 0
## 2 125434 244 1 1 0
## 3 108320 244 2 1 0
## 4 162016 244 3 1 0
## 5 89437 244 4 1 0
Let’s now look at the products data.
prods <- read.csv("C:/Users/dvorakt/Google Drive/business analytics/labs/lab7 summarizing by group/dh_product_lookup.csv")
str(prods)
## 'data.frame': 927 obs. of 5 variables:
## $ upc : num 1.11e+08 5.66e+08 5.66e+08 5.66e+08 5.66e+08 ...
## $ product_description: Factor w/ 872 levels "(S)BARILLA GEMELLI PASTA",..: 864 565 452 452 565 71 93 49 99 50 ...
## $ commodity : Factor w/ 4 levels "pancake mixes",..: 3 4 4 4 4 3 3 3 3 3 ...
## $ brand : Factor w/ 131 levels "Al Dente","Alaga",..: 127 101 82 82 101 11 11 11 11 11 ...
## $ product_size : Factor w/ 146 levels " ","##########",..: 55 87 42 18 42 60 61 62 62 62 ...
head(prods, n=5)
## upc product_description commodity brand
## 1 111112360 VINCENT S ORIG MARINARA S pasta sauce Vincent's
## 2 566300023 PINE MOUNTAIN SYRUP syrups Pine Mountain
## 3 566300028 MILLER CANE SYRUP syrups Miller
## 4 566300029 MILLER CANE SYRUP syrups Miller
## 5 566300035 PINE MOUNTAIN SYRUP syrups Pine Mountain
## product_size
## 1 25 OZ
## 2 40 OZ
## 3 19 OZ
## 4 12 OZ
## 5 19 OZ
There are 927 products in the products file. Looking at the number of factor levels, we see 872 product descriptions, 4 product categories (commodity) as we expected (pasta, pasta sauce, pancake mix, syrup), 131 brands etc.
group_by() and summarize()Suppose we wanted to calculate total sales for each product (as identified by the upc code). Right now each upc code appears in the transaction data many many times - note that we have only 927 products and over 5 million transactions. We want to aggregate to the product level. We can accomplish this using dplyr’s group_by() and summarize() functions.
prodsales <- trans %>% group_by(upc) %>% summarize(tot_sales=sum(dollar_sales))
str(prodsales)
## Classes 'tbl_df', 'tbl' and 'data.frame': 926 obs. of 2 variables:
## $ upc : num 1.11e+08 5.66e+08 5.66e+08 5.66e+08 5.66e+08 ...
## $ tot_sales: num 184.68 231.93 3446.55 2.98 33.25 ...
head(prodsales, n=5)
## Source: local data frame [5 x 2]
##
## upc tot_sales
## (dbl) (dbl)
## 1 111112360 184.68
## 2 566300023 231.93
## 3 566300028 3446.55
## 4 566300029 2.98
## 5 566300035 33.25
The function summarize() effectively collapsed our transactions data to 926 observations. This is the number of unique upc codes in the transactions data. In the process of collapsing we asked R to calculate (sum()) the dollar sales variable by each upc code. For example we see that upc code 111112360 had total sales of 184.68 dollars. To make this data a bit more useful, let’s merge in the information about these upc codes from the products table. Thankfully, both data sets have upc code allowing us to link the two data sets. The collapsed transaction data has 926 observations whereas the products data set has 927. This is probably because we filtered out one of of the products when we dropped the negative sales. We will use inner_join() and thus not include this product in the merged data set. Let’s also sort by sales so we can see the top selling products.
prodsales <- inner_join(prodsales,prods,by="upc")
prodsales <- arrange(prodsales, desc(tot_sales))
head(prodsales, n=5)
## Source: local data frame [5 x 6]
##
## upc tot_sales product_description commodity
## (dbl) (dbl) (fctr) (fctr)
## 1 3620000250 146484.22 RAGU TRADITIONAL PLAIN pasta sauce
## 2 5100002549 115526.82 PREGO REG SPAGHETTI SAUCE pasta sauce
## 3 3000005970 98922.68 AUNT JEMIMA ORIGINL SYRUP syrups
## 4 9999967727 96155.55 PRIVATE LABEL SYRUP PLASTIC BOTTLE syrups
## 5 3000005300 93322.41 AJ BUTTERMILK PANCAKE MIX pancake mixes
## Variables not shown: brand (fctr), product_size (fctr)
With about 146 thousand dollars in sales, Ragu Traditional Plain is the top seller in our data set.
Note that we could have calculated a lot more information about the products while collapsing. We could have taken an average (=mean()), or count the number of transaction (=n()) and many other functions that can be included within summarize(). For example, suppose we also wanted to know the average unit price of each product and the number of transactions in each product.
prodsales <- trans %>% group_by(upc) %>%
summarize(tot_sales=sum(dollar_sales), av_price=mean(dollar_sales/units), no_trans=n())
prodsales <- inner_join(prodsales,prods,by="upc")
prodsales <- arrange(prodsales, desc(av_price))
head(prodsales, n=5)
## Source: local data frame [5 x 8]
##
## upc tot_sales av_price no_trans product_description
## (dbl) (dbl) (dbl) (int) (fctr)
## 1 6172005180 12.00 12.000000 1 KARO RED SYRUP
## 2 2100077260 68.88 11.480000 6 KRAFT 5 PK MAC & CHEESE
## 3 6971911731 61.93 8.847143 7 PATTY PIZZAILO SAUCE
## 4 6971911431 358.59 8.740000 40 PATSYS PUTTANESCA SCE 24Z
## 5 6971911231 566.35 8.726842 57 PATSYS TOMATO BASIL SAUCE
## Variables not shown: commodity (fctr), brand (fctr), product_size (fctr)
At the average price of 12 dollars, Karo Red Syrup is the most expensive product in our data set. Interestingly, it was purchased only once.
IN-CLASS EXERCISE 1: Calculate total sales in each product category.
IN-CLASS EXERCISE 2: How many different products are in each product category?
group_by() and mutate()Now that we have prices for each product, let’s calculate how these prices differ from the average prices in each product category. This means that we need to calculate average price within a category, but we don’t want the data to collapse. We want a new variable in our prodsales data set that has the average price in each category. We can accomplish this using functions group_by() and mutate(). This is because mutate() calculates summary statistics within groups defined by group_by().
prodsales <- prodsales %>% group_by(commodity) %>% mutate(av_cat_price=mean(av_price))
head(select(prodsales, product_description, commodity, av_price, av_cat_price), n=5)
## Source: local data frame [5 x 4]
## Groups: commodity [3]
##
## product_description commodity av_price av_cat_price
## (fctr) (fctr) (dbl) (dbl)
## 1 KARO RED SYRUP syrups 12.000000 3.040105
## 2 KRAFT 5 PK MAC & CHEESE pasta 11.480000 1.770425
## 3 PATTY PIZZAILO SAUCE pasta sauce 8.847143 3.064678
## 4 PATSYS PUTTANESCA SCE 24Z pasta sauce 8.740000 3.064678
## 5 PATSYS TOMATO BASIL SAUCE pasta sauce 8.726842 3.064678
Notice that we grouped data by product category (variable commodity). The number of observations is still 926. We also see that av_cat_price is the same in rows that have the same category. Let’s calculate how each product’s price deviates from the average price in its category. Then, let’s plot this deviation against number of transactions.
library(ggplot2)
prodsales$price_dev <- prodsales$av_price - prodsales$av_cat_price
ggplot(prodsales, aes(x=price_dev, y=no_trans)) + geom_point()
We may want to use log scale for the number of transactions. Also, to make the graph more informative let’s vary the color by product category.
ggplot(prodsales, aes(x=price_dev,y=no_trans,color=commodity)) + geom_point() +
scale_y_continuous(trans="log", breaks=c(1, 2, 5, 10, 25, 50, 100,200, 500,1000,2000, 5000, 10000, 20000,50000,100000)) +
labs(title="Number of transactions in each product and the deviation \nfrom average price in the product's category") +
xlab("deviation of price from average category price") +
ylab("number of transactions (log scale)") +
geom_vline(aes(x=0))
It is pretty clear that relatively lower priced items have somewhat higher volume of transactions.
IN-CLASS EXERCISE 3: Suppose we wanted to focus on transactions from frequent buyers in each category. Create data that has transactions from households and product categories in which the household made at least 10 transactions in that product category.
IN-CLASS EXERCISE 4: Among households who bought at least 10 items in a product category, how many different brands in a product category does a median household buy?
Load in dunnhumby’s transaction data and filter out negative sales.
Create a data set that contains total spending by each household. What is average of spending across all households? What is the id of the highest spending household? How much did that household spend?
Create a data set that contains total spending on each brand in each category. How many category and brand combinations are there? On average, how many brands are in each product category? (Hint 1: Note that you can group_by() more than one variable, e.g. if I wanted to group by both category and brand, I could do group_by(commodity, brand). Hint 2: Since neither product category nor brand are in the transactions data, you will first need to merge that information in. Hint 3: Use inner_join() function in case some of the products are not in the trans data.)
Using the data from question 3, calculate the market share for each brand in each category. Which brand has the biggest share in each category?
Create a data set that contains total dollar value of sales and total volume of transactions (i.e. number of transactions) for each store. Merge this with the store_lookup data at this address https://www.dropbox.com/s/lqwhwlpsbjpdo1y/dh_store_lookup.csv?raw=1. Show the first 6 rows of this data.
Are there zip codes that have more than one store? Aggregate your store data to the zip code level. Which zip code has the highest volume of sales? How much is that volume?