group_by() and mutate()group_by() and mutate()mutate() and when to apply summarize()Let’s load in our prodsales data from Lab 5.
library(dplyr)
library(ggplot2)
prodsales <- read.csv("https://www.dropbox.com/s/3lbd20guk994obs/prodsales.csv?raw=1")
group_by() and mutate()Our prodsales data has the average price for each product. Since some product categories, such as pasta sauce, tend to be more expensive than others, such as pasta, let’s calculate how prices of individual products differ from the average prices their product category. This means that we need to calculate average price within a category, but we don’t want the data to collapse. We want a new variable in our prodsales data set that has the average price in each category. We can accomplish this using functions group_by() and mutate(). This is because mutate() calculates summary statistics within groups defined by group_by().
prodsales <- prodsales %>% group_by(commodity) %>% mutate(av_cat_price=mean(av_price))
head(select(prodsales, product_description, commodity, av_price, av_cat_price), n=5)
## Source: local data frame [5 x 4]
## Groups: commodity [3]
##
## product_description commodity av_price av_cat_price
## <fctr> <fctr> <dbl> <dbl>
## 1 KARO RED SYRUP syrups 12.000000 3.040105
## 2 KRAFT 5 PK MAC & CHEESE pasta 11.480000 1.770425
## 3 PATTY PIZZAILO SAUCE pasta sauce 8.847143 3.064678
## 4 PATSYS PUTTANESCA SCE 24Z pasta sauce 8.740000 3.064678
## 5 PATSYS TOMATO BASIL SAUCE pasta sauce 8.726842 3.064678
Notice that we grouped data by product category (variable commodity). The number of observations is still 926. We also see that av_cat_price is the same in rows that have the same category. Let’s calculate how each product’s price deviates from the average price in its category. Then, let’s plot this deviation against number of transactions. We may want to use log scale for the number of transactions. Also, to make the graph more informative let’s vary the color by product category.
library(ggplot2)
prodsales$price_dev <- prodsales$av_price - prodsales$av_cat_price
ggplot(prodsales, aes(y=price_dev,x=no_trans,color=commodity)) + geom_point() +
scale_x_continuous(trans="log", breaks=c(1, 2, 5, 10, 25, 50, 100,500,2000, 20000,100000)) +
labs(title="Number of transactions in each product and the deviation \nfrom average price in the product's category") +
ylab("deviation of price from average category price") +
xlab("number of transactions (log scale)") +
geom_hline(aes(yintercept=0))
It is pretty clear that relatively lower priced items have somewhat higher volume of transactions. We can also detect somewhat higher price elasticity of demand for pasta than for pasta sauce.
IN-CLASS EXERCISE 1 (using rank() function): Calculate the market share of each brand in its product category. Create a data frame with top three brands in each category.
IN_CLASS EXERCISE 2 (using cumsum() function): Suppose we wanted to predict the outcome of the games in our NHL data set. What variables would you want to use?
Suppose you wanted to create a measure of the opposing teams success so far in the season. Suppose further that you wanted to measure that success by counting the number of games won so far in the season. Create a variable that calculates that number for each team (home and visiting) for each game. The code below is to get you started. Can you tell what it does?
nhl <- read.csv("https://www.dropbox.com/s/krlwr6z38ol6wer/NHLseason2016.csv?raw=1")
nhl$Date <- as.Date(nhl$Date,"%Y-%m-%d")
home_games <- nhl %>%
rename(team=Home, goals_for=goals_home, goals_against=goals_visit) %>%
select(Date, team, goals_for, goals_against)
away_games <- nhl %>%
rename(team=Visitor, goals_for=goals_visit, goals_against=goals_home) %>%
select(Date, team, goals_for, goals_against)
games <- bind_rows(home_games, away_games)
games$win <- ifelse(games$goals_for-games$goals_against>0,1,0)
games <- arrange(games, team, Date)
Load in dunnhumby’s transaction data and filter out negative sales. Load in the products data as well.
Let’s examine brand loyalty. As a first step, let’s focus on transactions from frequent buyers in each category. Create data that has transactions from households and product categories in which the household made at least 10 transactions in that product category. (Hint: Since transaction data only has upc but not product category, you will need to merge-in product category into the transaction data.)
Among households who bought at least 10 items in a product category, how many different brands in a product category does a median household buy? (Hint: You may need to “collapse” you data three times.)
Would you say that brand loyalty varies by product category?