Tipping Data with dplyr

The tips dataset is included in the R reshape2 package.
The description reads as follows:
One waiter recorded information about each tip he received over a period of a few months working in one restaurant. He collected several variables:
- tip in dollars
- bill in dollars
- sex of the bill payer
- whether there were smokers in the party
- day of the week
- time of day
- size of the party.
In all he recorded 244 tips. The data was reported in a collection of case studies for business statistics (Bryant & Smith 1995).
The data set and example is pulled from:
Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL: Richard D. Irwin Publishing

Let’s view a few rows of the data:

library(reshape2)
head(tips)
##   total_bill  tip    sex smoker day   time size
## 1      16.99 1.01 Female     No Sun Dinner    2
## 2      10.34 1.66   Male     No Sun Dinner    3
## 3      21.01 3.50   Male     No Sun Dinner    3
## 4      23.68 3.31   Male     No Sun Dinner    2
## 5      24.59 3.61 Female     No Sun Dinner    4
## 6      25.29 4.71   Male     No Sun Dinner    4

Using dplyr

I like working with dplyr, so let’s make sure that’s loaded and ready. We’ll also use the dplyr glimpse() function to view the data in a slightly different way than we would using the base R head() function. Then we’ll use dplyr’s tbl_df() function, which makes the data frame a little bit tidier.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
glimpse(tips)
## Observations: 244
## Variables: 7
## $ total_bill (dbl) 16.99, 10.34, 21.01, 23.68, 24.59, 25.29, 8.77, 26....
## $ tip        (dbl) 1.01, 1.66, 3.50, 3.31, 3.61, 4.71, 2.00, 3.12, 1.9...
## $ sex        (fctr) Female, Male, Male, Male, Female, Male, Male, Male...
## $ smoker     (fctr) No, No, No, No, No, No, No, No, No, No, No, No, No...
## $ day        (fctr) Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, ...
## $ time       (fctr) Dinner, Dinner, Dinner, Dinner, Dinner, Dinner, Di...
## $ size       (int) 2, 3, 3, 2, 4, 4, 2, 4, 2, 2, 2, 4, 2, 4, 2, 2, 3, ...
tips2 <- tbl_df(tips)
head(tips2)
## Source: local data frame [6 x 7]
## 
##   total_bill   tip    sex smoker    day   time  size
##        (dbl) (dbl) (fctr) (fctr) (fctr) (fctr) (int)
## 1      16.99  1.01 Female     No    Sun Dinner     2
## 2      10.34  1.66   Male     No    Sun Dinner     3
## 3      21.01  3.50   Male     No    Sun Dinner     3
## 4      23.68  3.31   Male     No    Sun Dinner     2
## 5      24.59  3.61 Female     No    Sun Dinner     4
## 6      25.29  4.71   Male     No    Sun Dinner     4

Effect of Party Size

I’m interested in the effect of the size of the party on the tipping outcome, so let’s trim this data with dplyr’s select() function.

tips_by_party_size <- select(tips2, size, total_bill, tip)
head(tips_by_party_size)
## Source: local data frame [6 x 3]
## 
##    size total_bill   tip
##   (int)      (dbl) (dbl)
## 1     2      16.99  1.01
## 2     3      10.34  1.66
## 3     3      21.01  3.50
## 4     2      23.68  3.31
## 5     4      24.59  3.61
## 6     4      25.29  4.71

Getting the Percentage Column

This is all well and good, but we care about the outcome in terms of a percentage. We need to create this column from the eisting data, which we can do with dplyr’s mutate() function. We’re going to replace the original “tips by party size” variable (rather than storing the result in a new variable) since we’re not losing any data by doing so, just adding a new column.

tips_by_party_size <- mutate(tips_by_party_size, tip_perc =    round(tip/total_bill, 2))
head(tips_by_party_size)
## Source: local data frame [6 x 4]
## 
##    size total_bill   tip tip_perc
##   (int)      (dbl) (dbl)    (dbl)
## 1     2      16.99  1.01     0.06
## 2     3      10.34  1.66     0.16
## 3     3      21.01  3.50     0.17
## 4     2      23.68  3.31     0.14
## 5     4      24.59  3.61     0.15
## 6     4      25.29  4.71     0.19

The tip_perc Data Frame

Now that I have the actual measurement I’m interested, I can trim more of the data frame. This time I’ll store the result in a new variable. This data frame will contain sizes of parties, and corresponding tip percentages. Once again, I’m using the select() function.

t_perc_size <- select(tips_by_party_size, size, tip_perc)
head(t_perc_size)
## Source: local data frame [6 x 2]
## 
##    size tip_perc
##   (int)    (dbl)
## 1     2     0.06
## 2     3     0.16
## 3     3     0.17
## 4     2     0.14
## 5     4     0.15
## 6     4     0.19

Grouping the Results by Party Size

Now we can use plyr’s group_by() and summarize() functions to produce a data frame consisting of different party sizes, and the corresponding average tip percentages per size in aggregate. This is exactly like using “GROUP BY” in SQL when you “SELECT” a regular column to group along with a built-in function (in this case it would be “AVG”).

size_cats <- summarize(group_by(t_perc_size, size), mean(tip_perc))
head(size_cats)
## Source: local data frame [6 x 2]
## 
##    size mean(tip_perc)
##   (int)          (dbl)
## 1     1      0.2175000
## 2     2      0.1657051
## 3     3      0.1526316
## 4     4      0.1464865
## 5     5      0.1420000
## 6     6      0.1550000

Plotting the Results

Let’s look at the results in graphical form.

plot(size_cats, pch=19, type="b",col="green", 
     main="Tip Percentage by Party Size")