The tips dataset is included in the R reshape2 package.
The description reads as follows:
One waiter recorded information about each tip he received over a period of a few months working in one restaurant. He collected several variables:
- tip in dollars
- bill in dollars
- sex of the bill payer
- whether there were smokers in the party
- day of the week
- time of day
- size of the party.
In all he recorded 244 tips. The data was reported in a collection of case studies for business statistics (Bryant & Smith 1995).
The data set and example is pulled from:
Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL: Richard D. Irwin Publishing
Let’s view a few rows of the data:
library(reshape2)
head(tips)
## total_bill tip sex smoker day time size
## 1 16.99 1.01 Female No Sun Dinner 2
## 2 10.34 1.66 Male No Sun Dinner 3
## 3 21.01 3.50 Male No Sun Dinner 3
## 4 23.68 3.31 Male No Sun Dinner 2
## 5 24.59 3.61 Female No Sun Dinner 4
## 6 25.29 4.71 Male No Sun Dinner 4
I like working with dplyr, so let’s make sure that’s loaded and ready. We’ll also use the dplyr glimpse() function to view the data in a slightly different way than we would using the base R head() function. Then we’ll use dplyr’s tbl_df() function, which makes the data frame a little bit tidier.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
glimpse(tips)
## Observations: 244
## Variables: 7
## $ total_bill (dbl) 16.99, 10.34, 21.01, 23.68, 24.59, 25.29, 8.77, 26....
## $ tip (dbl) 1.01, 1.66, 3.50, 3.31, 3.61, 4.71, 2.00, 3.12, 1.9...
## $ sex (fctr) Female, Male, Male, Male, Female, Male, Male, Male...
## $ smoker (fctr) No, No, No, No, No, No, No, No, No, No, No, No, No...
## $ day (fctr) Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, ...
## $ time (fctr) Dinner, Dinner, Dinner, Dinner, Dinner, Dinner, Di...
## $ size (int) 2, 3, 3, 2, 4, 4, 2, 4, 2, 2, 2, 4, 2, 4, 2, 2, 3, ...
tips2 <- tbl_df(tips)
head(tips2)
## Source: local data frame [6 x 7]
##
## total_bill tip sex smoker day time size
## (dbl) (dbl) (fctr) (fctr) (fctr) (fctr) (int)
## 1 16.99 1.01 Female No Sun Dinner 2
## 2 10.34 1.66 Male No Sun Dinner 3
## 3 21.01 3.50 Male No Sun Dinner 3
## 4 23.68 3.31 Male No Sun Dinner 2
## 5 24.59 3.61 Female No Sun Dinner 4
## 6 25.29 4.71 Male No Sun Dinner 4
I’m interested in the effect of the size of the party on the tipping outcome, so let’s trim this data with dplyr’s select() function.
tips_by_party_size <- select(tips2, size, total_bill, tip)
head(tips_by_party_size)
## Source: local data frame [6 x 3]
##
## size total_bill tip
## (int) (dbl) (dbl)
## 1 2 16.99 1.01
## 2 3 10.34 1.66
## 3 3 21.01 3.50
## 4 2 23.68 3.31
## 5 4 24.59 3.61
## 6 4 25.29 4.71
This is all well and good, but we care about the outcome in terms of a percentage. We need to create this column from the eisting data, which we can do with dplyr’s mutate() function. We’re going to replace the original “tips by party size” variable (rather than storing the result in a new variable) since we’re not losing any data by doing so, just adding a new column.
tips_by_party_size <- mutate(tips_by_party_size, tip_perc = round(tip/total_bill, 2))
head(tips_by_party_size)
## Source: local data frame [6 x 4]
##
## size total_bill tip tip_perc
## (int) (dbl) (dbl) (dbl)
## 1 2 16.99 1.01 0.06
## 2 3 10.34 1.66 0.16
## 3 3 21.01 3.50 0.17
## 4 2 23.68 3.31 0.14
## 5 4 24.59 3.61 0.15
## 6 4 25.29 4.71 0.19
Now that I have the actual measurement I’m interested, I can trim more of the data frame. This time I’ll store the result in a new variable. This data frame will contain sizes of parties, and corresponding tip percentages. Once again, I’m using the select() function.
t_perc_size <- select(tips_by_party_size, size, tip_perc)
head(t_perc_size)
## Source: local data frame [6 x 2]
##
## size tip_perc
## (int) (dbl)
## 1 2 0.06
## 2 3 0.16
## 3 3 0.17
## 4 2 0.14
## 5 4 0.15
## 6 4 0.19
Now we can use plyr’s group_by() and summarize() functions to produce a data frame consisting of different party sizes, and the corresponding average tip percentages per size in aggregate. This is exactly like using “GROUP BY” in SQL when you “SELECT” a regular column to group along with a built-in function (in this case it would be “AVG”).
size_cats <- summarize(group_by(t_perc_size, size), mean(tip_perc))
head(size_cats)
## Source: local data frame [6 x 2]
##
## size mean(tip_perc)
## (int) (dbl)
## 1 1 0.2175000
## 2 2 0.1657051
## 3 3 0.1526316
## 4 4 0.1464865
## 5 5 0.1420000
## 6 6 0.1550000
Let’s look at the results in graphical form.
plot(size_cats, pch=19, type="b",col="green",
main="Tip Percentage by Party Size")