I. Description

The owner of a restaurant was interested in studying the tipping patterns of his customers. He collected restaurant bills over a two week period that he believes provide a good sample of his customers. The data recorded include the amount of the bill, size of the tip, percentage tip, number of customers in the group, whether or not a credit card was used, day of the week, and a coded identity of the server.

II. Questions

  1. How many variables are included in the data set? Which ones are numerical, and which ones are categorical?
  2. Is the average percentage tip the same for all three servers?
  3. Is the relationship between bill amount and the tip amount the same for those customers that paid with a credit card and those that paid with cash?
  4. Is there a higher tipping percentage with more guests?
  5. Do the average percentage tips on Wednesday and on Friday indicate a significant difference?
  6. What is the relationship between the amount of the bill and the percentage of the tip?

It is stated in the description above that the owner of the restaurant wishes to analyze this data in order to discover the tipping patterns of his customers. If it could be determined through data analysis that manipulation of one particular variable resulted in a significant increase in tipping, the restaurant owner could use this information to maximize the tips collected by his servers. Because the amount of a tip is generally dependent upon the amount of the bill, we ascertained that exploring how different variables in the data set affect percentage tips would be most beneficial and valuable. Thus, each member of our group contributed a question that could help us determine what variables most affected the percentage of tip paid by customers.

For Question 4, we tested to see if there was a higher tipping percentage with more guests because we were curious if larger groups of customers felt compelled to pay the server a higher tip in gratitude for accommodating their larger group. In Question 5, we compared a week day to a weekend evening to see if customers were more generous in their tipping on a weekend evening. Lastly, in Question 6, we were curious to determine what relationship existed between the amount of the bill and the percentage of the tip. We suspected that a customer with a particularly large bill might be more likely to pay the server a larger percentage tip. Overall, in conducting our data analysis, we intentionally chose tests that would reveal significant relationships between the percentage tip and other variables to see if manipulating method of payment, group size, day of week, etc. could help a restaurant owner aquire higher tips.

III. About the Data

Question 1.

How many variables are included in the data set? Which ones are numerical, and which ones are categorical?

There are are seven variables being measured in the data set. These variables are bill amount, tip amount, method of payment (“credit”)“, number of guests, day, server, and percent tip. The numerical variables are bill, tip, guest, and percent tip. The categorical data includes method of payment (whether or not a credit card was used), day of the week, and server.

mydata = read.csv("http://bit.ly/1StTazL",header=T)
str(mydata)
## 'data.frame':    140 obs. of  7 variables:
##  $ Bill  : num  10.2 18.4 11.7 9.2 18.1 ...
##  $ Tip   : num  1.83 2.75 2.28 1.8 4 3.13 5 3.35 7.25 3 ...
##  $ Credit: Factor w/ 2 levels "n","y": 1 1 2 1 1 2 2 2 2 1 ...
##  $ Guests: int  1 2 1 1 3 2 2 2 2 2 ...
##  $ Day   : Factor w/ 5 levels "F","M","R","T",..: 5 2 5 5 5 5 3 4 5 1 ...
##  $ Server: Factor w/ 3 levels "A","B","C": 1 2 1 1 3 2 3 1 1 3 ...
##  $ PctTip: num  18 14.9 19.5 19.6 22.1 15 19.9 18 18.2 13.4 ...

Summary statistics of data can be seen below:

summary(mydata)
##       Bill            Tip         Credit     Guests      Day    Server
##  Min.   : 1.66   Min.   : 0.250   n:92   Min.   :1.000   F:25   A:55  
##  1st Qu.:15.37   1st Qu.: 2.145   y:48   1st Qu.:2.000   M:18   B:55  
##  Median :19.95   Median : 3.340          Median :2.000   R:32   C:30  
##  Mean   :23.08   Mean   : 3.925          Mean   :2.129   T:13         
##  3rd Qu.:28.92   3rd Qu.: 5.000          3rd Qu.:2.000   W:52         
##  Max.   :70.51   Max.   :15.000          Max.   :7.000                
##      PctTip     
##  Min.   : 6.70  
##  1st Qu.:14.28  
##  Median :16.35  
##  Mean   :16.70  
##  3rd Qu.:18.20  
##  Max.   :42.20

Overall, the average bill amount is $23.08; average tip amount is $3.93, with the average percentage of tip being 16.7%; and the average number of guests per party is 2. Nearly twice as many customers paid with cash rather than a credit card, and Wednesdays appear to be the most popular day at the restaurant.

One particiular piece of data that stood out to us was the wide variability in the bill amounts reported. We were also suprised by the wide variability in tip amounts, which ranged anywhere from $0.25 to $15.00. This variability can be shown in the boxplots below:

x= c(mydata$Bill)
y= c(mydata$Tip)
boxplot(x,y, main= "Summary Statistics of Bill and Tip Amount", ylab= "Amount ($)", names= c("Bill", "Tip"))

Representing the data for bill amount with a boxplot revealed that there was an unusually high number of larger bills that can be classified as outliers. There were also several outliers past the upper fence for tip amount, though no outliers below the lower fence.

IV. Discussion and Data Analysis

(Question 1 answered in “About the Data”)

Question 2.

Is the average percentage tip the same for all three servers?

In order to find the average tip for each server, we first had to separate the data by server.

table(mydata$Server)
## 
##  A  B  C 
## 55 55 30
ServerA = subset(mydata, Server == "A")
ServerB = subset(mydata, Server == "B")
ServerC = subset(mydata, Server == "C")

Next, we found the average tip for each server.

mean(ServerA$Tip)
## [1] 4.047636
mean(ServerB$Tip)
## [1] 3.478
mean(ServerC$Tip)
## [1] 4.517667

The average tip of each server can be represented by the bar graph below:

x= c(1,2,3)
y= c(4.05, 3.48, 4.52)
barplot(y,x, main= "Average Tip per Server", xlab= "Server", ylab= "Average Tip ($)", names= c("ServerA", "ServerB", "ServerC"))

Although there is a clear difference in the averages, we next needed to assess whether or not this difference was of any substantial, meaningful difference. In order to determine if this difference in the average tip of each server was significant, we conducted three separate T-Tests at a 95% confidence level. The null hypothesis would be that the difference in tips, ‘ud’, is equal to 0; the alternative hypothesis would be that the difference ‘ud’ is not equal to 0.

t.test(ServerA$Tip, ServerB$Tip)
## 
##  Welch Two Sample t-test
## 
## data:  ServerA$Tip and ServerB$Tip
## t = 1.247, df = 96.843, p-value = 0.2154
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3369945  1.4762672
## sample estimates:
## mean of x mean of y 
##  4.047636  3.478000
t.test(ServerA$Tip, ServerC$Tip)
## 
##  Welch Two Sample t-test
## 
## data:  ServerA$Tip and ServerC$Tip
## t = -0.74044, df = 59.03, p-value = 0.462
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.7402434  0.8001828
## sample estimates:
## mean of x mean of y 
##  4.047636  4.517667
t.test(ServerB$Tip, ServerC$Tip)
## 
##  Welch Two Sample t-test
## 
## data:  ServerB$Tip and ServerC$Tip
## t = -1.804, df = 44.534, p-value = 0.078
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.2007664  0.1214331
## sample estimates:
## mean of x mean of y 
##  3.478000  4.517667

In conclusion, the average tips for Servers A, B, and C, respectively, were: $4.05, $3.48, and $4.52. We tested for a significance in the difference among these values by running 3 separate Welch Two-Sample T-Tests. The p-vlaue when comparing Server A and Server B was 0.2154 and the 95% confidence interval for difference in averages was -0.337 < ud < 1.476; when comparing Servers A and C, the p-value was 0.462 and the 95% confidence interval was -1.74 < ud < 0.8; when comparing Servers B and C, the p-value was 0.078 and the 95% confidence interval was -2.201 < ud < 0.121. This led us to conclude that there is not sufficient evidence to say that there is significant difference in the average tips earned by each server.

Question 3.

Is the relationship between bill amount and the tip amount the same for those customers that paid with a credit card and those that paid with cash?

In order to determine the relationships between bill amount and tip amount for each method of payment, we first needed to separate the data into two subsets: those who paid with credit card, and those who paid with cash.

table(mydata$Credit)
## 
##  n  y 
## 92 48
Credit = subset(mydata, Credit == "y")
Cash = subset(mydata, Credit == "n")

Next, we determined the correlation coefficient and linear regression model describing the relationship between bill amount and tip amount only for those that paid with a credit card.

x= Credit$Bill
y= Credit$Tip
cor(x,y)
## [1] 0.9524133
myreg= lm(y ~ x)
summary(myreg)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3302 -0.6097  0.0792  0.3089  4.0112 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.370420   0.289664  -1.279    0.207    
## x            0.187075   0.008828  21.192   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9 on 46 degrees of freedom
## Multiple R-squared:  0.9071, Adjusted R-squared:  0.9051 
## F-statistic: 449.1 on 1 and 46 DF,  p-value: < 2.2e-16

We then conducted the same analysis to determine the correlation coefficient and linear regression model describing the relationship between bill amount and tip amount for those that paid with cash.

x= Cash$Bill
y= Cash$Tip
cor(x,y)
## [1] 0.8425103
myreg= lm(y ~ x)
summary(myreg)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3179 -0.5195 -0.1635  0.2944  6.0212 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.14272    0.25838  -0.552    0.582    
## x            0.17390    0.01172  14.838   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.087 on 90 degrees of freedom
## Multiple R-squared:  0.7098, Adjusted R-squared:  0.7066 
## F-statistic: 220.2 on 1 and 90 DF,  p-value: < 2.2e-16

In order to visually compare these two sets of data, we constucted the following scatterplots, including their regression lines, which can be seen below:

Credit = subset(mydata, Credit == "y")
Cash = subset(mydata, Credit == "n")
x= Credit$Bill
y= Credit$Tip

myreg= lm(y ~ x)

plot(x,y, main= "Credit", xlab= "Bill Amount ($)", ylab= "Tip Amount ($)")
abline(myreg, col= "red")

Credit = subset(mydata, Credit == "y")
Cash = subset(mydata, Credit == "n")
x= Cash$Bill
y= Cash$Tip

myreg= lm(y ~ x)
plot(x,y, main= "Cash", xlab= "Bill Amount ($)", ylab= "Tip Amount ($)")
abline(myreg, col= "red")

We also compared these two relationships by considering the average bill amount for all of the data in the set, $23.08. Using the linear regression models found through analysis, we calculated the expected tip amount for a customer paying his $23.08 bill using a credit card as compared to the expected tip amount for a customer paying for his $23.08 bill in cash.

Credit:

0.187075 * 23.08 - 0.370420
## [1] 3.947271

Cash:

0.17390 * 23.08 - 0.14272
## [1] 3.870892

In summary, the relationship between bill amount and tip amount for those who paid with a credit card has a correlation of r= 0.9524. This relationship can be modeled by the linear regression model y= 0.187075x - 0.37042. On the other hand, the relationship between bill amount and tip amount for those who paid with cash has a correlation of r= 0.8425 and can be modeled by the linear regression y= 0.1739x - 0.14272. These models prove that there is an observerable and predictable difference between the relationship of tip amount and bill amount depending on whether or not a customer pays with a credit card or cash. Customers that pay with a credit card tend to give higher tips. For example, if a customer had a bill amount of $23.08 and paid with a credit card, their expected tip amount would be $3.95; if, however, they had paid with cash, the expected tip amount would be $3.87. Although this difference for an average bill amount at the restaurant is very small, it is still interesting to note.

Question 4.

Is there a higher tipping percentage with more guests?

In order to determine the relationship between the number of guests in a party and the percentage tip paid, we found the correlation coefficient between these two variables and determined the linear model that represents the data.

x= mydata$Guests
y= mydata$PctTip
cor(x,y)
## [1] 0.0720307
myreg= lm(y ~ x)
summary(myreg)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9530 -2.3624 -0.6218  1.3970 25.5470 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.9777     0.9310  17.162   <2e-16 ***
## x             0.3377     0.3980   0.848    0.398    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.567 on 138 degrees of freedom
## Multiple R-squared:  0.005188,   Adjusted R-squared:  -0.00202 
## F-statistic: 0.7197 on 1 and 138 DF,  p-value: 0.3977

The relationship between the number of guests and the percentage tip can be shown the scatterplot below:

x= mydata$Guests
y= mydata$PctTip
myreg= lm(y ~ x)
plot(x,y, main= "Percent Tip per Number of Guests", xlab= "Number of Guests", ylab= "Percentage Tip")
abline(myreg, col= "red")

In order to consider this information more practically, we used the linear regression model describing the data to determine how the percentage tip would change with increasing party sizes of 2, 4, and 6 guests. The results are shown below:

0.3377 * 2 + 15.9777
## [1] 16.6531
0.3377 * 4 + 15.9777
## [1] 17.3285
0.3377 * 6 + 15.9777
## [1] 18.0039

To see how this change in percentage tip translates into a change in amount of tip, the following calculations were performed using the average bill amount for the data set:

23.08 * .1565
## [1] 3.61202
23.08 * .1733
## [1] 3.999764
23.08 * .18
## [1] 4.1544

In summary, data analysis proves that there is a positive correlation of r= 0.072 between number of guests and tipping percentage. This can also be modeled by the linear regression y= 0.3377x + 15.9777. Given this relationship, for an average bill amount of $23.08, we would expect that two guests would pay a 15.65% tip, four guests would be predicted to pay a 17.33% tip, and six guests would be predicted to pay a 18% tip. This translates into tip amounts of $3.61 for two guests, $4.00 for four guests, and $4.15 for six guests. Thus, increasing the party size can have some effect on the percentage of tip that is paid.

Question 5.

Do the average percentage tips on Wednesday and on Friday indicate a significant difference?

In order to determine the average percent tip paid on Wednesdays and Fridays at the restaurant, we first separated the relevant data into two subsets: Wednesday and Friday.

table(mydata$Day)
## 
##  F  M  R  T  W 
## 25 18 32 13 52
Wednesday= subset(mydata, Day == "W")
Friday= subset(mydata, Day == "F")

Next, we calcualted the average percentage tip for both Wednesday and Friday.

mean(Wednesday$PctTip)
## [1] 16.81154
mean(Friday$PctTip)
## [1] 16.268

Although these simple calculations indicate that there is a minor difference, we did not feel this was sufficient evidence to prove a meaningful difference in percentage tip between the two days. Thus, we conducted a T-Test to analyze the data further:

t.test(Wednesday$PctTip, Friday$PctTip)
## 
##  Welch Two Sample t-test
## 
## data:  Wednesday$PctTip and Friday$PctTip
## t = 0.40181, df = 31.61, p-value = 0.6905
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.213212  3.300289
## sample estimates:
## mean of x mean of y 
##  16.81154  16.26800

The difference in the distribution of percentage tip per day of the week can be shown by the following boxplots:

x= c(Wednesday$PctTip)
y= c(Friday$PctTip)
boxplot(x,y, main= "Percentage Tip per Day of Week", ylab= "Percentage Tip", names= c("Wednesday", "Friday"))

In summary, the average percent tip on Wednesdays was found to be 16.81%, and the average percent tip on Fridays was 16.27%. When running a Welch Two-Sample T-Test testing the difference between these averages (null hypothesis: difference ‘ud’ is equal to zero; alternative hypothesis: difference ‘ud’ is not equal to 0), the p-value was reported as 0.6905 and the 95% confidence interval was -2.213 < ud < 3.3. This leads us to conclude that we definitely do not have sufficient evidence to suggest that the difference between the average percentage of tip on Wednesdays and Fridays is meaningful or significant.

Question 6.

What is the relationship between the amount of the bill and the percentage of the tip?

In order to ascertain the relationship between bill amount and the percentage tip for the entire data set, we calculated the correlation coeffecient as well as the linear regression model describing the data:

x= mydata$Bill
y= mydata$PctTip
cor(x,y)
## [1] 0.1260588
myreg= lm(y ~ x)
summary(myreg)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1099 -2.5014 -0.6406  1.5424 25.4749 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.63782    0.80641  19.392   <2e-16 ***
## x            0.04588    0.03073   1.493    0.138    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.543 on 138 degrees of freedom
## Multiple R-squared:  0.01589,    Adjusted R-squared:  0.00876 
## F-statistic: 2.228 on 1 and 138 DF,  p-value: 0.1378

The relationship between amount of bill and percentage tip is modeled by the linear regression below:

x= mydata$Bill
y= mydata$PctTip
myreg= lm(y ~ x)
plot(x, y, main= "Percentage Tip given Bill Amount", ylab= "Percentage Tip", xlab= "Amount of Bill ($)")
abline(myreg, col= "red")

In order to consider this relationship more practically, we again considered the average bill amount paid at the restaurant, $23.08, and calculated what the expected percentage of tip paid would be. We then doubled the average bill amount and calculated what the expected percentage tip would be for a bill totaling $46.16

0.0459 * 23.08 + 15.638
## [1] 16.69737
0.0459 * 46.16 + 15.638
## [1] 17.75674

In conclusion, Although we suspected that the percentage tip might increase with the amount of the bill, the linear regression model proves that this relationship is very weak. The relationship is given by the correlation coefficient r= 0.126, and the linear equation that represents this is y= 0.0459x + 15.638. So, for example, if a customer had the average bill amount of of $23.08, his expected percentage of tip would be 16.697%. If, however, his bill had been doubled $46.16, his expected percentage of tip would only increase to 17.757%.

V. Summary

Our key findings suprised us in revealing that, contrary to what we expected, no drastically impactful relationship between percentage tip and any other variable seemed to exist. Where we expected to see significant relationships, correlations, or differences, we instead found relatively high p-values and low correlations. For example, we suspected we might see a difference in percentage tip depending on the day of the week, but when comparing Wednesday to Friday, we found that there was almost no significant difference whatsoever, with a p-value of 0.691 reported at the 95% confidence level. We also expected to see a significant increase in percentage of tip with increasing bill amount, but this relationship only had a small positive correlation of r= 0.126.

Overall, our analysis of the data shows that server, day of the week, and the amount of the bill had minimal effect on the percentage of the tip paid by the customer. Those who paid with a credit card seemed to give somewhat higher tips than those who paid with cash, and percentage tip increased somewhat with party size. Regarding the restaurant owner who wished to discover the tipping patterns of his customers, the greatest recommendation we could make would be to have enough large table areas to accommodate large parties of guests (as these parties seemed to pay tips that were just somewhat higher than smaller parties) and to encourage parties to pay with credit cards.

Given the opportunity to conduct this analysis again or more thoroughly, one of the ways that we could have improved our analysis would be to have run more t-tests to test the significane of difference among variables in the data. We also could have tested our data at a higher confidence level than 95% in order to gain even more accurate information. Including a chi-square test could also allow us to compare categorical variables more closely.

Lastly, this data set contained a great deal of variablility and included several outliers. We did not eliminate any of the outliers for our testing, but our failure to do so could have skewed some of the data. Had we excluded these variables from our testing, we may have obtained slightly different results.