1.) Description of Problem:
For our project, we decided to conduct an analysis to determine what factors affect the tip rate. The amount of tip that a dining customer decides to give depends on numerous factors such as the amount of the total bill, the sex of the customer, the customer's smoking/nonsmoking status, the day of the week, the time of the day, and the size of the customer's party. In our study, we chose to use two independent variables and one dependent variable. The dependent variable is tip while the independent variables are total bill and the day of the week. There are four levels to the “day of the week” variable: Thursday, Friday, Saturday, and Sunday.
One of our goals for completing this project is to study and analyze which day of the week obtains the greatest and least amount of tips. Secondly, we want to discover whether or not the amount in tip rises as the customer's total bill increases. In other words, we are testing to see if there is a positive linear association between tip and total bill. This study is being conducted because we want employees working in the food industry to be aware of which day leads to the most tips so that they know the potential, additional amount of money they can earn in terms of tips. In addition, we want employees to be aware of which customers are under-tipping or over-tipping based on their total bill.
2.) Description of Data:
The Tips dataset we are using for this project was created by one waiter who recorded information about each tip he received over a period of a few months working in one restaurant. This data was collected in 1995. The waiter probably collected the data because he was curious to see which factors played the biggest role in shaping the amount of tips he was earning. There are 244 observations and 7 variables in the dataset. The 7 variables are tip in dollars, total bill in dollars, sex of the bill payer, whether there were smokers in the party, day of the week, time of day, and size of the party.
Of the seven variables, three are numerical while four are categorical. Numerical variables include total bill, tip, and size of the party. Categorical variables include sex of the bill payer, smoker/nonsmoker status of customer, day of the week, and time of the day. Each of the categorical variables has multiple levels. Sex has two levels (male and female), smoker status has two levels (Yes and No), day of the week has four levels (Thursday, Friday, Saturday, Sunday), and time of day has two levels (lunch and dinner).
Particular issues that might arise during the data collection include factors that skew the data. This includes holidays, birthdays, wedding receptions, big occasions, and type of restaurant that it is (seafood, fast food). The big occasions just listed above would lead to higher bills and higher tips. A bill at a seafood restaurant could probably lead to left skew results since the mean bill is lesser than the median bill. In contrast, a bill at a fast food restaurant could lead to right skew results since the mean bill is greater than the median bill. Other external issues that might arise during the data collection include poor customer service and any bias towards the sex of the bill payer, size of the party, and smoking status.
3.) Proposed Analysis
i.) Data Restructuring
a.) Approach: Find proportion of tips by using
tips proportion = tips day X / tip total
Find the tip percentage by using
(tips day X / tip total)*100
b.) Reason: to get the proportion and percentage of tips received throughout
the 4-day span
c.) Questions addressed:
1.) Which day earned most tips?
2.) Which day earned fewest tips?
3.) Are customers willing to pay more in tip, the higher their total
bill increases?
ii.) Summary Statistics
a.) Approach: Find the 5-number summary for the total bill and total tip
variable. (min, max, mean…)
b.) Reason: to go in depth of trying to find the average, min, max and the
IQR
c.) Questions addressed:
1.) Using the IQR, we can see what percentage of however many people are
willing to pay an amount of tip that is 75%, 50%, 25% more than the
rest.
2.) Shows employees what kind of tips to be expecting, especially on
days that have a lot of tips.
iii.) Graphical Summaries
a.) Approach: Find the following graphical summaries
i.) Histogram of days of the week
ii.) Scatterplot -> x = total bill, y = tips (include a regression line)
iii.) Boxplots for five-number summaries from total bill and tips
variable
iv.) Segmented bar chart of the proportion of tips
b.) Reason: to gain a visual understanding of the data that's been collected
, to help simplify the process of making conclusions from the data
c.) Questions addressed:
1.) Day with lowest tips?
2.) Day with highest tips?
3.) IQR – Xth % of sample? Paid an average of Y tips?
4.) Min. Max. for each day?
iv.) Regression
a.) Approach: Run a regression analysis between the amount of tips received
and the total bill of the customer
b.) Reason: to see how total bill and tips are related
c.) Questions addressed:
1.) The higher the total bill, the higher the tips?
2.) The lower the total bill, the lower the tips?
3.) Upward trend?
4.) Results
Bar Chart for Days of the Week
## Warning: unable to open display
## Error: invalid device
Scatterplot of tip vs. total bill
Boxplot of 5-number summary: total bill variable
Boxplot of 5-number summary: tip variable
Boxplot of Tip vs. Day of the Week
Summary for Day of the Week
Fri Sat Sun Thur 19 87 76 62
Tip Total for Dataset [1] 731.6
Total Tip on Thursday (# times tipped*mean)
[1] 171.8
Total Tip on Friday (# times tipped*mean)
[1] 51.96
Total Tip on Saturday (# times tipped*mean)
[1] 260.4
Total Tip on Sunday (# times tipped*mean)
[1] 247.4
Percentage of Tips-Thursday
[1] 23.48
Percentage of Tips-Friday
[1] 7.103
Percentage of Tips-Saturday
[1] 35.59
Percentage of Tips-Sunday
[1] 33.81
Numerical Summary: tip variable
Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 2.00 2.90 3.00 3.56 10.00
Numerical Summary: total bill variable
Min. 1st Qu. Median Mean 3rd Qu. Max. 3.07 13.30 17.80 19.80 24.10 50.80
Linear Regression Analysis: tip vs. total bill
Call: lm(formula = tip ~ total_bill)
Residuals: Min 1Q Median 3Q Max -3.198 -0.565 -0.097 0.486 3.743
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.92027 0.15973 5.76 2.5e-08 ***
Signif. codes: 0 '**' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.02 on 242 degrees of freedom Multiple R-squared: 0.457, Adjusted R-squared: 0.454 F-statistic: 203 on 1 and 242 DF, p-value: <2e-16
Summary of Tip vs. Day of the Week
day: Fri Min. 1st Qu. Median Mean 3rd Qu. Max.
day: Sat Min. 1st Qu. Median Mean 3rd Qu. Max.
day: Sun Min. 1st Qu. Median Mean 3rd Qu. Max.
day: Thur Min. 1st Qu. Median Mean 3rd Qu. Max. 1.25 2.00 2.30 2.77 3.36 6.70
5.) The statistical model between the amount of tips earned and the customer's total bill can be found above:
tip = 0.920270 + 0.105025*total_bill
6.) Summary of Analysis/Findings
After conducting statistical analysis on the Tips dataset, we found that the
day with the most tips is Saturday ($260.39) and the day with the least
amount of tips is Friday ($51.97). Saturday tips accounted for about 36
percent of the waiter's overall tips, while Friday tips accounted for about
7 percent of his overall tips. After conducting simple linear regression
between the tip and total bill variable, we discovered that the p-value is
close to 0 which means that we reject the null hypothesis that the
correlation/slope is equal to 0 (p-value < 0.05). Thus we have evidence to
conclude that there is a linear association between the amount of tip one
earns and the customer's total bill. As a customer's total bill increases,
the waiter's tip earnings increases as well.
7.) Reference
Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL: Richard D. Irwin Publishing
8.) Appendix
r setoptions, echo=FALSE}
opts_chunk$set(echo=FALSE, results = “asis”)
{r}
library(“reshape2”, lib.loc=“/home/alenlai28@gmail.com/R/x86_64-pc-linux-gnu-library/3.0”)
View(tips)
attach(tips)
library(“ggplot2”, lib.loc=“/home/alenlai28@gmail.com/R/x86_64-pc-linux-gnu-library/3.0”)
qplot(day, data = tips, geom = “bar”)
{r}
library(ggplot2)
qplot(total_bill, tip, data = tips)
{r}
boxplot(total_bill, ylab= “Total Bill Amount”)
boxplot(tip, ylab= “Tip Amount”)
boxplot(tip~day, xlab= “Day of the Week”, ylab= “Tip Amount”)
summary(day)
tiptotal = sum(tip)
tiptotal
totthur = 62*2.771
totthur
totfri = 19*2.735
totfri
totsat= 87*2.993
totsat
totsun= 76*3.255
totsun
y1 = (totthur*100)/tiptotal
y1
y2= (totfri*100)/tiptotal
y2
y3= (totsat*100)/tiptotal
y3
y4= (totsun*100)/tiptotal
y4
summary(tip)
summary(total_bill)
regmodel1 = lm(tip~total_bill)
summary(regmodel1)
by(tip,day,summary)