data(tips, package="reshape2")
library(reshape2)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Tips at Restaurants

Due: 3/20/2023 before 11:59pm. Submit in Canvas (file upload).

In this homework we will use the tips data set. THis data set is part of the reshape2 package. You can load the data set by executing the command:

data(tips, package="reshape2")

The information contained in the data is collected by one waiter, who recorded over the course of a season information about each tip he received working in one restaurant. See ?tips for a description of all of the variables.

  1. Download the RMarkdown file with these homework instructions to use as a template for your work. Make sure to replace “Your Name” in the YAML with your name.
  2. How many parties did the waiter serve? What type are the variables that he collected? Show your code.
str(tips)
## 'data.frame':    244 obs. of  7 variables:
##  $ total_bill: num  17 10.3 21 23.7 24.6 ...
##  $ tip       : num  1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ...
##  $ sex       : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 2 2 2 ...
##  $ smoker    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ day       : Factor w/ 4 levels "Fri","Sat","Sun",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ time      : Factor w/ 2 levels "Dinner","Lunch": 1 1 1 1 1 1 1 1 1 1 ...
##  $ size      : int  2 3 3 2 4 4 2 4 2 2 ...

The waiter served 244 parties. we have 7 variables, Factor: sex, smoker, day, time and Integer: Size.

  1. Creat a new variable size.factor which tranlates the variable size to a factor. Should size be a factor or a numerical variable? Give your reasonings.
tips$size.factor <- as.factor(tips$size)

Size should be numberical, because factor stores the data and stores it as level. but the size could have multiple enteries, and reading data will not be easy if it is in factor.

  1. How does the tipping amount (tip) depend on the overall bill (total_bill)? Use the ggplot2 package to show a chart, describe the relationship in words. Describe at least two types of anomalies in the plot. What do they mean?
ggplot(tips, aes(x=total_bill, y = tip)) + geom_point()

There is a linear relation between bill and tips. the higher the bill, the higher the tip.

  1. Introduce a variable tiprate into the data set, that incorporates the rate of tips. What information is available for the best tipper, what for the worst? What is the average rate for tips?
tips$tiprate <- tips$tip/tips$total_bill
filter(tips, tiprate==max(tips$tiprate))
##     total_bill  tip  sex smoker day   time size size.factor   tiprate
## 173       7.25 5.15 Male    Yes Sun Dinner    2           2 0.7103448
filter(tips, tiprate==min(tips$tiprate))
##     total_bill  tip  sex smoker day   time size size.factor    tiprate
## 238      32.83 1.17 Male    Yes Sat Dinner    2           2 0.03563814
mean(tips$tiprate)
## [1] 0.1608026

best tipper is a male has with total bill of 7.25 and tip of 5.15 on sunday dinner with tip rate of 0.710344.

worst tipper is a male with total bill of 32.83 and tip of 1.17 on Saturday dinner with tip rate of 0.03563814.

there is a average tip rate of 0.1608026 ~ 16%

  1. How does smoking behavior and gender of the person who pays impact the relationship between tip and total bill? Find a visualization that incorporates all four variables. Interpret the result.
ggplot(data=tips, aes(x=total_bill, y=tip)) + geom_point() +facet_grid(smoker~sex)

there is a linear relation between the tip amount and total bill for nonsmokers with higher correlation for females.

  1. Use ggplot2 to find a graphical summary of the relationship between day of the week and gender of the person paying the bill. What can you say about this relationship?
ggplot(tips, aes(x=day, fill=sex)) + geom_bar(position="fill")

on friday and thursday we can see that both male and female are paying equally. but on saturday and sunday male is paying more than female.

  1. Use ggplot2 to make a boxplot of the tiprate on different days. Rank the levels of day by the average of the tiprate. What can you say about this relationship?
ggplot(tips, aes(x=day, y=tiprate)) + geom_boxplot()

on fridays and thursdays, we can see that the tiprate is in the lower quartile, while in saturday is in the middle, and on sunday is in the higher. . we also have outliers on saturday and sunday. Note: your submission is supposed to be fully reproducible, i.e. the TA and I will ‘knit’ your submission in RStudio.

For the submission: submit your solution in an R Markdown file and (just for insurance) submit the corresponding html (or Word) file with it.