data(tips, package="reshape2")
library(reshape2)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
In this homework we will use the tips data set. THis
data set is part of the reshape2 package. You can load the
data set by executing the command:
data(tips, package="reshape2")
The information contained in the data is collected by one waiter, who
recorded over the course of a season information about each tip he
received working in one restaurant. See ?tips for a
description of all of the variables.
str(tips)
## 'data.frame': 244 obs. of 7 variables:
## $ total_bill: num 17 10.3 21 23.7 24.6 ...
## $ tip : num 1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 2 2 2 ...
## $ smoker : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ day : Factor w/ 4 levels "Fri","Sat","Sun",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ time : Factor w/ 2 levels "Dinner","Lunch": 1 1 1 1 1 1 1 1 1 1 ...
## $ size : int 2 3 3 2 4 4 2 4 2 2 ...
The waiter served 244 parties. we have 7 variables, Factor: sex, smoker, day, time and Integer: Size.
size.factor which tranlates the
variable size to a factor. Should size be a
factor or a numerical variable? Give your reasonings.tips$size.factor <- as.factor(tips$size)
Size should be numberical, because factor stores the data and stores it as level. but the size could have multiple enteries, and reading data will not be easy if it is in factor.
tip) depend on the overall
bill (total_bill)? Use the ggplot2 package to
show a chart, describe the relationship in words. Describe at least two
types of anomalies in the plot. What do they mean?ggplot(tips, aes(x=total_bill, y = tip)) + geom_point()
There is a linear relation between bill and tips. the higher the bill, the higher the tip.
tiprate into the data set, that
incorporates the rate of tips. What information is available for the
best tipper, what for the worst? What is the average rate for tips?tips$tiprate <- tips$tip/tips$total_bill
filter(tips, tiprate==max(tips$tiprate))
## total_bill tip sex smoker day time size size.factor tiprate
## 173 7.25 5.15 Male Yes Sun Dinner 2 2 0.7103448
filter(tips, tiprate==min(tips$tiprate))
## total_bill tip sex smoker day time size size.factor tiprate
## 238 32.83 1.17 Male Yes Sat Dinner 2 2 0.03563814
mean(tips$tiprate)
## [1] 0.1608026
best tipper is a male has with total bill of 7.25 and tip of 5.15 on sunday dinner with tip rate of 0.710344.
worst tipper is a male with total bill of 32.83 and tip of 1.17 on Saturday dinner with tip rate of 0.03563814.
there is a average tip rate of 0.1608026 ~ 16%
ggplot(data=tips, aes(x=total_bill, y=tip)) + geom_point() +facet_grid(smoker~sex)
there is a linear relation between the tip amount and total bill for nonsmokers with higher correlation for females.
ggplot(tips, aes(x=day, fill=sex)) + geom_bar(position="fill")
on friday and thursday we can see that both male and female are paying equally. but on saturday and sunday male is paying more than female.
tiprate on
different days. Rank the levels of day by the average of
the tiprate. What can you say about this relationship?ggplot(tips, aes(x=day, y=tiprate)) + geom_boxplot()
on fridays and thursdays, we can see that the tiprate is in the lower quartile, while in saturday is in the middle, and on sunday is in the higher. . we also have outliers on saturday and sunday. Note: your submission is supposed to be fully reproducible, i.e. the TA and I will ‘knit’ your submission in RStudio.
For the submission: submit your solution in an R Markdown file and (just for insurance) submit the corresponding html (or Word) file with it.