1 Research question (2.5 points)

1.1 Clear research question that can be tested statistically (1.5 points)

Tipping culture is interesting to observe, since it reflects the generosity and social norms of a country. Particularly in the US, where the social pressure to tip is high, it’s intriguing to examine differences in tipping habits. Furthermore, it is interesting to see how these norms differ between the genders, and if there is a difference in “generousity” of the genders. Hence, this research question arises: Is there a difference between the relative amounts tipped by men and women?

1.2 Which variables need to be collected to answer your research question (1 point)

This research question is concerning the whole population of men and women in the US. However, this population cannot be easily examined. A lot of transaction data is not available, or not accessible, and gathering it would be very resource consuming. Therefore, a sample must be conducted to investigate this research question. This sample should focus on two variables of interest: the gender of each observation and the corresponding relative amount tipped.

2 Data (2.5 points)

2.1 Import of the data, presentation of the data using a function head, definition of the variables used (0.5 points)

The data is imported with the following function and the first 6 rows are displayed by head().

dt <- read.csv("./tip.csv", header = TRUE, sep = ",", dec = ".")

head(dt)
##   total_bill  tip    sex smoker day   time size
## 1      16.99 1.01 Female     No Sun Dinner    2
## 2      10.34 1.66   Male     No Sun Dinner    3
## 3      21.01 3.50   Male     No Sun Dinner    3
## 4      23.68 3.31   Male     No Sun Dinner    2
## 5      24.59 3.61 Female     No Sun Dinner    4
## 6      25.29 4.71   Male     No Sun Dinner    4

Furthermore, all non-numeric variables are factorised.

dt$sex <- factor(dt$sex,
             levels = c("Female", "Male"),
             labels = c("Female", "Male"))

dt$smoker <- factor(dt$smoker,
                    levels = c("No", "Yes"),
                    labels = c("No", "Yes"))

dt$day <- factor(dt$day,
                 levels = c("Thur", "Fri", "Sat", "Sun"),
                 labels = c("Thur", "Fri", "Sat", "Sun"))

dt$time <- factor(dt$time,
                  levels = c("Lunch", "Dinner"),
                  labels = c("Lunch", "Dinner"))

Explanation of all variables:

  • total_bill: Variable defining the total amount of the bill in US$ (numeric)
  • tip: Variable defining the amount of the tip in US$ (numeric)
  • sex: Variable defining the gender of an observation - Female or Male (string)
  • smoker: Variable defining if an observation is a smoker or not - No or Yes (string)
  • day: Variable defining the day of the week the observation occurred - Thur, Fri, Sat, or Sun (string)
  • time: Variable defining at what time the dinner of a specific observation took place - Lunch or Dinner (string)
  • size: Variable defining how big the group of dinner was for a specific observation in number of people (numeric)

This dataset contains a lot of useful variables; however, not the most important. There is no variable for the relative amount of tip in comparison to price paid. However, this variable is quite important to make tips from different total amounts comparable. Therefore, this variable is created with the following code:

dt$tip_percent <- dt$tip/dt$total_bill

This variable is now how many percent of the bill were given as a tip.

2.2 Definition of the unit of observation and the sample size (0.5 points)

The unit of observation is a group of customers in a restaurant on a given day. The sample size of this dataset can be determined using the following code:

nrow(dt)
## [1] 244

We see that this sample has 244 observations.

2.3 Source of the data set (0.5 points)

The sample is from kaggle and can be found here.

2.4 Basic descriptive statistics (1 point) - estimate a few parameters (e.g., functions summary, describe, etc.) and explanation

Descriptive Statistics

When using the describe() function from the psych library, it is only sensible to use numeric data. Therefore, I will run this command excluding non-numeric data.

library(psych)
describe(dt[sapply(dt, is.numeric)])
##             vars   n  mean   sd median trimmed  mad  min   max range skew
## total_bill     1 244 19.79 8.90  17.80   18.73 7.46 3.07 50.81 47.74 1.12
## tip            2 244  3.00 1.38   2.90    2.84 1.33 1.00 10.00  9.00 1.45
## size           3 244  2.57 0.95   2.00    2.42 0.00 1.00  6.00  5.00 1.43
## tip_percent    4 244  0.16 0.06   0.15    0.16 0.05 0.04  0.71  0.67 3.31
##             kurtosis   se
## total_bill      1.14 0.57
## tip             3.50 0.09
## size            1.63 0.06
## tip_percent    26.31 0.00

What we can see from this output is that we have four numeric variables, and for each observation, there is a value for each of those variables (always n=244). Let me take total_bill to describe what the individual parameters mean.

As already discussed, n reflects the number of observations of that variable in the sample. So in total, there are 244 different observations for total_bill.

The mean of total_bill is about the average value of the observations, calculated by dividing the sum of all values by the number of values. Here, the mean of 19.79 means that on average, an observation in this sample paid $19.79.

Sd stands for standard deviation and is a measure of spread. It gives information about how spread the data is around the sample mean. In this case, 68% of the data is within one standard deviation of the mean (19.79 ± 8.9).

The median gives us information about the true middle of the observations. This means that for total_bill, 50% of observations have a value above 17.8, and 50% of observations have a value below 17.8.

The mad is about the mean absolute difference, so the average absolute distance datapoints are away from the mean. The value 7.46 means that on average, every observation is absolutely 7.46 away from the mean.

Trimmed stands for trimmed mean, so the mean of all observations excluding outliers.

Min and max are pretty self-explanatory; in this case, 3.07 is the lowest price for a meal paid and 50.81 is the highest price for a meal paid. Range is the absolute difference of min and max.

The skew defines if the data for this variable is skewed, e.g., that the mean and the median are not the same. Since the value 1.12 is a positive value, there is a positive skew, meaning that the data is skewed to the right. For data about spending, this is a pretty classical thing that happens.

Kurtosis can be seen as a measure of how “flat” a distribution is. The higher the values, the more spread out a distribution is and the more values are in the tails of the distribution.

Lastly, the se, the standard error, is a measure of how dispersed the sample mean estimation is around the true mean.

Summary Statistics

For non-numeric variables it is best to analyse frequencies of observations to get an overview.

summary(dt[!sapply(dt, is.numeric)])
##      sex      smoker      day         time    
##  Female: 87   No :151   Thur:62   Lunch : 68  
##  Male  :157   Yes: 93   Fri :19   Dinner:176  
##                         Sat :87               
##                         Sun :76

3 Analysis (7.5 points)

3.1 Determine which statistical test to use and why (1 points)

The variable of interest is the proportion of a tip in relation to the total amount of the bill. This variable is numeric and continuous. Within this variable, I hypothesize differences between two groups, namely men and women. Since this grouping variable is categorical with two factors, we require a t-test. The measures in this sample are independent of each other, and each observation is only measured once. Therefore, an independent t-test, or if the assumptions are not met, the adequate non-parametric alternative, should be used for analysis.

3.2 Evaluate all assumptions (1.5 points)

Normality Assumption

The variable of interest, in this case, tip percentage of total bill, must be normally distributed for each of the two populations. Since we do not have population data, we use sample data. To test normality for each of the groups, one uses a Shapiro-Wilk normality test. This test has normality as the null hypothesis.

library(rstatix)
library(dplyr)

result_sw <- dt %>% 
  group_by(sex) %>% 
  shapiro_test(tip_percent)

print(result_sw)
## # A tibble: 2 × 4
##   sex    variable    statistic        p
##   <fct>  <chr>           <dbl>    <dbl>
## 1 Female tip_percent     0.898 4.72e- 6
## 2 Male   tip_percent     0.745 3.22e-15

The groupwise Shapiro-Wilk test showed that the distribution of the tip percentage among females, as well as the distribution of tip percentage among males, departed significantly from normality (W = 0.898, p < 0.001; W = 0.745, p < 0.001). Therefore, we can reject the null hypothesis of normality for both groups. This already indicates that I have to use a non-parametric test; however, it might still be good practice to test the other assumptions.

Independence Assumption

It is somewhat safe to assume that the data comes from two independent samples. It is almost impossible that a person is in both gender groups. However, since this data was collected on multiple days, we cannot exclude the possibility that some people ate two times in a row at this place and hence were sampled two times. Nevertheless, for now, we just assume that this was not the case.

Equal Variance

To test equal variance of two samples Levene’s test is applied. This test has equal variance as null hypothesis. Levene’s test can be conducted the following way.

library(car)

leveneTest(dt$tip_percent, dt$sex)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.4592 0.4986
##       242

We cannot reject the null hypothesis of Levene’s Test of equal variances at a 5% alpha level, F(1, 242) = 0.46, p = 0.5. Therefore, equal variance can be assumed. However, since the normality assumption has been rejected anyway, I need to use a non-parametric alternative.

Graphical Analysis

It is also possible to analyze the normality and the independence assumption graphically using a histogram. We can do that by using the following code:

library(ggplot2)

gg_female <- ggplot(dt[dt$sex == "Female",], aes(x = tip_percent))+
  geom_histogram(binwidth = 0.005)+
  ggtitle("Female")+
  xlab("Tip in percent")+
  xlim(0,0.9)

gg_male <- ggplot(dt[dt$sex == "Male",], aes(x = tip_percent))+
  geom_histogram(binwidth = 0.005)+
  ggtitle("Male")+
  xlab("Tip in percent")+
  xlim(0,0.9)

library(ggpubr)
ggarrange(gg_female, gg_male,
          nrow = 2)

What we can see in this graphical analysis is consistent with what we observed with the tests above. We can see that we have to reject the normality assumption because the histograms for both groups do not appear very normally distributed. The histogram for females looks somewhat bimodal, and the one for males has multiple peaks. Additionally, both histograms have at least one outlier.

Regarding the equal variance assumption, because the x-axis has the same length across both plots and because they are arranged on top of each other, it is quite easy to visually compare them. We can somewhat observe that the deviation in both plots is similar, which supports the findings of Levene’s test that there is indeed somewhat equal variance.

Nevertheless, since we cannot assume normality, we require a Wilcoxon rank sum test to analyze this relationship.

3.3 Perform the appropriate statistical test based on the results of the assumption evaluation and its interpretation (2.5 points)

The Wilkoxon rank sum test can be performed in the following way:

wilcox.test(dt$tip_percent ~ dt$sex,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  dt$tip_percent by dt$sex
## W = 7619, p-value = 0.1349
## alternative hypothesis: true location shift is not equal to 0

The null hypothesis for the Wilcoxon rank sum test states that the two observations have the same location of distribution. With p = 0.135, we fail to reject the null hypothesis at a 5% alpha level. Therefore, we fail to find a significant difference in the location of distribution. This can be illustrated with the below density plot of both distributions.

ggplot(dt, aes(x = tip_percent, color = sex, fill = sex)) +
  geom_density(size = 1, alpha = 0.5) +
  labs(title = "Density Plot of Tip Percent by Sex",
       x = "Tip Percent",
       y = "Density")

What we can see from this plot is that the distributions are pretty much overlapping. Therefore, the Wilcoxon rank sum test cannot find sufficient evidence for a difference here.

3.4 Calculation of the effect size and its interpretation (2.5 points)

The effect size can be calculated and interpreted as follows:

library(effectsize)

effectsize(wilcox.test(dt$tip_percent ~ dt$sex,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided"))
## r (rank biserial) |        95% CI
## ---------------------------------
## 0.12              | [-0.04, 0.26]
interpret_rank_biserial(0.12)
## [1] "small"
## (Rules: funder2019)

We can see that the effect size with r = 0.12 is small, based on the rules by Funder (2009).

4 Conclusion (2.5 points)

Clear answer to your research question based on the results of the statistical test performed (2.5 points)

Research Question:

Is there a difference between the relative amounts tipped by men and women?

To examine this research question, a Wilcoxon rank sum test was applied. This test was used because the normality assumption, tested with a groupwise Shapiro-Wilk test (W = 0.898, p < 0.001; W = 0.745, p < 0.001), was violated for both groups, and therefore, a non-parametric test for the two-sample t-test was required. The Wilcoxon rank sum test found p = 0.5, and hence we fail to reject the null hypothesis. Therefore, we cannot find a difference in relative amounts tipped by men and women.


Note:

  • If data is not related to business or economics —> 0 points
  • If completely different functions/interpretations than those used in class are used (without proper citation) —> 0 points