mydata <- read.csv("~/Desktop/IMB/2. SEMESTER/MULTIVARIATE ANALYSIS/HOMEWORKS/HW2/menu.csv")

Removing the columns that will not be relevant for the further analysis.

mydata1 <- mydata[c(-3,-5,-7,-8,-9,-10,-11,-12,-13,-14,-16,-18,-21,-22,-23,-24)]
head(mydata1,10)
##     Category                                                          Item
## 1  Breakfast                                                  Egg McMuffin
## 2  Breakfast                                             Egg White Delight
## 3  Breakfast                                              Sausage McMuffin
## 4  Breakfast                                     Sausage McMuffin with Egg
## 5  Breakfast                              Sausage McMuffin with Egg Whites
## 6  Breakfast                                          Steak & Egg McMuffin
## 7  Breakfast                 Bacon, Egg & Cheese Biscuit (Regular Biscuit)
## 8  Breakfast                   Bacon, Egg & Cheese Biscuit (Large Biscuit)
## 9  Breakfast Bacon, Egg & Cheese Biscuit with Egg Whites (Regular Biscuit)
## 10 Breakfast   Bacon, Egg & Cheese Biscuit with Egg Whites (Large Biscuit)
##    Calories Total.Fat Carbohydrates Dietary.Fiber Sugars Protein
## 1       300        13            31             4      3      17
## 2       250         8            30             4      3      18
## 3       370        23            29             4      2      14
## 4       450        28            30             4      2      21
## 5       400        23            30             4      2      21
## 6       430        23            31             4      3      26
## 7       460        26            38             2      3      19
## 8       520        30            43             3      4      19
## 9       410        20            36             2      3      20
## 10      470        25            42             3      4      20

Renaming the columns in order to be grouped easier and better understand each of the variables in the data set.

colnames(mydata1) <- c("Category", "Name", "Calories", "Fat", "Carbs", "Fiber", "Sugar", "Protein")
mydata1[mydata1 == 'Beef & Pork'] <- 'Main'
mydata1[mydata1 == 'Chicken & Fish'] <- 'Main'
mydata1[mydata1 == 'Snacks & Sides'] <- 'Sides'
head(mydata1,10)
##     Category                                                          Name
## 1  Breakfast                                                  Egg McMuffin
## 2  Breakfast                                             Egg White Delight
## 3  Breakfast                                              Sausage McMuffin
## 4  Breakfast                                     Sausage McMuffin with Egg
## 5  Breakfast                              Sausage McMuffin with Egg Whites
## 6  Breakfast                                          Steak & Egg McMuffin
## 7  Breakfast                 Bacon, Egg & Cheese Biscuit (Regular Biscuit)
## 8  Breakfast                   Bacon, Egg & Cheese Biscuit (Large Biscuit)
## 9  Breakfast Bacon, Egg & Cheese Biscuit with Egg Whites (Regular Biscuit)
## 10 Breakfast   Bacon, Egg & Cheese Biscuit with Egg Whites (Large Biscuit)
##    Calories Fat Carbs Fiber Sugar Protein
## 1       300  13    31     4     3      17
## 2       250   8    30     4     3      18
## 3       370  23    29     4     2      14
## 4       450  28    30     4     2      21
## 5       400  23    30     4     2      21
## 6       430  23    31     4     3      26
## 7       460  26    38     2     3      19
## 8       520  30    43     3     4      19
## 9       410  20    36     2     3      20
## 10      470  25    42     3     4      20

The unit of observation in my sample is an item on the McDonald’s menu. The original size of the data set 260 units of observation (McDonald’s Menu Items) with 24 variables. I removed columns that included variables that did not hold potential for further hypothesis testing. After this the data set (mydata1) was left with 8 variables. The variables that were left are:

The source of the above data was found on the Kaggle website, the author is McDonald’s. Retrieved January 9th, 2023, from https://www.kaggle.com/datasets/mcdonalds/nutrition-facts.

The research questions will be different for each type of hypothesis testing method and will be listed after each of the hypothesis statements.

Retrieving the descriptive statistics to get a short overview of the data set and the corresponding variables.

get_summary_stats(mydata1[c(-1,-2)])
## # A tibble: 6 × 13
##   variable     n   min   max median     q1    q3   iqr    mad   mean     sd
##   <fct>    <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1 Calories   260     0  1880  340   210    500   290   208.   368.   240.  
## 2 Fat        260     0   118   11     2.38  22.2  19.9  14.8   14.2   14.2 
## 3 Carbs      260     0   141   44    30     60    30    20.8   47.3   28.3 
## 4 Fiber      260     0     7    1     0      3     3     1.48   1.63   1.57
## 5 Sugar      260     0   128   17.5   5.75  48    42.2  24.5   29.4   28.7 
## 6 Protein    260     0    87   12     4     19    15    11.9   13.3   11.4 
## # … with 2 more variables: se <dbl>, ci <dbl>

HYPOTHESIS TESTING METHODS

TESTING HYPOTHESES ABOUT THE DIFFERENCES IN TWO POPULATIONS ARITHMETIC MEAN/LOCATIONS OF THE DISTRIBUTION (USING: INDEPENDENT T-TEST/WILCOXON RANK SUM TEST)

Research question: Is there a difference between the amount of calories that items which are part of the Breakfast category and those items that are part of the Main (courses) category contain?

Hypothesis: Items that are part of the Breakfast category contain a different amount of calories than the items that are part of the Main (courses) category.

H0: The arithmetic means of Breakfast items and Main items on the McDonald’s menu are the same (MeanBreakfast = MeanMain). -> Distribution locations (of Breakfast and Main) are the same.

H1: The arithmetic means of Breakfast items and Main items on the McDonald’s menu are different (MeanBreakfast ≠ MeanMain). -> Distribution locations (of Breakfast and Main) are not the same.

Before we can decide on which testing method (test) to perform, we have to check whether the conditions and assumptions for the parametric test are fulfilled. The conditions and assumptions are, that the variable has to be numeric (which calories are, so this condition is met), that the distribution of the variable is normal for both populations (normality), the data comes from two populations (Breakfast and Mains) and that the variable has the same variance in both populations.

To check for normality, first create a histogram for both categories and then use the Shapiro-Wilk test, to further confirm the assumptions.

Breakfast_distribution <- mydata1  %>% 
                            group_by(Category) %>% 
                              filter(Category == 'Breakfast') %>% 
                                ggplot(aes(x = Calories)) +
                                geom_histogram(binwidth = 100, colour = "white", fill = "black") +
                                ylab("Frequency") +
                                xlab("Calories of Breakfast Item") +
                                theme_minimal()

Main_distribution <- mydata1  %>% 
                      group_by(Category) %>% 
                        filter(Category == 'Main') %>%
                          ggplot(aes(x = Calories)) +
                          geom_histogram(binwidth = 100, colour = "white", fill = "black") +
                          ylab("Frequency") +
                          xlab("Calories of Main Items") +
                          theme_minimal()

ggarrange(Breakfast_distribution, Main_distribution,
          ncol = 2, nrow = 1)

From the above histograms we can observe that neither of the distributions can be by eye confirmed as normal, that is why the Shapiro-Wilk test (below) has been performed on both categories.

mydata1 %>%
  group_by(Category) %>%
    filter(Category == 'Breakfast') %>% 
      shapiro_test(Calories)
## # A tibble: 1 × 4
##   Category  variable statistic        p
##   <chr>     <chr>        <dbl>    <dbl>
## 1 Breakfast Calories     0.879 0.000368
mydata1 %>%
  group_by(Category) %>%
    filter(Category == 'Main') %>% 
      shapiro_test(Calories)
## # A tibble: 1 × 4
##   Category variable statistic            p
##   <chr>    <chr>        <dbl>        <dbl>
## 1 Main     Calories     0.681 0.0000000288

The results of the Shapiro-Wilk test show that neither distribution of the variables are not normal, therefore the null hypothesis (H0:Distribution is normal) is rejected, both at p-value < 0.001. For this reason I will continue with the non-parametric alternative to the independent samples t-test which is the Wilcoxon Rank Sum Test.

As said, the normality assumption is not met, therefore, the selected test is Wilcoxon Rank Sum Test. For this test the data is filtered in order to only contain the two observed categories (Breakfast and Mains).

mydata_filtered <- mydata1[mydata1$Category == 'Breakfast' | mydata1$Category == 'Main',]

wilcox.test(mydata_filtered$Calories ~ mydata_filtered$Category,
                  paired = FALSE,
                  correct = FALSE,
                  exact = FALSE,
                  alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  mydata_filtered$Calories by mydata_filtered$Category
## W = 866.5, p-value = 0.8896
## alternative hypothesis: true location shift is not equal to 0

Since we allow for the variation in calories to be in both directions (more and less calories in either category) the two sided test was performed. From the p-value being very high (almost 89%), we cannot reject the null hypothesis, meaning the population location distributions cannot be confirmed as different.

effectsize(wilcox.test(mydata_filtered$Calories ~ mydata_filtered$Category,
                  paired = FALSE,
                  correct = FALSE,
                  exact = FALSE,
                  alternative = "two.sided"))
## r (rank biserial) |        95% CI
## ---------------------------------
## -0.02             | [-0.26, 0.23]
interpret_rank_biserial(0.02)
## [1] "tiny"
## (Rules: funder2019)

The effect size from this test is as can be interpreted as “tiny”, according to the Biserial scale.

From the sample data, we are unable to confirm that the calories in Breakfast menu items are different from the menu items classified as Main (dishes), since the p-value of the test does not meet the required alpha (0.05), furthermore, the effect size is tiny.

TESTING HYPOTHESES ABOUT THE EQUALITY OF THREE OR MORE POPULATION ARITHMETIC MEANS FOR INDEPENDENT SAMPLES (USING: ONE-WAY ANOVA/KRUSKAL-WALLIS RANK SUM TEST)

Research question: Is there a difference between the amount of average calories contained in each menu item from different categories?

Hypothesis: Items that are from the same menu category on average contain a different amount of calories than those items that are included in a different category.

H0: MeanBreakfast = MeanMain = MeanSalads = MeanSides = MeanDesserts = MeanBeverages = MeanCoffee & Tea = MeanSmoothies & Shakes -> All category distribution locations of variable are the same.

H1: At least one mean Mean is different from the rest. -> At least one distribution location of Calories variable is different.

Before deciding whether the testing method will be parametric or non-parametric the conditions and assumptions have to be assessed. The conditions and assumptions for the one-way ANOVA are that, the variable must be numeric (Calories are numeric, so this condition is met), the variable in the population is normally distributed within each of the categories (normality), that the variance of the analyzed variable is the same in all categories (homoscedasticity) and that there are no significant outliers.

First taking a look at some descriptive statistics for each of the categories on the menu.

describeBy(x = mydata1$Calories, group = mydata1$Category)
## 
##  Descriptive statistics by group 
## group: Beverages
##    vars  n  mean    sd median trimmed    mad min max range skew kurtosis    se
## X1    1 27 113.7 99.19    100  109.13 148.26   0 280   280 0.25    -1.25 19.09
## ------------------------------------------------------------ 
## group: Breakfast
##    vars  n   mean     sd median trimmed    mad min  max range skew kurtosis
## X1    1 42 526.67 221.68    470  496.76 111.19 150 1150  1000 1.21     1.18
##       se
## X1 34.21
## ------------------------------------------------------------ 
## group: Coffee & Tea
##    vars  n   mean     sd median trimmed    mad min max range skew kurtosis
## X1    1 95 283.89 157.81    270  277.79 148.26   0 760   760 0.46     0.29
##       se
## X1 16.19
## ------------------------------------------------------------ 
## group: Desserts
##    vars n   mean     sd median trimmed    mad min max range  skew kurtosis
## X1    1 7 222.14 108.08    250  222.14 133.43  45 340   295 -0.35    -1.57
##       se
## X1 40.85
## ------------------------------------------------------------ 
## group: Main
##    vars  n  mean     sd median trimmed    mad min  max range skew kurtosis
## X1    1 42 531.9 259.29    490  500.59 155.67 190 1880  1690 3.31    14.76
##       se
## X1 40.01
## ------------------------------------------------------------ 
## group: Salads
##    vars n mean     sd median trimmed   mad min max range skew kurtosis    se
## X1    1 6  270 127.44    255     270 170.5 140 450   310 0.21    -1.88 52.03
## ------------------------------------------------------------ 
## group: Sides
##    vars  n   mean     sd median trimmed    mad min max range  skew kurtosis
## X1    1 13 245.77 141.77    260  242.73 118.61  15 510   495 -0.13    -0.88
##       se
## X1 39.32
## ------------------------------------------------------------ 
## group: Smoothies & Shakes
##    vars  n   mean     sd median trimmed    mad min max range skew kurtosis
## X1    1 28 531.43 230.87    540  528.33 296.52 210 930   720 0.07    -1.46
##       se
## X1 43.63

Checking for the normality of distributions of all of the categories for the calories variable using the Shapiro-Wilk Test.

mydata1 %>% 
  group_by(Category) %>% 
  shapiro_test(Calories)
## # A tibble: 8 × 4
##   Category           variable statistic            p
##   <chr>              <chr>        <dbl>        <dbl>
## 1 Beverages          Calories     0.880 0.00472     
## 2 Breakfast          Calories     0.879 0.000368    
## 3 Coffee & Tea       Calories     0.974 0.0539      
## 4 Desserts           Calories     0.930 0.550       
## 5 Main               Calories     0.681 0.0000000288
## 6 Salads             Calories     0.917 0.484       
## 7 Sides              Calories     0.951 0.607       
## 8 Smoothies & Shakes Calories     0.925 0.0472

Checking that the variances of calories are the same in all menu categories, using the Levene test.

leveneTest(Calories ~ Category,
           data = mydata1)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   7  1.9749 0.05895 .
##       252                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the above tables, it can be seen that some of the p-values of the food categories in the Shapiro-Wilk test are under the 0.05 threshold. Therefore, the null hypothesis (H0:Distributions are normal), has to be rejected. In the Levene test, the p-value does not meet the required alpha of 0.05 since it is slightly above this threshold (0.059). Hence, for both reasons, the non-parametric alternative has to be used for testing the hypothesis, in this case the Kruskal-Wallis Rank Sum Test. Therefore, the hypothesis that will be tested is that at least one distribution location of the calories variable is different.

The Kruskal-Wallis test is performed below where the Calories are stated as a function of the Category in which the item is defined as.

kruskal.test(mydata1$Calories ~ mydata1$Category)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  mydata1$Calories by mydata1$Category
## Kruskal-Wallis chi-squared = 120.75, df = 7, p-value < 2.2e-16
kruskal_effsize(Calories ~ Category,
                data = mydata1)
## # A tibble: 1 × 5
##   .y.          n effsize method  magnitude
## * <chr>    <int>   <dbl> <chr>   <ord>    
## 1 Calories   260   0.451 eta2[H] large

With the test returning a p-value of less than 0.001, we can confirm the hypothesis (H1: At least one distribution location of variable Calories is different from the others.). The chi-squared value from the test is 120.75 and the effect size is large (0.45).

Checking the relations directly between the categories using the Wilcoxon test for each pair separatly.

(grouped_nonparametric <- wilcox_test(Calories ~ Category,
                                     paired = FALSE,
                                     p.adjust.method = "bonferroni", 
                                     data = mydata1))
## # A tibble: 28 × 9
##    .y.      group1    group2          n1    n2 stati…¹        p    p.adj p.adj…²
##  * <chr>    <chr>     <chr>        <int> <int>   <dbl>    <dbl>    <dbl> <chr>  
##  1 Calories Beverages Breakfast       27    42    16.5 1.27e-11 3.56e-10 ****   
##  2 Calories Beverages Coffee & Tea    27    95   466   4.66e- 7 1.3 e- 5 ****   
##  3 Calories Beverages Desserts        27     7    40   2   e- 2 5.6 e- 1 ns     
##  4 Calories Beverages Main            27    42    12.5 8.92e-12 2.5 e-10 ****   
##  5 Calories Beverages Salads          27     6    25   9   e- 3 2.43e- 1 ns     
##  6 Calories Beverages Sides           27    13    71   2   e- 3 7   e- 2 ns     
##  7 Calories Beverages Smoothies &…    27    28    24   2.41e- 9 6.75e- 8 ****   
##  8 Calories Breakfast Coffee & Tea    42    95  3348   2.69e-10 7.53e- 9 ****   
##  9 Calories Breakfast Desserts        42     7   276   2.39e- 4 7   e- 3 **     
## 10 Calories Breakfast Main            42    42   866.  8.93e- 1 1   e+ 0 ns     
## # … with 18 more rows, and abbreviated variable names ¹​statistic, ²​p.adj.signif

The post-hoc tests revealed that there are not significant differences between all of the pairs of categories, since when using the Bonferroni adjustment method (p-values multiplied by the number of comparisons) 16 of the 28 pairs (approximately 57%) showed that the p-value was not significant.

TESTING HYPOTHESES ABOUT THE POPULATION ARITHMETIC MEAN (USING: T-TEST/SIGN TEST/WILCOXON SIGNED RANK TEST)

Research question: Does an item on the McDonald’s menu, on average, have more or less than 400 calories (20% of the daily recommended calorie intake for men and 25% of the daily recommended calorie intake for women)?

Hypothesis: An item on the McDonald’s menu, on average, 400 calories.

H0: The average calorie amount of an item on the McDonald’s menu is 400 (Mean = 400). -> The median calorie amount of an item on the McDonald’s menu is 400 (Median = 400).

H1: The average calorie amount of an item on the McDonald’s menu is not 400 (Mean ≠ 400). -> The median calorie amount of an item on the McDonald’s menu is not 400 (Median ≠ 400).

Before deciding for the testing method (test) to perform, the conditions and assumptions for the parametric alternative have to be checked. The conditions and assumptions are, that the variable is numeric (which calories are, so this condition is met), that the distribution of the variable is normal (normality) and that there are no outliers in the population.

Testing the normality assumption of the calories variable for the population (McDonald’s menu). First by plotting the histogram of the distribution of calories and then confirming the assumption using the Shapiro-Wilk test.

mydata1 %>% 
  ggplot(aes(x = Calories)) +
  geom_histogram(binwidth = 100, colour = "white", fill = "black") +
  ylab("Frequency") +
  xlab("Calories") +
  theme_minimal()

shapiro_test(Calories,
             data = mydata1)
## # A tibble: 1 × 3
##   variable statistic        p
##   <chr>        <dbl>    <dbl>
## 1 Calories     0.919 1.12e-10

While the distribution would appear to be close to normal, with a slight right skew, the Shapiro-Wilk test (where H0:Distribution is normal) rejects the null hypothesis. Since the normality assumption is in the case of this data not met, the appropriate testing method is the Wilcoxon Signed Rank Test.

For this hypothesis external research was conducted. The calories of fast food items from other restaurants as well were found on the following websites: https://www.calories.info/food/fast-food and https://medlineplus.gov/ency/patientinstructions/000887.htm. The median from both samples returned a values of approximately 400 (395 and 399 respectively).

The median calorie amount of all items on the McDonald’s menu is calculated bellow, to have a reference point for the following tests.

median(mydata1$Calories)
## [1] 340
wilcox.test(mydata1$Calories,
            mu = 400,
            alternative = "two.sided",
            correct = FALSE)
## 
##  Wilcoxon signed rank test
## 
## data:  mydata1$Calories
## V = 12518, p-value = 0.0006674
## alternative hypothesis: true location is not equal to 400

From the above result we can reject the null hypothesis (H0: The median calorie amount of an item on the McDonald’s menu is 400 (Median = 400)) at p<0.001.

effectsize(wilcox.test(mydata1$Calories,
            mu = 400,
            alternative = "two.sided",
            correct = FALSE))
## r (rank biserial) |         95% CI
## ----------------------------------
## -0.24             | [-0.37, -0.11]
## 
## - Deviation from a difference of 400.
interpret_rank_biserial(0.24)
## [1] "medium"
## (Rules: funder2019)

Based on the data, we found that the median calorie amount of an item on the McDonald’s menu was significantly the assumed amount of 400 calories (p < 0.001, r = 0.24 meaning a medium effect). The surprising result is that in comparison to other fast food options, the menu items from McDonald’s contain less calories. (From these statistics we can infer that McDonald’s is healthy :)).