mydata <- read.csv("~/Desktop/IMB/2. SEMESTER/MULTIVARIATE ANALYSIS/HOMEWORKS/HW2/menu.csv")
Removing the columns that will not be relevant for the further analysis.
mydata1 <- mydata[c(-3,-5,-7,-8,-9,-10,-11,-12,-13,-14,-16,-18,-21,-22,-23,-24)]
head(mydata1,10)
## Category Item
## 1 Breakfast Egg McMuffin
## 2 Breakfast Egg White Delight
## 3 Breakfast Sausage McMuffin
## 4 Breakfast Sausage McMuffin with Egg
## 5 Breakfast Sausage McMuffin with Egg Whites
## 6 Breakfast Steak & Egg McMuffin
## 7 Breakfast Bacon, Egg & Cheese Biscuit (Regular Biscuit)
## 8 Breakfast Bacon, Egg & Cheese Biscuit (Large Biscuit)
## 9 Breakfast Bacon, Egg & Cheese Biscuit with Egg Whites (Regular Biscuit)
## 10 Breakfast Bacon, Egg & Cheese Biscuit with Egg Whites (Large Biscuit)
## Calories Total.Fat Carbohydrates Dietary.Fiber Sugars Protein
## 1 300 13 31 4 3 17
## 2 250 8 30 4 3 18
## 3 370 23 29 4 2 14
## 4 450 28 30 4 2 21
## 5 400 23 30 4 2 21
## 6 430 23 31 4 3 26
## 7 460 26 38 2 3 19
## 8 520 30 43 3 4 19
## 9 410 20 36 2 3 20
## 10 470 25 42 3 4 20
Renaming the columns in order to be grouped easier and better understand each of the variables in the data set.
colnames(mydata1) <- c("Category", "Name", "Calories", "Fat", "Carbs", "Fiber", "Sugar", "Protein")
mydata1[mydata1 == 'Beef & Pork'] <- 'Main'
mydata1[mydata1 == 'Chicken & Fish'] <- 'Main'
mydata1[mydata1 == 'Snacks & Sides'] <- 'Sides'
head(mydata1,10)
## Category Name
## 1 Breakfast Egg McMuffin
## 2 Breakfast Egg White Delight
## 3 Breakfast Sausage McMuffin
## 4 Breakfast Sausage McMuffin with Egg
## 5 Breakfast Sausage McMuffin with Egg Whites
## 6 Breakfast Steak & Egg McMuffin
## 7 Breakfast Bacon, Egg & Cheese Biscuit (Regular Biscuit)
## 8 Breakfast Bacon, Egg & Cheese Biscuit (Large Biscuit)
## 9 Breakfast Bacon, Egg & Cheese Biscuit with Egg Whites (Regular Biscuit)
## 10 Breakfast Bacon, Egg & Cheese Biscuit with Egg Whites (Large Biscuit)
## Calories Fat Carbs Fiber Sugar Protein
## 1 300 13 31 4 3 17
## 2 250 8 30 4 3 18
## 3 370 23 29 4 2 14
## 4 450 28 30 4 2 21
## 5 400 23 30 4 2 21
## 6 430 23 31 4 3 26
## 7 460 26 38 2 3 19
## 8 520 30 43 3 4 19
## 9 410 20 36 2 3 20
## 10 470 25 42 3 4 20
The unit of observation in my sample is an item on the McDonald’s menu. The original size of the data set 260 units of observation (McDonald’s Menu Items) with 24 variables. I removed columns that included variables that did not hold potential for further hypothesis testing. After this the data set (mydata1) was left with 8 variables. The variables that were left are:
Category: The subcategory of each food item that each Menu Item falls into (Breakfast, Main, Salads, Sides, Desserts, Beverages, Coffee & Tea & Smoothies & Shakes).
Menu Item: The name of each food item on the McDonald’s menu.
Calories: Amount of kilo calories of each food item on the McDonald’s menu.
Fat: Amount of fat in grams in each food item on the McDonald’s menu.
Carbs: Amount of carbohydrates in grams in each food item on the McDonald’s menu.
Fiber: Amount of fiber in grams in each food item on the McDonald’s menu.
Sugar: Amount of sugar in grams in each food item on the McDonald’s menu.
Protein: Amount of protein in grams in each food item on the McDonald’s menu.
The source of the above data was found on the Kaggle website, the author is McDonald’s. Retrieved January 9th, 2023, from https://www.kaggle.com/datasets/mcdonalds/nutrition-facts.
The research questions will be different for each type of hypothesis testing method and will be listed after each of the hypothesis statements.
Retrieving the descriptive statistics to get a short overview of the data set and the corresponding variables.
get_summary_stats(mydata1[c(-1,-2)])
## # A tibble: 6 × 13
## variable n min max median q1 q3 iqr mad mean sd
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Calories 260 0 1880 340 210 500 290 208. 368. 240.
## 2 Fat 260 0 118 11 2.38 22.2 19.9 14.8 14.2 14.2
## 3 Carbs 260 0 141 44 30 60 30 20.8 47.3 28.3
## 4 Fiber 260 0 7 1 0 3 3 1.48 1.63 1.57
## 5 Sugar 260 0 128 17.5 5.75 48 42.2 24.5 29.4 28.7
## 6 Protein 260 0 87 12 4 19 15 11.9 13.3 11.4
## # … with 2 more variables: se <dbl>, ci <dbl>
Research question: Is there a difference between the amount of calories that items which are part of the Breakfast category and those items that are part of the Main (courses) category contain?
Hypothesis: Items that are part of the Breakfast category contain a different amount of calories than the items that are part of the Main (courses) category.
H0: The arithmetic means of Breakfast items and Main items on the McDonald’s menu are the same (MeanBreakfast = MeanMain). -> Distribution locations (of Breakfast and Main) are the same.
H1: The arithmetic means of Breakfast items and Main items on the McDonald’s menu are different (MeanBreakfast ≠ MeanMain). -> Distribution locations (of Breakfast and Main) are not the same.
Before we can decide on which testing method (test) to perform, we have to check whether the conditions and assumptions for the parametric test are fulfilled. The conditions and assumptions are, that the variable has to be numeric (which calories are, so this condition is met), that the distribution of the variable is normal for both populations (normality), the data comes from two populations (Breakfast and Mains) and that the variable has the same variance in both populations.
To check for normality, first create a histogram for both categories and then use the Shapiro-Wilk test, to further confirm the assumptions.
Breakfast_distribution <- mydata1 %>%
group_by(Category) %>%
filter(Category == 'Breakfast') %>%
ggplot(aes(x = Calories)) +
geom_histogram(binwidth = 100, colour = "white", fill = "black") +
ylab("Frequency") +
xlab("Calories of Breakfast Item") +
theme_minimal()
Main_distribution <- mydata1 %>%
group_by(Category) %>%
filter(Category == 'Main') %>%
ggplot(aes(x = Calories)) +
geom_histogram(binwidth = 100, colour = "white", fill = "black") +
ylab("Frequency") +
xlab("Calories of Main Items") +
theme_minimal()
ggarrange(Breakfast_distribution, Main_distribution,
ncol = 2, nrow = 1)
From the above histograms we can observe that neither of the distributions can be by eye confirmed as normal, that is why the Shapiro-Wilk test (below) has been performed on both categories.
mydata1 %>%
group_by(Category) %>%
filter(Category == 'Breakfast') %>%
shapiro_test(Calories)
## # A tibble: 1 × 4
## Category variable statistic p
## <chr> <chr> <dbl> <dbl>
## 1 Breakfast Calories 0.879 0.000368
mydata1 %>%
group_by(Category) %>%
filter(Category == 'Main') %>%
shapiro_test(Calories)
## # A tibble: 1 × 4
## Category variable statistic p
## <chr> <chr> <dbl> <dbl>
## 1 Main Calories 0.681 0.0000000288
The results of the Shapiro-Wilk test show that neither distribution of the variables are not normal, therefore the null hypothesis (H0:Distribution is normal) is rejected, both at p-value < 0.001. For this reason I will continue with the non-parametric alternative to the independent samples t-test which is the Wilcoxon Rank Sum Test.
As said, the normality assumption is not met, therefore, the selected test is Wilcoxon Rank Sum Test. For this test the data is filtered in order to only contain the two observed categories (Breakfast and Mains).
mydata_filtered <- mydata1[mydata1$Category == 'Breakfast' | mydata1$Category == 'Main',]
wilcox.test(mydata_filtered$Calories ~ mydata_filtered$Category,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata_filtered$Calories by mydata_filtered$Category
## W = 866.5, p-value = 0.8896
## alternative hypothesis: true location shift is not equal to 0
Since we allow for the variation in calories to be in both directions (more and less calories in either category) the two sided test was performed. From the p-value being very high (almost 89%), we cannot reject the null hypothesis, meaning the population location distributions cannot be confirmed as different.
effectsize(wilcox.test(mydata_filtered$Calories ~ mydata_filtered$Category,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## -0.02 | [-0.26, 0.23]
interpret_rank_biserial(0.02)
## [1] "tiny"
## (Rules: funder2019)
The effect size from this test is as can be interpreted as “tiny”, according to the Biserial scale.
From the sample data, we are unable to confirm that the calories in Breakfast menu items are different from the menu items classified as Main (dishes), since the p-value of the test does not meet the required alpha (0.05), furthermore, the effect size is tiny.
Research question: Is there a difference between the amount of average calories contained in each menu item from different categories?
Hypothesis: Items that are from the same menu category on average contain a different amount of calories than those items that are included in a different category.
H0: MeanBreakfast = MeanMain = MeanSalads = MeanSides = MeanDesserts = MeanBeverages = MeanCoffee & Tea = MeanSmoothies & Shakes -> All category distribution locations of variable are the same.
H1: At least one mean Mean is different from the rest. -> At least one distribution location of Calories variable is different.
Before deciding whether the testing method will be parametric or non-parametric the conditions and assumptions have to be assessed. The conditions and assumptions for the one-way ANOVA are that, the variable must be numeric (Calories are numeric, so this condition is met), the variable in the population is normally distributed within each of the categories (normality), that the variance of the analyzed variable is the same in all categories (homoscedasticity) and that there are no significant outliers.
First taking a look at some descriptive statistics for each of the categories on the menu.
describeBy(x = mydata1$Calories, group = mydata1$Category)
##
## Descriptive statistics by group
## group: Beverages
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 27 113.7 99.19 100 109.13 148.26 0 280 280 0.25 -1.25 19.09
## ------------------------------------------------------------
## group: Breakfast
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 42 526.67 221.68 470 496.76 111.19 150 1150 1000 1.21 1.18
## se
## X1 34.21
## ------------------------------------------------------------
## group: Coffee & Tea
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 95 283.89 157.81 270 277.79 148.26 0 760 760 0.46 0.29
## se
## X1 16.19
## ------------------------------------------------------------
## group: Desserts
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 7 222.14 108.08 250 222.14 133.43 45 340 295 -0.35 -1.57
## se
## X1 40.85
## ------------------------------------------------------------
## group: Main
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 42 531.9 259.29 490 500.59 155.67 190 1880 1690 3.31 14.76
## se
## X1 40.01
## ------------------------------------------------------------
## group: Salads
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 6 270 127.44 255 270 170.5 140 450 310 0.21 -1.88 52.03
## ------------------------------------------------------------
## group: Sides
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 13 245.77 141.77 260 242.73 118.61 15 510 495 -0.13 -0.88
## se
## X1 39.32
## ------------------------------------------------------------
## group: Smoothies & Shakes
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 28 531.43 230.87 540 528.33 296.52 210 930 720 0.07 -1.46
## se
## X1 43.63
Checking for the normality of distributions of all of the categories for the calories variable using the Shapiro-Wilk Test.
mydata1 %>%
group_by(Category) %>%
shapiro_test(Calories)
## # A tibble: 8 × 4
## Category variable statistic p
## <chr> <chr> <dbl> <dbl>
## 1 Beverages Calories 0.880 0.00472
## 2 Breakfast Calories 0.879 0.000368
## 3 Coffee & Tea Calories 0.974 0.0539
## 4 Desserts Calories 0.930 0.550
## 5 Main Calories 0.681 0.0000000288
## 6 Salads Calories 0.917 0.484
## 7 Sides Calories 0.951 0.607
## 8 Smoothies & Shakes Calories 0.925 0.0472
Checking that the variances of calories are the same in all menu categories, using the Levene test.
leveneTest(Calories ~ Category,
data = mydata1)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 7 1.9749 0.05895 .
## 252
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the above tables, it can be seen that some of the p-values of the food categories in the Shapiro-Wilk test are under the 0.05 threshold. Therefore, the null hypothesis (H0:Distributions are normal), has to be rejected. In the Levene test, the p-value does not meet the required alpha of 0.05 since it is slightly above this threshold (0.059). Hence, for both reasons, the non-parametric alternative has to be used for testing the hypothesis, in this case the Kruskal-Wallis Rank Sum Test. Therefore, the hypothesis that will be tested is that at least one distribution location of the calories variable is different.
The Kruskal-Wallis test is performed below where the Calories are stated as a function of the Category in which the item is defined as.
kruskal.test(mydata1$Calories ~ mydata1$Category)
##
## Kruskal-Wallis rank sum test
##
## data: mydata1$Calories by mydata1$Category
## Kruskal-Wallis chi-squared = 120.75, df = 7, p-value < 2.2e-16
kruskal_effsize(Calories ~ Category,
data = mydata1)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 Calories 260 0.451 eta2[H] large
With the test returning a p-value of less than 0.001, we can confirm the hypothesis (H1: At least one distribution location of variable Calories is different from the others.). The chi-squared value from the test is 120.75 and the effect size is large (0.45).
Checking the relations directly between the categories using the Wilcoxon test for each pair separatly.
(grouped_nonparametric <- wilcox_test(Calories ~ Category,
paired = FALSE,
p.adjust.method = "bonferroni",
data = mydata1))
## # A tibble: 28 × 9
## .y. group1 group2 n1 n2 stati…¹ p p.adj p.adj…²
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Calories Beverages Breakfast 27 42 16.5 1.27e-11 3.56e-10 ****
## 2 Calories Beverages Coffee & Tea 27 95 466 4.66e- 7 1.3 e- 5 ****
## 3 Calories Beverages Desserts 27 7 40 2 e- 2 5.6 e- 1 ns
## 4 Calories Beverages Main 27 42 12.5 8.92e-12 2.5 e-10 ****
## 5 Calories Beverages Salads 27 6 25 9 e- 3 2.43e- 1 ns
## 6 Calories Beverages Sides 27 13 71 2 e- 3 7 e- 2 ns
## 7 Calories Beverages Smoothies &… 27 28 24 2.41e- 9 6.75e- 8 ****
## 8 Calories Breakfast Coffee & Tea 42 95 3348 2.69e-10 7.53e- 9 ****
## 9 Calories Breakfast Desserts 42 7 276 2.39e- 4 7 e- 3 **
## 10 Calories Breakfast Main 42 42 866. 8.93e- 1 1 e+ 0 ns
## # … with 18 more rows, and abbreviated variable names ¹statistic, ²p.adj.signif
The post-hoc tests revealed that there are not significant differences between all of the pairs of categories, since when using the Bonferroni adjustment method (p-values multiplied by the number of comparisons) 16 of the 28 pairs (approximately 57%) showed that the p-value was not significant.
Research question: Does an item on the McDonald’s menu, on average, have more or less than 400 calories (20% of the daily recommended calorie intake for men and 25% of the daily recommended calorie intake for women)?
Hypothesis: An item on the McDonald’s menu, on average, 400 calories.
H0: The average calorie amount of an item on the McDonald’s menu is 400 (Mean = 400). -> The median calorie amount of an item on the McDonald’s menu is 400 (Median = 400).
H1: The average calorie amount of an item on the McDonald’s menu is not 400 (Mean ≠ 400). -> The median calorie amount of an item on the McDonald’s menu is not 400 (Median ≠ 400).
Before deciding for the testing method (test) to perform, the conditions and assumptions for the parametric alternative have to be checked. The conditions and assumptions are, that the variable is numeric (which calories are, so this condition is met), that the distribution of the variable is normal (normality) and that there are no outliers in the population.
Testing the normality assumption of the calories variable for the population (McDonald’s menu). First by plotting the histogram of the distribution of calories and then confirming the assumption using the Shapiro-Wilk test.
mydata1 %>%
ggplot(aes(x = Calories)) +
geom_histogram(binwidth = 100, colour = "white", fill = "black") +
ylab("Frequency") +
xlab("Calories") +
theme_minimal()
shapiro_test(Calories,
data = mydata1)
## # A tibble: 1 × 3
## variable statistic p
## <chr> <dbl> <dbl>
## 1 Calories 0.919 1.12e-10
While the distribution would appear to be close to normal, with a slight right skew, the Shapiro-Wilk test (where H0:Distribution is normal) rejects the null hypothesis. Since the normality assumption is in the case of this data not met, the appropriate testing method is the Wilcoxon Signed Rank Test.
For this hypothesis external research was conducted. The calories of fast food items from other restaurants as well were found on the following websites: https://www.calories.info/food/fast-food and https://medlineplus.gov/ency/patientinstructions/000887.htm. The median from both samples returned a values of approximately 400 (395 and 399 respectively).
The median calorie amount of all items on the McDonald’s menu is calculated bellow, to have a reference point for the following tests.
median(mydata1$Calories)
## [1] 340
wilcox.test(mydata1$Calories,
mu = 400,
alternative = "two.sided",
correct = FALSE)
##
## Wilcoxon signed rank test
##
## data: mydata1$Calories
## V = 12518, p-value = 0.0006674
## alternative hypothesis: true location is not equal to 400
From the above result we can reject the null hypothesis (H0: The median calorie amount of an item on the McDonald’s menu is 400 (Median = 400)) at p<0.001.
effectsize(wilcox.test(mydata1$Calories,
mu = 400,
alternative = "two.sided",
correct = FALSE))
## r (rank biserial) | 95% CI
## ----------------------------------
## -0.24 | [-0.37, -0.11]
##
## - Deviation from a difference of 400.
interpret_rank_biserial(0.24)
## [1] "medium"
## (Rules: funder2019)
Based on the data, we found that the median calorie amount of an item on the McDonald’s menu was significantly the assumed amount of 400 calories (p < 0.001, r = 0.24 meaning a medium effect). The surprising result is that in comparison to other fast food options, the menu items from McDonald’s contain less calories. (From these statistics we can infer that McDonald’s is healthy :)).