mydata <- read.csv("C:/Users/Tajda/Downloads/burger-king-menu (2).csv") [,c(1,2,3,5,12,13)]
colnames(mydata) <- c ("Item", "Category", "Calories", "Fat", "Sugars", "Proteins")
head(mydata)
## Item Category Calories Fat Sugars Proteins
## 1 Whopper® Sandwich Burgers 660 40 11 28
## 2 Whopper® Sandwich with Cheese Burgers 740 46 11 32
## 3 Bacon & Cheese Whopper® Sandwich Burgers 790 51 11 35
## 4 Double Whopper® Sandwich Burgers 900 58 11 48
## 5 Double Whopper® Sandwich with Cheese Burgers 980 64 11 52
## 6 Triple Whopper® Sandwich Burgers 1130 75 11 67
Independent samples t-test, first research hypothesis: Burgers have the same average number of calories as Breakfast.
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
mydata2 <- mydata[mydata$Category != "Chicken",]
describeBy(mydata2$Calories, g = mydata2$Category) #First I want to show average number of calories in categories Burgers and Breakfast
##
## Descriptive statistics by group
## group: Breakfast
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 33 389.39 262.45 380 376.3 341 10 930 920 0.23 -1 45.69
## ------------------------------------------------------------
## group: Burgers
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 26 635 343.42 630 622.27 467.02 170 1220 1050 0.29 -1.4 67.35
From above we can see a difference in arithmetic mean of number of calories. To conclude for population I will make a formal test (if the assumptions are violated I will use non-parametric alternative test )
Independent samples t-test
H0: average number of calories (Burgers) = average number of calories (Breakfast).
H1: average number of calories (Burgers) IS NOT THE SAME AS average number of calories (Breakfast).
First I will check normality of distribution of both groups (one of assumptions) (graphically) and then confirm it with Shapiro-Wilk test. If this assumption is violated I will apply non-parametric test: Wilcoxon Rank Sum Test.
library(ggplot2)
ggplot(mydata2, aes(x = Calories)) +
geom_histogram(binwidth = 80, colour="black", fill = "lightgreen") +
facet_wrap(~Category , ncol = 1) +
ylab("Frequency")
From the histogram above we cannot conclude if distribution of variable (in both groups) is normal or not so I will do Shapiro Wilk test.
Hypotheses for each group separately:
H0: Distribution of variable calories is normal.
H1: Distribution of variable calories is not normal.
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata2 %>%
group_by(Category) %>%
shapiro_test(Calories)
## # A tibble: 2 × 4
## Category variable statistic p
## <chr> <chr> <dbl> <dbl>
## 1 Breakfast Calories 0.951 0.141
## 2 Burgers Calories 0.913 0.0313
From the results above we can say:
Category(group) Breakfast: p value= 0.141 > 5 %: we cannot reject H0.
Category(group) Burgers: p value= 0.0312 < 5 %: we can reject null hypothesis, meaning distribution of calories in this group is not normal.
Because distribution is not normal in both groups I will apply alternative test: non-parametric test: Wilcoxon Rank Sum Test.
H0: Distribution locations of variable Calories are the same.
H1: Distribution location of variable Calories are not the same.
wilcox.test(mydata2$Calories ~ mydata2$Category,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata2$Calories by mydata2$Category
## W = 259.5, p-value = 0.009635
## alternative hypothesis: true location shift is not equal to 0
As p value =0.009635 which is less than 5 % we can reject H0, which means there are differences between distribution locations. To see how big differences are –> Effect size
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize(wilcox.test(mydata2$Calories ~ mydata2$Category,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ----------------------------------
## -0.40 | [-0.62, -0.12]
interpret_rank_biserial(0.40)
## [1] "very large"
## (Rules: funder2019)
Conclusions:
Based on the sample data, we have found that burgers and breakfasts differ in the number of calories (p = 0.009635 < 5% ) - Burgers have higher number of calories, the difference in distribution of variable is very large (r=0.40).
Research hypothesis: Burgers, breakfast and chicken have the same arithmetic mean of grams of proteins.
H0: μ (Burgers)= μ (Breakfast) = μ (Chicken)
(The average number of grams of proteins is the same in all three categories)
H1: At least one μ is different.
(The average number of grams of proteins is differenct in at least one group).
describeBy(x = mydata$Proteins, group = mydata$Category) #I want to show the average number of grams of proteins in every group
##
## Descriptive statistics by group
## group: Breakfast
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 33 11.76 10.81 12 11.07 16.31 0 32 32 0.34 -1.39 1.88
## ------------------------------------------------------------
## group: Burgers
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 26 33.5 19.83 28.5 32.41 21.5 8 71 63 0.53 -1.21 3.89
## ------------------------------------------------------------
## group: Chicken
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 18 19.5 10.44 18 19.12 9.64 5 40 35 0.51 -0.95 2.46
First checking if variable (Proteins) in normally distributed in all three groups
library(ggplot2)
ggplot(mydata, aes(x = Proteins)) +
geom_histogram(binwidth = 5, colour="black", fill = "lightblue") +
facet_wrap(~Category , ncol = 1) +
ylab("Frequency")
To check normality –> Shapiro-Wilk test of normality.
Hypotheses for each group separately:
H0: Distribution of variable Proteins is normal.
H1: Distribution of variable Proteins is not normal.
mydata %>%
group_by(Category) %>%
shapiro_test(Proteins)
## # A tibble: 3 × 4
## Category variable statistic p
## <chr> <chr> <dbl> <dbl>
## 1 Breakfast Proteins 0.881 0.00174
## 2 Burgers Proteins 0.897 0.0133
## 3 Chicken Proteins 0.929 0.183
From the results above we can say:
Category(group) Breakfast: p value= 0.0017 < 5 %: we can reject H0, meaning distribution of proteins in this group is not normal.
Category(group) Burgers: p value= 0.0133 < 5 %: we can reject H0, meaning distribution of proteins in this group is not normal.
Category(group) Chicken: p value= 0.183 > 5 %: we cannot reject H0.
Because distribution of chosen variable is not normal in every group I will apply alternative test non-parametric Kruskal-Wallis Rank Sum Test.
H0: All distribution locations of variable Proteins are the same in all 3 groups
H1: At least one distribution location of variable Proteins is different.
kruskal.test(Proteins ~ Category,
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: Proteins by Category
## Kruskal-Wallis chi-squared = 21.164, df = 2, p-value = 2.537e-05
(P - value (2.537e-05) is low. We can reject H0 with p < 0.001, which means at least one distribution location of variable protein is different.
kruskal_effsize(Proteins ~ Category,
data = mydata)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 Proteins 77 0.259 eta2[H] large
Based on the sample data, we have found that the distribution of grams of proteins differs for at least of the group (chi-squared = 21.164 and p < 0.001 ) - the difference in distribution locations of variable is large (effsize=0.2589763).
library(rstatix)
groups_nonpar <- wilcox_test(Proteins ~ Category,
paired = FALSE,
p.adjust.method = "bonferroni",
data = mydata)
groups_nonpar
## # A tibble: 3 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.s…¹
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Proteins Breakfast Burgers 33 26 146 0.0000155 0.0000465 ****
## 2 Proteins Breakfast Chicken 33 18 184. 0.025 0.076 ns
## 3 Proteins Burgers Chicken 26 18 339 0.013 0.038 *
## # … with abbreviated variable name ¹p.adj.signif
From the table above, we can see that p.adj value between groups breakfast and chicken is more than 5% which means that there are no differences in distribution of grams of proteins in this pair of group (not significant, p.ajd > 5%). For other two pairs of groups there are differences (p.ajd < 5%) in distribution of grams of proteins.
I show additional variables from my data set (saturated and trans fat) to compare the average number of those in category Burgers. (Before I just did not show them).
Saturated fat: how many grams (g) of saturated fat is in each menu item
Trans fat: how many grams (g) of trans fat is in each menu item
Research hypothesis: Burgers differ in average number (grams) of saturated and trans fat.
mydata3 <- read.csv("C:/Users/Tajda/Downloads/burger-king-menu (2).csv") [,c(1,2,3,5,6,7,12,13)]
head(mydata3)
## Item Category Calories Fat..g.
## 1 Whopper® Sandwich Burgers 660 40
## 2 Whopper® Sandwich with Cheese Burgers 740 46
## 3 Bacon & Cheese Whopper® Sandwich Burgers 790 51
## 4 Double Whopper® Sandwich Burgers 900 58
## 5 Double Whopper® Sandwich with Cheese Burgers 980 64
## 6 Triple Whopper® Sandwich Burgers 1130 75
## Saturated.Fat..g. Trans.Fat..g. Sugars..g. Protein..g.
## 1 12 1.5 11 28
## 2 16 2.0 11 32
## 3 17 2.0 11 35
## 4 20 3.0 11 48
## 5 24 3.0 11 52
## 6 28 4.0 11 67
colnames(mydata3) <- c ("Item", "Category", "Calories", "Fat", "SaturatedFat" ,"TransFat" , "Sugars", "Proteins")
head(mydata3)
## Item Category Calories Fat SaturatedFat
## 1 Whopper® Sandwich Burgers 660 40 12
## 2 Whopper® Sandwich with Cheese Burgers 740 46 16
## 3 Bacon & Cheese Whopper® Sandwich Burgers 790 51 17
## 4 Double Whopper® Sandwich Burgers 900 58 20
## 5 Double Whopper® Sandwich with Cheese Burgers 980 64 24
## 6 Triple Whopper® Sandwich Burgers 1130 75 28
## TransFat Sugars Proteins
## 1 1.5 11 28
## 2 2.0 11 32
## 3 2.0 11 35
## 4 3.0 11 48
## 5 3.0 11 52
## 6 4.0 11 67
describe(mydata3[mydata$Category == "Burgers",]$SaturatedFat)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 26 14.65 10 12.5 14.18 10.38 1.5 33 31.5 0.47 -1.19 1.96
describe(mydata3[mydata$Category == "Burgers",]$TransFat)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 26 1.75 1.36 1.5 1.68 1.48 0 4.5 4.5 0.4 -1.18 0.27
First I have to calculate differences between two variables (TransFat and SaturatedFat) to see if they are normally distributed.
mydata3$Difference <- mydata3$TransFat - mydata3$SaturatedFat
describe(mydata3$Difference)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 77 -9.17 7.23 -7.5 -8.37 6.67 -29.5 0 29.5 -0.91 0.17 0.82
library(ggplot2)
ggplot(mydata3, aes(x = Difference)) +
geom_histogram(binwidth = 1, color = "black", fill = "pink") +
xlab("Differences") +
ylab("Frequency")
Shapiro-Wilk test to test if distribution of Differences is normal.
H0: Variable (Differences) is normally distributed.
H1: Variable (Differences) is normally distributed.
shapiro.test(mydata3$Difference)
##
## Shapiro-Wilk normality test
##
## data: mydata3$Difference
## W = 0.9173, p-value = 9.657e-05
With p < 0.001 which is less than 5% we can reject H0 which means differences are not normally distributed –>
Alternative: Wilcoxon Signed Rank Test (Hypothesis about the difference in two distribution locations):
H0: Distribution locations are the same.
H1: Distribution locations are different.
wilcox.test(mydata3$TransFat, mydata3$SaturatedFat,
paired = TRUE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon signed rank test
##
## data: mydata3$TransFat and mydata3$SaturatedFat
## V = 0, p-value = 1.12e-13
## alternative hypothesis: true location shift is not equal to 0
With p<0.001 we can reject H0 which means distribution location are different. To see how big the difference is –> Effect size
effectsize(wilcox.test(mydata3$TransFat, mydata3$SaturatedFat,
paired = TRUE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ----------------------------------
## -1.00 | [-1.00, -1.00]
interpret_rank_biserial(1, rules = "funder2019")
## [1] "very large"
## (Rules: funder2019)
Effect size: 1 –> more than 0.4 according to Funder and Ozer (2019) means very large.
Based on the sample data, we have found that number (grams) of saturated fat differs compared to number (grams) of trans fat in Burgers (with p < 0.001 ) - the difference in distribution locations of variables is large (r=1). Burgers have higher number (grams) of saturated fat compared to trans fat.