Homework2_ Korošec

mydata <- read.csv("C:/Users/Tajda/Downloads/burger-king-menu (2).csv") [,c(1,2,3,5,12,13)]

colnames(mydata) <- c ("Item", "Category", "Calories", "Fat", "Sugars", "Proteins")
head(mydata)

##                                   Item Category Calories Fat Sugars Proteins
## 1                    Whopper® Sandwich  Burgers      660  40     11       28
## 2        Whopper® Sandwich with Cheese  Burgers      740  46     11       32
## 3     Bacon & Cheese Whopper® Sandwich  Burgers      790  51     11       35
## 4             Double Whopper® Sandwich  Burgers      900  58     11       48
## 5 Double Whopper® Sandwich with Cheese  Burgers      980  64     11       52
## 6             Triple Whopper® Sandwich  Burgers     1130  75     11       67

Description: This data set is a collection of nutritional information for all major menu items offered by Burger King. The data set includes information on the number of calories, total fat,sugars and protein found in each menu item.

Unit of observation: menu items offered by Burger King,

Sample size: 77 (units of observation).

Categories: breakfast, chicken, burgers

Calories: number of calories (kcal) is in each menu item (meal)

Fat: how many grams (g) of fats is in each menu item.

Sugars:how many grams (g) of sugars is in each menu item.

Proteins:how many grams (g) of proteins is in each menu item.

Source: https://www.kaggle.com/datasets/mattop/burger-king-menu-nutrition-data?select=burger-king-menu.csv

Research question: Are there variations in nutrients of menu items offered by Burger king?

Hypothesis about the difference between two population arithmetic means(independent samples)

Independent samples t-test, first research hypothesis: Burgers have the same average number of calories as Breakfast.

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

mydata2 <- mydata[mydata$Category != "Chicken",]
describeBy(mydata2$Calories, g = mydata2$Category) #First I want to show average number of calories in categories Burgers and Breakfast

## 
##  Descriptive statistics by group 
## group: Breakfast
##    vars  n   mean     sd median trimmed mad min max range skew kurtosis    se
## X1    1 33 389.39 262.45    380   376.3 341  10 930   920 0.23       -1 45.69
## ------------------------------------------------------------ 
## group: Burgers
##    vars  n mean     sd median trimmed    mad min  max range skew kurtosis    se
## X1    1 26  635 343.42    630  622.27 467.02 170 1220  1050 0.29     -1.4 67.35

From above we can see a difference in arithmetic mean of number of calories. To conclude for population I will make a formal test (if the assumptions are violated I will use non-parametric alternative test )

Independent samples t-test

H0: average number of calories (Burgers) = average number of calories (Breakfast).

H1: average number of calories (Burgers) IS NOT THE SAME AS average number of calories (Breakfast).

First I will check normality of distribution of both groups (one of assumptions) (graphically) and then confirm it with Shapiro-Wilk test. If this assumption is violated I will apply non-parametric test: Wilcoxon Rank Sum Test.

library(ggplot2)
ggplot(mydata2, aes(x = Calories)) +
  geom_histogram(binwidth = 80, colour="black", fill = "lightgreen") +
  facet_wrap(~Category , ncol = 1) + 
  ylab("Frequency")

From the histogram above we cannot conclude if distribution of variable (in both groups) is normal or not so I will do Shapiro Wilk test.

Hypotheses for each group separately:

H0: Distribution of variable calories is normal.

H1: Distribution of variable calories is not normal.

library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata2 %>%
  group_by(Category) %>%
  shapiro_test(Calories)

## # A tibble: 2 × 4
##   Category  variable statistic      p
##   <chr>     <chr>        <dbl>  <dbl>
## 1 Breakfast Calories     0.951 0.141 
## 2 Burgers   Calories     0.913 0.0313

From the results above we can say:

Category(group) Breakfast: p value= 0.141 > 5 %: we cannot reject H0.

Category(group) Burgers: p value= 0.0312 < 5 %: we can reject null hypothesis, meaning distribution of calories in this group is not normal.

Because distribution is not normal in both groups I will apply alternative test: non-parametric test: Wilcoxon Rank Sum Test.

H0: Distribution locations of variable Calories are the same.

H1: Distribution location of variable Calories are not the same.

wilcox.test(mydata2$Calories ~ mydata2$Category,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mydata2$Calories by mydata2$Category
## W = 259.5, p-value = 0.009635
## alternative hypothesis: true location shift is not equal to 0

As p value =0.009635 which is less than 5 % we can reject H0, which means there are differences between distribution locations. To see how big differences are –> Effect size

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

effectsize(wilcox.test(mydata2$Calories ~ mydata2$Category,
                       paired = FALSE,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))

## r (rank biserial) |         95% CI
## ----------------------------------
## -0.40             | [-0.62, -0.12]

interpret_rank_biserial(0.40)

## [1] "very large"
## (Rules: funder2019)

Conclusions:

Based on the sample data, we have found that burgers and breakfasts differ in the number of calories (p = 0.009635 < 5% ) - Burgers have higher number of calories, the difference in distribution of variable is very large (r=0.40).

Hypothesis about the equality of three or more population arithmetic means for independent samples, One-way analysis of variance - Anova, if assumtions are violated –> Kruskal-Wallis rank sum test

Research hypothesis: Burgers, breakfast and chicken have the same arithmetic mean of grams of proteins.

H0: μ (Burgers)= μ (Breakfast) = μ (Chicken)

(The average number of grams of proteins is the same in all three categories)

H1: At least one μ is different.

(The average number of grams of proteins is differenct in at least one group).

describeBy(x = mydata$Proteins, group = mydata$Category) #I want to show the average number of grams of proteins in every group

## 
##  Descriptive statistics by group 
## group: Breakfast
##    vars  n  mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 33 11.76 10.81     12   11.07 16.31   0  32    32 0.34    -1.39 1.88
## ------------------------------------------------------------ 
## group: Burgers
##    vars  n mean    sd median trimmed  mad min max range skew kurtosis   se
## X1    1 26 33.5 19.83   28.5   32.41 21.5   8  71    63 0.53    -1.21 3.89
## ------------------------------------------------------------ 
## group: Chicken
##    vars  n mean    sd median trimmed  mad min max range skew kurtosis   se
## X1    1 18 19.5 10.44     18   19.12 9.64   5  40    35 0.51    -0.95 2.46

First checking if variable (Proteins) in normally distributed in all three groups

library(ggplot2)
ggplot(mydata, aes(x = Proteins)) +
  geom_histogram(binwidth = 5, colour="black", fill = "lightblue") +
  facet_wrap(~Category , ncol = 1) + 
  ylab("Frequency")

To check normality –> Shapiro-Wilk test of normality.

Hypotheses for each group separately:

H0: Distribution of variable Proteins is normal.

H1: Distribution of variable Proteins is not normal.

mydata %>%
  group_by(Category) %>%
  shapiro_test(Proteins)

## # A tibble: 3 × 4
##   Category  variable statistic       p
##   <chr>     <chr>        <dbl>   <dbl>
## 1 Breakfast Proteins     0.881 0.00174
## 2 Burgers   Proteins     0.897 0.0133 
## 3 Chicken   Proteins     0.929 0.183

From the results above we can say:

Category(group) Breakfast: p value= 0.0017 < 5 %: we can reject H0, meaning distribution of proteins in this group is not normal.

Category(group) Burgers: p value= 0.0133 < 5 %: we can reject H0, meaning distribution of proteins in this group is not normal.

Category(group) Chicken: p value= 0.183 > 5 %: we cannot reject H0.

Because distribution of chosen variable is not normal in every group I will apply alternative test non-parametric Kruskal-Wallis Rank Sum Test.

H0: All distribution locations of variable Proteins are the same in all 3 groups

H1: At least one distribution location of variable Proteins is different.

kruskal.test(Proteins ~ Category, 
             data = mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Proteins by Category
## Kruskal-Wallis chi-squared = 21.164, df = 2, p-value = 2.537e-05

(P - value (2.537e-05) is low. We can reject H0 with p < 0.001, which means at least one distribution location of variable protein is different.

kruskal_effsize(Proteins ~ Category, 
                data = mydata)

## # A tibble: 1 × 5
##   .y.          n effsize method  magnitude
## * <chr>    <int>   <dbl> <chr>   <ord>    
## 1 Proteins    77   0.259 eta2[H] large

Based on the sample data, we have found that the distribution of grams of proteins differs for at least of the group (chi-squared = 21.164 and p < 0.001 ) - the difference in distribution locations of variable is large (effsize=0.2589763).

library(rstatix)

groups_nonpar <- wilcox_test(Proteins ~ Category,
                             paired = FALSE,
                             p.adjust.method = "bonferroni",
                             data = mydata)

groups_nonpar

## # A tibble: 3 × 9
##   .y.      group1    group2     n1    n2 statistic         p     p.adj p.adj.s…¹
## * <chr>    <chr>     <chr>   <int> <int>     <dbl>     <dbl>     <dbl> <chr>    
## 1 Proteins Breakfast Burgers    33    26      146  0.0000155 0.0000465 ****     
## 2 Proteins Breakfast Chicken    33    18      184. 0.025     0.076     ns       
## 3 Proteins Burgers   Chicken    26    18      339  0.013     0.038     *        
## # … with abbreviated variable name ¹p.adj.signif

From the table above, we can see that p.adj value between groups breakfast and chicken is more than 5% which means that there are no differences in distribution of grams of proteins in this pair of group (not significant, p.ajd > 5%). For other two pairs of groups there are differences (p.ajd < 5%) in distribution of grams of proteins.

Hypothesis about the difference between two arithmetic means, paired samples t-test

I show additional variables from my data set (saturated and trans fat) to compare the average number of those in category Burgers. (Before I just did not show them).

Saturated fat: how many grams (g) of saturated fat is in each menu item

Trans fat: how many grams (g) of trans fat is in each menu item

Research hypothesis: Burgers differ in average number (grams) of saturated and trans fat.

mydata3 <- read.csv("C:/Users/Tajda/Downloads/burger-king-menu (2).csv") [,c(1,2,3,5,6,7,12,13)]

head(mydata3)

##                                   Item Category Calories Fat..g.
## 1                    Whopper® Sandwich  Burgers      660      40
## 2        Whopper® Sandwich with Cheese  Burgers      740      46
## 3     Bacon & Cheese Whopper® Sandwich  Burgers      790      51
## 4             Double Whopper® Sandwich  Burgers      900      58
## 5 Double Whopper® Sandwich with Cheese  Burgers      980      64
## 6             Triple Whopper® Sandwich  Burgers     1130      75
##   Saturated.Fat..g. Trans.Fat..g. Sugars..g. Protein..g.
## 1                12           1.5         11          28
## 2                16           2.0         11          32
## 3                17           2.0         11          35
## 4                20           3.0         11          48
## 5                24           3.0         11          52
## 6                28           4.0         11          67

colnames(mydata3) <- c ("Item", "Category", "Calories", "Fat", "SaturatedFat" ,"TransFat" , "Sugars", "Proteins")
head(mydata3)

##                                   Item Category Calories Fat SaturatedFat
## 1                    Whopper® Sandwich  Burgers      660  40           12
## 2        Whopper® Sandwich with Cheese  Burgers      740  46           16
## 3     Bacon & Cheese Whopper® Sandwich  Burgers      790  51           17
## 4             Double Whopper® Sandwich  Burgers      900  58           20
## 5 Double Whopper® Sandwich with Cheese  Burgers      980  64           24
## 6             Triple Whopper® Sandwich  Burgers     1130  75           28
##   TransFat Sugars Proteins
## 1      1.5     11       28
## 2      2.0     11       32
## 3      2.0     11       35
## 4      3.0     11       48
## 5      3.0     11       52
## 6      4.0     11       67

describe(mydata3[mydata$Category == "Burgers",]$SaturatedFat)

##    vars  n  mean sd median trimmed   mad min max range skew kurtosis   se
## X1    1 26 14.65 10   12.5   14.18 10.38 1.5  33  31.5 0.47    -1.19 1.96

describe(mydata3[mydata$Category == "Burgers",]$TransFat)

##    vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 26 1.75 1.36    1.5    1.68 1.48   0 4.5   4.5  0.4    -1.18 0.27

First I have to calculate differences between two variables (TransFat and SaturatedFat) to see if they are normally distributed.

mydata3$Difference <- mydata3$TransFat - mydata3$SaturatedFat
describe(mydata3$Difference)

##    vars  n  mean   sd median trimmed  mad   min max range  skew kurtosis   se
## X1    1 77 -9.17 7.23   -7.5   -8.37 6.67 -29.5   0  29.5 -0.91     0.17 0.82

library(ggplot2)
ggplot(mydata3, aes(x = Difference)) +
  geom_histogram(binwidth = 1, color = "black", fill = "pink") +
  xlab("Differences") +
  ylab("Frequency")

Shapiro-Wilk test to test if distribution of Differences is normal.

H0: Variable (Differences) is normally distributed.

H1: Variable (Differences) is normally distributed.

shapiro.test(mydata3$Difference)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata3$Difference
## W = 0.9173, p-value = 9.657e-05

With p < 0.001 which is less than 5% we can reject H0 which means differences are not normally distributed –>

Alternative: Wilcoxon Signed Rank Test (Hypothesis about the difference in two distribution locations):

H0: Distribution locations are the same.

H1: Distribution locations are different.

wilcox.test(mydata3$TransFat,  mydata3$SaturatedFat,
            paired = TRUE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon signed rank test
## 
## data:  mydata3$TransFat and mydata3$SaturatedFat
## V = 0, p-value = 1.12e-13
## alternative hypothesis: true location shift is not equal to 0

With p<0.001 we can reject H0 which means distribution location are different. To see how big the difference is –> Effect size

effectsize(wilcox.test(mydata3$TransFat,  mydata3$SaturatedFat,
            paired = TRUE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided"))

## r (rank biserial) |         95% CI
## ----------------------------------
## -1.00             | [-1.00, -1.00]

interpret_rank_biserial(1, rules = "funder2019")

## [1] "very large"
## (Rules: funder2019)

Effect size: 1 –> more than 0.4 according to Funder and Ozer (2019) means very large.

Based on the sample data, we have found that number (grams) of saturated fat differs compared to number (grams) of trans fat in Burgers (with p < 0.001 ) - the difference in distribution locations of variables is large (r=1). Burgers have higher number (grams) of saturated fat compared to trans fat.

Homework2_ Korošec_Tajda

Tajda_Korošec

2023-01-11

Description: This data set is a collection of nutritional information for all major menu items offered by Burger King. The data set includes information on the number of calories, total fat,sugars and protein found in each menu item.

Hypothesis about the difference between two population arithmetic means(independent samples)

Hypothesis about the equality of three or more population arithmetic means for independent samples, One-way analysis of variance - Anova, if assumtions are violated –> Kruskal-Wallis rank sum test

Hypothesis about the difference between two arithmetic means, paired samples t-test