mydata <- read.csv("C:/Users/Tajda/Downloads/burger-king-menu (2).csv") [,c(1,2,3,5,12,13)]
colnames(mydata) <- c ("Item", "Category", "Calories", "Fat", "Sugars", "Proteins")
head(mydata)
## Item Category Calories Fat Sugars Proteins
## 1 Whopper® Sandwich Burgers 660 40 11 28
## 2 Whopper® Sandwich with Cheese Burgers 740 46 11 32
## 3 Bacon & Cheese Whopper® Sandwich Burgers 790 51 11 35
## 4 Double Whopper® Sandwich Burgers 900 58 11 48
## 5 Double Whopper® Sandwich with Cheese Burgers 980 64 11 52
## 6 Triple Whopper® Sandwich Burgers 1130 75 11 67
Unit of observation: menu items offered by Burger King,
Sample size: 77 (units of observation).
Categories: breakfast, chicken, burgers
Calories: number of calories (kcal) is in each menu item (meal)
Fat: how many grams (g) of fats is in each menu item.
Sugars:how many grams (g) of sugars is in each menu item.
Proteins:how many grams (g) of proteins is in each menu item.
Source: https://www.kaggle.com/datasets/mattop/burger-king-menu-nutrition-data?select=burger-king-menu.csv
Regression model:
In the regression model, we will identify what is the relationship between variables in our data: How fat sugars protein and categories of menu items affect the calories and which of those independent variables have bigger effect on the number of calories in menu items.
Theory behind: Macronutrients, which are essential nutrients needed in large quantities daily, consist of carbohydrates (sugars), proteins, and fats. Each macronutrient supplies energy, measured in calories, but the amount of energy per gram varies: carbohydrates and proteins have 4 calories per gram, while fats have 9 calories per gram. (Source: https://www.msdmanuals.com/home/disorders-of-nutrition/overview-of-nutrition/carbohydrates,-proteins,-and-fats)
According to that we expect fats have the biggest impact Calories: dependent variable
Category, Fat, Sugars, Proteins: independent variables
mydata$CategoryFactor <- factor(mydata$Category,
labels = c ("Burgers", "Chicken", "Breakfast"),
levels = c ("Burgers", "Chicken", "Breakfast")) #Data manipulation
head(mydata) #Data manipulation
## Item Category Calories Fat Sugars Proteins CategoryFactor
## 1 Whopper® Sandwich Burgers 660 40 11 28 Burgers
## 2 Whopper® Sandwich with Cheese Burgers 740 46 11 32 Burgers
## 3 Bacon & Cheese Whopper® Sandwich Burgers 790 51 11 35 Burgers
## 4 Double Whopper® Sandwich Burgers 900 58 11 48 Burgers
## 5 Double Whopper® Sandwich with Cheese Burgers 980 64 11 52 Burgers
## 6 Triple Whopper® Sandwich Burgers 1130 75 11 67 Burgers
scatterplotMatrix(mydata[c(3, 4, 5, 6)], smooth = FALSE)
From the scatterplotMatrix above we can observe that there is a positive relationship among all variables. Also assumption of linearity is not violated because we don’t see any curves.
myreg <- lm(data = mydata, Calories ~ CategoryFactor + Fat + Proteins + Sugars)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[c(3, 4, 5, 6)]))
## Calories Fat Sugars Proteins
## Calories 1.00 0.98 0.44 0.91
## Fat 0.98 1.00 0.30 0.91
## Sugars 0.44 0.30 1.00 0.30
## Proteins 0.91 0.91 0.30 1.00
##
## n= 77
##
##
## P
## Calories Fat Sugars Proteins
## Calories 0.0000 0.0000 0.0000
## Fat 0.0000 0.0083 0.0000
## Sugars 0.0000 0.0083 0.0072
## Proteins 0.0000 0.0000 0.0072
From matrix above we can say that variable Fat has the strongest correlation with Calories since the r (Pearson correlation coefficient) value is the highest (0.98). It is a number between –1 and 1 that measures the strength and direction of the relationship between two variables.
All p-values are low (< 5%) meaning there is a correlation between variables.
vif(myreg) # Multicolinearity:
## GVIF Df GVIF^(1/(2*Df))
## CategoryFactor 2.296206 2 1.230985
## Fat 8.630276 1 2.937733
## Proteins 11.122781 1 3.335083
## Sugars 1.183032 1 1.087673
mean(vif(myreg))
## [1] 3.068647
Based on the Vif test above we can conclude that in our model there is a problem with multicolinearity, because for variables proteins and sugar GVIF values are above 5, which is a breaking point for this test. For categorical variable with more than 3 categories we should compare with √5 = 2.24. So in our case with variable Category there is no problem with multicolinearity since the value is less than 2.24 (Looking the last column). The average vif value is 3.068. We will remove variable Proteins since the value is the highest.
mydata <- mydata[,-6 ] #Removing variable Proteins (column six)
myreg <- lm(data = mydata, Calories ~ CategoryFactor + Fat + Sugars)
vif(myreg) # Checking Multicolinearity after removing variable
## GVIF Df GVIF^(1/(2*Df))
## CategoryFactor 1.203170 2 1.047326
## Fat 1.229530 1 1.108842
## Sugars 1.181629 1 1.087028
mean(vif(myreg))
## [1] 1.206392
From the results above we can see that after removing variable Proteins, we don have a problem with multicolinearty so there is not a strong relationship between variables anymore. The values drop closer to 1 and they are all <5.
mydata$StdResid <- round(rstandard(myreg), 3) #we first standardize residuals of myreg and round them up to 3 decimals, and name them mydata$StdResid #how far each unit is from the regression line (all units outside ± 3 are outliers and need to be removed)
mydata$CooksD <- round(cooks.distance(myreg), 3) #with Cooks distance we find units with high impact.
hist(mydata$StdResid, #We check if std. residuals are normally distributed.
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
From the histogram above we can see an outlier, since the values of std.residuals are are not all between -3 and 3. Outlier is at approximately 4.
shapiro.test(mydata$StdResid) #formal test to see if distribution of errors is normal (assumption of normality of errors based on residuals).
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.95122, p-value = 0.005032
H0: Errors are normally distributed.
H1: Errors are NOT normally distributed.
We can reject H0 with p-value 0.005 (<5%) –> Errors are not normally distributed.
hist(mydata$CooksD, #With this histogram we can find units with high impact
main = "Histogram of Cooks distances",
xlab = "Cooks distance",
ylab = "Frequency")
If it is above 1 it must be removed but here we have everything bellow 1. If there are units with significantly bigger distances than others which means we can see a gap in between –> those needs to be removed.
tail(mydata[order(mydata$StdResid),], 6)
## Item Category Calories Fat Sugars CategoryFactor StdResid CooksD
## 45 CROISSAN’WICH® Egg & Cheese Breakfast 340 18 4 Breakfast 1.337 0.012
## 65 Hash Browns – large Breakfast 670 44 0 Breakfast 1.427 0.031
## 47 CROISSAN’WICH® Ham, Egg & Cheese Breakfast 370 19 5 Breakfast 1.567 0.016
## 11 Single Quarter Pound King Sandwich Burgers 580 29 10 Burgers 1.848 0.031
## 33 Chicken Nuggets- 20pc Chicken 860 54 1 Chicken 1.864 0.060
## 59 EGG-NORMOUS BURRITOΡ Breakfast 780 42 4 Breakfast 4.238 0.185
From the table above we can identify our outlier. This is item 59 (std.res = 4.238). We will remove it.
head(mydata[order(-mydata$CooksD),], 6)
## Item Category Calories Fat Sugars CategoryFactor StdResid CooksD
## 59 EGG-NORMOUS BURRITOΡ Breakfast 780 42 4 Breakfast 4.238 0.185
## 10 Cheddar Bacon King Sandwich Burgers 1190 84 11 Burgers -2.220 0.123
## 62 Pancake and Sausage platter Breakfast 610 31 30 Breakfast -1.384 0.092
## 33 Chicken Nuggets- 20pc Chicken 860 54 1 Chicken 1.864 0.060
## 74 Ranch Dipping Sauce (1 oz) Breakfast 140 15 1 Breakfast -2.221 0.042
## 60 BK™ Ultimate Breakfast Platter Breakfast 930 44 40 Breakfast 0.604 0.040
From the table above we can identify units with high impact. Those are items unit 59 (CooksD = 0.185), 10 (CooksD = 0.123) and 62 (CooksD = 0.092). We will remove them.
mydata <- mydata[c(-10,-59, -62), ] #Removing outlier and units with high impact
myreg <- lm(data = mydata, Calories ~ CategoryFactor + Fat + Sugars) #Restimation of a model after removing.
mydata$StdFittedValues <- scale(myreg$fitted.values) #Homoskedasticity
library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
From the scatterplot above we can see that std. residuals are random (without any pattern) so there is no problem with heteroskedasticity since variance is constant. We will still do a formal test.
In addition we can see that linearity is met, because there is no curve which which would indicate a pattern in the sample data. (We also observed that in the scatterplotMatrix.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(myreg) #Formal test for checking heteroskedasticity
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ------------------------------------
## Response : Calories
## Variables: fitted values of Calories
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 0.3996987
## Prob > Chi2 = 0.5272449
We cannot reject H0 (p = 0.527 > 5%). So variance is constant and there is no heteroskedasticity. Assumption of homoskedasticity is therefore NOT violated.
summary(myreg)
##
## Call:
## lm(formula = Calories ~ CategoryFactor + Fat + Sugars, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.502 -21.500 1.696 18.845 72.131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0711 10.3479 3.486 0.000857 ***
## CategoryFactorChicken 3.2635 10.4958 0.311 0.756789
## CategoryFactorBreakfast -31.2670 9.2334 -3.386 0.001173 **
## Fat 13.7138 0.2126 64.505 < 2e-16 ***
## Sugars 7.9911 0.6413 12.461 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.62 on 69 degrees of freedom
## Multiple R-squared: 0.9889, Adjusted R-squared: 0.9883
## F-statistic: 1536 on 4 and 69 DF, p-value: < 2.2e-16
Explaining coefficients:
number 36.07 represents the intercept, others present slope and form the regression equation.
Category chicken: since p-value is > 5%, we did not find any effect of chicken meals on number of calories.
Category breakfast: Given the values of all other variables, Breakfasts have on average 31.2670 calories less than burgers (a base for comparison are Burgers) with p = 0.001.
If fat (grams) is increased by 1 g, the number of calories on average increases by 13.7138 kcal assuming everything else remains unchanged. (statistically significant effect with p < 0.001).
If Sugars (grams) are increased by 1 gram, the number of calories on average increases by 7.9911 kcal assuming everything else remains unchanged (statistically significant effect with p < 0.001).
Explaining F-statistic (ANOVA):
H0: ρ2 (ro squared) = 0
H1: ρ2 > 0
We can reject H0 at p < 0.001 –> We found some relationship between dependent and at least one explanatory variable. Model is good.
Multiple R-squared= 0.9889 –> 98,89 % of VARIABILITY OF the dependent variable(Calories) is explained by linear effect of 3 independent variables (Fat, Sugars and Category).
sqrt(summary(myreg)$r.squared) # Multiple coefficient of correlation
## [1] 0.994433
Linear relationship between Calories and 3 explanatory variables is very strong (0.9-1).