Hw3_KorošecTaajda

Description: This data set is a collection of nutritional information for all major menu items offered by Burger King. The data set includes information on the number of calories, total fat,sugars and protein found in each menu item.

Unit of observation: menu items offered by Burger King,

Sample size: 77 (units of observation).

Categories: breakfast, chicken, burgers

Calories: number of calories (kcal) is in each menu item (meal)

Fat: how many grams (g) of fats is in each menu item.

Sugars:how many grams (g) of sugars is in each menu item.

Proteins:how many grams (g) of proteins is in each menu item.

Source: https://www.kaggle.com/datasets/mattop/burger-king-menu-nutrition-data?select=burger-king-menu.csv

Regression model:

In the regression model, we will identify what is the relationship between variables in our data: How fat sugars protein and categories of menu items affect the calories and which of those independent variables have bigger effect on the number of calories in menu items.

Theory behind: Macronutrients, which are essential nutrients needed in large quantities daily, consist of carbohydrates (sugars), proteins, and fats. Each macronutrient supplies energy, measured in calories, but the amount of energy per gram varies: carbohydrates and proteins have 4 calories per gram, while fats have 9 calories per gram. (Source: https://www.msdmanuals.com/home/disorders-of-nutrition/overview-of-nutrition/carbohydrates,-proteins,-and-fats)

According to that we expect fats have the biggest impact Calories: dependent variable

Category, Fat, Sugars, Proteins: independent variables

mydata$CategoryFactor <- factor(mydata$Category,
                                labels = c ("Burgers", "Chicken", "Breakfast"),
                                levels = c ("Burgers", "Chicken", "Breakfast")) #Data manipulation
head(mydata) #Data manipulation

##                                   Item Category Calories Fat Sugars Proteins CategoryFactor
## 1                    Whopper® Sandwich  Burgers      660  40     11       28        Burgers
## 2        Whopper® Sandwich with Cheese  Burgers      740  46     11       32        Burgers
## 3     Bacon & Cheese Whopper® Sandwich  Burgers      790  51     11       35        Burgers
## 4             Double Whopper® Sandwich  Burgers      900  58     11       48        Burgers
## 5 Double Whopper® Sandwich with Cheese  Burgers      980  64     11       52        Burgers
## 6             Triple Whopper® Sandwich  Burgers     1130  75     11       67        Burgers

scatterplotMatrix(mydata[c(3, 4, 5, 6)], smooth = FALSE)

From the scatterplotMatrix above we can observe that there is a positive relationship among all variables. Also assumption of linearity is not violated because we don’t see any curves.

myreg <- lm(data = mydata, Calories ~ CategoryFactor + Fat + Proteins + Sugars)

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[c(3, 4, 5, 6)]))

##          Calories  Fat Sugars Proteins
## Calories     1.00 0.98   0.44     0.91
## Fat          0.98 1.00   0.30     0.91
## Sugars       0.44 0.30   1.00     0.30
## Proteins     0.91 0.91   0.30     1.00
## 
## n= 77 
## 
## 
## P
##          Calories Fat    Sugars Proteins
## Calories          0.0000 0.0000 0.0000  
## Fat      0.0000          0.0083 0.0000  
## Sugars   0.0000   0.0083        0.0072  
## Proteins 0.0000   0.0000 0.0072

From matrix above we can say that variable Fat has the strongest correlation with Calories since the r (Pearson correlation coefficient) value is the highest (0.98). It is a number between –1 and 1 that measures the strength and direction of the relationship between two variables.

All p-values are low (< 5%) meaning there is a correlation between variables.

vif(myreg)      # Multicolinearity:

##                     GVIF Df GVIF^(1/(2*Df))
## CategoryFactor  2.296206  2        1.230985
## Fat             8.630276  1        2.937733
## Proteins       11.122781  1        3.335083
## Sugars          1.183032  1        1.087673

mean(vif(myreg))

## [1] 3.068647

Based on the Vif test above we can conclude that in our model there is a problem with multicolinearity, because for variables proteins and sugar GVIF values are above 5, which is a breaking point for this test. For categorical variable with more than 3 categories we should compare with √5 = 2.24. So in our case with variable Category there is no problem with multicolinearity since the value is less than 2.24 (Looking the last column). The average vif value is 3.068. We will remove variable Proteins since the value is the highest.

mydata <- mydata[,-6 ]   #Removing variable Proteins (column six)

myreg <- lm(data = mydata, Calories ~ CategoryFactor + Fat + Sugars)

vif(myreg)      # Checking Multicolinearity after removing variable

##                    GVIF Df GVIF^(1/(2*Df))
## CategoryFactor 1.203170  2        1.047326
## Fat            1.229530  1        1.108842
## Sugars         1.181629  1        1.087028

mean(vif(myreg))

## [1] 1.206392

From the results above we can see that after removing variable Proteins, we don have a problem with multicolinearty so there is not a strong relationship between variables anymore. The values drop closer to 1 and they are all <5.

mydata$StdResid <- round(rstandard(myreg), 3)   #we first standardize residuals of myreg and round them up to 3 decimals, and name them mydata$StdResid #how far each unit is from the regression line (all units outside ± 3 are outliers and need to be removed)
mydata$CooksD <- round(cooks.distance(myreg), 3)    #with Cooks distance we find units with high impact. 

hist(mydata$StdResid,                       #We check if std. residuals are normally distributed.
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

From the histogram above we can see an outlier, since the values of std.residuals are are not all between -3 and 3. Outlier is at approximately 4.

shapiro.test(mydata$StdResid) #formal test to see if distribution of errors is normal (assumption of normality of errors based on residuals).

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.95122, p-value = 0.005032

H0: Errors are normally distributed.

H1: Errors are NOT normally distributed.

We can reject H0 with p-value 0.005 (<5%) –> Errors are not normally distributed.

hist(mydata$CooksD,           #With this histogram we can find units with high impact
      main = "Histogram of Cooks distances",
     xlab = "Cooks distance", 
     ylab = "Frequency")

If it is above 1 it must be removed but here we have everything bellow 1. If there are units with significantly bigger distances than others which means we can see a gap in between –> those needs to be removed.

tail(mydata[order(mydata$StdResid),], 6)

##                                  Item  Category Calories Fat Sugars CategoryFactor StdResid CooksD
## 45        CROISSAN’WICH® Egg & Cheese Breakfast      340  18      4      Breakfast    1.337  0.012
## 65                Hash Browns – large Breakfast      670  44      0      Breakfast    1.427  0.031
## 47   CROISSAN’WICH® Ham, Egg & Cheese Breakfast      370  19      5      Breakfast    1.567  0.016
## 11 Single Quarter Pound King Sandwich   Burgers      580  29     10        Burgers    1.848  0.031
## 33              Chicken Nuggets- 20pc   Chicken      860  54      1        Chicken    1.864  0.060
## 59               EGG-NORMOUS BURRITOΡ Breakfast      780  42      4      Breakfast    4.238  0.185

From the table above we can identify our outlier. This is item 59 (std.res = 4.238). We will remove it.

head(mydata[order(-mydata$CooksD),], 6)

##                              Item  Category Calories Fat Sugars CategoryFactor StdResid CooksD
## 59           EGG-NORMOUS BURRITOΡ Breakfast      780  42      4      Breakfast    4.238  0.185
## 10    Cheddar Bacon King Sandwich   Burgers     1190  84     11        Burgers   -2.220  0.123
## 62    Pancake and Sausage platter Breakfast      610  31     30      Breakfast   -1.384  0.092
## 33          Chicken Nuggets- 20pc   Chicken      860  54      1        Chicken    1.864  0.060
## 74     Ranch Dipping Sauce (1 oz) Breakfast      140  15      1      Breakfast   -2.221  0.042
## 60 BK™ Ultimate Breakfast Platter Breakfast      930  44     40      Breakfast    0.604  0.040

From the table above we can identify units with high impact. Those are items unit 59 (CooksD = 0.185), 10 (CooksD = 0.123) and 62 (CooksD = 0.092). We will remove them.

mydata <- mydata[c(-10,-59, -62), ] #Removing outlier and units with high impact

myreg <- lm(data = mydata, Calories ~ CategoryFactor + Fat + Sugars)  #Restimation of a model after removing.

mydata$StdFittedValues <- scale(myreg$fitted.values)    #Homoskedasticity

library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFittedValues,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

From the scatterplot above we can see that std. residuals are random (without any pattern) so there is no problem with heteroskedasticity since variance is constant. We will still do a formal test.

In addition we can see that linearity is met, because there is no curve which which would indicate a pattern in the sample data. (We also observed that in the scatterplotMatrix.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(myreg)                   #Formal test for checking heteroskedasticity

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                 Data                 
##  ------------------------------------
##  Response : Calories 
##  Variables: fitted values of Calories 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    0.3996987 
##  Prob > Chi2   =    0.5272449

We cannot reject H0 (p = 0.527 > 5%). So variance is constant and there is no heteroskedasticity. Assumption of homoskedasticity is therefore NOT violated.

summary(myreg)

## 
## Call:
## lm(formula = Calories ~ CategoryFactor + Fat + Sugars, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.502 -21.500   1.696  18.845  72.131 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              36.0711    10.3479   3.486 0.000857 ***
## CategoryFactorChicken     3.2635    10.4958   0.311 0.756789    
## CategoryFactorBreakfast -31.2670     9.2334  -3.386 0.001173 ** 
## Fat                      13.7138     0.2126  64.505  < 2e-16 ***
## Sugars                    7.9911     0.6413  12.461  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.62 on 69 degrees of freedom
## Multiple R-squared:  0.9889, Adjusted R-squared:  0.9883 
## F-statistic:  1536 on 4 and 69 DF,  p-value: < 2.2e-16

Explaining coefficients:

number 36.07 represents the intercept, others present slope and form the regression equation.

Category chicken: since p-value is > 5%, we did not find any effect of chicken meals on number of calories.

Category breakfast: Given the values of all other variables, Breakfasts have on average 31.2670 calories less than burgers (a base for comparison are Burgers) with p = 0.001.

If fat (grams) is increased by 1 g, the number of calories on average increases by 13.7138 kcal assuming everything else remains unchanged. (statistically significant effect with p < 0.001).

If Sugars (grams) are increased by 1 gram, the number of calories on average increases by 7.9911 kcal assuming everything else remains unchanged (statistically significant effect with p < 0.001).

Explaining F-statistic (ANOVA):

H0: ρ2 (ro squared) = 0

H1: ρ2 > 0

We can reject H0 at p < 0.001 –> We found some relationship between dependent and at least one explanatory variable. Model is good.

Multiple R-squared= 0.9889 –> 98,89 % of VARIABILITY OF the dependent variable(Calories) is explained by linear effect of 3 independent variables (Fat, Sugars and Category).

sqrt(summary(myreg)$r.squared) # Multiple coefficient of correlation

## [1] 0.994433

Linear relationship between Calories and 3 explanatory variables is very strong (0.9-1).

Hw3_KorošecTaajda

Tajda_Korošec

2023-01-19

Description: This data set is a collection of nutritional information for all major menu items offered by Burger King. The data set includes information on the number of calories, total fat,sugars and protein found in each menu item.