Building Regression Models

Pizza Ratings

Manufacturers of frozen foods often reformulate their products to maintain and increase customer satisfaction and sales. So they pay particular attention to evaluations of their products in comparison to their competitors’ products. Frozen pizzas are a major sector of the frozen food market, accounting for an average of $3.1 billion in sales in the five-year period from 2008 to 2012. The prestigious Consumer’s Union rated frozen pizzas for flavor and quality, assigning an overall score to each brand tested. Let’s build a regression model to predict the Consumer’s Union score from Calories, Type (1 = cheese, 0 = pepperoni), and Fat content.

My data

str(x)

## 'data.frame':    29 obs. of  6 variables:
##  $ Brand   : Factor w/ 29 levels "Amy's organic",..: 8 7 4 1 19 25 12 23 15 2 ...
##  $ Score   : int  89 86 85 81 80 79 75 75 73 67 ...
##  $ Cost    : num  0.98 1.23 0.94 1.92 0.84 0.96 0.8 0.96 0.91 0.89 ...
##  $ Calories: int  364 334 332 341 307 335 292 364 384 333 ...
##  $ Fat     : int  15 11 12 14 9 12 9 18 20 12 ...
##  $ Type    : int  1 1 1 1 1 1 1 1 1 1 ...

Let’s regress Scores on Calories, Fat, and Type

m <- lm(Score ~ Calories + Type + Fat, data = x)
summary(m)

## 
## Call:
## lm(formula = Score ~ Calories + Type + Fat, data = x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -40.63  -7.75   3.95  15.29  26.24 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -148.8173    77.9854  -1.908   0.0679 .
## Calories       0.7430     0.3066   2.424   0.0229 *
## Type          15.6344     8.1033   1.929   0.0651 .
## Fat           -3.8914     2.1381  -1.820   0.0807 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.79 on 25 degrees of freedom
## Multiple R-squared:  0.2873, Adjusted R-squared:  0.2018 
## F-statistic:  3.36 on 3 and 25 DF,  p-value: 0.03464

\[\hat {Scores} = -148.8173 + 0.7430(Calories) + 15.6344(Type) - 3.8914(Fat)\] The coefficient of Type is 15.63, which means cheese pizzas (Type = 1) on average scored about 15.6 points higher than pepperoni pizzas, after allowing for the effects of calories and fat.

The p-value of the t-test of the coefficient for Type is 0.0651. The statistical significance depends on $\alpha$: if $\alpha \ge 0.1$ we reject $H_{0}$ and it’s significant; if $\alpha \le 0.5$ we fail to reject $H_{0}$ and it’s insignificant.

Unusual observations?

plot(m$fitted.values, m$residuals, xlab = 'Predicted', ylab = 'Residuals')
abline(0,0)

Yes, we do see a couple of unusual points in the bottom right corner of the residual plot. They are Reggio and Michelina. Both pizzas are predicted to have high scores. But both have scores more than 30 points lower than predicted.

idx1 <- order(m$fitted.values, decreasing = TRUE)[1] #extracting unusual points
idx2 <- order(m$fitted.values, decreasing = TRUE)[2]
x[c(idx1, idx2), ]

##          Brand Score Cost Calories Fat Type
## 12    Reggio's    55 1.02      367  13    1
## 16 Michelina's    45 1.28      394  19    1

rstandard(m)[c(idx1, idx2)]

##        12        16 
## -2.047205 -2.281745

hatvalues(m)[c(idx1, idx2)]

##        12        16 
## 0.2990986 0.1903315

cooks.distance(m)[c(idx1, idx2)]

##        12        16 
## 0.4471159 0.3059690

dffits(m)[c(idx1, idx2)]

##        12        16 
## -1.436219 -1.218180

Their standardized residuals are -2.047 and -2.282, leverages are 0.299 and 0.19, Cook’s distances are 0.447 and 0.306, and DFFITS measures are -1.436 and -1.218. From the boxplots of Cook’s distances and DFFITS measures, we see that both Reggio and Michelina pizzas stick out like sour thumbs from the other values.

par(mfrow = c(1,2))
boxplot(cooks.distance(m))
boxplot(dffits(m))

Therefore, they are both influential. From the boxplots, we see there are three influential points. Besides Reggio and Michelina pizzas, the Healthy Choice pepperoni pizza is also an influential case.

idx.cook1 <- order(cooks.distance(m), decreasing = TRUE)[1]
idx.cook2 <- order(cooks.distance(m), decreasing = TRUE)[2]
idx.cook3 <- order(cooks.distance(m), decreasing = TRUE)[3]
x[c(idx.cook1, idx.cook2, idx.cook3),]

##                       Brand Score Cost Calories Fat Type
## 29 Healthy Choice pepperoni    15 1.62      280   4    0
## 12                 Reggio's    55 1.02      367  13    1
## 16              Michelina's    45 1.28      394  19    1

idx.dfit1 <- order(dffits(m))[1]
idx.dfit2 <- order(dffits(m))[2]
idx.dfit3 <- order(dffits(m))[3]
x[c(idx.dfit1, idx.dfit2, idx.dfit3),]

##                       Brand Score Cost Calories Fat Type
## 12                 Reggio's    55 1.02      367  13    1
## 29 Healthy Choice pepperoni    15 1.62      280   4    0
## 16              Michelina's    45 1.28      394  19    1

Let’s remove all the influential cases from the dataset and refit the multiple regression model. Also, we will compare the new model to the old one based on their summaries in addition to checking the assumptions for the new model.

x_new <- x[-c(idx.dfit1, idx.dfit2, idx.dfit3),]
m2 <- lm(Score ~ Calories + Fat + Type, data = x_new)
summary(m2)

## 
## Call:
## lm(formula = Score ~ Calories + Fat + Type, data = x_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.878  -8.372  -2.298   7.960  31.503 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -351.9436    65.4809  -5.375 2.14e-05 ***
## Calories       1.5951     0.2559   6.234 2.84e-06 ***
## Fat           -9.8278     1.7557  -5.598 1.26e-05 ***
## Type          18.1209     6.3100   2.872  0.00886 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.04 on 22 degrees of freedom
## Multiple R-squared:  0.6639, Adjusted R-squared:  0.6181 
## F-statistic: 14.48 on 3 and 22 DF,  p-value: 1.99e-05

The new model seems to be much better than the old one. It has a higher $R^2$, a more significant overall F test, and more significant t-tests.

plot(m2$fitted.values, m2$residuals, xlab = "predicted", ylab = "residuals")
abline(0,0)

par(mfrow = c(1, 2))
hist(m2$residuals, main = "", xlab = 'Residuals')
qqnorm(m2$residual)
qqline(m2$residuals)

Does there exist any serious collinearity?

Yes, there exists a serious collinearity in the model because the VIF measures for Calories and Fat are both greater than 10. It’s because those two variables are highly correlated (r = 0.94603).

library(car)

## Warning: package 'car' was built under R version 3.6.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 3.6.1

vif(m2)

## Calories      Fat     Type 
## 12.52500 12.37517  1.48502

cor(x$Calories, x$Fat)

## [1] 0.94603

We now use the full dataset. Do we need to consider the interaction between Calories and Type?

we should consider the interaction between Calories and Type. It is because the cheese pizzas (Type = 1) and the pepperoni pizzas (Type = 0) have quite different slopes (0.34 vs 0.014) when regress Score on each of them separately.

The coefficient of interaction term is -0.65. It means that holding Fat fixed the difference in scores between cheese pizzas and pepperoni pizzas decreases by 0.65 points when Calories increases by 1 unit.

library(lme4)

## Warning: package 'lme4' was built under R version 3.6.3

## Loading required package: Matrix

## Warning: package 'Matrix' was built under R version 3.6.2

## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car

summary(lmList(Score ~ Calories | Type, data = x))

## Call:
##   Model: Score ~ Calories | NULL 
##    Data: x 
## 
## Coefficients:
##    (Intercept) 
##    Estimate Std. Error   t value  Pr(>|t|)
## 0 -70.52839   50.66690 -1.392001 0.1761829
## 1  61.16563   61.10983  1.000913 0.3264589
##    Calories 
##     Estimate Std. Error    t value  Pr(>|t|)
## 0 0.34015474  0.1372922 2.47759743 0.0203390
## 1 0.01425078  0.1795580 0.07936587 0.9373735
## 
## Residual standard error: 20.23601 on 25 degrees of freedom

summary(lm(Score ~ Calories + Type + Fat + Type*Calories, data = x))

## 
## Call:
## lm(formula = Score ~ Calories + Type + Fat + Type * Calories, 
##     data = x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.148 -12.860   5.229  13.291  21.158 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -329.4968    90.5332  -3.640 0.001303 ** 
## Calories         1.3787     0.3400   4.055 0.000458 ***
## Type           242.2710    75.5779   3.206 0.003789 ** 
## Fat             -6.8262     2.0988  -3.252 0.003383 ** 
## Calories:Type   -0.6536     0.2170  -3.012 0.006034 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.21 on 24 degrees of freedom
## Multiple R-squared:  0.4828, Adjusted R-squared:  0.3966 
## F-statistic: 5.601 on 4 and 24 DF,  p-value: 0.002489

Building Regression Models

Christian Zuna Largo

6/4/2020

Pizza Ratings