Manufacturers of frozen foods often reformulate their products to maintain and increase customer satisfaction and sales. So they pay particular attention to evaluations of their products in comparison to their competitors’ products. Frozen pizzas are a major sector of the frozen food market, accounting for an average of $3.1 billion in sales in the five-year period from 2008 to 2012. The prestigious Consumer’s Union rated frozen pizzas for flavor and quality, assigning an overall score to each brand tested. Let’s build a regression model to predict the Consumer’s Union score from Calories, Type (1 = cheese, 0 = pepperoni), and Fat content.
My data
str(x)
## 'data.frame': 29 obs. of 6 variables:
## $ Brand : Factor w/ 29 levels "Amy's organic",..: 8 7 4 1 19 25 12 23 15 2 ...
## $ Score : int 89 86 85 81 80 79 75 75 73 67 ...
## $ Cost : num 0.98 1.23 0.94 1.92 0.84 0.96 0.8 0.96 0.91 0.89 ...
## $ Calories: int 364 334 332 341 307 335 292 364 384 333 ...
## $ Fat : int 15 11 12 14 9 12 9 18 20 12 ...
## $ Type : int 1 1 1 1 1 1 1 1 1 1 ...
Let’s regress Scores on Calories, Fat, and Type
m <- lm(Score ~ Calories + Type + Fat, data = x)
summary(m)
##
## Call:
## lm(formula = Score ~ Calories + Type + Fat, data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.63 -7.75 3.95 15.29 26.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -148.8173 77.9854 -1.908 0.0679 .
## Calories 0.7430 0.3066 2.424 0.0229 *
## Type 15.6344 8.1033 1.929 0.0651 .
## Fat -3.8914 2.1381 -1.820 0.0807 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.79 on 25 degrees of freedom
## Multiple R-squared: 0.2873, Adjusted R-squared: 0.2018
## F-statistic: 3.36 on 3 and 25 DF, p-value: 0.03464
\[\hat {Scores} = -148.8173 + 0.7430(Calories) + 15.6344(Type) - 3.8914(Fat)\] The coefficient of Type is 15.63, which means cheese pizzas (Type = 1) on average scored about 15.6 points higher than pepperoni pizzas, after allowing for the effects of calories and fat.
The p-value of the t-test of the coefficient for Type is 0.0651. The statistical significance depends on \(\alpha\): if \(\alpha \ge 0.1\) we reject \(H_{0}\) and it’s significant; if \(\alpha \le 0.5\) we fail to reject \(H_{0}\) and it’s insignificant.
Unusual observations?
plot(m$fitted.values, m$residuals, xlab = 'Predicted', ylab = 'Residuals')
abline(0,0)
Yes, we do see a couple of unusual points in the bottom right corner of the residual plot. They are Reggio and Michelina. Both pizzas are predicted to have high scores. But both have scores more than 30 points lower than predicted.
idx1 <- order(m$fitted.values, decreasing = TRUE)[1] #extracting unusual points
idx2 <- order(m$fitted.values, decreasing = TRUE)[2]
x[c(idx1, idx2), ]
## Brand Score Cost Calories Fat Type
## 12 Reggio's 55 1.02 367 13 1
## 16 Michelina's 45 1.28 394 19 1
rstandard(m)[c(idx1, idx2)]
## 12 16
## -2.047205 -2.281745
hatvalues(m)[c(idx1, idx2)]
## 12 16
## 0.2990986 0.1903315
cooks.distance(m)[c(idx1, idx2)]
## 12 16
## 0.4471159 0.3059690
dffits(m)[c(idx1, idx2)]
## 12 16
## -1.436219 -1.218180
Their standardized residuals are -2.047 and -2.282, leverages are 0.299 and 0.19, Cook’s distances are 0.447 and 0.306, and DFFITS measures are -1.436 and -1.218. From the boxplots of Cook’s distances and DFFITS measures, we see that both Reggio and Michelina pizzas stick out like sour thumbs from the other values.
par(mfrow = c(1,2))
boxplot(cooks.distance(m))
boxplot(dffits(m))
Therefore, they are both influential. From the boxplots, we see there are three influential points. Besides Reggio and Michelina pizzas, the Healthy Choice pepperoni pizza is also an influential case.
idx.cook1 <- order(cooks.distance(m), decreasing = TRUE)[1]
idx.cook2 <- order(cooks.distance(m), decreasing = TRUE)[2]
idx.cook3 <- order(cooks.distance(m), decreasing = TRUE)[3]
x[c(idx.cook1, idx.cook2, idx.cook3),]
## Brand Score Cost Calories Fat Type
## 29 Healthy Choice pepperoni 15 1.62 280 4 0
## 12 Reggio's 55 1.02 367 13 1
## 16 Michelina's 45 1.28 394 19 1
idx.dfit1 <- order(dffits(m))[1]
idx.dfit2 <- order(dffits(m))[2]
idx.dfit3 <- order(dffits(m))[3]
x[c(idx.dfit1, idx.dfit2, idx.dfit3),]
## Brand Score Cost Calories Fat Type
## 12 Reggio's 55 1.02 367 13 1
## 29 Healthy Choice pepperoni 15 1.62 280 4 0
## 16 Michelina's 45 1.28 394 19 1
Let’s remove all the influential cases from the dataset and refit the multiple regression model. Also, we will compare the new model to the old one based on their summaries in addition to checking the assumptions for the new model.
x_new <- x[-c(idx.dfit1, idx.dfit2, idx.dfit3),]
m2 <- lm(Score ~ Calories + Fat + Type, data = x_new)
summary(m2)
##
## Call:
## lm(formula = Score ~ Calories + Fat + Type, data = x_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.878 -8.372 -2.298 7.960 31.503
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -351.9436 65.4809 -5.375 2.14e-05 ***
## Calories 1.5951 0.2559 6.234 2.84e-06 ***
## Fat -9.8278 1.7557 -5.598 1.26e-05 ***
## Type 18.1209 6.3100 2.872 0.00886 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.04 on 22 degrees of freedom
## Multiple R-squared: 0.6639, Adjusted R-squared: 0.6181
## F-statistic: 14.48 on 3 and 22 DF, p-value: 1.99e-05
The new model seems to be much better than the old one. It has a higher \(R^2\), a more significant overall F test, and more significant t-tests.
plot(m2$fitted.values, m2$residuals, xlab = "predicted", ylab = "residuals")
abline(0,0)
par(mfrow = c(1, 2))
hist(m2$residuals, main = "", xlab = 'Residuals')
qqnorm(m2$residual)
qqline(m2$residuals)
Does there exist any serious collinearity?
Yes, there exists a serious collinearity in the model because the VIF measures for Calories and Fat are both greater than 10. It’s because those two variables are highly correlated (r = 0.94603).
library(car)
## Warning: package 'car' was built under R version 3.6.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.6.1
vif(m2)
## Calories Fat Type
## 12.52500 12.37517 1.48502
cor(x$Calories, x$Fat)
## [1] 0.94603
We now use the full dataset. Do we need to consider the interaction between Calories and Type?
we should consider the interaction between Calories and Type. It is because the cheese pizzas (Type = 1) and the pepperoni pizzas (Type = 0) have quite different slopes (0.34 vs 0.014) when regress Score on each of them separately.
The coefficient of interaction term is -0.65. It means that holding Fat fixed the difference in scores between cheese pizzas and pepperoni pizzas decreases by 0.65 points when Calories increases by 1 unit.
library(lme4)
## Warning: package 'lme4' was built under R version 3.6.3
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.6.2
## Registered S3 methods overwritten by 'lme4':
## method from
## cooks.distance.influence.merMod car
## influence.merMod car
## dfbeta.influence.merMod car
## dfbetas.influence.merMod car
summary(lmList(Score ~ Calories | Type, data = x))
## Call:
## Model: Score ~ Calories | NULL
## Data: x
##
## Coefficients:
## (Intercept)
## Estimate Std. Error t value Pr(>|t|)
## 0 -70.52839 50.66690 -1.392001 0.1761829
## 1 61.16563 61.10983 1.000913 0.3264589
## Calories
## Estimate Std. Error t value Pr(>|t|)
## 0 0.34015474 0.1372922 2.47759743 0.0203390
## 1 0.01425078 0.1795580 0.07936587 0.9373735
##
## Residual standard error: 20.23601 on 25 degrees of freedom
summary(lm(Score ~ Calories + Type + Fat + Type*Calories, data = x))
##
## Call:
## lm(formula = Score ~ Calories + Type + Fat + Type * Calories,
## data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.148 -12.860 5.229 13.291 21.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -329.4968 90.5332 -3.640 0.001303 **
## Calories 1.3787 0.3400 4.055 0.000458 ***
## Type 242.2710 75.5779 3.206 0.003789 **
## Fat -6.8262 2.0988 -3.252 0.003383 **
## Calories:Type -0.6536 0.2170 -3.012 0.006034 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.21 on 24 degrees of freedom
## Multiple R-squared: 0.4828, Adjusted R-squared: 0.3966
## F-statistic: 5.601 on 4 and 24 DF, p-value: 0.002489