candy<-read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv",
header=TRUE)
str(candy)
## 'data.frame': 85 obs. of 13 variables:
## $ competitorname : Factor w/ 85 levels "100 Grand","3 Musketeers",..: 1 2 45 46 3 4 5 6 7 8 ...
## $ chocolate : int 1 1 0 0 0 1 1 0 0 0 ...
## $ fruity : int 0 0 0 0 1 0 0 0 0 1 ...
## $ caramel : int 1 0 0 0 0 0 1 0 0 1 ...
## $ peanutyalmondy : int 0 0 0 0 0 1 1 1 0 0 ...
## $ nougat : int 0 1 0 0 0 0 1 0 0 0 ...
## $ crispedricewafer: int 1 0 0 0 0 0 0 0 0 0 ...
## $ hard : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bar : int 1 1 0 0 0 1 1 0 0 0 ...
## $ pluribus : int 0 0 0 0 0 0 0 1 1 0 ...
## $ sugarpercent : num 0.732 0.604 0.011 0.011 0.906 ...
## $ pricepercent : num 0.86 0.511 0.116 0.511 0.511 ...
## $ winpercent : num 67 67.6 32.3 46.1 52.3 ...
ggplot(candy, aes(x=sugarpercent, y=winpercent))+
geom_point()+
ggtitle("Sugar percent vs. Win percent")+
theme_minimal()
Describe scatterplot: The scatterplot above has a somewhat positive direction. While it could be described as linear, the strength is very weak and as such, potential outliers are unclear (though the point at the top left of the graph appears to be one).
SLR for winpercent as a function of sugarpercent:
mod<- lm(winpercent ~ sugarpercent, data=candy)
summary(mod)
##
## Call:
## lm(formula = winpercent ~ sugarpercent, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.924 -11.066 -1.168 9.252 36.851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.609 3.086 14.455 <2e-16 ***
## sugarpercent 11.924 5.560 2.145 0.0349 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.41 on 83 degrees of freedom
## Multiple R-squared: 0.05251, Adjusted R-squared: 0.04109
## F-statistic: 4.6 on 1 and 83 DF, p-value: 0.0349
The slope of 11.924 tells us that when there is a one unit increase in the percentage of sugar in a candy, we expect an 11.924 percent increase in the win percent on average.
The p-value for this model is 0.0349, so we can determine that the relationship between sugarpercent and winpercent is statistically significant.
Add least squares line to scatterplot:
ggplot(candy, aes(x=sugarpercent, y=winpercent))+
geom_point()+
theme_minimal()+
ggtitle("Sugar percent vs. Win percent")+
geom_smooth(method='lm', se=FALSE, col="light blue")
## `geom_smooth()` using formula 'y ~ x'
ggplot(mod, aes(x=mod$fitted, y=mod$residuals))+
geom_point()+
theme_minimal()+
ggtitle("Residual plot")+
geom_abline(slope=0,intercept=0, lty=2, lwd=1,color="pink")+
geom_smooth(se=FALSE, color="gray")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
We observe that the points appear to be spread equally above and below y=0 (which models the mean zero assumption) and as fitted values increase, suggesting homoskedasticity. By adding the geom_smooth line (using LOESS), we investigate the potential of non-linearity and observe that our data appears to be generally linear.
qqnorm(mod$residuals)
qqline(mod$residuals)
Observe that our points generally lie on the straight line. Thus, our assumption of normality is appropriate.
mod2<- lm(winpercent~sugarpercent+chocolate, data=candy)
summary(mod2)
##
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.6981 -8.5153 -0.4489 7.7602 27.4820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.262 2.552 14.990 < 2e-16 ***
## sugarpercent 8.567 4.355 1.967 0.0525 .
## chocolate 18.273 2.469 7.401 1.06e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.22 on 82 degrees of freedom
## Multiple R-squared: 0.432, Adjusted R-squared: 0.4181
## F-statistic: 31.18 on 2 and 82 DF, p-value: 8.5e-11
The slope of 18.273 for the predictor “chocolate” tells us that when a candy has chocolate in it, we expect an 18.273 percent increase in the win percent on average.
If a candy has chocolate in it (represented by a value of 1), then the regression equation is \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 = 38.262 + 8.567 x_1 + 18.273 (1) = 56.535 + 8.567 x_1\) where \(x_1\) represents sugarpercent and \(y\) represents winpercent. If a candy does not have chocolate in it (represented by a value of 0) then the equation is \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 = 38.262 + 8.567 x_1 + 18.273 (0) = 38.262 + 8.567 x_1\) where \(x_1\) represents sugarpercent and \(y\) represents winpercent.
Add parallel least squares lines to scatter plot:
ggplot(candy, aes(x=sugarpercent, y=winpercent, color=as.factor(chocolate)))+
geom_point()+
theme_minimal()+
labs(color="Chocolate (0 = no, 1=yes)")+
geom_abline(slope=mod2$coefficients[2], intercept=mod2$coefficients[1], col="pink")+
geom_abline(slope=mod2$coefficients[2], intercept=mod2$coefficients[1]+mod2$coefficients[3], col="light blue")
mod3<- lm(winpercent~sugarpercent*chocolate, data=candy)
summary(mod3)
##
## Call:
## lm(formula = winpercent ~ sugarpercent * chocolate, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.4007 -8.0463 -0.7059 6.2815 28.5003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.778 2.878 13.820 <2e-16 ***
## sugarpercent 5.221 5.257 0.993 0.3236
## chocolate 13.051 5.230 2.495 0.0146 *
## sugarpercent:chocolate 10.586 9.350 1.132 0.2609
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.21 on 81 degrees of freedom
## Multiple R-squared: 0.4408, Adjusted R-squared: 0.4201
## F-statistic: 21.28 on 3 and 81 DF, p-value: 2.916e-10
We can interpret the coefficient estimate for sugarpercent:chocolate to be the increase in sugarpercent if a candy is chocolate. Thus, for chocolate candies we expect a 10.586% increase in sugar content compared to non-chocolate candies.
When a candy is not chocolate (represented by a value of 0), we have \(y = 39.778 + 5.221x_1 + 13.051(0) + 10.586x_1 (0) = 39.778 + 5.221x_1\) where \(y\) represents winpercent and \(x_1\) represents sugarpercent. When a candy has chocolate (represented by a value of 1), we have \(y = 39.778 + 5.221x_1 + 13.051(1) + 10.586x_1 (1) = 52.829 + 15.807x_1\) where \(y\) represents winpercent and \(x_1\) represents sugarpercent.
Add lines:
ggplot(candy, aes(x=sugarpercent, y=winpercent, color=as.factor(chocolate)))+
geom_point()+
theme_minimal()+
labs(color="Chocolate (0 = no, 1=yes)")+
geom_abline(slope = mod3$coefficients[2], intercept = mod3$coefficients[1], col="pink")+
geom_abline(slope = mod3$coefficients[2]+mod3$coefficients[4], intercept = mod3$coefficients[1]+mod3$coefficients[3], col="light blue")
mod4<- lm(winpercent ~ chocolate+fruity+caramel+peanutyalmondy+nougat+crispedricewafer +hard+bar+pluribus+sugarpercent+pricepercent, data=candy)
summary(mod4)
##
## Call:
## lm(formula = winpercent ~ chocolate + fruity + caramel + peanutyalmondy +
## nougat + crispedricewafer + hard + bar + pluribus + sugarpercent +
## pricepercent, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.2244 -6.6247 0.1986 6.8420 23.8680
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.5340 4.3199 7.994 1.44e-11 ***
## chocolate 19.7481 3.8987 5.065 2.96e-06 ***
## fruity 9.4223 3.7630 2.504 0.01452 *
## caramel 2.2245 3.6574 0.608 0.54493
## peanutyalmondy 10.0707 3.6158 2.785 0.00681 **
## nougat 0.8043 5.7164 0.141 0.88849
## crispedricewafer 8.9190 5.2679 1.693 0.09470 .
## hard -6.1653 3.4551 -1.784 0.07852 .
## bar 0.4415 5.0611 0.087 0.93072
## pluribus -0.8545 3.0401 -0.281 0.77945
## sugarpercent 9.0868 4.6595 1.950 0.05500 .
## pricepercent -5.9284 5.5132 -1.075 0.28578
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.7 on 73 degrees of freedom
## Multiple R-squared: 0.5402, Adjusted R-squared: 0.4709
## F-statistic: 7.797 on 11 and 73 DF, p-value: 9.504e-09
We see that chocolate, fruity, and peanutyalmondy have p values of less than 0.05 and thus are significant at the 0.05 significance level. The following is a reduced model to include only these predictors:
reducedmod4<-lm(winpercent ~ chocolate+fruity+peanutyalmondy, data=candy)
summary(reducedmod4)
##
## Call:
## lm(formula = winpercent ~ chocolate + fruity + peanutyalmondy,
## data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.0497 -7.3084 -0.4523 7.9446 23.8712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.788 3.237 11.057 < 2e-16 ***
## chocolate 21.983 3.599 6.108 3.34e-08 ***
## fruity 7.753 3.625 2.139 0.0354 *
## peanutyalmondy 9.066 3.520 2.576 0.0118 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.94 on 81 degrees of freedom
## Multiple R-squared: 0.4673, Adjusted R-squared: 0.4475
## F-statistic: 23.68 on 3 and 81 DF, p-value: 4.209e-11
Since we are having technical difficulties with the leaps package including 95 variables instead of the 13 we consider in this data set, let us try doing it “by hand.” We start with the predictor chocolate because it has the lowest p value in our saturated model and then add by next lowest p value:
modfwd1<- lm(winpercent~chocolate, data=candy)
summary(modfwd1)
##
## Call:
## lm(formula = winpercent ~ chocolate, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.1995 -7.5633 -0.2379 8.5623 24.8954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.142 1.648 25.574 < 2e-16 ***
## chocolate 18.779 2.498 7.519 5.86e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.42 on 83 degrees of freedom
## Multiple R-squared: 0.4052, Adjusted R-squared: 0.398
## F-statistic: 56.53 on 1 and 83 DF, p-value: 5.86e-11
modfwd2<- lm(winpercent~chocolate+peanutyalmondy, data=candy)
summary(modfwd2)
##
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.0296 -7.6657 0.0797 7.3629 25.2130
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.825 1.619 25.828 < 2e-16 ***
## chocolate 16.625 2.640 6.297 1.42e-08 ***
## peanutyalmondy 7.623 3.529 2.160 0.0337 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.17 on 82 degrees of freedom
## Multiple R-squared: 0.4372, Adjusted R-squared: 0.4235
## F-statistic: 31.85 on 2 and 82 DF, p-value: 5.822e-11
modfwd3<- lm(winpercent~chocolate+peanutyalmondy+fruity, data=candy)
summary(modfwd3)
##
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity,
## data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.0497 -7.3084 -0.4523 7.9446 23.8712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.788 3.237 11.057 < 2e-16 ***
## chocolate 21.983 3.599 6.108 3.34e-08 ***
## peanutyalmondy 9.066 3.520 2.576 0.0118 *
## fruity 7.753 3.625 2.139 0.0354 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.94 on 81 degrees of freedom
## Multiple R-squared: 0.4673, Adjusted R-squared: 0.4475
## F-statistic: 23.68 on 3 and 81 DF, p-value: 4.209e-11
modfwd4<- lm(winpercent~chocolate+peanutyalmondy+fruity+sugarpercent, data=candy)
summary(modfwd4)
##
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity +
## sugarpercent, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.0204 -7.9139 -0.0392 7.4155 25.8389
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.803 3.612 9.081 6.09e-14 ***
## chocolate 21.301 3.574 5.961 6.45e-08 ***
## peanutyalmondy 8.657 3.482 2.486 0.0150 *
## fruity 7.271 3.588 2.026 0.0461 *
## sugarpercent 7.449 4.206 1.771 0.0804 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.8 on 80 degrees of freedom
## Multiple R-squared: 0.4874, Adjusted R-squared: 0.4617
## F-statistic: 19.01 on 4 and 80 DF, p-value: 5.056e-11
# modfwd4 took it too far, so let us stick with modfwd3
summary(modfwd3)
##
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity,
## data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.0497 -7.3084 -0.4523 7.9446 23.8712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.788 3.237 11.057 < 2e-16 ***
## chocolate 21.983 3.599 6.108 3.34e-08 ***
## peanutyalmondy 9.066 3.520 2.576 0.0118 *
## fruity 7.753 3.625 2.139 0.0354 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.94 on 81 degrees of freedom
## Multiple R-squared: 0.4673, Adjusted R-squared: 0.4475
## F-statistic: 23.68 on 3 and 81 DF, p-value: 4.209e-11
Because the predictor sugarpercent is not significant at the 0.05 level, we will not include it in our model. Thus, through forward stepwise selection, we have attained the model with winpercent as a function of chocolate, peanutyalmondy, and fruity.
anova(modfwd3)
## Analysis of Variance Table
##
## Response: winpercent
## Df Sum Sq Mean Sq F value Pr(>F)
## chocolate 1 7368.5 7368.5 61.6029 1.491e-11 ***
## peanutyalmondy 1 582.5 582.5 4.8700 0.03016 *
## fruity 1 547.3 547.3 4.5754 0.03545 *
## Residuals 81 9688.7 119.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(reducedmod4)
## Analysis of Variance Table
##
## Response: winpercent
## Df Sum Sq Mean Sq F value Pr(>F)
## chocolate 1 7368.5 7368.5 61.6029 1.491e-11 ***
## fruity 1 336.1 336.1 2.8100 0.09753 .
## peanutyalmondy 1 793.7 793.7 6.6353 0.01181 *
## Residuals 81 9688.7 119.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(reducedmod4)
## 2.5 % 97.5 %
## (Intercept) 29.3481409 42.22839
## chocolate 14.8220911 29.14475
## fruity 0.5412327 14.96470
## peanutyalmondy 2.0632117 16.06867
indicatormod<- lm(winpercent~chocolate+fruity+caramel+peanutyalmondy+nougat+crispedricewafer+hard+bar+pluribus, data=candy)
summary(indicatormod)
##
## Call:
## lm(formula = winpercent ~ chocolate + fruity + caramel + peanutyalmondy +
## nougat + crispedricewafer + hard + bar + pluribus, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.6779 -5.6765 0.3966 7.0583 21.9144
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.0155 4.0781 8.586 9.13e-13 ***
## chocolate 19.9058 3.8975 5.107 2.41e-06 ***
## fruity 10.2677 3.7887 2.710 0.00833 **
## caramel 3.3843 3.6034 0.939 0.35065
## peanutyalmondy 10.1410 3.5949 2.821 0.00612 **
## nougat 2.4163 5.6897 0.425 0.67229
## crispedricewafer 8.9915 5.3279 1.688 0.09564 .
## hard -4.8726 3.4394 -1.417 0.16072
## bar -0.7220 4.8707 -0.148 0.88256
## pluribus -0.1599 3.0115 -0.053 0.95779
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.85 on 75 degrees of freedom
## Multiple R-squared: 0.5148, Adjusted R-squared: 0.4566
## F-statistic: 8.842 on 9 and 75 DF, p-value: 6.052e-09
Sour Patch watermelon would merit the following values in our binary variables: chocolate = 0, fruity = 1, caramel = 0, peanutyalmondy = 0, nougat = 0, crispedricewafer = 0, hard = 0, bar = 0, pluribus = 1. Thus, we attain the equation \[y_{(SP watermelon)} = 35.0155 + 10.2677(1) - 0.1599(1) = 45.1233\] Therefore, we would expect Sour Patch watermelon to win 45.12% of the time according to this model. We confirm this with R:
predict(indicatormod, data.frame(chocolate=0, fruity=1, caramel=0, peanutyalmondy=0, nougat=0, crispedricewafer=0, hard=0, bar=0, pluribus=1), type="response")
## 1
## 45.12327
Extra credit: use R to find this prediction interval
predict(indicatormod, data.frame(chocolate=0, fruity=1, caramel=0, peanutyalmondy=0, nougat=0, crispedricewafer=0, hard=0, bar=0, pluribus=1), interval="prediction")
## fit lwr upr
## 1 45.12327 23.06349 67.18306