candy<-read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv",header=TRUE)
str(candy)
## 'data.frame': 85 obs. of 13 variables:
## $ competitorname : chr "100 Grand" "3 Musketeers" "One dime" "One quarter" ...
## $ chocolate : int 1 1 0 0 0 1 1 0 0 0 ...
## $ fruity : int 0 0 0 0 1 0 0 0 0 1 ...
## $ caramel : int 1 0 0 0 0 0 1 0 0 1 ...
## $ peanutyalmondy : int 0 0 0 0 0 1 1 1 0 0 ...
## $ nougat : int 0 1 0 0 0 0 1 0 0 0 ...
## $ crispedricewafer: int 1 0 0 0 0 0 0 0 0 0 ...
## $ hard : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bar : int 1 1 0 0 0 1 1 0 0 0 ...
## $ pluribus : int 0 0 0 0 0 0 0 1 1 0 ...
## $ sugarpercent : num 0.732 0.604 0.011 0.011 0.906 ...
## $ pricepercent : num 0.86 0.511 0.116 0.511 0.511 ...
## $ winpercent : num 67 67.6 32.3 46.1 52.3 ...
ggplot(candy, aes(sugarpercent, winpercent)) +
geom_point() +
theme_light()
There doesn’t appear to be a strong correlation here. It seems slightly positive, possibly linear, with maybe one outlier with low sugar and high wins.
slrcandy <- lm(winpercent~sugarpercent, data=candy)
summary(slrcandy)
##
## Call:
## lm(formula = winpercent ~ sugarpercent, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.924 -11.066 -1.168 9.252 36.851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.609 3.086 14.455 <2e-16 ***
## sugarpercent 11.924 5.560 2.145 0.0349 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.41 on 83 degrees of freedom
## Multiple R-squared: 0.05251, Adjusted R-squared: 0.04109
## F-statistic: 4.6 on 1 and 83 DF, p-value: 0.0349
As the sugar increases by 1%, the percentage of wins increases by 11.92.
There is a significant relationship, because our predictor variable p value is less than .05.
ggplot(candy, aes(sugarpercent, winpercent)) +
geom_point() +
geom_abline(intercept = slrcandy$coefficients[1], slope= slrcandy$coefficients[2], color = "blue", lty = 1) +
theme_light()
ggplot(candy, aes(sugarpercent, slrcandy$residuals)) +
geom_point()+
geom_hline(yintercept=0, lty = 2, color = "blue") +
theme_light()
It seems like the mean of the residuals is 0, and there is no pattern in the distribution, so we have homoscedasticity.
qqnorm(slrcandy$residuals)
qqline(slrcandy$residuals)
The residuals don’t map on perfectly to the line, but it seems close enough to conclude our data is normal.
Fitting a model with 2 predictors:
candymod <- lm(winpercent ~ sugarpercent + chocolate, data = candy)
summary(candymod)
##
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.6981 -8.5153 -0.4489 7.7602 27.4820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.262 2.552 14.990 < 2e-16 ***
## sugarpercent 8.567 4.355 1.967 0.0525 .
## chocolate 18.273 2.469 7.401 1.06e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.22 on 82 degrees of freedom
## Multiple R-squared: 0.432, Adjusted R-squared: 0.4181
## F-statistic: 31.18 on 2 and 82 DF, p-value: 8.5e-11
If chocolate is present, the average percent of wins is 18.27% higher than candies without chocolate, when controlling for the percentage of sugar. This is a significant effect.
Model when chocolate is present: \[ y = 56.533 + 8.567x \] Model when chocolate is not present: \[ y = 38.262 + 8.567x \]
ggplot(candy, aes(sugarpercent, winpercent, color = as.factor(chocolate))) +
geom_point() +
geom_abline(intercept = candymod$coefficients[1], slope= candymod$coefficients[2],color = "red", lty = 1) +
geom_abline(intercept = candymod$coefficients[1]+candymod$coefficients[3], slope = candymod$coefficients[2],color = "blue", lty = 1) +
theme_light()
Fitting a model with 2 predictors and an interaction:
candymod2 <- lm(winpercent ~ sugarpercent + chocolate + sugarpercent*chocolate, data = candy)
summary(candymod2)
##
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate + sugarpercent *
## chocolate, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.4007 -8.0463 -0.7059 6.2815 28.5003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.778 2.878 13.820 <2e-16 ***
## sugarpercent 5.221 5.257 0.993 0.3236
## chocolate 13.051 5.230 2.495 0.0146 *
## sugarpercent:chocolate 10.586 9.350 1.132 0.2609
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.21 on 81 degrees of freedom
## Multiple R-squared: 0.4408, Adjusted R-squared: 0.4201
## F-statistic: 21.28 on 3 and 81 DF, p-value: 2.916e-10
If chocolate is present, the association between percentage of sugar and wins is more dramatic- i.e. as percentage of sugar increases, the percentage of wins increases by 15.7 compared to when chocolate is absent. However this effect is not significant, since p = .26.
When chocolate is present: \[ y = 39.778 + 13.051 + (10.586 + 5.221)x \] \[ y = 52.829 + 15.807x \]
When chocolate is absent: \[39.778 + 5.221x \]
ggplot(candy, aes(sugarpercent, winpercent, color = as.factor(chocolate))) +
geom_point() +
geom_abline(intercept = (candymod2$coefficients[1]+candymod2$coefficients[3]), slope= (candymod2$coefficients[2]+candymod2$coefficients[4]),color = "blue", lty = 1) +
geom_abline(intercept = candymod$coefficients[1], slope = candymod$coefficients[2],color = "red", lty = 1) +
theme_light()
Fitting a fully saturated model.
saturatedmod <- lm(winpercent~chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer + hard + bar + pluribus + pricepercent, data = candy)
summary(saturatedmod)
##
## Call:
## lm(formula = winpercent ~ chocolate + fruity + caramel + peanutyalmondy +
## nougat + crispedricewafer + hard + bar + pluribus + pricepercent,
## data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.5288 -5.9519 0.0209 7.0941 21.5661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.71037 4.35784 8.195 5.55e-12 ***
## chocolate 20.19265 3.96509 5.093 2.61e-06 ***
## fruity 10.25959 3.80857 2.694 0.00873 **
## caramel 3.61280 3.65473 0.989 0.32612
## peanutyalmondy 10.46296 3.67799 2.845 0.00574 **
## nougat 1.99093 5.79061 0.344 0.73196
## crispedricewafer 9.14263 5.36546 1.704 0.09258 .
## hard -4.90968 3.45832 -1.420 0.15990
## bar 0.03111 5.15156 0.006 0.99520
## pluribus 0.05223 3.06073 0.017 0.98643
## pricepercent -2.50284 5.32401 -0.470 0.63966
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.9 on 74 degrees of freedom
## Multiple R-squared: 0.5163, Adjusted R-squared: 0.4509
## F-statistic: 7.897 on 10 and 74 DF, p-value: 1.704e-08
Now using only the significant (p<.05) predictors from the saturated model:
reducedmod <- lm(winpercent ~ chocolate + fruity + peanutyalmondy, data = candy)
summary(reducedmod)
##
## Call:
## lm(formula = winpercent ~ chocolate + fruity + peanutyalmondy,
## data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.0497 -7.3084 -0.4523 7.9446 23.8712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.788 3.237 11.057 < 2e-16 ***
## chocolate 21.983 3.599 6.108 3.34e-08 ***
## fruity 7.753 3.625 2.139 0.0354 *
## peanutyalmondy 9.066 3.520 2.576 0.0118 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.94 on 81 degrees of freedom
## Multiple R-squared: 0.4673, Adjusted R-squared: 0.4475
## F-statistic: 23.68 on 3 and 81 DF, p-value: 4.209e-11
Using forward selection:
# install.packages(leaps)
library(leaps)
regfit.fwd <- regsubsets(winpercent ~chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer + hard + bar + pluribus + pricepercent, data = candy, method = "forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(winpercent ~ chocolate + fruity + caramel +
## peanutyalmondy + nougat + crispedricewafer + hard + bar +
## pluribus + pricepercent, data = candy, method = "forward")
## 10 Variables (and intercept)
## Forced in Forced out
## chocolate FALSE FALSE
## fruity FALSE FALSE
## caramel FALSE FALSE
## peanutyalmondy FALSE FALSE
## nougat FALSE FALSE
## crispedricewafer FALSE FALSE
## hard FALSE FALSE
## bar FALSE FALSE
## pluribus FALSE FALSE
## pricepercent FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
## chocolate fruity caramel peanutyalmondy nougat crispedricewafer hard
## 1 ( 1 ) "*" " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " "*" " " " " " "
## 3 ( 1 ) "*" "*" " " "*" " " " " " "
## 4 ( 1 ) "*" "*" " " "*" " " "*" " "
## 5 ( 1 ) "*" "*" " " "*" " " "*" "*"
## 6 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## bar pluribus pricepercent
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) " " " " " "
## 7 ( 1 ) " " " " "*"
## 8 ( 1 ) " " " " "*"
forwardmod <- lm(winpercent ~ chocolate + peanutyalmondy+ fruity + crispedricewafer + hard + caramel + pricepercent + nougat, data = candy)
summary(forwardmod)
##
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity +
## crispedricewafer + hard + caramel + pricepercent + nougat,
## data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.5098 -5.9621 0.0165 7.1088 21.5797
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.746 3.791 9.430 2.01e-14 ***
## chocolate 20.190 3.796 5.319 1.02e-06 ***
## peanutyalmondy 10.456 3.604 2.901 0.00486 **
## fruity 10.261 3.754 2.733 0.00780 **
## crispedricewafer 9.141 4.853 1.883 0.06347 .
## hard -4.921 3.352 -1.468 0.14622
## caramel 3.605 3.576 1.008 0.31658
## pricepercent -2.493 4.993 -0.499 0.61905
## nougat 1.991 4.760 0.418 0.67698
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.76 on 76 degrees of freedom
## Multiple R-squared: 0.5163, Adjusted R-squared: 0.4653
## F-statistic: 10.14 on 8 and 76 DF, p-value: 1.635e-09
Variables stop being significant at crispedricewafer, so I would only include the variables above that.
Comparing the two models:
finalforwardmod <- lm(winpercent~chocolate + peanutyalmondy + fruity, data = candy)
summary(finalforwardmod)
##
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity,
## data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.0497 -7.3084 -0.4523 7.9446 23.8712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.788 3.237 11.057 < 2e-16 ***
## chocolate 21.983 3.599 6.108 3.34e-08 ***
## peanutyalmondy 9.066 3.520 2.576 0.0118 *
## fruity 7.753 3.625 2.139 0.0354 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.94 on 81 degrees of freedom
## Multiple R-squared: 0.4673, Adjusted R-squared: 0.4475
## F-statistic: 23.68 on 3 and 81 DF, p-value: 4.209e-11
summary(reducedmod)
##
## Call:
## lm(formula = winpercent ~ chocolate + fruity + peanutyalmondy,
## data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.0497 -7.3084 -0.4523 7.9446 23.8712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.788 3.237 11.057 < 2e-16 ***
## chocolate 21.983 3.599 6.108 3.34e-08 ***
## fruity 7.753 3.625 2.139 0.0354 *
## peanutyalmondy 9.066 3.520 2.576 0.0118 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.94 on 81 degrees of freedom
## Multiple R-squared: 0.4673, Adjusted R-squared: 0.4475
## F-statistic: 23.68 on 3 and 81 DF, p-value: 4.209e-11
anova(finalforwardmod)
## Analysis of Variance Table
##
## Response: winpercent
## Df Sum Sq Mean Sq F value Pr(>F)
## chocolate 1 7368.5 7368.5 61.6029 1.491e-11 ***
## peanutyalmondy 1 582.5 582.5 4.8700 0.03016 *
## fruity 1 547.3 547.3 4.5754 0.03545 *
## Residuals 81 9688.7 119.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(reducedmod)
## Analysis of Variance Table
##
## Response: winpercent
## Df Sum Sq Mean Sq F value Pr(>F)
## chocolate 1 7368.5 7368.5 61.6029 1.491e-11 ***
## fruity 1 336.1 336.1 2.8100 0.09753 .
## peanutyalmondy 1 793.7 793.7 6.6353 0.01181 *
## Residuals 81 9688.7 119.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both models ended up with the same variables. They both have an MSE of 119.6. Looking at the ANOVA table, I would choose the forward selection model, because the p values for the F test are smaller.
95% confidence intervals:
confint(finalforwardmod)
## 2.5 % 97.5 %
## (Intercept) 29.3481409 42.22839
## chocolate 14.8220911 29.14475
## peanutyalmondy 2.0632117 16.06867
## fruity 0.5412327 14.96470
Estimating Win Percentage of York Peppermint Patties
york <- data.frame(c("York Peppermint Patty", 1, 0, 0, 0, 0, 0, 0, 0, 0, NULL, NULL, NULL))
indicatormod <- lm(winpercent ~ chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer + hard + bar + pluribus, data = candy)
# yorkprediction <- predict(indicatormod, newdata = york, interval = "prediction")
# The prediction function keeps giving me an error I can't figure out, so I wasn't able to find the values, but this is my best guess.
# Predicted Values for the winpercent of york patties, and the lower and upper bounds of the prediction interval
# yorkprediction$fit
# yorkprediction$lwr
# yorkprediction$upr