candy <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv", header = TRUE)
View(candy)
library(tidyverse)
## ── Attaching packages ─────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(leaps)
ggplot(candy, aes(x = sugarpercent, y = winpercent))+
geom_point()
Direction: The direction looks some what positive
Form: The scatterplot could be linear, but it is very hard to tell
Strength: The strength of the scatterplot is weak
Outliers: There are possibly some outliers around 1 of sugarpercent and 81 of winpercent
mod1 <- lm(winpercent ~ sugarpercent, data = candy)
summary(mod1)
##
## Call:
## lm(formula = winpercent ~ sugarpercent, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.924 -11.066 -1.168 9.252 36.851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.609 3.086 14.455 <2e-16 ***
## sugarpercent 11.924 5.560 2.145 0.0349 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.41 on 83 degrees of freedom
## Multiple R-squared: 0.05251, Adjusted R-squared: 0.04109
## F-statistic: 4.6 on 1 and 83 DF, p-value: 0.0349
For every one unit increase in sugar percent, winpercent will increase by 11.924 units.
Yes, there is a significant relationship between sugarpercent and winpercent because the p-value is fairly small.
ggplot(candy, aes(x = sugarpercent, y = winpercent))+
geom_point()+
geom_abline(slope = mod1$coefficients[2], intercept = mod1$coefficients[1], color = "red", lwd = 1)
res <- residuals(mod1)
fit <- fitted.values(mod1)
ggplot(NULL, aes(x = fit, y = res)) +
geom_point()+
geom_hline(yintercept = 0, col = "blue", lty = 2)
Since most of the points are near the line, there is not much noise. Also, since the points are not evenly distributed on both sides of the line, the mean zero assumptions are not met.
qqnorm(res)
qqline(res, col = "red", lwd = 2)
Looking at the graph, since the points follow the line, the plot follows normality.
mod2 <- lm(winpercent ~ sugarpercent + chocolate, data = candy)
summary(mod2)
##
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.6981 -8.5153 -0.4489 7.7602 27.4820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.262 2.552 14.990 < 2e-16 ***
## sugarpercent 8.567 4.355 1.967 0.0525 .
## chocolate 18.273 2.469 7.401 1.06e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.22 on 82 degrees of freedom
## Multiple R-squared: 0.432, Adjusted R-squared: 0.4181
## F-statistic: 31.18 on 2 and 82 DF, p-value: 8.5e-11
For every one unit change in sugarpercent, winpercent increases by 8.576 units. If the candy has chocolate, then winpercent increases by 18.273 units.
y = winpercent and x = sugarpercent.
If chocolate is 0: \(y = 38.262+8.567x\)
If chocolate is 1: \(y=56.535+8.567x\)
ggplot(candy, aes(x = sugarpercent, y = winpercent, color = chocolate))+
geom_point()+
geom_abline(slope = mod2$coefficients[2], intercept = mod2$coefficients[1], color = "red") +
geom_abline(slope = mod2$coefficients[2], intercept = mod2$coefficients[1]+mod2$coefficients[3], color = "green")
Yes, since our p-value for chocolate is small, there is a significant effect of the interaction in the model.
mod3 <- lm(winpercent ~ sugarpercent + chocolate + sugarpercent*chocolate, data = candy)
summary(mod3)
##
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate + sugarpercent *
## chocolate, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.4007 -8.0463 -0.7059 6.2815 28.5003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.778 2.878 13.820 <2e-16 ***
## sugarpercent 5.221 5.257 0.993 0.3236
## chocolate 13.051 5.230 2.495 0.0146 *
## sugarpercent:chocolate 10.586 9.350 1.132 0.2609
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.21 on 81 degrees of freedom
## Multiple R-squared: 0.4408, Adjusted R-squared: 0.4201
## F-statistic: 21.28 on 3 and 81 DF, p-value: 2.916e-10
y = winpercent and x = sugarpercent.
If chocolate is 0: \(y = 39.778+5.221x\)
If chocolate is 1: \(y = 52.829+15.807x\)
ggplot(candy, aes(x = sugarpercent, y = winpercent, color = chocolate))+
geom_point()+
geom_abline(slope = mod3$coefficients[2], intercept = mod3$coefficients[1], color = "red") +
geom_abline(slope = mod3$coefficients[2]+mod3$coefficients[4], intercept = mod3$coefficients[1]+mod3$coefficients[3], color = "green")
No there is no significant effect because we have a large p-value.
mod4 <- lm(winpercent ~ sugarpercent+chocolate+fruity+caramel+peanutyalmondy+nougat+crispedricewafer+hard+bar+pluribus+pricepercent, data = candy)
summary(mod4)
##
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate + fruity +
## caramel + peanutyalmondy + nougat + crispedricewafer + hard +
## bar + pluribus + pricepercent, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.2244 -6.6247 0.1986 6.8420 23.8680
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.5340 4.3199 7.994 1.44e-11 ***
## sugarpercent 9.0868 4.6595 1.950 0.05500 .
## chocolate 19.7481 3.8987 5.065 2.96e-06 ***
## fruity 9.4223 3.7630 2.504 0.01452 *
## caramel 2.2245 3.6574 0.608 0.54493
## peanutyalmondy 10.0707 3.6158 2.785 0.00681 **
## nougat 0.8043 5.7164 0.141 0.88849
## crispedricewafer 8.9190 5.2679 1.693 0.09470 .
## hard -6.1653 3.4551 -1.784 0.07852 .
## bar 0.4415 5.0611 0.087 0.93072
## pluribus -0.8545 3.0401 -0.281 0.77945
## pricepercent -5.9284 5.5132 -1.075 0.28578
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.7 on 73 degrees of freedom
## Multiple R-squared: 0.5402, Adjusted R-squared: 0.4709
## F-statistic: 7.797 on 11 and 73 DF, p-value: 9.504e-09
The variable that are significant at a 0.05 significance level are chocolate, peanutyalmondy, and fruity.
#REDUCED MODEL
mod5 <- lm(winpercent ~ chocolate+peanutyalmondy+fruity, data = candy)
summary(mod5)
##
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity,
## data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.0497 -7.3084 -0.4523 7.9446 23.8712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.788 3.237 11.057 < 2e-16 ***
## chocolate 21.983 3.599 6.108 3.34e-08 ***
## peanutyalmondy 9.066 3.520 2.576 0.0118 *
## fruity 7.753 3.625 2.139 0.0354 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.94 on 81 degrees of freedom
## Multiple R-squared: 0.4673, Adjusted R-squared: 0.4475
## F-statistic: 23.68 on 3 and 81 DF, p-value: 4.209e-11
candy <- subset(candy, select = -c(competitorname))
regfitfwd <- regsubsets(winpercent~., data = candy,
method = "forward")
summary(regfitfwd)
## Subset selection object
## Call: regsubsets.formula(winpercent ~ ., data = candy, method = "forward")
## 11 Variables (and intercept)
## Forced in Forced out
## chocolate FALSE FALSE
## fruity FALSE FALSE
## caramel FALSE FALSE
## peanutyalmondy FALSE FALSE
## nougat FALSE FALSE
## crispedricewafer FALSE FALSE
## hard FALSE FALSE
## bar FALSE FALSE
## pluribus FALSE FALSE
## sugarpercent FALSE FALSE
## pricepercent FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
## chocolate fruity caramel peanutyalmondy nougat crispedricewafer hard
## 1 ( 1 ) "*" " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " "*" " " " " " "
## 3 ( 1 ) "*" "*" " " "*" " " " " " "
## 4 ( 1 ) "*" "*" " " "*" " " "*" " "
## 5 ( 1 ) "*" "*" " " "*" " " "*" " "
## 6 ( 1 ) "*" "*" " " "*" " " "*" "*"
## 7 ( 1 ) "*" "*" " " "*" " " "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## bar pluribus sugarpercent pricepercent
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " " "
## 5 ( 1 ) " " " " "*" " "
## 6 ( 1 ) " " " " "*" " "
## 7 ( 1 ) " " " " "*" "*"
## 8 ( 1 ) " " " " "*" "*"
The variables I would put in the model is chocolate, fruity, peanutyalmondy, crispedricewafer, and hard.
mod6 <- lm(winpercent ~ chocolate+fruity+peanutyalmondy+crispedricewafer+hard, data = candy)
summary(mod6)
##
## Call:
## lm(formula = winpercent ~ chocolate + fruity + peanutyalmondy +
## crispedricewafer + hard, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.8057 -5.8376 0.2539 7.5341 21.7866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.015 3.191 11.287 < 2e-16 ***
## chocolate 19.858 3.629 5.472 5.09e-07 ***
## fruity 9.236 3.617 2.554 0.01259 *
## peanutyalmondy 10.027 3.494 2.870 0.00526 **
## crispedricewafer 8.864 4.554 1.946 0.05516 .
## hard -4.835 3.316 -1.458 0.14887
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.67 on 79 degrees of freedom
## Multiple R-squared: 0.5053, Adjusted R-squared: 0.474
## F-statistic: 16.14 on 5 and 79 DF, p-value: 6.141e-11
The variables in the model are different because I added crisped rice wafer and hard after doing the forward selection.
mse1 <- mean(residuals(mod5)^2)
mse1
## [1] 113.9847
The MSE of the reduced model from a is 113.9847
mse2 <- mean(residuals(mod6)^2)
mse2
## [1] 105.8438
the MSE of the final model from part b is 105.8438
The model I would prefer is the final model from part b because it has a smaller p-value.
confint(mod2, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 33.18452 43.33971
## sugarpercent -0.09603 17.22927
## chocolate 13.36170 23.18492
My favorite candy is Jolly Ranchers. From the indicator variables jolly ranchers are fruity, hard, and pluribus.
10.3-4.9-0.2
## [1] 5.2
Looking at the fivethirtyeight article, my estimated fitted value is 5.2
I would use a prediction interval to create a reasonable range for observing this new candy because it accounts for a linear model.
mod7 <- lm(winpercent ~ fruity + pluribus + hard, data = candy)
summary(mod3)
##
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate + sugarpercent *
## chocolate, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.4007 -8.0463 -0.7059 6.2815 28.5003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.778 2.878 13.820 <2e-16 ***
## sugarpercent 5.221 5.257 0.993 0.3236
## chocolate 13.051 5.230 2.495 0.0146 *
## sugarpercent:chocolate 10.586 9.350 1.132 0.2609
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.21 on 81 degrees of freedom
## Multiple R-squared: 0.4408, Adjusted R-squared: 0.4201
## F-statistic: 21.28 on 3 and 81 DF, p-value: 2.916e-10
predict(mod7, interval = "predict")
## Warning in predict.lm(mod7, interval = "predict"): predictions on current data refer to _future_ responses
## fit lwr upr
## 1 57.58055 30.557169 84.60392
## 2 57.58055 30.557169 84.60392
## 3 57.58055 30.557169 84.60392
## 4 57.58055 30.557169 84.60392
## 5 50.30251 22.801667 77.80335
## 6 57.58055 30.557169 84.60392
## 7 57.58055 30.557169 84.60392
## 8 52.59729 25.422416 79.77217
## 9 52.59729 25.422416 79.77217
## 10 50.30251 22.801667 77.80335
## 11 57.58055 30.557169 84.60392
## 12 45.31926 18.143204 72.49531
## 13 45.31926 18.143204 72.49531
## 14 45.31926 18.143204 72.49531
## 15 42.19631 14.449053 69.94357
## 16 45.31926 18.143204 72.49531
## 17 42.19631 14.449053 69.94357
## 18 37.21306 9.573131 64.85299
## 19 45.31926 18.143204 72.49531
## 20 52.59729 25.422416 79.77217
## 21 45.31926 18.143204 72.49531
## 22 45.31926 18.143204 72.49531
## 23 52.59729 25.422416 79.77217
## 24 57.58055 30.557169 84.60392
## 25 57.58055 30.557169 84.60392
## 26 57.58055 30.557169 84.60392
## 27 37.21306 9.573131 64.85299
## 28 52.59729 25.422416 79.77217
## 29 57.58055 30.557169 84.60392
## 30 50.30251 22.801667 77.80335
## 31 42.19631 14.449053 69.94357
## 32 50.30251 22.801667 77.80335
## 33 52.59729 25.422416 79.77217
## 34 52.59729 25.422416 79.77217
## 35 45.31926 18.143204 72.49531
## 36 52.59729 25.422416 79.77217
## 37 57.58055 30.557169 84.60392
## 38 57.58055 30.557169 84.60392
## 39 57.58055 30.557169 84.60392
## 40 57.58055 30.557169 84.60392
## 41 57.58055 30.557169 84.60392
## 42 37.21306 9.573131 64.85299
## 43 57.58055 30.557169 84.60392
## 44 57.58055 30.557169 84.60392
## 45 45.31926 18.143204 72.49531
## 46 45.31926 18.143204 72.49531
## 47 57.58055 30.557169 84.60392
## 48 52.59729 25.422416 79.77217
## 49 52.59729 25.422416 79.77217
## 50 37.21306 9.573131 64.85299
## 51 45.31926 18.143204 72.49531
## 52 57.58055 30.557169 84.60392
## 53 57.58055 30.557169 84.60392
## 54 52.59729 25.422416 79.77217
## 55 57.58055 30.557169 84.60392
## 56 42.19631 14.449053 69.94357
## 57 52.59729 25.422416 79.77217
## 58 44.49110 16.054787 72.92741
## 59 37.21306 9.573131 64.85299
## 60 52.59729 25.422416 79.77217
## 61 45.31926 18.143204 72.49531
## 62 45.31926 18.143204 72.49531
## 63 52.59729 25.422416 79.77217
## 64 37.21306 9.573131 64.85299
## 65 57.58055 30.557169 84.60392
## 66 57.58055 30.557169 84.60392
## 67 45.31926 18.143204 72.49531
## 68 45.31926 18.143204 72.49531
## 69 45.31926 18.143204 72.49531
## 70 37.21306 9.573131 64.85299
## 71 52.59729 25.422416 79.77217
## 72 57.58055 30.557169 84.60392
## 73 50.30251 22.801667 77.80335
## 74 45.31926 18.143204 72.49531
## 75 42.19631 14.449053 69.94357
## 76 57.58055 30.557169 84.60392
## 77 52.59729 25.422416 79.77217
## 78 57.58055 30.557169 84.60392
## 79 45.31926 18.143204 72.49531
## 80 57.58055 30.557169 84.60392
## 81 50.30251 22.801667 77.80335
## 82 42.19631 14.449053 69.94357
## 83 45.31926 18.143204 72.49531
## 84 49.47435 21.392347 77.55635
## 85 52.59729 25.422416 79.77217