library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(leaps)
candy<-read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv", header=TRUE)
#View(candy)
str(candy)
## 'data.frame': 85 obs. of 13 variables:
## $ competitorname : chr "100 Grand" "3 Musketeers" "One dime" "One quarter" ...
## $ chocolate : int 1 1 0 0 0 1 1 0 0 0 ...
## $ fruity : int 0 0 0 0 1 0 0 0 0 1 ...
## $ caramel : int 1 0 0 0 0 0 1 0 0 1 ...
## $ peanutyalmondy : int 0 0 0 0 0 1 1 1 0 0 ...
## $ nougat : int 0 1 0 0 0 0 1 0 0 0 ...
## $ crispedricewafer: int 1 0 0 0 0 0 0 0 0 0 ...
## $ hard : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bar : int 1 1 0 0 0 1 1 0 0 0 ...
## $ pluribus : int 0 0 0 0 0 0 0 1 1 0 ...
## $ sugarpercent : num 0.732 0.604 0.011 0.011 0.906 ...
## $ pricepercent : num 0.86 0.511 0.116 0.511 0.511 ...
## $ winpercent : num 67 67.6 32.3 46.1 52.3 ...
a) Create a scatterplot of sugarpercent vs winpercent
ggplot(candy,aes(sugarpercent,winpercent))+
geom_point()
b) Describe the scatterplot above (direction, form, strength, outliers)
c) Create a simple linear regression (SLR) model for winpercent as a function of sugarpercent.
lm1<- lm(winpercent~sugarpercent, data=candy)
summary(lm1)
##
## Call:
## lm(formula = winpercent ~ sugarpercent, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.924 -11.066 -1.168 9.252 36.851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.609 3.086 14.455 <2e-16 ***
## sugarpercent 11.924 5.560 2.145 0.0349 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.41 on 83 degrees of freedom
## Multiple R-squared: 0.05251, Adjusted R-squared: 0.04109
## F-statistic: 4.6 on 1 and 83 DF, p-value: 0.0349
ggplot(candy,aes(sugarpercent,winpercent))+
geom_point()+
geom_abline(intercept = lm1$coefficients[1], slope=lm1$coefficients[2], color="pink", lwd=1)
d) Should we trust the inference we made in the previous step? To assess this check the model diagnostics. - No we should not trust the last model
I & II)
par(mfrow = c(2, 2))
plot(lm1)
Now think of your favorite candy. What qualities does it have? Is it chocolatey? Is it fruity? Does it have caramel? Now that you are thinking of your favorite candy choose one of the indicator variables in the dataset to use as an explanatory variable. - Kit Kat is the candy I choose! It is chocolaty and has wafers.
a) Fit a regression model for winpercent as a function of sugarpercent and the variable of your choice.
lm2<- lm(winpercent~sugarpercent+chocolate, data=candy)
summary(lm2)
##
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.6981 -8.5153 -0.4489 7.7602 27.4820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.262 2.552 14.990 < 2e-16 ***
## sugarpercent 8.567 4.355 1.967 0.0525 .
## chocolate 18.273 2.469 7.401 1.06e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.22 on 82 degrees of freedom
## Multiple R-squared: 0.432, Adjusted R-squared: 0.4181
## F-statistic: 31.18 on 2 and 82 DF, p-value: 8.5e-11
ggplot(candy,aes(sugarpercent,winpercent))+
geom_point()+
geom_abline(intercept = lm2$coefficients[1], slope=lm2$coefficients[2], color="pink", lwd=1)+
geom_abline(intercept = lm2$coefficients[1]+lm2$coefficients[3], slope=lm2$coefficients[2], color="brown", lwd=1)
b) Fit a regression model for winpercent as a function of sugarpercent, the variable of your choice, and an interaction of the two.
lm3<- lm(winpercent~sugarpercent*chocolate, data=candy)
summary(lm3)
##
## Call:
## lm(formula = winpercent ~ sugarpercent * chocolate, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.4007 -8.0463 -0.7059 6.2815 28.5003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.778 2.878 13.820 <2e-16 ***
## sugarpercent 5.221 5.257 0.993 0.3236
## chocolate 13.051 5.230 2.495 0.0146 *
## sugarpercent:chocolate 10.586 9.350 1.132 0.2609
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.21 on 81 degrees of freedom
## Multiple R-squared: 0.4408, Adjusted R-squared: 0.4201
## F-statistic: 21.28 on 3 and 81 DF, p-value: 2.916e-10
I)Interpret the slope coefficient for the interaction in the context of the problem. - The predicted win score when there is no sugar in the candy is 38.26% with the predicted win score increasing by 8.57% when the candy has a 100% suger content. - When the candy has chocolate and has no suger then the predicted win % is 56.54% and incrases by by 8.57% when the candy has a 100% suger content.
ggplot(candy,aes(sugarpercent,winpercent))+
geom_point()+
geom_abline(intercept = lm3$coefficients[1], slope=lm3$coefficients[2], color="pink", lwd=1)+
geom_abline(intercept = lm3$coefficients[1]+lm3$coefficients[3], slope=lm3$coefficients[2], color="brown", lwd=1)
a)First, fit a fully saturated model with all the variables. Which variables are significant at a 0.05 significance level? What is the reduced model? - Only chocolate, fruity, and penutyalmondy are signifigent.
#fully satraated model
lm4<-lm(winpercent~
sugarpercent+chocolate+fruity+caramel+peanutyalmondy+nougat+crispedricewafer+hard+bar+pluribus+pricepercent,
data=candy)
summary(lm4)
##
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate + fruity +
## caramel + peanutyalmondy + nougat + crispedricewafer + hard +
## bar + pluribus + pricepercent, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.2244 -6.6247 0.1986 6.8420 23.8680
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.5340 4.3199 7.994 1.44e-11 ***
## sugarpercent 9.0868 4.6595 1.950 0.05500 .
## chocolate 19.7481 3.8987 5.065 2.96e-06 ***
## fruity 9.4223 3.7630 2.504 0.01452 *
## caramel 2.2245 3.6574 0.608 0.54493
## peanutyalmondy 10.0707 3.6158 2.785 0.00681 **
## nougat 0.8043 5.7164 0.141 0.88849
## crispedricewafer 8.9190 5.2679 1.693 0.09470 .
## hard -6.1653 3.4551 -1.784 0.07852 .
## bar 0.4415 5.0611 0.087 0.93072
## pluribus -0.8545 3.0401 -0.281 0.77945
## pricepercent -5.9284 5.5132 -1.075 0.28578
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.7 on 73 degrees of freedom
## Multiple R-squared: 0.5402, Adjusted R-squared: 0.4709
## F-statistic: 7.797 on 11 and 73 DF, p-value: 9.504e-09
#reduced model
lm4.5<-lm(winpercent~ chocolate+fruity+peanutyalmondy,data=candy)
summary(lm4.5)
##
## Call:
## lm(formula = winpercent ~ chocolate + fruity + peanutyalmondy,
## data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.0497 -7.3084 -0.4523 7.9446 23.8712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.788 3.237 11.057 < 2e-16 ***
## chocolate 21.983 3.599 6.108 3.34e-08 ***
## fruity 7.753 3.625 2.139 0.0354 *
## peanutyalmondy 9.066 3.520 2.576 0.0118 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.94 on 81 degrees of freedom
## Multiple R-squared: 0.4673, Adjusted R-squared: 0.4475
## F-statistic: 23.68 on 3 and 81 DF, p-value: 4.209e-11
b)Now, perform either forward or backward selection. What variables would you put in the model?
regfit.fwd<-regsubsets(winpercent~sugarpercent+chocolate+fruity+caramel+peanutyalmondy+nougat+crispedricewafer+hard+bar+pluribus+pricepercent, data=candy,
method="forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(winpercent ~ sugarpercent + chocolate + fruity +
## caramel + peanutyalmondy + nougat + crispedricewafer + hard +
## bar + pluribus + pricepercent, data = candy, method = "forward")
## 11 Variables (and intercept)
## Forced in Forced out
## sugarpercent FALSE FALSE
## chocolate FALSE FALSE
## fruity FALSE FALSE
## caramel FALSE FALSE
## peanutyalmondy FALSE FALSE
## nougat FALSE FALSE
## crispedricewafer FALSE FALSE
## hard FALSE FALSE
## bar FALSE FALSE
## pluribus FALSE FALSE
## pricepercent FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
## sugarpercent chocolate fruity caramel peanutyalmondy nougat
## 1 ( 1 ) " " "*" " " " " " " " "
## 2 ( 1 ) " " "*" " " " " "*" " "
## 3 ( 1 ) " " "*" "*" " " "*" " "
## 4 ( 1 ) " " "*" "*" " " "*" " "
## 5 ( 1 ) "*" "*" "*" " " "*" " "
## 6 ( 1 ) "*" "*" "*" " " "*" " "
## 7 ( 1 ) "*" "*" "*" " " "*" " "
## 8 ( 1 ) "*" "*" "*" "*" "*" " "
## crispedricewafer hard bar pluribus pricepercent
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " "
## 7 ( 1 ) "*" "*" " " " " "*"
## 8 ( 1 ) "*" "*" " " " " "*"
lm5<-lm(winpercent~chocolate+peanutyalmondy+fruity, data=candy)
summary(lm5)
##
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity,
## data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.0497 -7.3084 -0.4523 7.9446 23.8712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.788 3.237 11.057 < 2e-16 ***
## chocolate 21.983 3.599 6.108 3.34e-08 ***
## peanutyalmondy 9.066 3.520 2.576 0.0118 *
## fruity 7.753 3.625 2.139 0.0354 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.94 on 81 degrees of freedom
## Multiple R-squared: 0.4673, Adjusted R-squared: 0.4475
## F-statistic: 23.68 on 3 and 81 DF, p-value: 4.209e-11
anova(lm4.5)
## Analysis of Variance Table
##
## Response: winpercent
## Df Sum Sq Mean Sq F value Pr(>F)
## chocolate 1 7368.5 7368.5 61.6029 1.491e-11 ***
## fruity 1 336.1 336.1 2.8100 0.09753 .
## peanutyalmondy 1 793.7 793.7 6.6353 0.01181 *
## Residuals 81 9688.7 119.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# MSE for lm4.4 is 8498.3
anova(lm5)
## Analysis of Variance Table
##
## Response: winpercent
## Df Sum Sq Mean Sq F value Pr(>F)
## chocolate 1 7368.5 7368.5 61.6029 1.491e-11 ***
## peanutyalmondy 1 582.5 582.5 4.8700 0.03016 *
## fruity 1 547.3 547.3 4.5754 0.03545 *
## Residuals 81 9688.7 119.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# MSE for lm5 is 8498.3
III)Which model would you prefer and why? - The models ended up being the same so I would not care which one I used.
a)Provide 95% confidence intervals for the coefficient estimates of your favorite model (after comparing your models in the previous section). You can use the appropriate R function to do this. - Intercept p-value: < 2e-16 - Chocolate p-value: 3.34e-08 - Peanutyalmondy p-value:0.0118 - Fruity: p-value:0.0354
b) Finally, Look at the list of candies included in this dataset. Think of your favorite candy that is not in the dataset (like raisinettes!). What is it? Imagine that you want to estimate the win percentage for this new candy. - I choose jolly ranchers and it is hard and fruity.
lm6<-lm(winpercent~chocolate+fruity+caramel+peanutyalmondy+nougat+crispedricewafer+hard+bar+pluribus, data=candy)
summary(lm6)
##
## Call:
## lm(formula = winpercent ~ chocolate + fruity + caramel + peanutyalmondy +
## nougat + crispedricewafer + hard + bar + pluribus, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.6779 -5.6765 0.3966 7.0583 21.9144
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.0155 4.0781 8.586 9.13e-13 ***
## chocolate 19.9058 3.8975 5.107 2.41e-06 ***
## fruity 10.2677 3.7887 2.710 0.00833 **
## caramel 3.3843 3.6034 0.939 0.35065
## peanutyalmondy 10.1410 3.5949 2.821 0.00612 **
## nougat 2.4163 5.6897 0.425 0.67229
## crispedricewafer 8.9915 5.3279 1.688 0.09564 .
## hard -4.8726 3.4394 -1.417 0.16072
## bar -0.7220 4.8707 -0.148 0.88256
## pluribus -0.1599 3.0115 -0.053 0.95779
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.85 on 75 degrees of freedom
## Multiple R-squared: 0.5148, Adjusted R-squared: 0.4566
## F-statistic: 8.842 on 9 and 75 DF, p-value: 6.052e-09
35.0155+10.2677-4.8726
## [1] 40.4106
The estimated score of jolly ranchers is 40.4106