PART 2

candy <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv", header = TRUE)
View(candy)
library(tidyverse)
## ── Attaching packages ─────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(leaps)

B)

a).

ggplot(candy, aes(x = sugarpercent, y = winpercent))+
  geom_point()

b).

Direction: The direction looks some what positive

Form: The scatterplot could be linear, but it is very hard to tell

Strength: The strength of the scatterplot is weak

Outliers: There are possibly some outliers around 1 of sugarpercent and 81 of winpercent

c).

mod1 <- lm(winpercent ~ sugarpercent, data = candy)
summary(mod1)
## 
## Call:
## lm(formula = winpercent ~ sugarpercent, data = candy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.924 -11.066  -1.168   9.252  36.851 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    44.609      3.086  14.455   <2e-16 ***
## sugarpercent   11.924      5.560   2.145   0.0349 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.41 on 83 degrees of freedom
## Multiple R-squared:  0.05251,    Adjusted R-squared:  0.04109 
## F-statistic:   4.6 on 1 and 83 DF,  p-value: 0.0349
i)

For every one unit increase in sugar percent, winpercent will increase by 11.924 units.

ii)

Yes, there is a significant relationship between sugarpercent and winpercent because the p-value is fairly small.

iii)
ggplot(candy, aes(x = sugarpercent, y = winpercent))+
  geom_point()+
  geom_abline(slope = mod1$coefficients[2], intercept = mod1$coefficients[1], color = "red", lwd = 1)

d).

i)
res <- residuals(mod1)
fit <- fitted.values(mod1)

ggplot(NULL, aes(x = fit, y = res)) +
  geom_point()+
  geom_hline(yintercept = 0, col = "blue", lty = 2)

Since most of the points are near the line, there is not much noise. Also, since the points are not evenly distributed on both sides of the line, the mean zero assumptions are not met.

ii)
qqnorm(res)
qqline(res, col = "red", lwd = 2)

Looking at the graph, since the points follow the line, the plot follows normality.

C)

a).

mod2 <- lm(winpercent ~ sugarpercent + chocolate, data = candy)
summary(mod2)
## 
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.6981  -8.5153  -0.4489   7.7602  27.4820 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    38.262      2.552  14.990  < 2e-16 ***
## sugarpercent    8.567      4.355   1.967   0.0525 .  
## chocolate      18.273      2.469   7.401 1.06e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.22 on 82 degrees of freedom
## Multiple R-squared:  0.432,  Adjusted R-squared:  0.4181 
## F-statistic: 31.18 on 2 and 82 DF,  p-value: 8.5e-11
i)

For every one unit change in sugarpercent, winpercent increases by 8.576 units. If the candy has chocolate, then winpercent increases by 18.273 units.

ii)

y = winpercent and x = sugarpercent.

If chocolate is 0: \(y = 38.262+8.567x\)

If chocolate is 1: \(y=56.535+8.567x\)

iii)
ggplot(candy, aes(x = sugarpercent, y = winpercent, color = chocolate))+
  geom_point()+
  geom_abline(slope = mod2$coefficients[2], intercept = mod2$coefficients[1], color = "red") +
  geom_abline(slope = mod2$coefficients[2], intercept = mod2$coefficients[1]+mod2$coefficients[3], color = "green")

iv)

Yes, since our p-value for chocolate is small, there is a significant effect of the interaction in the model.

b).

i)
mod3 <- lm(winpercent ~ sugarpercent + chocolate + sugarpercent*chocolate, data = candy)
summary(mod3)
## 
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate + sugarpercent * 
##     chocolate, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.4007  -8.0463  -0.7059   6.2815  28.5003 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              39.778      2.878  13.820   <2e-16 ***
## sugarpercent              5.221      5.257   0.993   0.3236    
## chocolate                13.051      5.230   2.495   0.0146 *  
## sugarpercent:chocolate   10.586      9.350   1.132   0.2609    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.21 on 81 degrees of freedom
## Multiple R-squared:  0.4408, Adjusted R-squared:  0.4201 
## F-statistic: 21.28 on 3 and 81 DF,  p-value: 2.916e-10
ii)

y = winpercent and x = sugarpercent.

If chocolate is 0: \(y = 39.778+5.221x\)

If chocolate is 1: \(y = 52.829+15.807x\)

iii)
ggplot(candy, aes(x = sugarpercent, y = winpercent, color = chocolate))+
  geom_point()+
  geom_abline(slope = mod3$coefficients[2], intercept = mod3$coefficients[1], color = "red") +
  geom_abline(slope = mod3$coefficients[2]+mod3$coefficients[4], intercept = mod3$coefficients[1]+mod3$coefficients[3], color = "green")

iv)

No there is no significant effect because we have a large p-value.

D)

a).

mod4 <- lm(winpercent ~ sugarpercent+chocolate+fruity+caramel+peanutyalmondy+nougat+crispedricewafer+hard+bar+pluribus+pricepercent, data = candy)
summary(mod4)
## 
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate + fruity + 
##     caramel + peanutyalmondy + nougat + crispedricewafer + hard + 
##     bar + pluribus + pricepercent, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.2244  -6.6247   0.1986   6.8420  23.8680 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       34.5340     4.3199   7.994 1.44e-11 ***
## sugarpercent       9.0868     4.6595   1.950  0.05500 .  
## chocolate         19.7481     3.8987   5.065 2.96e-06 ***
## fruity             9.4223     3.7630   2.504  0.01452 *  
## caramel            2.2245     3.6574   0.608  0.54493    
## peanutyalmondy    10.0707     3.6158   2.785  0.00681 ** 
## nougat             0.8043     5.7164   0.141  0.88849    
## crispedricewafer   8.9190     5.2679   1.693  0.09470 .  
## hard              -6.1653     3.4551  -1.784  0.07852 .  
## bar                0.4415     5.0611   0.087  0.93072    
## pluribus          -0.8545     3.0401  -0.281  0.77945    
## pricepercent      -5.9284     5.5132  -1.075  0.28578    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.7 on 73 degrees of freedom
## Multiple R-squared:  0.5402, Adjusted R-squared:  0.4709 
## F-statistic: 7.797 on 11 and 73 DF,  p-value: 9.504e-09

The variable that are significant at a 0.05 significance level are chocolate, peanutyalmondy, and fruity.

#REDUCED MODEL
mod5 <- lm(winpercent ~ chocolate+peanutyalmondy+fruity, data = candy)
summary(mod5)
## 
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity, 
##     data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0497  -7.3084  -0.4523   7.9446  23.8712 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      35.788      3.237  11.057  < 2e-16 ***
## chocolate        21.983      3.599   6.108 3.34e-08 ***
## peanutyalmondy    9.066      3.520   2.576   0.0118 *  
## fruity            7.753      3.625   2.139   0.0354 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.94 on 81 degrees of freedom
## Multiple R-squared:  0.4673, Adjusted R-squared:  0.4475 
## F-statistic: 23.68 on 3 and 81 DF,  p-value: 4.209e-11

b).

candy <- subset(candy, select = -c(competitorname))
regfitfwd <- regsubsets(winpercent~., data = candy,
                        method = "forward")
summary(regfitfwd)
## Subset selection object
## Call: regsubsets.formula(winpercent ~ ., data = candy, method = "forward")
## 11 Variables  (and intercept)
##                  Forced in Forced out
## chocolate            FALSE      FALSE
## fruity               FALSE      FALSE
## caramel              FALSE      FALSE
## peanutyalmondy       FALSE      FALSE
## nougat               FALSE      FALSE
## crispedricewafer     FALSE      FALSE
## hard                 FALSE      FALSE
## bar                  FALSE      FALSE
## pluribus             FALSE      FALSE
## sugarpercent         FALSE      FALSE
## pricepercent         FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          chocolate fruity caramel peanutyalmondy nougat crispedricewafer hard
## 1  ( 1 ) "*"       " "    " "     " "            " "    " "              " " 
## 2  ( 1 ) "*"       " "    " "     "*"            " "    " "              " " 
## 3  ( 1 ) "*"       "*"    " "     "*"            " "    " "              " " 
## 4  ( 1 ) "*"       "*"    " "     "*"            " "    "*"              " " 
## 5  ( 1 ) "*"       "*"    " "     "*"            " "    "*"              " " 
## 6  ( 1 ) "*"       "*"    " "     "*"            " "    "*"              "*" 
## 7  ( 1 ) "*"       "*"    " "     "*"            " "    "*"              "*" 
## 8  ( 1 ) "*"       "*"    "*"     "*"            " "    "*"              "*" 
##          bar pluribus sugarpercent pricepercent
## 1  ( 1 ) " " " "      " "          " "         
## 2  ( 1 ) " " " "      " "          " "         
## 3  ( 1 ) " " " "      " "          " "         
## 4  ( 1 ) " " " "      " "          " "         
## 5  ( 1 ) " " " "      "*"          " "         
## 6  ( 1 ) " " " "      "*"          " "         
## 7  ( 1 ) " " " "      "*"          "*"         
## 8  ( 1 ) " " " "      "*"          "*"

The variables I would put in the model is chocolate, fruity, peanutyalmondy, crispedricewafer, and hard.

mod6 <- lm(winpercent ~ chocolate+fruity+peanutyalmondy+crispedricewafer+hard, data = candy)
summary(mod6)
## 
## Call:
## lm(formula = winpercent ~ chocolate + fruity + peanutyalmondy + 
##     crispedricewafer + hard, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.8057  -5.8376   0.2539   7.5341  21.7866 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        36.015      3.191  11.287  < 2e-16 ***
## chocolate          19.858      3.629   5.472 5.09e-07 ***
## fruity              9.236      3.617   2.554  0.01259 *  
## peanutyalmondy     10.027      3.494   2.870  0.00526 ** 
## crispedricewafer    8.864      4.554   1.946  0.05516 .  
## hard               -4.835      3.316  -1.458  0.14887    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.67 on 79 degrees of freedom
## Multiple R-squared:  0.5053, Adjusted R-squared:  0.474 
## F-statistic: 16.14 on 5 and 79 DF,  p-value: 6.141e-11

c).

i)

The variables in the model are different because I added crisped rice wafer and hard after doing the forward selection.

ii)
mse1 <- mean(residuals(mod5)^2)
mse1
## [1] 113.9847

The MSE of the reduced model from a is 113.9847

mse2 <- mean(residuals(mod6)^2)
mse2
## [1] 105.8438

the MSE of the final model from part b is 105.8438

iii)

The model I would prefer is the final model from part b because it has a smaller p-value.

E)

a).

confint(mod2, level = 0.95)
##                 2.5 %   97.5 %
## (Intercept)  33.18452 43.33971
## sugarpercent -0.09603 17.22927
## chocolate    13.36170 23.18492

b).

i)

My favorite candy is Jolly Ranchers. From the indicator variables jolly ranchers are fruity, hard, and pluribus.

10.3-4.9-0.2
## [1] 5.2

Looking at the fivethirtyeight article, my estimated fitted value is 5.2

ii)

I would use a prediction interval to create a reasonable range for observing this new candy because it accounts for a linear model.

mod7 <- lm(winpercent ~ fruity + pluribus + hard, data = candy)
summary(mod3)
## 
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate + sugarpercent * 
##     chocolate, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.4007  -8.0463  -0.7059   6.2815  28.5003 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              39.778      2.878  13.820   <2e-16 ***
## sugarpercent              5.221      5.257   0.993   0.3236    
## chocolate                13.051      5.230   2.495   0.0146 *  
## sugarpercent:chocolate   10.586      9.350   1.132   0.2609    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.21 on 81 degrees of freedom
## Multiple R-squared:  0.4408, Adjusted R-squared:  0.4201 
## F-statistic: 21.28 on 3 and 81 DF,  p-value: 2.916e-10
predict(mod7, interval = "predict")
## Warning in predict.lm(mod7, interval = "predict"): predictions on current data refer to _future_ responses
##         fit       lwr      upr
## 1  57.58055 30.557169 84.60392
## 2  57.58055 30.557169 84.60392
## 3  57.58055 30.557169 84.60392
## 4  57.58055 30.557169 84.60392
## 5  50.30251 22.801667 77.80335
## 6  57.58055 30.557169 84.60392
## 7  57.58055 30.557169 84.60392
## 8  52.59729 25.422416 79.77217
## 9  52.59729 25.422416 79.77217
## 10 50.30251 22.801667 77.80335
## 11 57.58055 30.557169 84.60392
## 12 45.31926 18.143204 72.49531
## 13 45.31926 18.143204 72.49531
## 14 45.31926 18.143204 72.49531
## 15 42.19631 14.449053 69.94357
## 16 45.31926 18.143204 72.49531
## 17 42.19631 14.449053 69.94357
## 18 37.21306  9.573131 64.85299
## 19 45.31926 18.143204 72.49531
## 20 52.59729 25.422416 79.77217
## 21 45.31926 18.143204 72.49531
## 22 45.31926 18.143204 72.49531
## 23 52.59729 25.422416 79.77217
## 24 57.58055 30.557169 84.60392
## 25 57.58055 30.557169 84.60392
## 26 57.58055 30.557169 84.60392
## 27 37.21306  9.573131 64.85299
## 28 52.59729 25.422416 79.77217
## 29 57.58055 30.557169 84.60392
## 30 50.30251 22.801667 77.80335
## 31 42.19631 14.449053 69.94357
## 32 50.30251 22.801667 77.80335
## 33 52.59729 25.422416 79.77217
## 34 52.59729 25.422416 79.77217
## 35 45.31926 18.143204 72.49531
## 36 52.59729 25.422416 79.77217
## 37 57.58055 30.557169 84.60392
## 38 57.58055 30.557169 84.60392
## 39 57.58055 30.557169 84.60392
## 40 57.58055 30.557169 84.60392
## 41 57.58055 30.557169 84.60392
## 42 37.21306  9.573131 64.85299
## 43 57.58055 30.557169 84.60392
## 44 57.58055 30.557169 84.60392
## 45 45.31926 18.143204 72.49531
## 46 45.31926 18.143204 72.49531
## 47 57.58055 30.557169 84.60392
## 48 52.59729 25.422416 79.77217
## 49 52.59729 25.422416 79.77217
## 50 37.21306  9.573131 64.85299
## 51 45.31926 18.143204 72.49531
## 52 57.58055 30.557169 84.60392
## 53 57.58055 30.557169 84.60392
## 54 52.59729 25.422416 79.77217
## 55 57.58055 30.557169 84.60392
## 56 42.19631 14.449053 69.94357
## 57 52.59729 25.422416 79.77217
## 58 44.49110 16.054787 72.92741
## 59 37.21306  9.573131 64.85299
## 60 52.59729 25.422416 79.77217
## 61 45.31926 18.143204 72.49531
## 62 45.31926 18.143204 72.49531
## 63 52.59729 25.422416 79.77217
## 64 37.21306  9.573131 64.85299
## 65 57.58055 30.557169 84.60392
## 66 57.58055 30.557169 84.60392
## 67 45.31926 18.143204 72.49531
## 68 45.31926 18.143204 72.49531
## 69 45.31926 18.143204 72.49531
## 70 37.21306  9.573131 64.85299
## 71 52.59729 25.422416 79.77217
## 72 57.58055 30.557169 84.60392
## 73 50.30251 22.801667 77.80335
## 74 45.31926 18.143204 72.49531
## 75 42.19631 14.449053 69.94357
## 76 57.58055 30.557169 84.60392
## 77 52.59729 25.422416 79.77217
## 78 57.58055 30.557169 84.60392
## 79 45.31926 18.143204 72.49531
## 80 57.58055 30.557169 84.60392
## 81 50.30251 22.801667 77.80335
## 82 42.19631 14.449053 69.94357
## 83 45.31926 18.143204 72.49531
## 84 49.47435 21.392347 77.55635
## 85 52.59729 25.422416 79.77217