Midterm 2

Part II

Import the Data

candy<-read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv",header=TRUE)

str(candy)

## 'data.frame':    85 obs. of  13 variables:
##  $ competitorname  : chr  "100 Grand" "3 Musketeers" "One dime" "One quarter" ...
##  $ chocolate       : int  1 1 0 0 0 1 1 0 0 0 ...
##  $ fruity          : int  0 0 0 0 1 0 0 0 0 1 ...
##  $ caramel         : int  1 0 0 0 0 0 1 0 0 1 ...
##  $ peanutyalmondy  : int  0 0 0 0 0 1 1 1 0 0 ...
##  $ nougat          : int  0 1 0 0 0 0 1 0 0 0 ...
##  $ crispedricewafer: int  1 0 0 0 0 0 0 0 0 0 ...
##  $ hard            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bar             : int  1 1 0 0 0 1 1 0 0 0 ...
##  $ pluribus        : int  0 0 0 0 0 0 0 1 1 0 ...
##  $ sugarpercent    : num  0.732 0.604 0.011 0.011 0.906 ...
##  $ pricepercent    : num  0.86 0.511 0.116 0.511 0.511 ...
##  $ winpercent      : num  67 67.6 32.3 46.1 52.3 ...

Simple Linear Regression

ggplot(candy, aes(sugarpercent, winpercent)) +
  geom_point() +
  theme_light()

There doesn’t appear to be a strong correlation here. It seems slightly positive, possibly linear, with maybe one outlier with low sugar and high wins.

slrcandy <- lm(winpercent~sugarpercent, data=candy)
summary(slrcandy)

## 
## Call:
## lm(formula = winpercent ~ sugarpercent, data = candy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.924 -11.066  -1.168   9.252  36.851 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    44.609      3.086  14.455   <2e-16 ***
## sugarpercent   11.924      5.560   2.145   0.0349 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.41 on 83 degrees of freedom
## Multiple R-squared:  0.05251,    Adjusted R-squared:  0.04109 
## F-statistic:   4.6 on 1 and 83 DF,  p-value: 0.0349

As the sugar increases by 1%, the percentage of wins increases by 11.92.

There is a significant relationship, because our predictor variable p value is less than .05.

ggplot(candy, aes(sugarpercent, winpercent)) +
  geom_point() +
  geom_abline(intercept = slrcandy$coefficients[1], slope= slrcandy$coefficients[2], color = "blue", lty = 1) +
  theme_light()

Assessing Assumptions

ggplot(candy, aes(sugarpercent, slrcandy$residuals)) +
  geom_point()+
  geom_hline(yintercept=0, lty = 2, color = "blue") +
  theme_light()

It seems like the mean of the residuals is 0, and there is no pattern in the distribution, so we have homoscedasticity.

qqnorm(slrcandy$residuals)
qqline(slrcandy$residuals)

The residuals don’t map on perfectly to the line, but it seems close enough to conclude our data is normal.

Categorical Explanatory Variables and Interactions

Fitting a model with 2 predictors:

candymod <- lm(winpercent ~ sugarpercent + chocolate, data = candy)
summary(candymod)

## 
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.6981  -8.5153  -0.4489   7.7602  27.4820 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    38.262      2.552  14.990  < 2e-16 ***
## sugarpercent    8.567      4.355   1.967   0.0525 .  
## chocolate      18.273      2.469   7.401 1.06e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.22 on 82 degrees of freedom
## Multiple R-squared:  0.432,  Adjusted R-squared:  0.4181 
## F-statistic: 31.18 on 2 and 82 DF,  p-value: 8.5e-11

If chocolate is present, the average percent of wins is 18.27% higher than candies without chocolate, when controlling for the percentage of sugar. This is a significant effect.

Model when chocolate is present: \[ y = 56.533 + 8.567x \] Model when chocolate is not present: \[ y = 38.262 + 8.567x \]

ggplot(candy, aes(sugarpercent, winpercent, color = as.factor(chocolate))) +
  geom_point() +
  geom_abline(intercept = candymod$coefficients[1], slope= candymod$coefficients[2],color = "red", lty = 1) +
  geom_abline(intercept = candymod$coefficients[1]+candymod$coefficients[3], slope = candymod$coefficients[2],color = "blue", lty = 1) +
  theme_light()

Fitting a model with 2 predictors and an interaction:

candymod2 <- lm(winpercent ~ sugarpercent + chocolate + sugarpercent*chocolate, data = candy)
summary(candymod2)

## 
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate + sugarpercent * 
##     chocolate, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.4007  -8.0463  -0.7059   6.2815  28.5003 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              39.778      2.878  13.820   <2e-16 ***
## sugarpercent              5.221      5.257   0.993   0.3236    
## chocolate                13.051      5.230   2.495   0.0146 *  
## sugarpercent:chocolate   10.586      9.350   1.132   0.2609    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.21 on 81 degrees of freedom
## Multiple R-squared:  0.4408, Adjusted R-squared:  0.4201 
## F-statistic: 21.28 on 3 and 81 DF,  p-value: 2.916e-10

If chocolate is present, the association between percentage of sugar and wins is more dramatic- i.e. as percentage of sugar increases, the percentage of wins increases by 15.7 compared to when chocolate is absent. However this effect is not significant, since p = .26.

When chocolate is present: \[ y = 39.778 + 13.051 + (10.586 + 5.221)x \] \[ y = 52.829 + 15.807x \]

When chocolate is absent: \[39.778 + 5.221x \]

ggplot(candy, aes(sugarpercent, winpercent, color = as.factor(chocolate))) +
  geom_point() +
  geom_abline(intercept = (candymod2$coefficients[1]+candymod2$coefficients[3]), slope= (candymod2$coefficients[2]+candymod2$coefficients[4]),color = "blue", lty = 1) +
  geom_abline(intercept = candymod$coefficients[1], slope = candymod$coefficients[2],color = "red", lty = 1) +
  theme_light()

Model Selection

Fitting a fully saturated model.

saturatedmod <- lm(winpercent~chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer + hard + bar + pluribus + pricepercent, data = candy)
summary(saturatedmod)

## 
## Call:
## lm(formula = winpercent ~ chocolate + fruity + caramel + peanutyalmondy + 
##     nougat + crispedricewafer + hard + bar + pluribus + pricepercent, 
##     data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.5288  -5.9519   0.0209   7.0941  21.5661 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      35.71037    4.35784   8.195 5.55e-12 ***
## chocolate        20.19265    3.96509   5.093 2.61e-06 ***
## fruity           10.25959    3.80857   2.694  0.00873 ** 
## caramel           3.61280    3.65473   0.989  0.32612    
## peanutyalmondy   10.46296    3.67799   2.845  0.00574 ** 
## nougat            1.99093    5.79061   0.344  0.73196    
## crispedricewafer  9.14263    5.36546   1.704  0.09258 .  
## hard             -4.90968    3.45832  -1.420  0.15990    
## bar               0.03111    5.15156   0.006  0.99520    
## pluribus          0.05223    3.06073   0.017  0.98643    
## pricepercent     -2.50284    5.32401  -0.470  0.63966    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.9 on 74 degrees of freedom
## Multiple R-squared:  0.5163, Adjusted R-squared:  0.4509 
## F-statistic: 7.897 on 10 and 74 DF,  p-value: 1.704e-08

Now using only the significant (p<.05) predictors from the saturated model:

reducedmod <- lm(winpercent ~ chocolate + fruity + peanutyalmondy, data = candy)
summary(reducedmod)

## 
## Call:
## lm(formula = winpercent ~ chocolate + fruity + peanutyalmondy, 
##     data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0497  -7.3084  -0.4523   7.9446  23.8712 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      35.788      3.237  11.057  < 2e-16 ***
## chocolate        21.983      3.599   6.108 3.34e-08 ***
## fruity            7.753      3.625   2.139   0.0354 *  
## peanutyalmondy    9.066      3.520   2.576   0.0118 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.94 on 81 degrees of freedom
## Multiple R-squared:  0.4673, Adjusted R-squared:  0.4475 
## F-statistic: 23.68 on 3 and 81 DF,  p-value: 4.209e-11

Using forward selection:

# install.packages(leaps)
library(leaps)

regfit.fwd <- regsubsets(winpercent ~chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer + hard + bar + pluribus + pricepercent, data = candy, method = "forward")
summary(regfit.fwd)

## Subset selection object
## Call: regsubsets.formula(winpercent ~ chocolate + fruity + caramel + 
##     peanutyalmondy + nougat + crispedricewafer + hard + bar + 
##     pluribus + pricepercent, data = candy, method = "forward")
## 10 Variables  (and intercept)
##                  Forced in Forced out
## chocolate            FALSE      FALSE
## fruity               FALSE      FALSE
## caramel              FALSE      FALSE
## peanutyalmondy       FALSE      FALSE
## nougat               FALSE      FALSE
## crispedricewafer     FALSE      FALSE
## hard                 FALSE      FALSE
## bar                  FALSE      FALSE
## pluribus             FALSE      FALSE
## pricepercent         FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          chocolate fruity caramel peanutyalmondy nougat crispedricewafer hard
## 1  ( 1 ) "*"       " "    " "     " "            " "    " "              " " 
## 2  ( 1 ) "*"       " "    " "     "*"            " "    " "              " " 
## 3  ( 1 ) "*"       "*"    " "     "*"            " "    " "              " " 
## 4  ( 1 ) "*"       "*"    " "     "*"            " "    "*"              " " 
## 5  ( 1 ) "*"       "*"    " "     "*"            " "    "*"              "*" 
## 6  ( 1 ) "*"       "*"    "*"     "*"            " "    "*"              "*" 
## 7  ( 1 ) "*"       "*"    "*"     "*"            " "    "*"              "*" 
## 8  ( 1 ) "*"       "*"    "*"     "*"            "*"    "*"              "*" 
##          bar pluribus pricepercent
## 1  ( 1 ) " " " "      " "         
## 2  ( 1 ) " " " "      " "         
## 3  ( 1 ) " " " "      " "         
## 4  ( 1 ) " " " "      " "         
## 5  ( 1 ) " " " "      " "         
## 6  ( 1 ) " " " "      " "         
## 7  ( 1 ) " " " "      "*"         
## 8  ( 1 ) " " " "      "*"

forwardmod <- lm(winpercent ~ chocolate + peanutyalmondy+ fruity + crispedricewafer + hard + caramel + pricepercent + nougat, data = candy)
summary(forwardmod)

## 
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity + 
##     crispedricewafer + hard + caramel + pricepercent + nougat, 
##     data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.5098  -5.9621   0.0165   7.1088  21.5797 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        35.746      3.791   9.430 2.01e-14 ***
## chocolate          20.190      3.796   5.319 1.02e-06 ***
## peanutyalmondy     10.456      3.604   2.901  0.00486 ** 
## fruity             10.261      3.754   2.733  0.00780 ** 
## crispedricewafer    9.141      4.853   1.883  0.06347 .  
## hard               -4.921      3.352  -1.468  0.14622    
## caramel             3.605      3.576   1.008  0.31658    
## pricepercent       -2.493      4.993  -0.499  0.61905    
## nougat              1.991      4.760   0.418  0.67698    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.76 on 76 degrees of freedom
## Multiple R-squared:  0.5163, Adjusted R-squared:  0.4653 
## F-statistic: 10.14 on 8 and 76 DF,  p-value: 1.635e-09

Variables stop being significant at crispedricewafer, so I would only include the variables above that.

Comparing the two models:

finalforwardmod <- lm(winpercent~chocolate + peanutyalmondy + fruity, data = candy)
summary(finalforwardmod)

## 
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity, 
##     data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0497  -7.3084  -0.4523   7.9446  23.8712 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      35.788      3.237  11.057  < 2e-16 ***
## chocolate        21.983      3.599   6.108 3.34e-08 ***
## peanutyalmondy    9.066      3.520   2.576   0.0118 *  
## fruity            7.753      3.625   2.139   0.0354 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.94 on 81 degrees of freedom
## Multiple R-squared:  0.4673, Adjusted R-squared:  0.4475 
## F-statistic: 23.68 on 3 and 81 DF,  p-value: 4.209e-11

summary(reducedmod)

## 
## Call:
## lm(formula = winpercent ~ chocolate + fruity + peanutyalmondy, 
##     data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0497  -7.3084  -0.4523   7.9446  23.8712 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      35.788      3.237  11.057  < 2e-16 ***
## chocolate        21.983      3.599   6.108 3.34e-08 ***
## fruity            7.753      3.625   2.139   0.0354 *  
## peanutyalmondy    9.066      3.520   2.576   0.0118 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.94 on 81 degrees of freedom
## Multiple R-squared:  0.4673, Adjusted R-squared:  0.4475 
## F-statistic: 23.68 on 3 and 81 DF,  p-value: 4.209e-11

anova(finalforwardmod)

## Analysis of Variance Table
## 
## Response: winpercent
##                Df Sum Sq Mean Sq F value    Pr(>F)    
## chocolate       1 7368.5  7368.5 61.6029 1.491e-11 ***
## peanutyalmondy  1  582.5   582.5  4.8700   0.03016 *  
## fruity          1  547.3   547.3  4.5754   0.03545 *  
## Residuals      81 9688.7   119.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(reducedmod)

## Analysis of Variance Table
## 
## Response: winpercent
##                Df Sum Sq Mean Sq F value    Pr(>F)    
## chocolate       1 7368.5  7368.5 61.6029 1.491e-11 ***
## fruity          1  336.1   336.1  2.8100   0.09753 .  
## peanutyalmondy  1  793.7   793.7  6.6353   0.01181 *  
## Residuals      81 9688.7   119.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both models ended up with the same variables. They both have an MSE of 119.6. Looking at the ANOVA table, I would choose the forward selection model, because the p values for the F test are smaller.

Inference and Prediction

95% confidence intervals:

confint(finalforwardmod)

##                     2.5 %   97.5 %
## (Intercept)    29.3481409 42.22839
## chocolate      14.8220911 29.14475
## peanutyalmondy  2.0632117 16.06867
## fruity          0.5412327 14.96470

Estimating Win Percentage of York Peppermint Patties

We will use a prediction interval to estimate the range for the new observation because confidence intervals are based on means of scores, which will have significantly less variance than a single score. The prediction interval is wider than the confidence interval to accommodate the larger variance in estimating a single score.

york <- data.frame(c("York Peppermint Patty", 1, 0, 0, 0, 0, 0, 0, 0, 0, NULL, NULL, NULL))


indicatormod <- lm(winpercent ~ chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer + hard + bar + pluribus, data = candy)

# yorkprediction <- predict(indicatormod, newdata = york, interval = "prediction")

# The prediction function keeps giving me an error I can't figure out, so I wasn't able to find the values, but this is my best guess.



# Predicted Values for the winpercent of york patties, and the lower and upper bounds of the prediction interval

# yorkprediction$fit
# yorkprediction$lwr
# yorkprediction$upr