midterm2

Part II

(A) Import data

candy<-read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv",
                header=TRUE)
str(candy)

## 'data.frame':    85 obs. of  13 variables:
##  $ competitorname  : Factor w/ 85 levels "100 Grand","3 Musketeers",..: 1 2 45 46 3 4 5 6 7 8 ...
##  $ chocolate       : int  1 1 0 0 0 1 1 0 0 0 ...
##  $ fruity          : int  0 0 0 0 1 0 0 0 0 1 ...
##  $ caramel         : int  1 0 0 0 0 0 1 0 0 1 ...
##  $ peanutyalmondy  : int  0 0 0 0 0 1 1 1 0 0 ...
##  $ nougat          : int  0 1 0 0 0 0 1 0 0 0 ...
##  $ crispedricewafer: int  1 0 0 0 0 0 0 0 0 0 ...
##  $ hard            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bar             : int  1 1 0 0 0 1 1 0 0 0 ...
##  $ pluribus        : int  0 0 0 0 0 0 0 1 1 0 ...
##  $ sugarpercent    : num  0.732 0.604 0.011 0.011 0.906 ...
##  $ pricepercent    : num  0.86 0.511 0.116 0.511 0.511 ...
##  $ winpercent      : num  67 67.6 32.3 46.1 52.3 ...

(B) Simple linear regression

Scatterplot of sugarpercent vs winpercent

ggplot(candy, aes(x=sugarpercent, y=winpercent))+
  geom_point()+
  ggtitle("Sugar percent vs. Win percent")+
  theme_minimal()

Describe scatterplot: The scatterplot above has a somewhat positive direction. While it could be described as linear, the strength is very weak and as such, potential outliers are unclear (though the point at the top left of the graph appears to be one).
SLR for winpercent as a function of sugarpercent:

mod<- lm(winpercent ~ sugarpercent, data=candy)
summary(mod)

## 
## Call:
## lm(formula = winpercent ~ sugarpercent, data = candy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.924 -11.066  -1.168   9.252  36.851 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    44.609      3.086  14.455   <2e-16 ***
## sugarpercent   11.924      5.560   2.145   0.0349 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.41 on 83 degrees of freedom
## Multiple R-squared:  0.05251,    Adjusted R-squared:  0.04109 
## F-statistic:   4.6 on 1 and 83 DF,  p-value: 0.0349

The slope of 11.924 tells us that when there is a one unit increase in the percentage of sugar in a candy, we expect an 11.924 percent increase in the win percent on average.
The p-value for this model is 0.0349, so we can determine that the relationship between sugarpercent and winpercent is statistically significant.
Add least squares line to scatterplot:

ggplot(candy, aes(x=sugarpercent, y=winpercent))+
  geom_point()+
  theme_minimal()+
  ggtitle("Sugar percent vs. Win percent")+
  geom_smooth(method='lm', se=FALSE, col="light blue")

## `geom_smooth()` using formula 'y ~ x'

1. Residuals plot

ggplot(mod, aes(x=mod$fitted, y=mod$residuals))+
  geom_point()+
  theme_minimal()+
  ggtitle("Residual plot")+
  geom_abline(slope=0,intercept=0, lty=2, lwd=1,color="pink")+
  geom_smooth(se=FALSE, color="gray")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We observe that the points appear to be spread equally above and below y=0 (which models the mean zero assumption) and as fitted values increase, suggesting homoskedasticity. By adding the geom_smooth line (using LOESS), we investigate the potential of non-linearity and observe that our data appears to be generally linear.

QQ Plot

qqnorm(mod$residuals)
qqline(mod$residuals)

Observe that our points generally lie on the straight line. Thus, our assumption of normality is appropriate.

(C) Categorical explanatory variables and interactions

Regression model for winpercent as function of sugarpercent and chocolate:

mod2<- lm(winpercent~sugarpercent+chocolate, data=candy)
summary(mod2)

## 
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.6981  -8.5153  -0.4489   7.7602  27.4820 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    38.262      2.552  14.990  < 2e-16 ***
## sugarpercent    8.567      4.355   1.967   0.0525 .  
## chocolate      18.273      2.469   7.401 1.06e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.22 on 82 degrees of freedom
## Multiple R-squared:  0.432,  Adjusted R-squared:  0.4181 
## F-statistic: 31.18 on 2 and 82 DF,  p-value: 8.5e-11

The slope of 18.273 for the predictor “chocolate” tells us that when a candy has chocolate in it, we expect an 18.273 percent increase in the win percent on average.
If a candy has chocolate in it (represented by a value of 1), then the regression equation is \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 = 38.262 + 8.567 x_1 + 18.273 (1) = 56.535 + 8.567 x_1\) where \(x_1\) represents sugarpercent and \(y\) represents winpercent. If a candy does not have chocolate in it (represented by a value of 0) then the equation is \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 = 38.262 + 8.567 x_1 + 18.273 (0) = 38.262 + 8.567 x_1\) where \(x_1\) represents sugarpercent and \(y\) represents winpercent.
Add parallel least squares lines to scatter plot:

ggplot(candy, aes(x=sugarpercent, y=winpercent, color=as.factor(chocolate)))+
  geom_point()+
  theme_minimal()+
  labs(color="Chocolate (0 = no, 1=yes)")+
  geom_abline(slope=mod2$coefficients[2], intercept=mod2$coefficients[1], col="pink")+
  geom_abline(slope=mod2$coefficients[2], intercept=mod2$coefficients[1]+mod2$coefficients[3], col="light blue")

Looking at the distance between the parallel lines as well as our linear model, we observe that the variable “chocolate” has a significant affect.

Regression model including interaction between predictors:

mod3<- lm(winpercent~sugarpercent*chocolate, data=candy)
summary(mod3)

## 
## Call:
## lm(formula = winpercent ~ sugarpercent * chocolate, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.4007  -8.0463  -0.7059   6.2815  28.5003 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              39.778      2.878  13.820   <2e-16 ***
## sugarpercent              5.221      5.257   0.993   0.3236    
## chocolate                13.051      5.230   2.495   0.0146 *  
## sugarpercent:chocolate   10.586      9.350   1.132   0.2609    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.21 on 81 degrees of freedom
## Multiple R-squared:  0.4408, Adjusted R-squared:  0.4201 
## F-statistic: 21.28 on 3 and 81 DF,  p-value: 2.916e-10

We can interpret the coefficient estimate for sugarpercent:chocolate to be the increase in sugarpercent if a candy is chocolate. Thus, for chocolate candies we expect a 10.586% increase in sugar content compared to non-chocolate candies.
When a candy is not chocolate (represented by a value of 0), we have \(y = 39.778 + 5.221x_1 + 13.051(0) + 10.586x_1 (0) = 39.778 + 5.221x_1\) where \(y\) represents winpercent and \(x_1\) represents sugarpercent. When a candy has chocolate (represented by a value of 1), we have \(y = 39.778 + 5.221x_1 + 13.051(1) + 10.586x_1 (1) = 52.829 + 15.807x_1\) where \(y\) represents winpercent and \(x_1\) represents sugarpercent.
Add lines:

ggplot(candy, aes(x=sugarpercent, y=winpercent, color=as.factor(chocolate)))+
  geom_point()+
  theme_minimal()+
  labs(color="Chocolate (0 = no, 1=yes)")+
  geom_abline(slope = mod3$coefficients[2], intercept = mod3$coefficients[1], col="pink")+
  geom_abline(slope = mod3$coefficients[2]+mod3$coefficients[4], intercept = mod3$coefficients[1]+mod3$coefficients[3], col="light blue")

There is not a significant impact of the interaction on our model.

(D) Model Selection

Fully saturated model:

mod4<- lm(winpercent ~ chocolate+fruity+caramel+peanutyalmondy+nougat+crispedricewafer +hard+bar+pluribus+sugarpercent+pricepercent, data=candy)
summary(mod4)

## 
## Call:
## lm(formula = winpercent ~ chocolate + fruity + caramel + peanutyalmondy + 
##     nougat + crispedricewafer + hard + bar + pluribus + sugarpercent + 
##     pricepercent, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.2244  -6.6247   0.1986   6.8420  23.8680 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       34.5340     4.3199   7.994 1.44e-11 ***
## chocolate         19.7481     3.8987   5.065 2.96e-06 ***
## fruity             9.4223     3.7630   2.504  0.01452 *  
## caramel            2.2245     3.6574   0.608  0.54493    
## peanutyalmondy    10.0707     3.6158   2.785  0.00681 ** 
## nougat             0.8043     5.7164   0.141  0.88849    
## crispedricewafer   8.9190     5.2679   1.693  0.09470 .  
## hard              -6.1653     3.4551  -1.784  0.07852 .  
## bar                0.4415     5.0611   0.087  0.93072    
## pluribus          -0.8545     3.0401  -0.281  0.77945    
## sugarpercent       9.0868     4.6595   1.950  0.05500 .  
## pricepercent      -5.9284     5.5132  -1.075  0.28578    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.7 on 73 degrees of freedom
## Multiple R-squared:  0.5402, Adjusted R-squared:  0.4709 
## F-statistic: 7.797 on 11 and 73 DF,  p-value: 9.504e-09

We see that chocolate, fruity, and peanutyalmondy have p values of less than 0.05 and thus are significant at the 0.05 significance level. The following is a reduced model to include only these predictors:

reducedmod4<-lm(winpercent ~ chocolate+fruity+peanutyalmondy, data=candy)
summary(reducedmod4)

## 
## Call:
## lm(formula = winpercent ~ chocolate + fruity + peanutyalmondy, 
##     data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0497  -7.3084  -0.4523   7.9446  23.8712 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      35.788      3.237  11.057  < 2e-16 ***
## chocolate        21.983      3.599   6.108 3.34e-08 ***
## fruity            7.753      3.625   2.139   0.0354 *  
## peanutyalmondy    9.066      3.520   2.576   0.0118 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.94 on 81 degrees of freedom
## Multiple R-squared:  0.4673, Adjusted R-squared:  0.4475 
## F-statistic: 23.68 on 3 and 81 DF,  p-value: 4.209e-11

We will perform forward selection.

Since we are having technical difficulties with the leaps package including 95 variables instead of the 13 we consider in this data set, let us try doing it “by hand.” We start with the predictor chocolate because it has the lowest p value in our saturated model and then add by next lowest p value:

modfwd1<- lm(winpercent~chocolate, data=candy)
summary(modfwd1)

## 
## Call:
## lm(formula = winpercent ~ chocolate, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.1995  -7.5633  -0.2379   8.5623  24.8954 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   42.142      1.648  25.574  < 2e-16 ***
## chocolate     18.779      2.498   7.519 5.86e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.42 on 83 degrees of freedom
## Multiple R-squared:  0.4052, Adjusted R-squared:  0.398 
## F-statistic: 56.53 on 1 and 83 DF,  p-value: 5.86e-11

modfwd2<- lm(winpercent~chocolate+peanutyalmondy, data=candy)
summary(modfwd2)

## 
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.0296  -7.6657   0.0797   7.3629  25.2130 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      41.825      1.619  25.828  < 2e-16 ***
## chocolate        16.625      2.640   6.297 1.42e-08 ***
## peanutyalmondy    7.623      3.529   2.160   0.0337 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.17 on 82 degrees of freedom
## Multiple R-squared:  0.4372, Adjusted R-squared:  0.4235 
## F-statistic: 31.85 on 2 and 82 DF,  p-value: 5.822e-11

modfwd3<- lm(winpercent~chocolate+peanutyalmondy+fruity, data=candy)
summary(modfwd3)

## 
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity, 
##     data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0497  -7.3084  -0.4523   7.9446  23.8712 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      35.788      3.237  11.057  < 2e-16 ***
## chocolate        21.983      3.599   6.108 3.34e-08 ***
## peanutyalmondy    9.066      3.520   2.576   0.0118 *  
## fruity            7.753      3.625   2.139   0.0354 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.94 on 81 degrees of freedom
## Multiple R-squared:  0.4673, Adjusted R-squared:  0.4475 
## F-statistic: 23.68 on 3 and 81 DF,  p-value: 4.209e-11

modfwd4<- lm(winpercent~chocolate+peanutyalmondy+fruity+sugarpercent, data=candy)
summary(modfwd4)

## 
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity + 
##     sugarpercent, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.0204  -7.9139  -0.0392   7.4155  25.8389 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      32.803      3.612   9.081 6.09e-14 ***
## chocolate        21.301      3.574   5.961 6.45e-08 ***
## peanutyalmondy    8.657      3.482   2.486   0.0150 *  
## fruity            7.271      3.588   2.026   0.0461 *  
## sugarpercent      7.449      4.206   1.771   0.0804 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.8 on 80 degrees of freedom
## Multiple R-squared:  0.4874, Adjusted R-squared:  0.4617 
## F-statistic: 19.01 on 4 and 80 DF,  p-value: 5.056e-11

# modfwd4 took it too far, so let us stick with modfwd3
summary(modfwd3)

## 
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity, 
##     data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0497  -7.3084  -0.4523   7.9446  23.8712 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      35.788      3.237  11.057  < 2e-16 ***
## chocolate        21.983      3.599   6.108 3.34e-08 ***
## peanutyalmondy    9.066      3.520   2.576   0.0118 *  
## fruity            7.753      3.625   2.139   0.0354 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.94 on 81 degrees of freedom
## Multiple R-squared:  0.4673, Adjusted R-squared:  0.4475 
## F-statistic: 23.68 on 3 and 81 DF,  p-value: 4.209e-11

Because the predictor sugarpercent is not significant at the 0.05 level, we will not include it in our model. Thus, through forward stepwise selection, we have attained the model with winpercent as a function of chocolate, peanutyalmondy, and fruity.

1. Our models from part (a) and part (b) would include the same variables: choclate, peanutyalmondy, and fruity. (ii) As such, they would have the same MSE which we can observe in the ANOVA table below is 119.6. (iii) N/A (same model)

anova(modfwd3)

## Analysis of Variance Table
## 
## Response: winpercent
##                Df Sum Sq Mean Sq F value    Pr(>F)    
## chocolate       1 7368.5  7368.5 61.6029 1.491e-11 ***
## peanutyalmondy  1  582.5   582.5  4.8700   0.03016 *  
## fruity          1  547.3   547.3  4.5754   0.03545 *  
## Residuals      81 9688.7   119.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(reducedmod4)

## Analysis of Variance Table
## 
## Response: winpercent
##                Df Sum Sq Mean Sq F value    Pr(>F)    
## chocolate       1 7368.5  7368.5 61.6029 1.491e-11 ***
## fruity          1  336.1   336.1  2.8100   0.09753 .  
## peanutyalmondy  1  793.7   793.7  6.6353   0.01181 *  
## Residuals      81 9688.7   119.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(E) Inference and prediction

95% confidence interval for coefficient estimates of model:

confint(reducedmod4)

##                     2.5 %   97.5 %
## (Intercept)    29.3481409 42.22839
## chocolate      14.8220911 29.14475
## fruity          0.5412327 14.96470
## peanutyalmondy  2.0632117 16.06867

My absolute favorite candy in the world is Sour Patch watermelon. I thought I was a tried-and-true Snickers gal until I tried them. While the dataset includes other Sour Patch candy, it does not include watermelon (which is a whole different flavor experience in my humble opinion). I am even munching on some right now as I work on this midterm.

Consider this model which includes the indicator variables:

indicatormod<- lm(winpercent~chocolate+fruity+caramel+peanutyalmondy+nougat+crispedricewafer+hard+bar+pluribus, data=candy)
summary(indicatormod)

## 
## Call:
## lm(formula = winpercent ~ chocolate + fruity + caramel + peanutyalmondy + 
##     nougat + crispedricewafer + hard + bar + pluribus, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.6779  -5.6765   0.3966   7.0583  21.9144 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       35.0155     4.0781   8.586 9.13e-13 ***
## chocolate         19.9058     3.8975   5.107 2.41e-06 ***
## fruity            10.2677     3.7887   2.710  0.00833 ** 
## caramel            3.3843     3.6034   0.939  0.35065    
## peanutyalmondy    10.1410     3.5949   2.821  0.00612 ** 
## nougat             2.4163     5.6897   0.425  0.67229    
## crispedricewafer   8.9915     5.3279   1.688  0.09564 .  
## hard              -4.8726     3.4394  -1.417  0.16072    
## bar               -0.7220     4.8707  -0.148  0.88256    
## pluribus          -0.1599     3.0115  -0.053  0.95779    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.85 on 75 degrees of freedom
## Multiple R-squared:  0.5148, Adjusted R-squared:  0.4566 
## F-statistic: 8.842 on 9 and 75 DF,  p-value: 6.052e-09

Sour Patch watermelon would merit the following values in our binary variables: chocolate = 0, fruity = 1, caramel = 0, peanutyalmondy = 0, nougat = 0, crispedricewafer = 0, hard = 0, bar = 0, pluribus = 1. Thus, we attain the equation \[y_{(SP watermelon)} = 35.0155 + 10.2677(1) - 0.1599(1) = 45.1233\] Therefore, we would expect Sour Patch watermelon to win 45.12% of the time according to this model. We confirm this with R:

predict(indicatormod, data.frame(chocolate=0, fruity=1, caramel=0, peanutyalmondy=0, nougat=0, crispedricewafer=0, hard=0, bar=0, pluribus=1), type="response")

##        1 
## 45.12327

In this case we would use a prediction interval because we are attempting to predict the range in which a brand new observation would fall.

Extra credit: use R to find this prediction interval

predict(indicatormod, data.frame(chocolate=0, fruity=1, caramel=0, peanutyalmondy=0, nougat=0, crispedricewafer=0, hard=0, bar=0, pluribus=1), interval="prediction")

##        fit      lwr      upr
## 1 45.12327 23.06349 67.18306