Midterm Part II

A) Import the data into R

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(leaps)
candy<-read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv", header=TRUE)
#View(candy)
str(candy)
## 'data.frame':    85 obs. of  13 variables:
##  $ competitorname  : chr  "100 Grand" "3 Musketeers" "One dime" "One quarter" ...
##  $ chocolate       : int  1 1 0 0 0 1 1 0 0 0 ...
##  $ fruity          : int  0 0 0 0 1 0 0 0 0 1 ...
##  $ caramel         : int  1 0 0 0 0 0 1 0 0 1 ...
##  $ peanutyalmondy  : int  0 0 0 0 0 1 1 1 0 0 ...
##  $ nougat          : int  0 1 0 0 0 0 1 0 0 0 ...
##  $ crispedricewafer: int  1 0 0 0 0 0 0 0 0 0 ...
##  $ hard            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bar             : int  1 1 0 0 0 1 1 0 0 0 ...
##  $ pluribus        : int  0 0 0 0 0 0 0 1 1 0 ...
##  $ sugarpercent    : num  0.732 0.604 0.011 0.011 0.906 ...
##  $ pricepercent    : num  0.86 0.511 0.116 0.511 0.511 ...
##  $ winpercent      : num  67 67.6 32.3 46.1 52.3 ...

B) Simple Linear Regression (SLR)

a) Create a scatterplot of sugarpercent vs winpercent

ggplot(candy,aes(sugarpercent,winpercent))+
  geom_point()

b) Describe the scatterplot above (direction, form, strength, outliers)

  • There is no discernible direction and is not strongly corilated. The percent of candy seams to be randomly correlated with the price of the candy.

c) Create a simple linear regression (SLR) model for winpercent as a function of sugarpercent.

lm1<- lm(winpercent~sugarpercent, data=candy)
summary(lm1)
## 
## Call:
## lm(formula = winpercent ~ sugarpercent, data = candy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.924 -11.066  -1.168   9.252  36.851 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    44.609      3.086  14.455   <2e-16 ***
## sugarpercent   11.924      5.560   2.145   0.0349 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.41 on 83 degrees of freedom
## Multiple R-squared:  0.05251,    Adjusted R-squared:  0.04109 
## F-statistic:   4.6 on 1 and 83 DF,  p-value: 0.0349
  1. Interpret the slope in the context of the problem.
  • The predicted win score when there is no sugar in the candy is 44.61% with the predicted win score increasing by 11.92% when the candy has a 100% suger content.
  1. Is there a significant relationship between sugarpercent and winpercent?
  • There is no significant correlation between win percent and sugar percent.
  1. Add the least squares line to your scatter plot.
ggplot(candy,aes(sugarpercent,winpercent))+
  geom_point()+
  geom_abline(intercept = lm1$coefficients[1], slope=lm1$coefficients[2], color="pink", lwd=1)

d) Should we trust the inference we made in the previous step? To assess this check the model diagnostics. - No we should not trust the last model

I & II)

par(mfrow = c(2, 2))
plot(lm1)

C)

Now think of your favorite candy. What qualities does it have? Is it chocolatey? Is it fruity? Does it have caramel? Now that you are thinking of your favorite candy choose one of the indicator variables in the dataset to use as an explanatory variable. - Kit Kat is the candy I choose! It is chocolaty and has wafers.

a) Fit a regression model for winpercent as a function of sugarpercent and the variable of your choice.

lm2<- lm(winpercent~sugarpercent+chocolate, data=candy)
summary(lm2)
## 
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.6981  -8.5153  -0.4489   7.7602  27.4820 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    38.262      2.552  14.990  < 2e-16 ***
## sugarpercent    8.567      4.355   1.967   0.0525 .  
## chocolate      18.273      2.469   7.401 1.06e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.22 on 82 degrees of freedom
## Multiple R-squared:  0.432,  Adjusted R-squared:  0.4181 
## F-statistic: 31.18 on 2 and 82 DF,  p-value: 8.5e-11
  1. Interpret the slope coefficient for your variable in the context of the problem
  • The predicted win score when there is no sugar in the candy is 38.26% with the predicted win score increasing by 8.57% when the candy has a 100% suger content.
  • When the candy has chocolate and has no suger then the predicted win % is 56.54% and incrases by by 8.57% when the candy has a 100% suger content.
  1. Write the two equations that result from your model.
  • All candy: Y=8.567x+38.262
  • If candy has chocolate: Y=8.567x+(38.262+18.273)
  1. Add the two parallel least squares lines to your scatter plot.
ggplot(candy,aes(sugarpercent,winpercent))+
  geom_point()+
  geom_abline(intercept = lm2$coefficients[1], slope=lm2$coefficients[2], color="pink", lwd=1)+
  geom_abline(intercept = lm2$coefficients[1]+lm2$coefficients[3], slope=lm2$coefficients[2], color="brown", lwd=1)

  1. Is there a significant effect of your variable in the model?
  • Chocolate does boost the win % and it is statisticly signifagent, but the amount of sugar in the candy is not signifigant
  • It is easy to asume that sugar content is important to candy rateing when taking if it contanes chocolate. I dont think many people would like a chocolate bar with no sugar in it.

b) Fit a regression model for winpercent as a function of sugarpercent, the variable of your choice, and an interaction of the two.

lm3<- lm(winpercent~sugarpercent*chocolate, data=candy)
summary(lm3)
## 
## Call:
## lm(formula = winpercent ~ sugarpercent * chocolate, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.4007  -8.0463  -0.7059   6.2815  28.5003 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              39.778      2.878  13.820   <2e-16 ***
## sugarpercent              5.221      5.257   0.993   0.3236    
## chocolate                13.051      5.230   2.495   0.0146 *  
## sugarpercent:chocolate   10.586      9.350   1.132   0.2609    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.21 on 81 degrees of freedom
## Multiple R-squared:  0.4408, Adjusted R-squared:  0.4201 
## F-statistic: 21.28 on 3 and 81 DF,  p-value: 2.916e-10

I)Interpret the slope coefficient for the interaction in the context of the problem. - The predicted win score when there is no sugar in the candy is 38.26% with the predicted win score increasing by 8.57% when the candy has a 100% suger content. - When the candy has chocolate and has no suger then the predicted win % is 56.54% and incrases by by 8.57% when the candy has a 100% suger content.

  1. Write the two equations that result from your model.
  • All candy: Y=5.221x+39.778
  • If candy has chocolate: Y=5.221x+(39.778+13.051)
  1. Add the two lines to your scatter plot.
ggplot(candy,aes(sugarpercent,winpercent))+
  geom_point()+
  geom_abline(intercept = lm3$coefficients[1], slope=lm3$coefficients[2], color="pink", lwd=1)+
  geom_abline(intercept = lm3$coefficients[1]+lm3$coefficients[3], slope=lm3$coefficients[2], color="brown", lwd=1)

  1. Is there a significant effect of the interaction in the model?
  • The only varable that is signifigent is the change that if the candy containes chocolate.

D) Model Selection

a)First, fit a fully saturated model with all the variables. Which variables are significant at a 0.05 significance level? What is the reduced model? - Only chocolate, fruity, and penutyalmondy are signifigent.

#fully satraated model
lm4<-lm(winpercent~
          sugarpercent+chocolate+fruity+caramel+peanutyalmondy+nougat+crispedricewafer+hard+bar+pluribus+pricepercent, 
        data=candy)
summary(lm4)
## 
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate + fruity + 
##     caramel + peanutyalmondy + nougat + crispedricewafer + hard + 
##     bar + pluribus + pricepercent, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.2244  -6.6247   0.1986   6.8420  23.8680 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       34.5340     4.3199   7.994 1.44e-11 ***
## sugarpercent       9.0868     4.6595   1.950  0.05500 .  
## chocolate         19.7481     3.8987   5.065 2.96e-06 ***
## fruity             9.4223     3.7630   2.504  0.01452 *  
## caramel            2.2245     3.6574   0.608  0.54493    
## peanutyalmondy    10.0707     3.6158   2.785  0.00681 ** 
## nougat             0.8043     5.7164   0.141  0.88849    
## crispedricewafer   8.9190     5.2679   1.693  0.09470 .  
## hard              -6.1653     3.4551  -1.784  0.07852 .  
## bar                0.4415     5.0611   0.087  0.93072    
## pluribus          -0.8545     3.0401  -0.281  0.77945    
## pricepercent      -5.9284     5.5132  -1.075  0.28578    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.7 on 73 degrees of freedom
## Multiple R-squared:  0.5402, Adjusted R-squared:  0.4709 
## F-statistic: 7.797 on 11 and 73 DF,  p-value: 9.504e-09
#reduced model
lm4.5<-lm(winpercent~ chocolate+fruity+peanutyalmondy,data=candy)
summary(lm4.5)
## 
## Call:
## lm(formula = winpercent ~ chocolate + fruity + peanutyalmondy, 
##     data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0497  -7.3084  -0.4523   7.9446  23.8712 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      35.788      3.237  11.057  < 2e-16 ***
## chocolate        21.983      3.599   6.108 3.34e-08 ***
## fruity            7.753      3.625   2.139   0.0354 *  
## peanutyalmondy    9.066      3.520   2.576   0.0118 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.94 on 81 degrees of freedom
## Multiple R-squared:  0.4673, Adjusted R-squared:  0.4475 
## F-statistic: 23.68 on 3 and 81 DF,  p-value: 4.209e-11

b)Now, perform either forward or backward selection. What variables would you put in the model?

regfit.fwd<-regsubsets(winpercent~sugarpercent+chocolate+fruity+caramel+peanutyalmondy+nougat+crispedricewafer+hard+bar+pluribus+pricepercent, data=candy,
                       method="forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(winpercent ~ sugarpercent + chocolate + fruity + 
##     caramel + peanutyalmondy + nougat + crispedricewafer + hard + 
##     bar + pluribus + pricepercent, data = candy, method = "forward")
## 11 Variables  (and intercept)
##                  Forced in Forced out
## sugarpercent         FALSE      FALSE
## chocolate            FALSE      FALSE
## fruity               FALSE      FALSE
## caramel              FALSE      FALSE
## peanutyalmondy       FALSE      FALSE
## nougat               FALSE      FALSE
## crispedricewafer     FALSE      FALSE
## hard                 FALSE      FALSE
## bar                  FALSE      FALSE
## pluribus             FALSE      FALSE
## pricepercent         FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          sugarpercent chocolate fruity caramel peanutyalmondy nougat
## 1  ( 1 ) " "          "*"       " "    " "     " "            " "   
## 2  ( 1 ) " "          "*"       " "    " "     "*"            " "   
## 3  ( 1 ) " "          "*"       "*"    " "     "*"            " "   
## 4  ( 1 ) " "          "*"       "*"    " "     "*"            " "   
## 5  ( 1 ) "*"          "*"       "*"    " "     "*"            " "   
## 6  ( 1 ) "*"          "*"       "*"    " "     "*"            " "   
## 7  ( 1 ) "*"          "*"       "*"    " "     "*"            " "   
## 8  ( 1 ) "*"          "*"       "*"    "*"     "*"            " "   
##          crispedricewafer hard bar pluribus pricepercent
## 1  ( 1 ) " "              " "  " " " "      " "         
## 2  ( 1 ) " "              " "  " " " "      " "         
## 3  ( 1 ) " "              " "  " " " "      " "         
## 4  ( 1 ) "*"              " "  " " " "      " "         
## 5  ( 1 ) "*"              " "  " " " "      " "         
## 6  ( 1 ) "*"              "*"  " " " "      " "         
## 7  ( 1 ) "*"              "*"  " " " "      "*"         
## 8  ( 1 ) "*"              "*"  " " " "      "*"
lm5<-lm(winpercent~chocolate+peanutyalmondy+fruity, data=candy)
summary(lm5)
## 
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity, 
##     data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0497  -7.3084  -0.4523   7.9446  23.8712 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      35.788      3.237  11.057  < 2e-16 ***
## chocolate        21.983      3.599   6.108 3.34e-08 ***
## peanutyalmondy    9.066      3.520   2.576   0.0118 *  
## fruity            7.753      3.625   2.139   0.0354 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.94 on 81 degrees of freedom
## Multiple R-squared:  0.4673, Adjusted R-squared:  0.4475 
## F-statistic: 23.68 on 3 and 81 DF,  p-value: 4.209e-11
  • Chocolate, peanutyalmondy, fruity are the verables I would use for the model.
  1. Compare and contrast the reduced model from (a) and the final model from part (b).
  1. Are the variables in the models the same or different?
  • The varables are the same II)What are the MSEs for each model?
anova(lm4.5)
## Analysis of Variance Table
## 
## Response: winpercent
##                Df Sum Sq Mean Sq F value    Pr(>F)    
## chocolate       1 7368.5  7368.5 61.6029 1.491e-11 ***
## fruity          1  336.1   336.1  2.8100   0.09753 .  
## peanutyalmondy  1  793.7   793.7  6.6353   0.01181 *  
## Residuals      81 9688.7   119.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# MSE for lm4.4 is 8498.3
anova(lm5)
## Analysis of Variance Table
## 
## Response: winpercent
##                Df Sum Sq Mean Sq F value    Pr(>F)    
## chocolate       1 7368.5  7368.5 61.6029 1.491e-11 ***
## peanutyalmondy  1  582.5   582.5  4.8700   0.03016 *  
## fruity          1  547.3   547.3  4.5754   0.03545 *  
## Residuals      81 9688.7   119.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# MSE for lm5 is 8498.3

III)Which model would you prefer and why? - The models ended up being the same so I would not care which one I used.

E) InferenceandPrediction

a)Provide 95% confidence intervals for the coefficient estimates of your favorite model (after comparing your models in the previous section). You can use the appropriate R function to do this. - Intercept p-value: < 2e-16 - Chocolate p-value: 3.34e-08 - Peanutyalmondy p-value:0.0118 - Fruity: p-value:0.0354

b) Finally, Look at the list of candies included in this dataset. Think of your favorite candy that is not in the dataset (like raisinettes!). What is it? Imagine that you want to estimate the win percentage for this new candy. - I choose jolly ranchers and it is hard and fruity.

  1. Use the model that contains only the indicator variables (from the fivethirtyeight article) to find the estimated fitted value for this new candy.
lm6<-lm(winpercent~chocolate+fruity+caramel+peanutyalmondy+nougat+crispedricewafer+hard+bar+pluribus, data=candy) 
summary(lm6)
## 
## Call:
## lm(formula = winpercent ~ chocolate + fruity + caramel + peanutyalmondy + 
##     nougat + crispedricewafer + hard + bar + pluribus, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.6779  -5.6765   0.3966   7.0583  21.9144 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       35.0155     4.0781   8.586 9.13e-13 ***
## chocolate         19.9058     3.8975   5.107 2.41e-06 ***
## fruity            10.2677     3.7887   2.710  0.00833 ** 
## caramel            3.3843     3.6034   0.939  0.35065    
## peanutyalmondy    10.1410     3.5949   2.821  0.00612 ** 
## nougat             2.4163     5.6897   0.425  0.67229    
## crispedricewafer   8.9915     5.3279   1.688  0.09564 .  
## hard              -4.8726     3.4394  -1.417  0.16072    
## bar               -0.7220     4.8707  -0.148  0.88256    
## pluribus          -0.1599     3.0115  -0.053  0.95779    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.85 on 75 degrees of freedom
## Multiple R-squared:  0.5148, Adjusted R-squared:  0.4566 
## F-statistic: 8.842 on 9 and 75 DF,  p-value: 6.052e-09
35.0155+10.2677-4.8726
## [1] 40.4106

The estimated score of jolly ranchers is 40.4106

  1. Would you use a prediction or confidence intervals to create a reasonable range for observing this new candy? Why?
  • I would not use this model to caucluae new candys as many of the predictors are not statistically relevent.