Midterm 2

Midterm 2 Part II

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   1.0.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggplot2)
library(dplyr)
library(broom)
candy<-read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv",
 header=TRUE)

B) Simple Linear Regression

a) Create a scatterplot of sugarpercent vs winpercent.

b) Describe the scatterplot above (direction, form, strength, outliers)

The scatterplot shows a positive, linear, weak relationship with possible outliers.
Although the graph looks scattered, there is a stronger concentration of dots along the bottom left and top right. However, the strength is weak since the points are very diluted throughout the graph.

c) Create a simple linear regression (SLR) model for winpercent as a function of sugarpercent.

slr = lm(Sugar ~ Win)
summary(slr)

## 
## Call:
## lm(formula = Sugar ~ Win)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58358 -0.21530 -0.01378  0.20718  0.56069 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 0.257063   0.107593   2.389   0.0192 *
## Win         0.004404   0.002053   2.145   0.0349 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2769 on 83 degrees of freedom
## Multiple R-squared:  0.05251,    Adjusted R-squared:  0.04109 
## F-statistic:   4.6 on 1 and 83 DF,  p-value: 0.0349

plot(Sugar ~ Win)
abline(coef(slr[1]), coef(slr)[2])

The slope is the estimated rate of change between sugar and win percentage. So, for every one unit of win percentage, we can expect 0.0044 increase in sugar percentage. The intercept in the expected mean value of sugar percentage or sugar percentage when win is zero. In this case, win percentage cannot really reach zero so it is not signifigant in this case.

d) Should we trust the inference we made in the previous step? To assess this check the model diagnostics.

summary(slr)

## 
## Call:
## lm(formula = Sugar ~ Win)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58358 -0.21530 -0.01378  0.20718  0.56069 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 0.257063   0.107593   2.389   0.0192 *
## Win         0.004404   0.002053   2.145   0.0349 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2769 on 83 degrees of freedom
## Multiple R-squared:  0.05251,    Adjusted R-squared:  0.04109 
## F-statistic:   4.6 on 1 and 83 DF,  p-value: 0.0349

plot(Sugar ~ Win)
abline(coef(slr[1]), coef(slr)[2])

plot(slr)

Looking at the Residual vs Fitted plot and the Scale-Location plot, the points do not form a clear pattern and are roughly spread out below and above the line. Therefore, we can assume linearity and homoscedasticity. Looking at the qq plot, we can see that the points roughly follow the 45 degree line with expection of three flagged values. There is slight departure at the extremes but we can assume that the data is roughly normal.

C) Categorical Explanatory Variables and Interactions

a) Fit a regression model for winpercent as a function of sugarpercent and the variable of your choice.

Peanut = factor(candy$peanutyalmondy)
model = lm(Win ~ Sugar + Peanut, data=candy)

model

## 
## Call:
## lm(formula = Win ~ Sugar + Peanut, data = candy)
## 
## Coefficients:
## (Intercept)        Sugar      Peanut1  
##       42.93        10.14        15.34

summary(model)

## 
## Call:
## lm(formula = Win ~ Sugar + Peanut, data = candy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.034 -10.086  -0.079   9.469  33.170 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   42.934      2.880  14.908  < 2e-16 ***
## Sugar         10.145      5.151   1.969 0.052274 .  
## Peanut1       15.343      3.904   3.930 0.000176 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.3 on 82 degrees of freedom
## Multiple R-squared:  0.2027, Adjusted R-squared:  0.1833 
## F-statistic: 10.42 on 2 and 82 DF,  p-value: 9.254e-05

aug_model = augment(model)
data_plot=ggplot(data = aug_model, aes(x= Sugar, y = Win, color = Peanut ))+
                   geom_point()+
                   geom_line(aes(x=Sugar, y=.fitted))+
                   theme_bw()+
                   ggtitle("Win Percent vs Sugar Percent and Peanuts/Almonds")
data_plot

The intercept for the equation without peanuts/almonds is the expected value of win percentage. The second intercept tells us that we can expect 15.34 more win percent for candy with peanuts/almonds. The slope is the same in that it represents the increase of win percentage for every unit increase in sugar. Whether the candy has peanuts/almonds does have an effect on the win percentage in the candy. We can see in the plot above that candies without peanuts/almonds (red) have less win percentage that candies that do (blue).

The two equations are as follows: \[ y = 10.145x+42.934 \] for Win percentage without peanuts or almonds and \[ y = 10.145+58.274.\]

for Win percentage with peanuts or almonds.

b) Fit a regression model for winpercent as a function of sugarpercent, the variable of your choice, and an interaction of the two.

int_model = lm(Win ~ Sugar + Peanut + Sugar:Peanut, data = candy)
int_model

## 
## Call:
## lm(formula = Win ~ Sugar + Peanut + Sugar:Peanut, data = candy)
## 
## Coefficients:
##   (Intercept)          Sugar        Peanut1  Sugar:Peanut1  
##        43.399          9.151         10.477          9.231

summary(int_model)

## 
## Call:
## lm(formula = Win ~ Sugar + Peanut + Sugar:Peanut, data = candy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.212  -9.951   0.054   9.466  33.248 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     43.399      3.012  14.410   <2e-16 ***
## Sugar            9.151      5.476   1.671   0.0985 .  
## Peanut1         10.477      9.632   1.088   0.2799    
## Sugar:Peanut1    9.231     16.692   0.553   0.5818    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.35 on 81 degrees of freedom
## Multiple R-squared:  0.2057, Adjusted R-squared:  0.1763 
## F-statistic: 6.992 on 3 and 81 DF,  p-value: 0.0003055

int_pea = coef(int_model)[1]
int_sug_pea=coef(int_model)[1] + coef(int_model)[3]

slope_pea = coef(int_model)[2]
slope_sug_pea = coef(int_model)[2] + coef(int_model)[4]

ggplot(int_model, aes(x = Sugar, y=Win, color = Peanut))+
  geom_point()+
  geom_abline(intercept = int_pea, slope = slope_pea, col = 2 )+
  geom_abline(intercept = int_sug_pea, slope = slope_sug_pea, col = 4)+
  theme_bw()+
  ggtitle("Interaction Win vs Sugar and Peanuts/Almonds")

slope_pea

##    Sugar 
## 9.151143

int_pea

## (Intercept) 
##    43.39861

slope_sug_pea

##    Sugar 
## 18.38191

int_sug_pea

## (Intercept) 
##    53.87594

The interpretation of this model is similar to the one before. The two equations represent the relationship between win percentage vs sugar and peanut/almonds but it accounts for the relationship sugar and peanuts/almonds have with each other with a new multiplied term. The intercepts are the expected mean value of the win percentage for both peanut/almonds and without. The slopes represent the amount of win percentage increase for every unit of sugar percentage increase. These coefficients account for the interaction between sugar and peanuts/almonds. The two equations would be: \[y = 9.15114 +43.399\] \[y = 18.38191 + 53.876\]

D) Model Selection

a) First, fit a fully saturated model with all the variables. Which variables are significant at a 0.05 significance level? What is the reduced model?

library(leaps)
mod_org = lm(Win ~ candy$chocolate + candy$fruity + candy$caramel + Peanut+ candy$nougat+ candy$crispedricewafer+ candy$hard+candy$bar+candy$pluribus + Sugar+candy$pricepercent)

#summary(mod_org)

mod_red = lm(Win ~ candy$fruity+ candy$chocolate+Peanut)

mod_sel = lm(candy$winpercent ~ candy$chocolate + candy$fruity + Peanut+ candy$crispedricewafer+ Sugar)

summary(mod_sel)

## 
## Call:
## lm(formula = candy$winpercent ~ candy$chocolate + candy$fruity + 
##     Peanut + candy$crispedricewafer + Sugar)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.2424  -7.0503   0.5657   7.3837  25.6678 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              32.605      3.556   9.170 4.54e-14 ***
## candy$chocolate          19.670      3.618   5.436 5.89e-07 ***
## candy$fruity              7.702      3.538   2.177  0.03246 *  
## Peanut1                   9.851      3.482   2.829  0.00592 ** 
## candy$crispedricewafer    8.654      4.537   1.907  0.06010 .  
## Sugar                     7.044      4.144   1.700  0.09308 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.62 on 79 degrees of freedom
## Multiple R-squared:  0.5099, Adjusted R-squared:  0.4789 
## F-statistic: 16.44 on 5 and 79 DF,  p-value: 4.295e-11

The variables that are signifigant at the 0.05 level are fruity, chocolate, and peanut.

b) Now, perform either forward or backward selection. What variables would you put in the model?

After performing a backward selection, the variables I would put in the model are chocolate, fruity, peanuts/almond, cripsed rice wafer, and sugar percentage.

c) Compare and contrast the reduced model from (a) and the final model from part (b).

#Calculating MSE

mean(mod_red$residuals^2)

## [1] 113.9847

mean(mod_sel$residuals^2)

## [1] 104.8556

The difference is that the selection model has an additional sugar percentage and crisped rice wafer. Since the MSE of the reduced model is greater, I would prefer to use the selection model using the backwards method.

E) Inference and Prediction

a) Provide 95% confidence intervals for the coefficient estimates of your favorite model (after comparing your models in the previous section). You can use the appropriate R function to do this.

#Calculating Confidence Intervals

confint(mod_sel, level =.95)

##                             2.5 %   97.5 %
## (Intercept)            25.5272415 39.68204
## candy$chocolate        12.4676293 26.87236
## candy$fruity            0.6603272 14.74279
## Peanut1                 2.9192271 16.78228
## candy$crispedricewafer -0.3766684 17.68536
## Sugar                  -1.2039901 15.29226

b) Finally, Look at the list of candies included in this dataset. Think of your favorite candy that is not in the dataset (like raisinettes!). What is it? Imagine that you want to estimate the win percentage for this new candy.

Using the table from the article, Oreos would score a win percentage of 19.7. (+19.9 for chocolate and -.2 for multipiece). Because we are looking at a specific and individual instance, a prediction interval would be more appropriate.

#Didnt quite get to finish this part and make my data frame fit.

oreo = data.frame(chocolate = 1, fruity=0, peanutyalmondy=0, crispedricewafer = 0, sugarpercentage=.5)

predict(mod_sel, newdata = oreo, interval = "prediction")

## Warning: 'newdata' had 1 row but variables found have 85 rows

##         fit      lwr      upr
## 1  66.08529 43.40406 88.76651
## 2  56.52929 34.85789 78.20070
## 3  32.68212 10.40030 54.96394
## 4  32.68212 10.40030 54.96394
## 5  46.68818 24.96352 68.41285
## 6  65.40091 43.46712 87.33470
## 7  66.38005 44.44942 88.31067
## 8  44.66021 21.92973 67.39069
## 9  38.98663 16.54121 61.43204
## 10 44.56085 23.11147 66.01024
## 11 56.52929 34.85789 78.20070
## 12 45.46250 23.93117 66.99384
## 13 40.63023 18.93314 62.32731
## 14 45.46250 23.93117 66.99384
## 15 45.46250 23.93117 66.99384
## 16 41.20080 19.60025 62.80135
## 17 45.46250 23.93117 66.99384
## 18 46.68818 24.96352 68.41285
## 19 43.58172 22.16266 65.00078
## 20 35.88016 13.82182 57.93850
## 21 43.58172 22.16266 65.00078
## 22 43.58172 22.16266 65.00078
## 23 53.16924 31.32811 75.01037
## 24 63.95796 41.31812 86.59779
## 25 55.30361 33.65282 76.95441
## 26 55.30361 33.65282 76.95441
## 27 40.96130 19.32269 62.59991
## 28 53.66233 31.89049 75.43417
## 29 63.13379 40.43642 85.83116
## 30 41.85591 20.34113 63.37068
## 31 40.63023 18.93314 62.32731
## 32 42.18698 20.70527 63.66869
## 33 67.93680 45.88812 89.98548
## 34 58.08605 36.25207 79.92003
## 35 46.44868 24.76913 68.12824
## 36 54.40196 32.70565 76.09828
## 37 56.52929 34.85789 78.20070
## 38 57.43094 35.68384 79.17804
## 39 59.07223 37.05765 81.08680
## 40 54.47945 32.78906 76.16984
## 41 64.33020 42.32448 86.33593
## 42 46.27962 24.62978 67.92947
## 43 66.38005 44.44942 88.31067
## 44 63.13379 40.43642 85.83116
## 45 41.69389 20.16042 63.22737
## 46 41.85591 20.34113 63.37068
## 47 45.73092 23.03320 68.42863
## 48 66.30256 44.37387 88.23125
## 49 33.25974 11.06482 55.45466
## 50 44.56085 23.11147 66.01024
## 51 44.39884 22.95870 65.83898
## 52 62.36489 40.04372 84.68606
## 53 67.19717 45.22334 89.17099
## 54 64.98531 43.03207 86.93855
## 55 69.08499 46.85366 91.31633
## 56 45.46250 23.93117 66.99384
## 57 58.33259 36.45904 80.20615
## 58 37.76095 15.53845 59.98344
## 59 46.44868 24.76913 68.12824
## 60 53.82434 32.07198 75.57671
## 61 46.93473 25.15995 68.70951
## 62 46.93473 25.15995 68.70951
## 63 54.15542 32.43774 75.87310
## 64 42.18698 20.70527 63.66869
## 65 65.97149 44.04683 87.89615
## 66 75.03439 51.66907 98.39971
## 67 40.79224 19.12462 62.45987
## 68 40.79224 19.12462 62.45987
## 69 41.36986 19.79403 62.94570
## 70 44.31431 22.87833 65.75029
## 71 39.40223 16.86097 61.94349
## 72 35.54909 13.49699 57.60118
## 73 41.44735 19.88224 63.01245
## 74 44.56085 23.11147 66.01024
## 75 64.23085 41.69815 86.76355
## 76 54.47945 32.78906 76.16984
## 77 53.50031 31.70737 75.29326
## 78 55.55016 33.90286 77.19746
## 79 42.51101 21.05492 63.96710
## 80 64.77508 42.15175 87.39840
## 81 41.85591 20.34113 63.37068
## 82 40.96130 19.32269 62.59991
## 83 42.51101 21.05492 63.96710
## 84 33.91485 11.79386 56.03584
## 85 67.07147 44.27854 89.86439