library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 1.0.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(dplyr)
library(broom)
candy<-read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv",
header=TRUE)
The scatterplot shows a positive, linear, weak relationship with possible outliers.
Although the graph looks scattered, there is a stronger concentration of dots along the bottom left and top right. However, the strength is weak since the points are very diluted throughout the graph.
slr = lm(Sugar ~ Win)
summary(slr)
##
## Call:
## lm(formula = Sugar ~ Win)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58358 -0.21530 -0.01378 0.20718 0.56069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.257063 0.107593 2.389 0.0192 *
## Win 0.004404 0.002053 2.145 0.0349 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2769 on 83 degrees of freedom
## Multiple R-squared: 0.05251, Adjusted R-squared: 0.04109
## F-statistic: 4.6 on 1 and 83 DF, p-value: 0.0349
plot(Sugar ~ Win)
abline(coef(slr[1]), coef(slr)[2])
The slope is the estimated rate of change between sugar and win percentage. So, for every one unit of win percentage, we can expect 0.0044 increase in sugar percentage. The intercept in the expected mean value of sugar percentage or sugar percentage when win is zero. In this case, win percentage cannot really reach zero so it is not signifigant in this case.
summary(slr)
##
## Call:
## lm(formula = Sugar ~ Win)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58358 -0.21530 -0.01378 0.20718 0.56069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.257063 0.107593 2.389 0.0192 *
## Win 0.004404 0.002053 2.145 0.0349 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2769 on 83 degrees of freedom
## Multiple R-squared: 0.05251, Adjusted R-squared: 0.04109
## F-statistic: 4.6 on 1 and 83 DF, p-value: 0.0349
plot(Sugar ~ Win)
abline(coef(slr[1]), coef(slr)[2])
plot(slr)
Looking at the Residual vs Fitted plot and the Scale-Location plot, the points do not form a clear pattern and are roughly spread out below and above the line. Therefore, we can assume linearity and homoscedasticity. Looking at the qq plot, we can see that the points roughly follow the 45 degree line with expection of three flagged values. There is slight departure at the extremes but we can assume that the data is roughly normal.
Peanut = factor(candy$peanutyalmondy)
model = lm(Win ~ Sugar + Peanut, data=candy)
model
##
## Call:
## lm(formula = Win ~ Sugar + Peanut, data = candy)
##
## Coefficients:
## (Intercept) Sugar Peanut1
## 42.93 10.14 15.34
summary(model)
##
## Call:
## lm(formula = Win ~ Sugar + Peanut, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.034 -10.086 -0.079 9.469 33.170
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.934 2.880 14.908 < 2e-16 ***
## Sugar 10.145 5.151 1.969 0.052274 .
## Peanut1 15.343 3.904 3.930 0.000176 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.3 on 82 degrees of freedom
## Multiple R-squared: 0.2027, Adjusted R-squared: 0.1833
## F-statistic: 10.42 on 2 and 82 DF, p-value: 9.254e-05
aug_model = augment(model)
data_plot=ggplot(data = aug_model, aes(x= Sugar, y = Win, color = Peanut ))+
geom_point()+
geom_line(aes(x=Sugar, y=.fitted))+
theme_bw()+
ggtitle("Win Percent vs Sugar Percent and Peanuts/Almonds")
data_plot
The intercept for the equation without peanuts/almonds is the expected value of win percentage. The second intercept tells us that we can expect 15.34 more win percent for candy with peanuts/almonds. The slope is the same in that it represents the increase of win percentage for every unit increase in sugar. Whether the candy has peanuts/almonds does have an effect on the win percentage in the candy. We can see in the plot above that candies without peanuts/almonds (red) have less win percentage that candies that do (blue).
The two equations are as follows: \[ y = 10.145x+42.934 \] for Win percentage without peanuts or almonds and \[ y = 10.145+58.274.\]
for Win percentage with peanuts or almonds.
int_model = lm(Win ~ Sugar + Peanut + Sugar:Peanut, data = candy)
int_model
##
## Call:
## lm(formula = Win ~ Sugar + Peanut + Sugar:Peanut, data = candy)
##
## Coefficients:
## (Intercept) Sugar Peanut1 Sugar:Peanut1
## 43.399 9.151 10.477 9.231
summary(int_model)
##
## Call:
## lm(formula = Win ~ Sugar + Peanut + Sugar:Peanut, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.212 -9.951 0.054 9.466 33.248
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.399 3.012 14.410 <2e-16 ***
## Sugar 9.151 5.476 1.671 0.0985 .
## Peanut1 10.477 9.632 1.088 0.2799
## Sugar:Peanut1 9.231 16.692 0.553 0.5818
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.35 on 81 degrees of freedom
## Multiple R-squared: 0.2057, Adjusted R-squared: 0.1763
## F-statistic: 6.992 on 3 and 81 DF, p-value: 0.0003055
int_pea = coef(int_model)[1]
int_sug_pea=coef(int_model)[1] + coef(int_model)[3]
slope_pea = coef(int_model)[2]
slope_sug_pea = coef(int_model)[2] + coef(int_model)[4]
ggplot(int_model, aes(x = Sugar, y=Win, color = Peanut))+
geom_point()+
geom_abline(intercept = int_pea, slope = slope_pea, col = 2 )+
geom_abline(intercept = int_sug_pea, slope = slope_sug_pea, col = 4)+
theme_bw()+
ggtitle("Interaction Win vs Sugar and Peanuts/Almonds")
slope_pea
## Sugar
## 9.151143
int_pea
## (Intercept)
## 43.39861
slope_sug_pea
## Sugar
## 18.38191
int_sug_pea
## (Intercept)
## 53.87594
The interpretation of this model is similar to the one before. The two equations represent the relationship between win percentage vs sugar and peanut/almonds but it accounts for the relationship sugar and peanuts/almonds have with each other with a new multiplied term. The intercepts are the expected mean value of the win percentage for both peanut/almonds and without. The slopes represent the amount of win percentage increase for every unit of sugar percentage increase. These coefficients account for the interaction between sugar and peanuts/almonds. The two equations would be: \[y = 9.15114 +43.399\] \[y = 18.38191 + 53.876\]
library(leaps)
mod_org = lm(Win ~ candy$chocolate + candy$fruity + candy$caramel + Peanut+ candy$nougat+ candy$crispedricewafer+ candy$hard+candy$bar+candy$pluribus + Sugar+candy$pricepercent)
#summary(mod_org)
mod_red = lm(Win ~ candy$fruity+ candy$chocolate+Peanut)
mod_sel = lm(candy$winpercent ~ candy$chocolate + candy$fruity + Peanut+ candy$crispedricewafer+ Sugar)
summary(mod_sel)
##
## Call:
## lm(formula = candy$winpercent ~ candy$chocolate + candy$fruity +
## Peanut + candy$crispedricewafer + Sugar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.2424 -7.0503 0.5657 7.3837 25.6678
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.605 3.556 9.170 4.54e-14 ***
## candy$chocolate 19.670 3.618 5.436 5.89e-07 ***
## candy$fruity 7.702 3.538 2.177 0.03246 *
## Peanut1 9.851 3.482 2.829 0.00592 **
## candy$crispedricewafer 8.654 4.537 1.907 0.06010 .
## Sugar 7.044 4.144 1.700 0.09308 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.62 on 79 degrees of freedom
## Multiple R-squared: 0.5099, Adjusted R-squared: 0.4789
## F-statistic: 16.44 on 5 and 79 DF, p-value: 4.295e-11
The variables that are signifigant at the 0.05 level are fruity, chocolate, and peanut.
After performing a backward selection, the variables I would put in the model are chocolate, fruity, peanuts/almond, cripsed rice wafer, and sugar percentage.
#Calculating MSE
mean(mod_red$residuals^2)
## [1] 113.9847
mean(mod_sel$residuals^2)
## [1] 104.8556
The difference is that the selection model has an additional sugar percentage and crisped rice wafer. Since the MSE of the reduced model is greater, I would prefer to use the selection model using the backwards method.
#Calculating Confidence Intervals
confint(mod_sel, level =.95)
## 2.5 % 97.5 %
## (Intercept) 25.5272415 39.68204
## candy$chocolate 12.4676293 26.87236
## candy$fruity 0.6603272 14.74279
## Peanut1 2.9192271 16.78228
## candy$crispedricewafer -0.3766684 17.68536
## Sugar -1.2039901 15.29226
Using the table from the article, Oreos would score a win percentage of 19.7. (+19.9 for chocolate and -.2 for multipiece). Because we are looking at a specific and individual instance, a prediction interval would be more appropriate.
#Didnt quite get to finish this part and make my data frame fit.
oreo = data.frame(chocolate = 1, fruity=0, peanutyalmondy=0, crispedricewafer = 0, sugarpercentage=.5)
predict(mod_sel, newdata = oreo, interval = "prediction")
## Warning: 'newdata' had 1 row but variables found have 85 rows
## fit lwr upr
## 1 66.08529 43.40406 88.76651
## 2 56.52929 34.85789 78.20070
## 3 32.68212 10.40030 54.96394
## 4 32.68212 10.40030 54.96394
## 5 46.68818 24.96352 68.41285
## 6 65.40091 43.46712 87.33470
## 7 66.38005 44.44942 88.31067
## 8 44.66021 21.92973 67.39069
## 9 38.98663 16.54121 61.43204
## 10 44.56085 23.11147 66.01024
## 11 56.52929 34.85789 78.20070
## 12 45.46250 23.93117 66.99384
## 13 40.63023 18.93314 62.32731
## 14 45.46250 23.93117 66.99384
## 15 45.46250 23.93117 66.99384
## 16 41.20080 19.60025 62.80135
## 17 45.46250 23.93117 66.99384
## 18 46.68818 24.96352 68.41285
## 19 43.58172 22.16266 65.00078
## 20 35.88016 13.82182 57.93850
## 21 43.58172 22.16266 65.00078
## 22 43.58172 22.16266 65.00078
## 23 53.16924 31.32811 75.01037
## 24 63.95796 41.31812 86.59779
## 25 55.30361 33.65282 76.95441
## 26 55.30361 33.65282 76.95441
## 27 40.96130 19.32269 62.59991
## 28 53.66233 31.89049 75.43417
## 29 63.13379 40.43642 85.83116
## 30 41.85591 20.34113 63.37068
## 31 40.63023 18.93314 62.32731
## 32 42.18698 20.70527 63.66869
## 33 67.93680 45.88812 89.98548
## 34 58.08605 36.25207 79.92003
## 35 46.44868 24.76913 68.12824
## 36 54.40196 32.70565 76.09828
## 37 56.52929 34.85789 78.20070
## 38 57.43094 35.68384 79.17804
## 39 59.07223 37.05765 81.08680
## 40 54.47945 32.78906 76.16984
## 41 64.33020 42.32448 86.33593
## 42 46.27962 24.62978 67.92947
## 43 66.38005 44.44942 88.31067
## 44 63.13379 40.43642 85.83116
## 45 41.69389 20.16042 63.22737
## 46 41.85591 20.34113 63.37068
## 47 45.73092 23.03320 68.42863
## 48 66.30256 44.37387 88.23125
## 49 33.25974 11.06482 55.45466
## 50 44.56085 23.11147 66.01024
## 51 44.39884 22.95870 65.83898
## 52 62.36489 40.04372 84.68606
## 53 67.19717 45.22334 89.17099
## 54 64.98531 43.03207 86.93855
## 55 69.08499 46.85366 91.31633
## 56 45.46250 23.93117 66.99384
## 57 58.33259 36.45904 80.20615
## 58 37.76095 15.53845 59.98344
## 59 46.44868 24.76913 68.12824
## 60 53.82434 32.07198 75.57671
## 61 46.93473 25.15995 68.70951
## 62 46.93473 25.15995 68.70951
## 63 54.15542 32.43774 75.87310
## 64 42.18698 20.70527 63.66869
## 65 65.97149 44.04683 87.89615
## 66 75.03439 51.66907 98.39971
## 67 40.79224 19.12462 62.45987
## 68 40.79224 19.12462 62.45987
## 69 41.36986 19.79403 62.94570
## 70 44.31431 22.87833 65.75029
## 71 39.40223 16.86097 61.94349
## 72 35.54909 13.49699 57.60118
## 73 41.44735 19.88224 63.01245
## 74 44.56085 23.11147 66.01024
## 75 64.23085 41.69815 86.76355
## 76 54.47945 32.78906 76.16984
## 77 53.50031 31.70737 75.29326
## 78 55.55016 33.90286 77.19746
## 79 42.51101 21.05492 63.96710
## 80 64.77508 42.15175 87.39840
## 81 41.85591 20.34113 63.37068
## 82 40.96130 19.32269 62.59991
## 83 42.51101 21.05492 63.96710
## 84 33.91485 11.79386 56.03584
## 85 67.07147 44.27854 89.86439