OnLine Assessment: A Halloween Candy Regression: R Notebook

Welcome to the Halloween Candy Multiple Linear Regression Online Assessment! This assessment is due prior to class on Monday, 19 October. You should include your code and upload your Rmd and html notebook to Sakai uner OnLine Assessment. You do not need to hide your code or output.

Before starting the assessment, please view the short video produced by 538 that is posted on Sakai and here: https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/. Please also check out the metadata here: https://github.com/fivethirtyeight/data/blob/master/candy-power-ranking/README.md

First you will need to install a new package named readr. This will allow us to read in csv files from github.

library(readr)
library(ggplot2)

library(readr)
candy <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   competitorname = col_character(),
##   chocolate = col_double(),
##   fruity = col_double(),
##   caramel = col_double(),
##   peanutyalmondy = col_double(),
##   nougat = col_double(),
##   crispedricewafer = col_double(),
##   hard = col_double(),
##   bar = col_double(),
##   pluribus = col_double(),
##   sugarpercent = col_double(),
##   pricepercent = col_double(),
##   winpercent = col_double()
## )

str(candy)

## tibble [85 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ competitorname  : chr [1:85] "100 Grand" "3 Musketeers" "One dime" "One quarter" ...
##  $ chocolate       : num [1:85] 1 1 0 0 0 1 1 0 0 0 ...
##  $ fruity          : num [1:85] 0 0 0 0 1 0 0 0 0 1 ...
##  $ caramel         : num [1:85] 1 0 0 0 0 0 1 0 0 1 ...
##  $ peanutyalmondy  : num [1:85] 0 0 0 0 0 1 1 1 0 0 ...
##  $ nougat          : num [1:85] 0 1 0 0 0 0 1 0 0 0 ...
##  $ crispedricewafer: num [1:85] 1 0 0 0 0 0 0 0 0 0 ...
##  $ hard            : num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##  $ bar             : num [1:85] 1 1 0 0 0 1 1 0 0 0 ...
##  $ pluribus        : num [1:85] 0 0 0 0 0 0 0 1 1 0 ...
##  $ sugarpercent    : num [1:85] 0.732 0.604 0.011 0.011 0.906 ...
##  $ pricepercent    : num [1:85] 0.86 0.511 0.116 0.511 0.511 ...
##  $ winpercent      : num [1:85] 67 67.6 32.3 46.1 52.3 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   competitorname = col_character(),
##   ..   chocolate = col_double(),
##   ..   fruity = col_double(),
##   ..   caramel = col_double(),
##   ..   peanutyalmondy = col_double(),
##   ..   nougat = col_double(),
##   ..   crispedricewafer = col_double(),
##   ..   hard = col_double(),
##   ..   bar = col_double(),
##   ..   pluribus = col_double(),
##   ..   sugarpercent = col_double(),
##   ..   pricepercent = col_double(),
##   ..   winpercent = col_double()
##   .. )

head(candy)

Question 1 (5 points)

You are running an analysis of candy preferences to determine which candy should be purchased for kids at Halloween. Select five explanatory variables to predict winpercent (The overall win percentage according to 269,000 matchups.). Interpret the coefficient of two of the explanatory variables. In your discussion, be sure to discuss their statistical significance.

library(car)

## Loading required package: carData

library(lmtest)

## Warning: package 'lmtest' was built under R version 4.0.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.0.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(GGally)

## Warning: package 'GGally' was built under R version 4.0.3

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

The five explanatory variables I’m choosing to predict winpercent are: chocolate, fruity, caramel, nougat and bar. The null hypothesis are none of the explanatory variables affect predict the winpercent and/or the slopes are zero. Clearly stated: Ho: All slope coefficients are equal to zero Ha: At least one coefficient is not equal to zero

candy <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   competitorname = col_character(),
##   chocolate = col_double(),
##   fruity = col_double(),
##   caramel = col_double(),
##   peanutyalmondy = col_double(),
##   nougat = col_double(),
##   crispedricewafer = col_double(),
##   hard = col_double(),
##   bar = col_double(),
##   pluribus = col_double(),
##   sugarpercent = col_double(),
##   pricepercent = col_double(),
##   winpercent = col_double()
## )

candy.df<-candy

model.1<-lm(winpercent~ chocolate+fruity+caramel + nougat + bar, data=candy.df, na.action = na.exclude)
summary(model.1)

## 
## Call:
## lm(formula = winpercent ~ chocolate + fruity + caramel + nougat + 
##     bar, data = candy.df, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.678  -6.605  -1.180   8.547  25.781 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.3087     3.4493  10.526  < 2e-16 ***
## chocolate    22.0909     4.0409   5.467 5.19e-07 ***
## fruity        7.1472     3.8253   1.868   0.0654 .  
## caramel       3.1383     3.6885   0.851   0.3974    
## nougat        0.3939     5.4042   0.073   0.9421    
## bar           2.6210     4.1078   0.638   0.5253    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.4 on 79 degrees of freedom
## Multiple R-squared:  0.4357, Adjusted R-squared:    0.4 
## F-statistic:  12.2 on 5 and 79 DF,  p-value: 9.003e-09

Interpreting chocolate and fruity: Candy with chocolate is associated with a 22.09 point increase in the predicted overall win percentage according to 269,000 matchups, holding other variables in the model constant. This slope parameter is significantly different than zero at a significance level of 5.19*10^-7. Fruity candy is associated with a 7.14 point increase in the predicted overall win percentage according to 269,000 matchups, holding other variables in the model constant (P=.06).

Question 2 (5 points)

Report the adjusted R2 and interpret the F-statistic. The R2 is 0.43 and interpreting the F-statistic of (12.2, df=79, p=9*10^-9), it strongly suggests there is at least one slope.

Question 3 (5 points)

Discuss the assumptions of multiple linear regression. Develop a residual vs fitted plot to inform your discussion. You may use plot() or construct your plot.

candy.df$model.1.res = resid(model.1, na.action = na.exclude)

candy.df$model.1.fit = fitted(model.1, na.action = na.exclude)

ggplot(candy.df, aes(x=model.1.fit, y=model.1.res)) +
  geom_point() +
  #geom_text(label=candy.df$competitorname)+
geom_hline(yintercept=0, col="red", lwd=2)+
ggtitle("Model 1 Residual vs Fitted Plot") +
  xlab("Fitted Win percent") + ylab("Residuals")

Five assumptions of multiple linear regression: Linear relationship: There needs to be a linear relationship between the independent and dependent variables. In this case, there doesn’t appear to be a linear relationship given my independent variables are all binary. Multivariate normality: this requires all variables to be normal which can be checked with a Q-Q plot or a goodness of fit test. No or little multicollinearity: This assures no independent variables are independent of one and other which the variables used are. I will test this in the next question. No auto-correlation: Auto-correlation is when the residuals are not independent from the others. Each residual are independent from one and other. Homoscedasticity: Lastly, a scatter plot can check whether the error terms along the regression line are equal. This data passes that test.

Question 4 (5 points)

Test for heteroskedasticity with the Breusch-Pagan Test and multicollinearity with the Variance Inflation Factor Test (VIF). Discuss results of these two tests in the context of the model in Question 1. Do you need to make any adjustments to your model? You do not need to make adjustments, but rather discuss how you might address potential violations of the assumptions.

lmtest::bptest(model.1)

## 
##  studentized Breusch-Pagan test
## 
## data:  model.1
## BP = 13.788, df = 5, p-value = 0.01701

car::vif(model.1)

## chocolate    fruity   caramel    nougat       bar 
##  2.626485  2.366911  1.224752  1.444210  2.053955

Multicollinearity can be present when the VIF is greater than 10. In my case, all 5 independent variables are below 2.6 which passes the test. Ho: Error variances are equal

In the context of question 1, I hypothesized that all slopes are equal to zero. In that case, I reject my null hypothesis. The only changes I’d need to address would be in relation to the vertical residuals in my fitted plot which do not pass the linearity assumption.

For fun:

c<-1-.43#Tolerance
x<-1/c#VIF
c

## [1] 0.57

## [1] 1.754386