While working for FiveThirtyEight, a website that focuses on opinionated polls, Walt Hickey designed an experiment to find out what Holloween candies were peoples favorites along with the characteristics of the candy. When the website was opened a poll would give the user the option two choose their favorite of the two candies displayed. A total of 85 different candies were tested against eachother randomly over 269,000 times. The independent variable of interest is chocolate
, which indicates if the candy has chocolate, and the explanatory variable being winpercent
indicating the percent of times the candy beat out its opponent in the polls. The purpose of this analysis is to predict based off the candies winning percentage if the candy has chocolate. The analysis will also allow us to understand if peoples favorite candies are chocolate
based or not.
\[ P(Y_i = 1|x_i) = \frac{e^{\beta_0+\beta_1 x_i}}{1+e^{\beta_0 + \beta_1 x_i}} = \pi_i \] Where
Variable | Explanation |
---|---|
\(Y_i =1\) | Candy has chocolate |
\(Y_i =0\) | Candy does not have chocolate |
\(x_i\) | The candies winning percentage after having been tested against one another |
The hypothesis test for \(\beta_1\) or the candies winning percentage is as follows. This test and all tests following will be evaluated at the significance level of \(\alpha\) = .05
\[ H_0: \beta_1 = 0 \\ H_a: \beta_1 \neq 0 \]
Below the dataset shows each of the candies that was used in this experiment from 2017. It was decided to leave out the variable sugarpercent
as the purpose is to focus completely on a candies winning percentage to predict wether or not it has chocolate.
datatable(Candy, options=list(lengthMenu = c(10,50)), style = "default")
plot(Candy$chocolate > 0 ~ winpercent , data=Candy, main="Are chocolate candies Americas favorite?", ylab='Probability that the candy has Chocolate', xlab = 'Winning percentage', pch=16)
curve(exp(-7.03514 + .13881*x)/(1+exp(-7.03514 + .13881*x)),add=TRUE)
pander(hoslem.test(Candy.glm$y, Candy.glm$residuals))
Test statistic | df | P value |
---|---|---|
-48.43 | 8 | 1 |
As the explanatory variable winpercent
does not have repeated values it was decided the appropriate goodness of fit test would be the Hosmer-Lemeshow test. The null hypothesis of the test is that the model is a good fit, with the alternative being that the model is not a good fit. With a P-value eqaul to 1 there is sufficient evidence to conclude the null that the model is a good fit.
pander(Candy.glm <- glm(chocolate > 0 ~ winpercent , data = Candy, family = binomial))
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -7.138 | 1.486 | -4.804 | 1.552e-06 |
winpercent | 0.1351 | 0.02865 | 4.716 | 2.411e-06 |
The regression output shows that both the \(\beta_0\) along with the \(\beta_1\) coefficients were significant at the level of alpha stated above in the hypothesis.
In order to evaluate the affect that a candies winpercent
has on wether or not the candy has chocolate
the \(\beta_1\) coefficient will be plugged in. Therefore \(e^\beta{_1}\) = \(e^.13510\) = 1.144651, which shows that the odds that the candy has chocolate
in it increases by a factor of 1.144651 for every 1 % increase in the candies winpercent
. When looking at the plot of the regression, one can see that once a candies winpercent
is above 60%, the odds that the candy has chocolate
is about 80%. And once the winpercent
reaches 80%, the odds that the candy has chocolate
is nearly 100%. As was stated above the model is a good fit and the beta coefficients were significant. This means that the model can be used confidently to predict. The question that is of interest is “Can a candies winning percentage be used to predict if it is chocolate based?”. The answer is yes and it is clear that the candies with the higher winning percentages were mostly chocolate based. In other words Americas favorite candies do have chocolate in them. For more information about the study that was done to collect the data visit the link below.