Project 2: Simple Logistic Regression

Background

While working for FiveThirtyEight, a website that focuses on opinionated polls, Walt Hickey designed an experiment to find out what Holloween candies were peoples favorites along with the characteristics of the candy. When the website was opened a poll would give the user the option two choose their favorite of the two candies displayed. A total of 85 different candies were tested against eachother randomly over 269,000 times. The independent variable of interest is chocolate, which indicates if the candy has chocolate, and the explanatory variable being winpercent indicating the percent of times the candy beat out its opponent in the polls. The purpose of this analysis is to predict based off the candies winning percentage if the candy has chocolate. The analysis will also allow us to understand if peoples favorite candies are chocolate based or not.

Logistic Model

\[ P(Y_i = 1|x_i) = \frac{e^{\beta_0+\beta_1 x_i}}{1+e^{\beta_0 + \beta_1 x_i}} = \pi_i \] Where

Variable	Explanation
$Y_i =1$	Candy has chocolate
$Y_i =0$	Candy does not have chocolate
$x_i$	The candies winning percentage after having been tested against one another

Hypothesis

The hypothesis test for $\beta_1$ or the candies winning percentage is as follows. This test and all tests following will be evaluated at the significance level of $\alpha$ = .05

\[ H_0: \beta_1 = 0 \\ H_a: \beta_1 \neq 0 \]

Dataset

Below the dataset shows each of the candies that was used in this experiment from 2017. It was decided to leave out the variable sugarpercentas the purpose is to focus completely on a candies winning percentage to predict wether or not it has chocolate.

datatable(Candy, options=list(lengthMenu = c(10,50)),  style = "default")

Plots

plot(Candy$chocolate > 0 ~ winpercent , data=Candy, main="Are chocolate candies Americas favorite?", ylab='Probability that the candy has Chocolate', xlab = 'Winning percentage', pch=16)

curve(exp(-7.03514 + .13881*x)/(1+exp(-7.03514 + .13881*x)),add=TRUE)

Goodness of fit

pander(hoslem.test(Candy.glm$y, Candy.glm$residuals))

Hosmer and Lemeshow goodness of fit (GOF) test: `Candy.glm$y, Candy.glm$residuals`
Test statistic	df	P value
-48.43	8	1

As the explanatory variable winpercent does not have repeated values it was decided the appropriate goodness of fit test would be the Hosmer-Lemeshow test. The null hypothesis of the test is that the model is a good fit, with the alternative being that the model is not a good fit. With a P-value eqaul to 1 there is sufficient evidence to conclude the null that the model is a good fit.

Analysis

pander(Candy.glm <- glm(chocolate > 0 ~ winpercent , data = Candy, family = binomial))

Fitting generalized (binomial/logit) linear model: chocolate > 0 ~ winpercent
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-7.138	1.486	-4.804	1.552e-06
winpercent	0.1351	0.02865	4.716	2.411e-06

The regression output shows that both the $\beta_0$ along with the $\beta_1$ coefficients were significant at the level of alpha stated above in the hypothesis.

Intepretation

In order to evaluate the affect that a candies winpercent has on wether or not the candy has chocolate the $\beta_1$ coefficient will be plugged in. Therefore $e^\beta{_1}$ = $e^.13510$ = 1.144651, which shows that the odds that the candy has chocolate in it increases by a factor of 1.144651 for every 1 % increase in the candies winpercent. When looking at the plot of the regression, one can see that once a candies winpercent is above 60%, the odds that the candy has chocolate is about 80%. And once the winpercent reaches 80%, the odds that the candy has chocolate is nearly 100%. As was stated above the model is a good fit and the beta coefficients were significant. This means that the model can be used confidently to predict. The question that is of interest is “Can a candies winning percentage be used to predict if it is chocolate based?”. The answer is yes and it is clear that the candies with the higher winning percentages were mostly chocolate based. In other words Americas favorite candies do have chocolate in them. For more information about the study that was done to collect the data visit the link below.

Links

https://fivethirtyeight.com/features/the-ultimate-halloween-candy-power-ranking/

Variable	Explanation
\(Y_i =1\)	Candy has chocolate
\(Y_i =0\)	Candy does not have chocolate
\(x_i\)	The candies winning percentage after having been tested against one another