Questions
1. Background
Go to https://fivethirtyeight.com/features/the-ultimate-halloween-candy-power-ranking/ and watch the video and read the article.
Discuss the research question addressed in this article:
ANSWER: DATA: The data on “candy_rankings” was collected from the Federal Government Websites and New Reports based on the candies people consume or demand more during Halloween. In the data, the name of the Halloween candy is competitorname. The “candy_rankings” consists of 85 rows with 13 variables. The thirteen observed variables contained in the data are competitorname, chocolate, fruity, caramel, peanutyalmond, nougat, crispedricewafer, hard, bar, pluribus, sugarpercent, pricepercent and winpercent. The question is, “why are some candies preferred to other candies”? Is it because the candy contains chocolate, fruity, caramel, peanutyalmond, nougat or crispedricewafer? Is it because of its texture (hard or bar) or is it because of the amount (are there many in a bag?). We will run a regression a linear regression model to estimate the impact of these observable characteristics. The data was described using the R codes in the R file.
2. Data
Load the data set candy_rankings from the package fivethirtyeight. install.packages(“fivethirtyeight”) library(fivethirtyeight) data(candy_rankings) View(candy_rankings)
# ANSWER:
3. Summary Table
Create a well-formatted and informative summary statistics table. Interpret the mean of two of your variables.
# ANSWER:
library(stargazer)
Please cite as:
Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
sillynames <- data.frame(candy_rankings)
stargazer(sillynames, type = "text")
===================================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
-------------------------------------------------------------------
chocolate 85 0.435 0.499 0 0 1 1
fruity 85 0.447 0.500 0 0 1 1
caramel 85 0.165 0.373 0 0 0 1
peanutyalmondy 85 0.165 0.373 0 0 0 1
nougat 85 0.082 0.277 0 0 0 1
crispedricewafer 85 0.082 0.277 0 0 0 1
hard 85 0.176 0.383 0 0 0 1
bar 85 0.247 0.434 0 0 0 1
pluribus 85 0.518 0.503 0 0 1 1
sugarpercent 85 0.479 0.283 0.011 0.220 0.732 0.988
pricepercent 85 0.469 0.286 0.011 0.255 0.651 0.976
winpercent 85 50.317 14.714 22.445 39.141 59.864 84.180
-------------------------------------------------------------------
ANSWER: From the table, we found out that there are 85 rows with 13 variables. The first column contains the variables, followed by the number of observations in each column, mean of the various variables, then the standard deviation associated with each variable, the minimum values for each variable, the 25th percentile for each variable, the 75th percentile and the maximum observation for each of the variables. The associated mean for the various variables is on the third column on the summary table. The mean is the average of all observations contained in each variable. Therefore, the mean is the weighted average. On average, taking chocolate into consideration, at least 0.435 candies contain chocolate and at least, 0.518 candies contain pluribus This applies to the mean of all other variables in the summary data.
4. Summary Graphs
Create two high-quality graphs/figures that illustrate the data in a way that is informative for the reader. Explain, in text, what these figures illustrate.
# ANSWER:
hist(candy_rankings$sugarpercent)

hist(candy_rankings$pricepercent)

# ANSWER:
boxplot(candy_rankings$sugarpercent)

boxplot(candy_rankings$pricepercent)

ANSWER:
5. Empirical Model
Write a multiple linear regression empirical model that was was estimated in the article. Intepret the coefficients in the context of the problem. What would it mean for a coefficient to be statistically significant? What is the assumptions made when using this linear model?
\[ answer \] We will create a linear regression model with winpercent as our dependent variable and independent variables (chocolate, fruity, caramel, peanutyalmondy, nougat, crispedricewafer, hard, bar and pluribus). All the independent variables are dummy variables with FALSE =0 and TRUE= 1 except winpercent, sugarpercent and pricepercent. The linear regression being considered here is the multiple linear regression. After we created the regression, we found out that the linear multiple regression model is Winpercent = B0 + B1chocolate + B2fruity + B3nougat + B4caramel + B5peanutyalmondy + B6crispedricewafer + B7hard + B8bar + B9pluribus + B10sugarpercent + B11pricepercernt + e, where e is the error term. This is the non-estimated model. There are 12 parameters, the error term, and 12 variables in this model. All of the variables on the right-hand side of the regression model are categorical in nature. The concept of a dummy variable comes into play here. The constant term is represented by the parameter B0. When all other variables are held constant, the winpercent equals B0. Holding all other variables constant, candy containing chocolate and having D equal to 1 corresponds to a B1 extra gain in total win % based on 269,000 matchups than non-chocolate candies. Moreover, candy containing fruity and having D as 1 is B2 on average more than candy that does not contain fruity. Candy that contains nougat has a D of 1 and is B3 on average higher than candy that does not contain nougat. Candy that contains caramel has a D of 1 and is B4 on average 1% more than candy that does not contain caramel for the criterion B4. Candy that contains peanutyalmondy has a D of 1 and is B5 on average 1% greater than candy that does not contain peanutyalmondy. Using the parameter B6, crispedricewafer-containing candies have a D as 1 that is B6 on average winpercent higher than non-crispedricewafer-containing candies. Candy that is hard has a D as 1 for the parameter B7, which means that it is B7 on average winpercent greater than candy that is not hard. Candies that are bars have a D of 1 and are B8 on average winpercent more than candies that are not bars for the criterion B8. For the criterion B9, candies that are one of many in a bag and have a D of 1 are B9 on average 1% greater than candies that are not one of many in a box or bag. keeping all other variables fixed, each further unit rise in sugarpercent equates to a B10 additional increase in winpercent for the parameter B10. And, assuming all other factors remain constant, each further unit increase in pricepercent equates to a B11 increase in winpercent. Since the F-statistic is 8.84, and the p-value is less than an alpha level of 0.05, the regression equation is statistically significant.
ANSWER:
6. Results
Replicate the regression table in the article. Display results in a nicely-formatted table.
# ANSWE
lm(formula=winpercent~D + J + A + P + C + R + V + E + B + sugerpercent + pricepercent,data=candy_rankings)
=============================================== Dependent variable: ————————– winpercent ———————————————– D 19.748*** (3.899)
J 9.422** (3.763)
A 2.224 (3.657)
P 10.071*** (3.616) C 0.804 (5.716)
R 8.919* (5.268)
V -6.165* (3.455)
E 0.442 (5.061)
B -0.854 (3.040)
sugarpercent 9.087* (4.659)
pricepercent -5.928 (5.513)
Constant 34.534*** (4.320) ———————————————– Observations 85 R2 0.540 Adjusted R2 0.471 Residual Std. Error 10.703 (df = 73) F Statistic 7.797*** (df = 11; 73) =============================================== Note: p<0.1; p<0.05; p<0.01
Call: lm(formula = winpercent ~ D + J + A + P + C + R + V + E + B + sugarpercent + pricepercent, data = candy_rankings) Residuals: Min 1Q Median 3Q Max - 20.2244 -6.6247 0.1986 6.8420 23.8680 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 34.5340 4.3199 7.994 1.44e-11 D 19.7481 3.8987 5.065 2.96e-06 J 9.4223 3.7630 2.504 0.01452 * A 2.2245 3.6574 0.608 0.54493 P 10.0707 3.6158 2.785 0.00681 ** C 0.8043 5.7164 0.141 0.88849 R 8.9190 5.2679 1.693 0.09470 . V -6.1653 3.4551 -1.784 0.07852 . E 0.4415 5.0611 0.087 0.93072 B -0.8545 3.0401 -0.281 0.77945 sugarpercent 9.0868 4.6595 1.950 0.05500 . pricepercent -5.9284 5.5132 -1.075 0.28578 — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 10.7 on 73 degrees of freedom Multiple R-squared: 0.5402, Adjusted R-squared: 0.4709 F-statistic: 7.797 on 11 and 73 DF, p-value: 9.504e-09
Based on the results of our regression, the estimated model is winpercenthat = 34.5340 + 19.7481D1+9.4223D2+2.2245D3+10.0707D4+0.8043D5+8.9190D6-6.1653D7+0.4415D8- 0.8545D9+9.0868sugarperecnt-5.9284pricepercent The estimated model contains D1, D2, D3, D4, D5, D6, D7, D8 and D9 as dummy variables. ## 7. Discussion
Interpret your results from question 6. Did the model explain much of the variation in the dependent variable? Do you feel like anything may have been omitted from the regression that should have been included?
ANSWER:
At least winpercent is equal to 34.5340. Candies which contain chocolate, which have D as 1, will correspond to 19.7481 additional increase in overall win percentage according to 269,000 matchups than candies not made of chocolate. Also candy which contains fruity, which have D as 1 is 9.4223 on average winpercent more than candies which does not contain fruity. For candy which contain caramel which have a D as 1 is 2.2245 on average winpercent more than candies which do not contain caramel. For candy which contain peanutyalmondy which have a D as 1 is 10.0707 on average winpercent more than candies which do not contain peanutyalmondy. For the candy which contain nougat which have a D as 1 is 0.8043 on average winpercent more than candies which do not contain nougat. With candy which contain crispedricewafer have a D as 1 is 8.9190 on average winpercent more than candies which do not contain crispedricewafer. For candies which are hard have a D as 1 is -6.1653 on average winpercent less than candies which are not hard. For candies which are bar which have a D as 1 is 0.4415 on average winpercent more than candies which are not bars. With candy which is one of many contained in a bag or box which have a D as 1 is -0.8545 on average winpercent less than candies which do not contain one of many in a box or bag. Holding all other variables constant, each additional unit increase in sugarpercent will corresponds to 9.0868 additional increase in winpercent. And holding all other variables constant, each additional unit increase in pricepercent will corresponds to -5.9284 decrease in winpercent.
Hypothesis testing using the F-statistic H0: B0=B1=B2=B3=B4=B5=B6=B7=B8=B9=B9=B10=B11=0 H1: B0 ≠𝐵1≠𝐵2≠𝐵3≠𝐵4≠𝐵5≠𝐵6≠𝐵7≠𝐵8≠𝐵9≠𝐵10≠𝐵11≠0 F-statistic: 7.797 on 11 and 73 DF, p-value: 9.504e-09 Looking at our results for the F-statistic, which is 7.797, the p-value which is 9.504e-09 is less than an alpha level of 0.05. Since our p-value is less than the alpha level, we reject Ho and we conclude that the regression equation is statistically significant. R^2 interpretation: From our results, R^2 was 0.540. This means that 54.0% of the total variance in winpercent is explained by the model. H0: B1=0 H1: B1 ≠0 D 19.7481 3.8987 5.065 2.96e-06 Here the variable D represents chocolate, we notice that the p-value which is 2.96e-06 is less than 0.05 so we reject Ho and we conclude that the variable chocolate is statistically significant in the model. H0:B2=0 B1:B1 ≠ 0 J 9.4223 3.7630 2.504 0.01452 Here, the variable J is fruity, since the p-value is 0.01452 is less than an alpha level of 0.05, we reject HO and we conclude that fruity is statistically significant in the model. H0: B4=0 H1: B4 ≠0 P 10.0707 3.6158 2.785 0.00681 The variable P here represents peanutyalmondy. The p-value of 0.00681 is less than an alpha level of 0.05, therefore we reject Ho and we conclude that peanutyalmondy is statistically significant in the model. Looking at all the remaining variables in the data summary, they all have p-values greater than an alpha level of 0.05, so we fail to reject Ho and we conclude that caramel, nougat, hard, bar, sugarpercent, pricepercent, crispedricewafer, pluribus are all not statistically significant in the model.
