Answer all the following questions in this notebook either
following the # ANSWER: or **ANSWER** prompts.
For mathematical equations, replace the text between the $$
with the appropriate formulas.
Use complete sentences, proof-read, and show all necessary work. After you finish, make sure to re-run all the R-chunks before clicking “Preview” and submitting. You can do this by clicking the dropdown menu that says “Run” in the toolbar above and then selecting “Run All”. There is a non-zero probability that Dr. Selby will download the Rmd file and try to run it on her computer. If it does not run from top to bottom, she will not accept the assignment.
Submit your .nb.html file on Canvas by the date and time listed above.
Lastly, post all code used to load necessary libraries in the following R-chunk.
# ANSWER:
library(fivethirtyeight)
Warning: package ‘fivethirtyeight’ was built under R version 4.2.3Some larger datasets need to be installed separately, like senators and house_district_forecast. To install these, we recommend you
install the fivethirtyeightdata package by running: install.packages('fivethirtyeightdata', repos =
'https://fivethirtyeightdata.github.io/drat/', type = 'source')
library(stargazer)
library(dplyr)
library(ggplot2)
Go to https://fivethirtyeight.com/features/the-ultimate-halloween-candy-power-ranking/ and watch the video and read the article.
Discuss the research question addressed in this article:
ANSWER: In this article and video the researcher is attempting to discover what qualities/components of a candy make it more desirable than other candies. He analyzed the general ingredients, chocolate vs fruit flavored, and other variables like, caramel, crisp/wafer, nougat, and nuts to seee how they faired against other candies. Then he identified the variables that made the biggest impact.
Load the data set candy_rankings from the package
fivethirtyeight.
# ANSWER:
?candy_rankings
data("candy_rankings")
head(candy_rankings)
View(candy_rankings)
attach(candy_rankings)
Create a well-formatted and informative summary statistics table. Interpret the mean of two of your variables.
# ANSWER:
stargazer(data.frame(candy_rankings), type = "html", title = "Summary Statistics of Candy Rankings", covariate.labels = c("Chocolate", "Fruity", "Caramel", "Nuts", "Nougat", "Crispy/Wafer", "Hard", "Bar", "Pluribus", "Sugar %", "Price %", "Win %"))
| Statistic | N | Mean | St. Dev. | Min | Max |
| Chocolate | 85 | 0.435 | 0.499 | 0 | 1 |
| Fruity | 85 | 0.447 | 0.500 | 0 | 1 |
| Caramel | 85 | 0.165 | 0.373 | 0 | 1 |
| Nuts | 85 | 0.165 | 0.373 | 0 | 1 |
| Nougat | 85 | 0.082 | 0.277 | 0 | 1 |
| Crispy/Wafer | 85 | 0.082 | 0.277 | 0 | 1 |
| Hard | 85 | 0.176 | 0.383 | 0 | 1 |
| Bar | 85 | 0.247 | 0.434 | 0 | 1 |
| Pluribus | 85 | 0.518 | 0.503 | 0 | 1 |
| Sugar % | 85 | 0.479 | 0.283 | 0.011 | 0.988 |
| Price % | 85 | 0.469 | 0.286 | 0.011 | 0.976 |
| Win % | 85 | 50.317 | 14.714 | 22.445 | 84.180 |
##html version to see in preview. remember add results = 'asis' in the {r} or it won't produce.
ANSWER: For the data set from fivethirtyeight candy_rankings, we can see in this table the mean for the percent of sugar in the candies was 46.9%. This suggests that the average percent of sugar for all of the 85 types of candies analylzed was 46.9%. This should make sense that some candies would be higher in sugar and some would be lower in sugar, but since they are candy sweetness is a component and sugar should be present.
We can also see that of the variables related to what the candies were made of chocolate appeared in approximately 44.7% of the candies. So we had a proportion of almost half the candies containing chocolate.
Create two high-quality graphs/figures that illustrate the data in a way that is informative for the reader. Explain, in text, what these figures illustrate.
# ANSWER:
#Graph 1= Scatterplot
ggplot(data = candy_rankings, aes(x = sugarpercent, y = pricepercent, col= chocolate)) +
geom_point()+
scale_color_manual(labels=c("Non Chocolate", "Chocolate"), values = c("red", "blue"))+
labs(title = "Sugar Percent vs Price for Chocolate and Non Chocolate Candy", x = "Percent of Sugar", y = "Price Percent")
# ANSWER:
#Graph 2 - Barplot
ggplot(candy_rankings, aes(x = reorder(competitorname, -winpercent), winpercent, fill= chocolate)) +
geom_col() +
scale_x_discrete(guide = guide_axis(n.dodge=1))+
labs(title = "Chocolate Candies VS Non-Chocolate Candies",
y = "Win Percentage",
x = "Candy Names") +
theme(axis.text.x = element_text(angle=90), text=element_text(size=9))
ANSWER: In the first graph, I chose a scatterplot to see if there is a trend in the sugar amount and the price. I also wanted to see if the chocolates (which were more popular) were correlated as well. As we can see there is a positive trend as the sugar percent increases, the price percent increases as well. Additionally, by coloring for whether the candies have chocolate or not we can see that often, the more expensive and sugary candies include chocolate.
In the second graphic, I used a bar graph to show the relationship between the type of candy (by name) and the percentage of wins it has against other candies. In the video and in the article it suggested that chocolate usually had a higher win percentage, so I added another layer to see which candies were chocolate and which candies contained no chocolate. By putting the values in decreasing order I can make a general statement from the graph that typically, chocolate candies had a higher winning percentage than on chocolate candies.
Both graphs represent the data well, because it appears having chocolate was more influential than other variables in ranking the candies.
Write a multiple linear regression empirical model that was was estimated in the article. Intepret the coefficients in the context of the problem. What would it mean for a coefficient to be statistically significant? What is the assumptions made when using this linear model?
\[ winpercent_{i} = \beta_0 + beta_1 chocolate+\beta_2 fruity+\beta_3 caramel +\beta_4 nuts +\beta_5 nougat +\beta_6 crispy+\beta_7 hard+\beta_8 bar+ \beta_9 pluribus + \varepsilon_i\]
ANSWER: Each coefficient represents a candy characteristic that would influence how people rank the candies according to preference. For example we could look at the coefficient of Beta_1 on chocolate, this would tell us how much chocolate positively or negatively influences our model in predicting whether a candy is liked or not. Each of the coefficients (if included) should have some significant impact on the results of our model to predict a ranking for a specific candy.
If a coefficient was statistically significant this would mean that we would want to include it in our model because the relationship between this variable influences our response variable by more than just chance. In other words, if we say assume a variable has no influence, and find this is statistically significance we can reject the assumption that it has no impact and argue we should use this variable because it will help provide a better regression model for predicting an outcome; in our case how well candy is liked.
We make several assumptions for this linear model. 1) We assume that regression has a linear form.(If we plot our data we should see a linear trend that supports a linear model) 2) We will assume that there is no disturbance from our expected value of epsilon. So we do not want to see any resulting pattern from our error terms or residuals. 3) We would hope that all of the variables are not correlated with our error term. This means that we can not use a variable (i.e. nougat) to predict the error/residual of our results. 4) We should not be able to predict an error amount based on a previous error amount. (The disturbance/error terms are not correlated) 5)If we were to look at the residual plot there should be no pattern of fanning out suggesting some terms have larger errors than other ones we would want to see consistent errors for larger data values and for smaller data values. 6) We do not want any independent variable to be perfectly correlated to another variable (if they do then one variable could be perfectly written in terms of the other variable). So we hope that the variable of caramel can not be written as a measure of whether a candy is crispy or has a wafer. And this will be true when we compare all of the independent variables to each other. 7) Optional: We hope that the error terms are normally distributed. This means that we hope that the size of the error terms typically are the same distance away either positive or negative from the predicted value where the errors are centered at 0.
Replicate the regression table in the article. Display results in a nicely-formatted table.
# ANSWER:
candy_mls<-lm(winpercent~chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer +hard +bar +pluribus)
candy_mls2<-lm(winpercent~chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer +hard +bar +pluribus + sugarpercent +pricepercent)
candy_mls3<-lm(winpercent~chocolate + fruity + peanutyalmondy + crispedricewafer +hard + sugarpercent)
stargazer(candy_mls, candy_mls2, candy_mls3, type = "text")
Warning: length of NULL cannot be changedWarning: length of NULL cannot be changedWarning: length of NULL cannot be changedWarning: length of NULL cannot be changedWarning: length of NULL cannot be changedWarning: number of rows of result is not a multiple of vector length (arg 2)Warning: number of rows of result is not a multiple of vector length (arg 2)
=======================================================================================
Dependent variable:
-------------------------------------------------------------------
winpercent
(1) (2) (3)
---------------------------------------------------------------------------------------
chocolate 19.906*** 19.748*** 19.147***
(3.897) (3.899) (3.587)
fruity 10.268*** 9.422** 8.881**
(3.789) (3.763) (3.561)
caramel 3.384 2.224
(3.603) (3.657)
peanutyalmondy 10.141*** 10.071*** 9.483***
(3.595) (3.616) (3.446)
nougat 2.416 0.804
(5.690) (5.716)
crispedricewafer 8.992* 8.919* 8.385*
(5.328) (5.268) (4.484)
hard -4.873 -6.165* -5.669*
(3.439) (3.455) (3.289)
bar -0.722 0.442
(4.871) (5.061)
pluribus -0.160 -0.854
(3.012) (3.040)
sugarpercent 9.087* 7.979*
(4.659) (4.129)
pricepercent -5.928
(5.513)
Constant 35.015*** 34.534*** 32.941***
(4.078) (4.320) (3.518)
---------------------------------------------------------------------------------------
Observations 85 85 85
R2 0.515 0.540 0.528
Adjusted R2 0.457 0.471 0.492
Residual Std. Error 10.847 (df = 75) 10.703 (df = 73) 10.492 (df = 78)
F Statistic 8.842*** (df = 9; 75) 7.797*** (df = 11; 73) 14.538*** (df = 6; 78)
=======================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Interpret your results from question 6. Did the model explain much of the variation in the dependent variable? Do you feel like anything may have been omitted from the regression that should have been included?
ANSWER: In all three models we are trying to predict how well a candy will fair against other candies. Generally, in all three models the constants remain the same with a positive coefficient (approximately 35). Then the remaining variables influenced the results positively with the exception of whether the candy was hard or not, this consistently had a negative variable coefficient. In the original data set (candy_mls), I think they should have included the sugarpercent variable since it appeared to be slightly statistically significant. In the other two models, I included the sugar percent and then considered which of the variables might be omitted. The candy_mls2 includes all the variables and candy_mls3 only includes the variables that were considered statistically significant. As we can see from comparing the 3 regressions, the constants remained similar as did the coefficients of variables that were common to each model. Additionally, when we look at the R^2 and residual standard error we see that these values also remained similar explaining about 52% of the variability in the model, with a standard error of 10.5-10.8). Since, these are not huge differences, we can assume that all of the models include the variables from this data set that are most influential. One might argue that simpler model with fewer variables explaining the relationship would be easier to explain to a person as well as be comparably useful to predict a win percent for a person’s candy preference.