library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.3 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
candy <- read_csv("candy.csv")
## Rows: 85 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): competitorname
## dbl (12): chocolate, fruity, caramel, peanutyalmondy, nougat, crispedricewaf...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
My research question was: Does sugar predict how much people prefer a candy?
My dataset is from 2015 and contains 13 variables and 85 rows. Each row is a specific candy (such as 3 Musketeers) and the variables describe different attributes of the candy. The variables relevant to my analysis were win percent and sugar percentile. The win percent variable was calculated by presenting visitors to the 538 website with a random 1 to 1 matchup between candies in the dataset where visitors had to select which of the two they preferred. There were a total 269,000 matchups across the whole dataset. A candy’s win percent was the percent of 1 to 1 matchcups it won out of the total matchups it was included in. For example, if a candy was included in 10 matchups and won 6, its win percent would be 60. The sugar percentile gives the percentile of how much sugar a specific candy had relative to all the candies in the dataset. I used win percent to represent how much people preferred a candy.
I used the “Candy Power Ranking” dataset first published by FiveThirtyEight in 2015 and then archived on GitHub: github.com/fivethirtyeight/data/blob/master/candy-power-ranking/candy-data.csv.
To clean my dataset, I selected the columns for the variables I am using and filtered out any NA data. The win percent variable was already in percent form, but sugar percentile was in decimal, not percent form. So I multiplied sugar percentile by 100 so that the units matched win percent.
For exploratory data analysis, I calculated the mean, median, minimum, and maximum of win percent. To visualize my data, I made a histogram of win percent values.
The mean of win percent is expected to be very close to 50% because, if all 1 to 1 matchups were shown the same number of times, wins and losses would cancel one another out. However, since matchups were shown randomly, the mean may not be exactly 50%. Note that I found the median of win percent to be lower than the mean (48<50), indicating a right skew, which can be seen in the histogram.
I used a linear regression to test my research question because I had two quantitative variables, and I wanted to determine if one predicted another. I made a scatter plot with my regression line overlaid to visualize my regression.
# Selecting relevant columns, filtering out any NA data, and multiplying sugarpercent by 100 so that it is in percent, not decimal, form to match winpercent.
candy <- candy %>%
filter(!is.na(sugarpercent)) %>%
filter(!is.na(winpercent)) %>%
mutate(sugarpercent = sugarpercent*100)
# Finding the mean, median min, and max of the win percent. Notice the median (47.8) is slightly lower than the mean (50).
mean(candy$winpercent)
## [1] 50.31676
median(candy$winpercent)
## [1] 47.82975
max(candy$winpercent)
## [1] 84.18029
min(candy$winpercent)
## [1] 22.44534
# Plotting a histogram of win percent values. Notice a right skew.
ggplot(candy, aes(x = winpercent)) +
geom_histogram(binwidth = 10, fill = "steelblue", color = "white") +
labs(title = "Distribution of Win Percent", x = "Win Percent", y = "Number of Candies") +
theme_minimal()
# Plotting a scatterplot with a linear model line.
ggplot(candy, aes(x = sugarpercent, y = winpercent)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", color = "steelblue", se = FALSE) +
labs(title = "Sugar Percentile vs Win Percent",
x = "Sugar Percentile", y = "Win Percent") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Predictor variable: sugar percentile
Response variable: win percent
correlation: 0.2291507
Linear regression: p = 0.0349, intercept = 44.6094, slope = 0.1192, R2 = 0.04109,
RMSE = 14.23832
The correlation has a very small absolute value (<0.3), indicating a weak correlation between sugar percentile and win percent.
From the slope of the linear model, for a 1% increase in sugar percentile, win percent increases by 0.1192%. Or, for a 10% increase in sugar percentile, win percent increases by 1.192%. The variation in sugar percentile explains 4% of the variation in win percent. While there is a relationship, sugar percentile has only a very small impact on win percent.
I used a residuals plot to test the assumption of homoscedasticity and a Q-Q plot to test the assumption of normality There are a few outliers in the residuals plot, but most of the residuals have an even spread across all fitted values. There is not a clear funnel shape to the graph, indicating that the residuals are normally distributed. The residuals closely follow the line of the Q-Q plot and do not dramatically deviate at the ends, indicating that the residuals are homoscedastic.
For candies with a sugar percentile of 10%, 50%, and 90%, my model predicts a win percent of 46%, 51%, and 55% respectively.
From the root mean squared error, my model’s predictions are off by 14 percentage points, on average.
# Finding the correlation between my variables.
cor(candy$sugarpercent, candy$winpercent, use = "complete.obs")
## [1] 0.2291507
# Making my linear model and finding the relevant statistics.
model <- lm(winpercent ~ sugarpercent, data=candy)
summary(model)
##
## Call:
## lm(formula = winpercent ~ sugarpercent, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.924 -11.066 -1.168 9.252 36.851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.6094 3.0861 14.455 <2e-16 ***
## sugarpercent 0.1192 0.0556 2.145 0.0349 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.41 on 83 degrees of freedom
## Multiple R-squared: 0.05251, Adjusted R-squared: 0.04109
## F-statistic: 4.6 on 1 and 83 DF, p-value: 0.0349
# Checking the homoscedasticity residuals with a residuals plot.
plot(model, which = 1)
# Checking the normality of residuals assumption with a Q-Q plot.
plot(model, which = 2)
# Finding the root mean squared error.
rmse <- sqrt(mean(residuals(model)^2))
rmse
## [1] 14.23832
# Predicting the win percent of a candy in the 10th, 50th, and 90th sugar percentile.
prediction <- data.frame(sugarpercent = c(10, 50, 90))
predict(model, prediction)
## 1 2 3
## 45.80183 50.57137 55.34092
My key finding was that sugar percentile does predict win percent, but this relationship is very weak. Sugar percentile variation only explains a small amount of variation in win percent and my model’s predictions are off by a large number on average.
My result is interesting because it challenges conventional wisdom about sugar being important to candy liking. While the answer to my research question may be yes, sugar percentile does predict win percent, my findings really show that sugar’s impact on candy liking is actually quite small.
This opens up future research questions about if other physical properties of the candy might better predict liking, such as salt content or crunchiness, or if external factors like cost or prevalence impact liking.
FiveThirtyEight. “Data/Candy-Power-Ranking/Candy-Data.csv at Master · Fivethirtyeight/Data.” GitHub, 2022, github.com/fivethirtyeight/data/blob/master/candy-power-ranking/candy-data.csv.