DATA 606 Data Project

Part 1 - Introduction

People like candies. Especial holidays like Halloween, candies are even more popular than usual. Many businesses have to supply different candies for customers to choose, but there are so many varieties of candies in the world, and each business will not stock up all kinds of candies. So those candies that are the best for sale or those that are the most popular became a key for businesses to order. Therefore, I want to see if the sugar content of a candy judge its popularity.

Part 2 - Data

The data set was found in FiveThirtyEight and I got it from its github. Ans it contains fields of ingredients, textures, shape and packaging style. The whole data set follow this link, The dependent variable is the win_percent which is numerical, the independent variable is sugar_percent and candy_name, they are numerical and categorical respectively.

Research question: Does sugar content predictive of win percent?

Part 3 - Exploratory data analysis

There are 85 observations and 4 useful variables. Each observation is presenting one kind of candy.

# import data
candy <- read.csv("https://raw.githubusercontent.com/Sugarcane-svg/R/main/R606/final_project/candy.csv")

# over view of data set
candy$X <- NULL
glimpse(candy)

## Rows: 85
## Columns: 4
## $ candy_name         <chr> "100 Grand", "3 Musketeers", "One dime", "One quart…
## $ sugar_percent      <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604, 0.…
## $ unit_price_percent <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767, 0.…
## $ win_percent        <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 5…

# summary of data set
candy %>% 
  dplyr::select(-c(candy_name)) %>%
  describe()

##                    vars  n  mean    sd median trimmed   mad   min   max range
## sugar_percent         1 85  0.48  0.28   0.47    0.48  0.36  0.01  0.99  0.98
## unit_price_percent    2 85  0.47  0.29   0.47    0.46  0.31  0.01  0.98  0.97
## win_percent           3 85 50.32 14.71  47.83   49.80 14.56 22.45 84.18 61.73
##                    skew kurtosis   se
## sugar_percent      0.09    -1.18 0.03
## unit_price_percent 0.13    -1.20 0.03
## win_percent        0.32    -0.67 1.60

# histogram of sugar_percent
ggplot(candy, aes(x = sugar_percent)) + 
  geom_histogram(binwidth = 0.10) +
  xlab("percentile of sugar")

# histogram of win_percent
ggplot(candy, aes(x = win_percent)) + 
  geom_histogram(binwidth = 5) +
  xlab("overall win percentage")

Part 4 - Inference

win vs. sugar

Lets take a look at the scatter plot

ggplot(candy, aes(x = sugar_percent, y = win_percent)) + 
  geom_point()

The summary of linear model

lm <- lm(candy$win_percent ~ candy$sugar_percent)
summary(lm)

## 
## Call:
## lm(formula = candy$win_percent ~ candy$sugar_percent)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.924 -11.066  -1.168   9.252  36.851 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           44.609      3.086  14.455   <2e-16 ***
## candy$sugar_percent   11.924      5.560   2.145   0.0349 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.41 on 83 degrees of freedom
## Multiple R-squared:  0.05251,    Adjusted R-squared:  0.04109 
## F-statistic:   4.6 on 1 and 83 DF,  p-value: 0.0349

So we get the equation:

  $\widehat{win} = 11.924 * \widehat{sugar} + 44.609$

the slope is positive, which mean that every percent of sugar added in the candy will have 11.924 higher chance to win. The \(R^2=0.053\) which shows that the 5.3% of variability in winpercent is explained by sugarpercent.

Now, we take a look at the residual plot, it doesn’t seem to have strange pattern, so linear model is fine to use.

ggplot(data = lm, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Fitted values",
       y = "Residuals", 
       title = "Residual plot")

the following is the scatter plot with y = win_percent and x = sugar_percent, the blue line is the linear model which use sugar percent to predict the win percent.

ggplot(data = candy, aes(x = sugar_percent, y = win_percent))+
  geom_point()+
  geom_smooth(method = "lm", formula = y~x, se = FALSE)+
  labs(title = "win percent vs. sugar percent")

From the blue line in the plot above, we see it does not pass though many points, and \(R^2\) restates that this linear model only explain 5.3% of variability. However, the residual plot indicate that it is fine to use linear model to make the prediction. So the next we can do is to do the forward selection and see if adding one more variable will increase \(R^2\) value to achieve better linear model.

forward selection (win vs. sugar + unit price)

# linear model with 2 variables
lm_2 <- candy %>% 
  select(-c(candy_name))
lm2 <- lm(lm_2$win_percent~lm_2$sugar_percent+lm_2$unit_price_percent)
summary(lm2)

## 
## Call:
## lm(formula = lm_2$win_percent ~ lm_2$sugar_percent + lm_2$unit_price_percent)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.880  -8.575  -0.759   8.142  37.502 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               39.787      3.438  11.574  < 2e-16 ***
## lm_2$sugar_percent         6.731      5.664   1.188  0.23808    
## lm_2$unit_price_percent   15.586      5.605   2.781  0.00672 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.86 on 82 degrees of freedom
## Multiple R-squared:  0.1342, Adjusted R-squared:  0.113 
## F-statistic: 6.353 on 2 and 82 DF,  p-value: 0.002722

resid_and_fitted <- data.frame(resid = lm2$residuals, fitted = lm2$fitted.values) %>%
  mutate(leverage_point = ifelse(resid > 30 | resid < -30, "yes", "no"))

The \(R^2\) from one variable is around 5.3%, let’s call it \(R_1^2\), but \(R_2^2\) (the adjusted \(R^2\) from two variables) is around 11.3%, which means that value of \(R_2^2\) increases 6%. The coefficient of sugar_percent from 11.9 down to 6.7, which implies that there is bias when using only one variable to predict the win_percent, while adding one more variable, the additional variable reduces some bias, that is the reason why the coefficients of sugar_percent are different.

The equation: \(\widehat{win} = 39.79 + 6.73*\widehat{sugar} + 15.586*\widehat{price}\)

Now, we are going to take a look at the residual plot. Even though there are two point may be considered high leverage point(the one is in the upper left and the other one is on the lower right, it would result the slope of linear model tilt down), the variability of approximately constant.

ggplot(resid_and_fitted, aes(fitted, resid, color = leverage_point)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "fitted value",
       y = "residuals",
       title = "residual plot with 2 variables")

The plot of linear model with 2 variables

ggplot(data = candy, aes(x = sugar_percent + unit_price_percent, y = win_percent))+
  geom_point()+
  geom_smooth(method = "lm", formula = y~x, se = FALSE)+
  labs(title = "win percent vs. sugar percent and unit price percent")

backward elimination (win vs. unit price)

we know that 2 variables increasing the goodness of fit of model. However, when we take a close look that the summary table, the p-value for both variable tells us something as well. It is greater than significant level in sugar_percent but it is below the significant level for unit_price percent, therefore, I guess, if we take out the sugar_percent, it may increase the goodness of fit as well.

# linear model with only unit_price
lm3 <- lm(candy$win_percent~candy$unit_price_percent)
summary(lm3)

## 
## Call:
## lm(formula = candy$win_percent ~ candy$unit_price_percent)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.889  -8.573  -0.544   8.784  34.926 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                41.979      2.908  14.435  < 2e-16 ***
## candy$unit_price_percent   17.783      5.305   3.352  0.00121 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.89 on 83 degrees of freedom
## Multiple R-squared:  0.1192, Adjusted R-squared:  0.1086 
## F-statistic: 11.24 on 1 and 83 DF,  p-value: 0.001209

Because there is only one explanatory variable, we are going to take a look at simple \(R^2\) instead of adjusted \(R^2\). From the observation, the \(R_2^2\) is 11.3%, and the \(R^2\) here is 11.9%. It means that modeling win percent from unit price percent increases the goodness of fit approximately 0.6%.

the equation: \(\widehat{win} = 41.98 + 17.78 *\widehat{price}\)

Interpretation: slope: every additional unite price will result 17.78 more win percent. intercept: if unit price were set to 0, the win percent would be 41.98.

the residual plot shows the variance is roughlt constant.

ggplot(lm3, aes(.fitted, .resid)) +
  geom_point()+
  geom_hline(yintercept = 0, linetype = "dashed")

below is the linear model of win vs unit price

ggplot(data = candy, aes(x = unit_price_percent, y = win_percent))+
  geom_point()+
  geom_smooth(method = "lm", formula = y~x, se = FALSE)+
  labs(title = "win percent vs. unit price percent")

Part 5 - Conclusion

sugar content does somewhat prediction on the candy win percentage. However, with sugar percent and unit price percent, the \(R^2\) of corresponding linear model explains that 6% higher variability than the one with only variable sugar percent. Furthermore, according to the backward elimination analysis, the resulting \(R^2\) of variable unit price percent has the highest value. Therefore, I will conclude that these three model can all predict the win percent, however, the unit price percentage will be the best model among the three.