Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

I chose to again use the candy dataset from fivethiryeight. https://github.com/fivethirtyeight/data/tree/master/candy-power-ranking

candy <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv")
head(candy)
##   competitorname chocolate fruity caramel peanutyalmondy nougat
## 1      100 Grand         1      0       1              0      0
## 2   3 Musketeers         1      0       0              0      1
## 3       One dime         0      0       0              0      0
## 4    One quarter         0      0       0              0      0
## 5      Air Heads         0      1       0              0      0
## 6     Almond Joy         1      0       0              1      0
##   crispedricewafer hard bar pluribus sugarpercent pricepercent winpercent
## 1                1    0   1        0        0.732        0.860   66.97173
## 2                0    0   1        0        0.604        0.511   67.60294
## 3                0    0   0        0        0.011        0.116   32.26109
## 4                0    0   0        0        0.011        0.511   46.11650
## 5                0    0   0        0        0.906        0.511   52.34146
## 6                0    0   1        0        0.465        0.767   50.34755

Let’s build a multiple regression model to show the price of a candy based on if it is a candybar.

multi_model <- lm(pricepercent ~ bar + bar * sugarpercent + chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer, data = candy)
summary(multi_model)
## 
## Call:
## lm(formula = pricepercent ~ bar + bar * sugarpercent + chocolate + 
##     fruity + caramel + peanutyalmondy + nougat + crispedricewafer, 
##     data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39739 -0.13889 -0.01221  0.11147  0.72789 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       0.22076    0.07852   2.811  0.00629 **
## bar               0.46444    0.20388   2.278  0.02557 * 
## sugarpercent      0.29501    0.09203   3.205  0.00198 **
## chocolate         0.10382    0.08123   1.278  0.20518   
## fruity           -0.03077    0.07798  -0.395  0.69430   
## caramel           0.07520    0.08390   0.896  0.37297   
## peanutyalmondy    0.09518    0.07436   1.280  0.20453   
## nougat           -0.15867    0.11939  -1.329  0.18787   
## crispedricewafer  0.03504    0.11115   0.315  0.75344   
## bar:sugarpercent -0.43831    0.37281  -1.176  0.24345   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2242 on 75 degrees of freedom
## Multiple R-squared:  0.4504, Adjusted R-squared:  0.3844 
## F-statistic: 6.828 on 9 and 75 DF,  p-value: 4.177e-07

There are nine coefficients, pricepercent, sugarpercent, chocolate, fruity, caramel, peanutyalmondy, nougat, crispedricewafer, and also the term bar * sugarpercent. Looking at the coefficients, we see that crispedricewafer and fruity have the highest p-values among all coefficients. Sugarpercent has the lowest p-value.

Residual Analysis

plot(fitted(multi_model), resid(multi_model))

qqnorm(resid(multi_model))
qqline(resid(multi_model))

Based on the residual analysis, I would say that the linear model is appropriate. This is because from the plots you can see that most of the points follow the straight line. This tells us that the residuals almost follow a normal distribution.