Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
I chose to again use the candy dataset from fivethiryeight. https://github.com/fivethirtyeight/data/tree/master/candy-power-ranking
candy <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv")
head(candy)
## competitorname chocolate fruity caramel peanutyalmondy nougat
## 1 100 Grand 1 0 1 0 0
## 2 3 Musketeers 1 0 0 0 1
## 3 One dime 0 0 0 0 0
## 4 One quarter 0 0 0 0 0
## 5 Air Heads 0 1 0 0 0
## 6 Almond Joy 1 0 0 1 0
## crispedricewafer hard bar pluribus sugarpercent pricepercent winpercent
## 1 1 0 1 0 0.732 0.860 66.97173
## 2 0 0 1 0 0.604 0.511 67.60294
## 3 0 0 0 0 0.011 0.116 32.26109
## 4 0 0 0 0 0.011 0.511 46.11650
## 5 0 0 0 0 0.906 0.511 52.34146
## 6 0 0 1 0 0.465 0.767 50.34755
Let’s build a multiple regression model to show the price of a candy based on if it is a candybar.
multi_model <- lm(pricepercent ~ bar + bar * sugarpercent + chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer, data = candy)
summary(multi_model)
##
## Call:
## lm(formula = pricepercent ~ bar + bar * sugarpercent + chocolate +
## fruity + caramel + peanutyalmondy + nougat + crispedricewafer,
## data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39739 -0.13889 -0.01221 0.11147 0.72789
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.22076 0.07852 2.811 0.00629 **
## bar 0.46444 0.20388 2.278 0.02557 *
## sugarpercent 0.29501 0.09203 3.205 0.00198 **
## chocolate 0.10382 0.08123 1.278 0.20518
## fruity -0.03077 0.07798 -0.395 0.69430
## caramel 0.07520 0.08390 0.896 0.37297
## peanutyalmondy 0.09518 0.07436 1.280 0.20453
## nougat -0.15867 0.11939 -1.329 0.18787
## crispedricewafer 0.03504 0.11115 0.315 0.75344
## bar:sugarpercent -0.43831 0.37281 -1.176 0.24345
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2242 on 75 degrees of freedom
## Multiple R-squared: 0.4504, Adjusted R-squared: 0.3844
## F-statistic: 6.828 on 9 and 75 DF, p-value: 4.177e-07
There are nine coefficients, pricepercent, sugarpercent, chocolate, fruity, caramel, peanutyalmondy, nougat, crispedricewafer, and also the term bar * sugarpercent. Looking at the coefficients, we see that crispedricewafer and fruity have the highest p-values among all coefficients. Sugarpercent has the lowest p-value.
Residual Analysis
plot(fitted(multi_model), resid(multi_model))
qqnorm(resid(multi_model))
qqline(resid(multi_model))
Based on the residual analysis, I would say that the linear model is appropriate. This is because from the plots you can see that most of the points follow the straight line. This tells us that the residuals almost follow a normal distribution.