Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
The data I chose to use is the candy dataset from fivethiryeight. https://github.com/fivethirtyeight/data/tree/master/candy-power-ranking
candy <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv")
head(candy)
## competitorname chocolate fruity caramel peanutyalmondy nougat
## 1 100 Grand 1 0 1 0 0
## 2 3 Musketeers 1 0 0 0 1
## 3 One dime 0 0 0 0 0
## 4 One quarter 0 0 0 0 0
## 5 Air Heads 0 1 0 0 0
## 6 Almond Joy 1 0 0 1 0
## crispedricewafer hard bar pluribus sugarpercent pricepercent winpercent
## 1 1 0 1 0 0.732 0.860 66.97173
## 2 0 0 1 0 0.604 0.511 67.60294
## 3 0 0 0 0 0.011 0.116 32.26109
## 4 0 0 0 0 0.011 0.511 46.11650
## 5 0 0 0 0 0.906 0.511 52.34146
## 6 0 0 1 0 0.465 0.767 50.34755
Let’s build a regression model that calculates the sugar percentage based on if the candy contains chocolate.
model <- lm(chocolate ~ sugarpercent, data = candy)
summary(model)
##
## Call:
## lm(formula = chocolate ~ sugarpercent, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5246 -0.4328 -0.3600 0.5417 0.6464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3474 0.1069 3.250 0.00167 **
## sugarpercent 0.1837 0.1925 0.954 0.34274
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.499 on 83 degrees of freedom
## Multiple R-squared: 0.01085, Adjusted R-squared: -0.001066
## F-statistic: 0.9105 on 1 and 83 DF, p-value: 0.3427
Residual analysis of the data
plot(fitted(model), resid(model))
qqnorm((resid(model)))
qqline((resid(model)))
Based on the residual analysis, I would say that the linear model is not appropriate. This is because from the plots you can see that the points do not follow the straight line. This tells us that the residuals are not normally distributed.