Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

The data I chose to use is the candy dataset from fivethiryeight. https://github.com/fivethirtyeight/data/tree/master/candy-power-ranking

candy <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv")
head(candy)
##   competitorname chocolate fruity caramel peanutyalmondy nougat
## 1      100 Grand         1      0       1              0      0
## 2   3 Musketeers         1      0       0              0      1
## 3       One dime         0      0       0              0      0
## 4    One quarter         0      0       0              0      0
## 5      Air Heads         0      1       0              0      0
## 6     Almond Joy         1      0       0              1      0
##   crispedricewafer hard bar pluribus sugarpercent pricepercent winpercent
## 1                1    0   1        0        0.732        0.860   66.97173
## 2                0    0   1        0        0.604        0.511   67.60294
## 3                0    0   0        0        0.011        0.116   32.26109
## 4                0    0   0        0        0.011        0.511   46.11650
## 5                0    0   0        0        0.906        0.511   52.34146
## 6                0    0   1        0        0.465        0.767   50.34755

Let’s build a regression model that calculates the sugar percentage based on if the candy contains chocolate.

model <- lm(chocolate ~ sugarpercent, data = candy)
summary(model)
## 
## Call:
## lm(formula = chocolate ~ sugarpercent, data = candy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5246 -0.4328 -0.3600  0.5417  0.6464 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    0.3474     0.1069   3.250  0.00167 **
## sugarpercent   0.1837     0.1925   0.954  0.34274   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.499 on 83 degrees of freedom
## Multiple R-squared:  0.01085,    Adjusted R-squared:  -0.001066 
## F-statistic: 0.9105 on 1 and 83 DF,  p-value: 0.3427

Residual analysis of the data

plot(fitted(model), resid(model))

qqnorm((resid(model)))
qqline((resid(model)))

Based on the residual analysis, I would say that the linear model is not appropriate. This is because from the plots you can see that the points do not follow the straight line. This tells us that the residuals are not normally distributed.