Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
The dataset we will be using is Orange which displays the growth of orange trees. We want to find out if the increasing age of the tree affects the circumference of the trees.
head(Orange)
## Tree age circumference
## 1 1 118 30
## 2 1 484 58
## 3 1 664 87
## 4 1 1004 115
## 5 1 1231 120
## 6 1 1372 142
plot(Orange$age, Orange$circumference)
lm1=lm(Orange$age~Orange$circumference)
plot(Orange$circumference, Orange$age)
abline(lm1)
summary(lm1)
##
## Call:
## lm(formula = Orange$age ~ Orange$circumference)
##
## Residuals:
## Min 1Q Median 3Q Max
## -317.88 -140.90 -17.20 96.54 471.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.6036 78.1406 0.212 0.833
## Orange$circumference 7.8160 0.6059 12.900 1.93e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 203.1 on 33 degrees of freedom
## Multiple R-squared: 0.8345, Adjusted R-squared: 0.8295
## F-statistic: 166.4 on 1 and 33 DF, p-value: 1.931e-14
plot(fitted(lm1),resid(lm1))
From this model, we see that the residual points are not uniformly distributed around 0 and we can conclude this is not a well-fitted model.
qqnorm(resid(lm1))
qqline(resid(lm1))
Using the Q-Q plot for the model, we find that the points plotted in this figure does follow a straight line with some outliers. This behavior indicates the residuals are normally distributed.