Objective

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

The dataset we will be using is Orange which displays the growth of orange trees. We want to find out if the increasing age of the tree affects the circumference of the trees.

Data Visualization

head(Orange)
##   Tree  age circumference
## 1    1  118            30
## 2    1  484            58
## 3    1  664            87
## 4    1 1004           115
## 5    1 1231           120
## 6    1 1372           142
plot(Orange$age, Orange$circumference)

lm1=lm(Orange$age~Orange$circumference)

plot(Orange$circumference, Orange$age)
abline(lm1)

summary(lm1)
## 
## Call:
## lm(formula = Orange$age ~ Orange$circumference)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -317.88 -140.90  -17.20   96.54  471.16 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           16.6036    78.1406   0.212    0.833    
## Orange$circumference   7.8160     0.6059  12.900 1.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 203.1 on 33 degrees of freedom
## Multiple R-squared:  0.8345, Adjusted R-squared:  0.8295 
## F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14

Residual Analysis

plot(fitted(lm1),resid(lm1))

From this model, we see that the residual points are not uniformly distributed around 0 and we can conclude this is not a well-fitted model.

qqnorm(resid(lm1))
qqline(resid(lm1))

Using the Q-Q plot for the model, we find that the points plotted in this figure does follow a straight line with some outliers. This behavior indicates the residuals are normally distributed.