Using R, create a simple linear regression model and test its assumptions. You may use any data that interest you.
For this excercise have picked the internal dataset - trees
#Displaying the trees dataset contents:
str(trees)
## 'data.frame': 31 obs. of 3 variables:
## $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
## $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
## $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
#Showing stats with Summary function:
summary(trees)
## Girth Height Volume
## Min. : 8.30 Min. :63 Min. :10.20
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
## Median :12.90 Median :76 Median :24.20
## Mean :13.25 Mean :76 Mean :30.17
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
## Max. :20.60 Max. :87 Max. :77.00
==> First to visualize the trees data using scatter plot. The x-axis is Height and y-axis is Volume :
plot(trees[,"Height"], trees[,"Volume"], main='trees DATASET', xlab='Height', ylab='Volume')
The plot shows that as trees Height increases, the Volume also inreases as expected. A regression model will help us quantify this relationship.
==> Next, The Linear Model Function :
# Create a linear model
trees_olm <- lm(Volume ~ Height, data = trees)
trees_olm
##
## Call:
## lm(formula = Volume ~ Height, data = trees)
##
## Coefficients:
## (Intercept) Height
## -87.124 1.543
plot(trees[,"Height"], trees[,"Volume"], main='trees DATASET with Linear Model Function', xlab='Height', ylab='Volume')
abline(trees_olm, col="blue")
==> Next, quality evaluation of the model :
summary(trees_olm)
##
## Call:
## lm(formula = Volume ~ Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.274 -9.894 -2.894 12.068 29.852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.1236 29.2731 -2.976 0.005835 **
## Height 1.5433 0.3839 4.021 0.000378 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.4 on 29 degrees of freedom
## Multiple R-squared: 0.3579, Adjusted R-squared: 0.3358
## F-statistic: 16.16 on 1 and 29 DF, p-value: 0.0003784
Multiple R-squared value of 0.3579 and Adjusted R-squared of 0.3358 tells us that the model explains about 36% of the data variation. That says that our model is not a good fit but perhaps a larger set of ore predictors would help make this model better.
==> Next, residual analysis:
hist(trees_olm$residuals)
qqnorm(resid(trees_olm))
qqline(resid(trees_olm))
Histogram of residual plot appear to be near normally distributed.
As we can see from the Quantile to Quantile (QQ) plot graph, samples are closely lined-up to the theoretical qqline. This signifies a normal distribution of the observed data. We can see a divergence though towards the lower and higher positive quantiles. That says that our model is not a good fit
# Ploting Residuals
trees$pred <- predict(trees_olm, newdata = trees)
trees$resid <- trees$pred - trees$Volume
ggplot(trees, aes(x = Height, y = resid)) +
geom_point() +
theme_bw() +
geom_hline(yintercept = 0)