Discussion 11 and 12

Using R, create a simple linear regression model and test its assumptions. You may use any data that interest you.

For this excercise have picked the internal dataset - trees

#Displaying the trees dataset contents:
str(trees)
## 'data.frame':    31 obs. of  3 variables:
##  $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
##  $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
##  $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
#Showing stats with Summary function:
summary(trees)
##      Girth           Height       Volume     
##  Min.   : 8.30   Min.   :63   Min.   :10.20  
##  1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
##  Median :12.90   Median :76   Median :24.20  
##  Mean   :13.25   Mean   :76   Mean   :30.17  
##  3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
##  Max.   :20.60   Max.   :87   Max.   :77.00

==> First to visualize the trees data using scatter plot. The x-axis is Height and y-axis is Volume :

plot(trees[,"Height"], trees[,"Volume"], main='trees DATASET', xlab='Height', ylab='Volume')

The plot shows that as trees Height increases, the Volume also inreases as expected. A regression model will help us quantify this relationship.


==> Next, The Linear Model Function :

# Create a linear model
trees_olm <- lm(Volume ~ Height, data = trees)
trees_olm
## 
## Call:
## lm(formula = Volume ~ Height, data = trees)
## 
## Coefficients:
## (Intercept)       Height  
##     -87.124        1.543
plot(trees[,"Height"], trees[,"Volume"], main='trees DATASET with Linear Model Function', xlab='Height', ylab='Volume')

abline(trees_olm, col="blue")


==> Next, quality evaluation of the model :

summary(trees_olm)
## 
## Call:
## lm(formula = Volume ~ Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.274  -9.894  -2.894  12.068  29.852 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.1236    29.2731  -2.976 0.005835 ** 
## Height        1.5433     0.3839   4.021 0.000378 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.4 on 29 degrees of freedom
## Multiple R-squared:  0.3579, Adjusted R-squared:  0.3358 
## F-statistic: 16.16 on 1 and 29 DF,  p-value: 0.0003784

Multiple R-squared value of 0.3579 and Adjusted R-squared of 0.3358 tells us that the model explains about 36% of the data variation. That says that our model is not a good fit but perhaps a larger set of ore predictors would help make this model better.


==> Next, residual analysis:

hist(trees_olm$residuals)

qqnorm(resid(trees_olm))
qqline(resid(trees_olm))

Histogram of residual plot appear to be near normally distributed.
As we can see from the Quantile to Quantile (QQ) plot graph, samples are closely lined-up to the theoretical qqline. This signifies a normal distribution of the observed data. We can see a divergence though towards the lower and higher positive quantiles. That says that our model is not a good fit


# Ploting Residuals
trees$pred <- predict(trees_olm, newdata = trees)
trees$resid <- trees$pred - trees$Volume

ggplot(trees, aes(x = Height, y = resid)) +
  geom_point() +
  theme_bw() +
  geom_hline(yintercept = 0)