Set up data

datasets::trees
##    Girth Height Volume
## 1    8.3     70   10.3
## 2    8.6     65   10.3
## 3    8.8     63   10.2
## 4   10.5     72   16.4
## 5   10.7     81   18.8
## 6   10.8     83   19.7
## 7   11.0     66   15.6
## 8   11.0     75   18.2
## 9   11.1     80   22.6
## 10  11.2     75   19.9
## 11  11.3     79   24.2
## 12  11.4     76   21.0
## 13  11.4     76   21.4
## 14  11.7     69   21.3
## 15  12.0     75   19.1
## 16  12.9     74   22.2
## 17  12.9     85   33.8
## 18  13.3     86   27.4
## 19  13.7     71   25.7
## 20  13.8     64   24.9
## 21  14.0     78   34.5
## 22  14.2     80   31.7
## 23  14.5     74   36.3
## 24  16.0     72   38.3
## 25  16.3     77   42.6
## 26  17.3     81   55.4
## 27  17.5     82   55.7
## 28  17.9     80   58.3
## 29  18.0     80   51.5
## 30  18.0     80   51.0
## 31  20.6     87   77.0
df <- trees
plot(df$Girth, df$Volume)

Correlation

cor(df$Girth, df$Volume)
## [1] 0.9671194
cor(df$Girth, df$Height)
## [1] 0.5192801
cor(df$Volume, df$Height)
## [1] 0.5982497

Correlation is measured from -1 to 1 and shows how two variables are linearly related. The closer correlation coefficiant is to -1 and 1, the stronger the relationship. The correlation between girth and volume is very strong, and positive. The correlation between girth and height is not strong and relatively moderate. It is positive as well. The correlation between volume and height is moderate, not strong, and positive.

Regression Analysis

tree_model = lm(Volume ~ Girth, data = df)
summary(tree_model)
## 
## Call:
## lm(formula = Volume ~ Girth, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.065 -3.107  0.152  3.495  9.587 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
## Girth         5.0659     0.2474   20.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared:  0.9353, Adjusted R-squared:  0.9331 
## F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16

Volume= -36.9435 + 5.0659 * Girth - 1 unit increase in Girth is associated with 5.0659 units of increase in Volume. - When Girth is zero, volume is -36.9435 - The model explained 93.31% of variance in volume.

Linear Model Graph

library(ggplot2)
ggplot(data = df, aes(x = Girth, y = Volume)) +
  geom_point() +
  stat_smooth(method = "lm", col = "dodgerblue3") +
  theme(panel.background = element_rect(fill = "white"),
        axis.line.x=element_line(),
        axis.line.y=element_line()) +
  ggtitle("Linear Model Fitted to Data")
## `geom_smooth()` using formula = 'y ~ x'

## Residuals

plot(tree_model)

Residuals vs Fitted graph - The residuals should be evenly distributed above and below the line. It looks like there may be a few more below the line, but it is evenly distributed enough.

Normal Q-Q - If this is a normal distribution, 98% of the data will lie between 2 standard deviations of the mean. Most of the data is within this range, with only 2 outliers outside of the range. Also, the residuals appear to be following a straight line which indicates it is a normal distribution.

Scale-Location - Shows if the residuals are spread equally among our predictions in order to check homoscedasticity and equal variance of residuals. Relatively straight line with residuals causing there to be bumps.

Residuals vs Leverage - Shows influential data points that have a big effect on the linear model. The 31st point has a big effect on the linear model.

Take 31 out

df1 <- df[-31,]
tree_model1 = lm(Volume ~ Girth, data = df1)
plot(tree_model1)

Graphs look a little better after taking 31 out

4 assumptions

  1. Linearity- The data does show a linear trend, by looking at the “Linear Model Fitted to Data” graph.
  2. Nearly normal residuals- The residuals in the Normal Q-Q graph seem to be following a normal distribution. There are some outliers, but they are following a straight line.
  3. Constant variability- The residuals around the least squares regression line do seem to be constant.
  4. Independent observations- volume and girth are not dependent upon one another and are measured independently. A measurement of the volume of the tree will not have an impact on the girth of the tree, and vise versa.

It does seem like the Gauss Markov assumptions held for the most part. Using log may have helped the residual graphs appear more normal.