datasets::trees
## Girth Height Volume
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
## 7 11.0 66 15.6
## 8 11.0 75 18.2
## 9 11.1 80 22.6
## 10 11.2 75 19.9
## 11 11.3 79 24.2
## 12 11.4 76 21.0
## 13 11.4 76 21.4
## 14 11.7 69 21.3
## 15 12.0 75 19.1
## 16 12.9 74 22.2
## 17 12.9 85 33.8
## 18 13.3 86 27.4
## 19 13.7 71 25.7
## 20 13.8 64 24.9
## 21 14.0 78 34.5
## 22 14.2 80 31.7
## 23 14.5 74 36.3
## 24 16.0 72 38.3
## 25 16.3 77 42.6
## 26 17.3 81 55.4
## 27 17.5 82 55.7
## 28 17.9 80 58.3
## 29 18.0 80 51.5
## 30 18.0 80 51.0
## 31 20.6 87 77.0
df <- trees
plot(df$Girth, df$Volume)
cor(df$Girth, df$Volume)
## [1] 0.9671194
cor(df$Girth, df$Height)
## [1] 0.5192801
cor(df$Volume, df$Height)
## [1] 0.5982497
Correlation is measured from -1 to 1 and shows how two variables are linearly related. The closer correlation coefficiant is to -1 and 1, the stronger the relationship. The correlation between girth and volume is very strong, and positive. The correlation between girth and height is not strong and relatively moderate. It is positive as well. The correlation between volume and height is moderate, not strong, and positive.
tree_model = lm(Volume ~ Girth, data = df)
summary(tree_model)
##
## Call:
## lm(formula = Volume ~ Girth, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.065 -3.107 0.152 3.495 9.587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
## Girth 5.0659 0.2474 20.48 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331
## F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
Volume= -36.9435 + 5.0659 * Girth - 1 unit increase in Girth is associated with 5.0659 units of increase in Volume. - When Girth is zero, volume is -36.9435 - The model explained 93.31% of variance in volume.
library(ggplot2)
ggplot(data = df, aes(x = Girth, y = Volume)) +
geom_point() +
stat_smooth(method = "lm", col = "dodgerblue3") +
theme(panel.background = element_rect(fill = "white"),
axis.line.x=element_line(),
axis.line.y=element_line()) +
ggtitle("Linear Model Fitted to Data")
## `geom_smooth()` using formula = 'y ~ x'
## Residuals
plot(tree_model)
Residuals vs Fitted graph - The residuals should be evenly distributed above and below the line. It looks like there may be a few more below the line, but it is evenly distributed enough.
Normal Q-Q - If this is a normal distribution, 98% of the data will lie between 2 standard deviations of the mean. Most of the data is within this range, with only 2 outliers outside of the range. Also, the residuals appear to be following a straight line which indicates it is a normal distribution.
Scale-Location - Shows if the residuals are spread equally among our predictions in order to check homoscedasticity and equal variance of residuals. Relatively straight line with residuals causing there to be bumps.
Residuals vs Leverage - Shows influential data points that have a big effect on the linear model. The 31st point has a big effect on the linear model.
df1 <- df[-31,]
tree_model1 = lm(Volume ~ Girth, data = df1)
plot(tree_model1)
Graphs look a little better after taking 31 out
It does seem like the Gauss Markov assumptions held for the most part. Using log may have helped the residual graphs appear more normal.