Using R, build a multiple regression model for data that interests
you. Include in this model at least one quadratic term, one dichotomous
term, and one dichotomous vs. quantitative interaction term. Interpret
all coefficients. Conduct residual analysis. Was the linear model
appropriate? Why or why not?
Load required libraries
library(tidyverse)
library(kableExtra)
Read the Data into memory
data(trees)
Display the data The data is an in-built data provided by R. It is about Diameter, Height and Volume for Black Cherry Trees.
trees_display = kable(trees) %>%
kable_paper("hover", full_width = F) %>%
scroll_box(width = "850px", height = "350px")
trees_display
| Girth | Height | Volume |
|---|---|---|
| 8.3 | 70 | 10.3 |
| 8.6 | 65 | 10.3 |
| 8.8 | 63 | 10.2 |
| 10.5 | 72 | 16.4 |
| 10.7 | 81 | 18.8 |
| 10.8 | 83 | 19.7 |
| 11.0 | 66 | 15.6 |
| 11.0 | 75 | 18.2 |
| 11.1 | 80 | 22.6 |
| 11.2 | 75 | 19.9 |
| 11.3 | 79 | 24.2 |
| 11.4 | 76 | 21.0 |
| 11.4 | 76 | 21.4 |
| 11.7 | 69 | 21.3 |
| 12.0 | 75 | 19.1 |
| 12.9 | 74 | 22.2 |
| 12.9 | 85 | 33.8 |
| 13.3 | 86 | 27.4 |
| 13.7 | 71 | 25.7 |
| 13.8 | 64 | 24.9 |
| 14.0 | 78 | 34.5 |
| 14.2 | 80 | 31.7 |
| 14.5 | 74 | 36.3 |
| 16.0 | 72 | 38.3 |
| 16.3 | 77 | 42.6 |
| 17.3 | 81 | 55.4 |
| 17.5 | 82 | 55.7 |
| 17.9 | 80 | 58.3 |
| 18.0 | 80 | 51.5 |
| 18.0 | 80 | 51.0 |
| 20.6 | 87 | 77.0 |
Glipmse of the data
glimpse(trees)
## Rows: 31
## Columns: 3
## $ Girth <dbl> 8.3, 8.6, 8.8, 10.5, 10.7, 10.8, 11.0, 11.0, 11.1, 11.2, 11.3, …
## $ Height <dbl> 70, 65, 63, 72, 81, 83, 66, 75, 80, 75, 79, 76, 76, 69, 75, 74,…
## $ Volume <dbl> 10.3, 10.3, 10.2, 16.4, 18.8, 19.7, 15.6, 18.2, 22.6, 19.9, 24.…
Summary of the data
summary(trees)
## Girth Height Volume
## Min. : 8.30 Min. :63 Min. :10.20
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
## Median :12.90 Median :76 Median :24.20
## Mean :13.25 Mean :76 Mean :30.17
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
## Max. :20.60 Max. :87 Max. :77.00
trees_transformed <- trees %>%
mutate(Height2 = Height ^ 2, # quadratic term
HoursCategory = ifelse(Height > mean(Height), 1.0, 0.0), #dichotomous term
Height_Girth = Height * Girth #quantitative interaction term
)
Scatter Plot - Q4
p = ggplot(trees_transformed, aes(x=(Height2 + HoursCategory + Height_Girth), y = Volume)) + geom_point() + theme_minimal() +
theme(panel.grid.major = element_line(colour = "lemonchiffon3"),
panel.grid.minor = element_line(colour = "lemonchiffon3"),
axis.title = element_text(size = 13),
axis.text = element_text(size = 11),
axis.text.x = element_text(family = "sans",
size = 11), axis.text.y = element_text(family = "sans",
size = 11), plot.title = element_text(size = 15,
hjust = 0.5), panel.background = element_rect(fill = "gray85"),
plot.background = element_rect(fill = "antiquewhite")) +labs(title = "Volume - Multi Regression",
x = "Height2 + HoursCategory + Height_Girth", y = "Volume")
p
Simple Linear Regression - Q4
lm_tree <- lm(Volume ~ Height2 + HoursCategory + Height_Girth, data = trees_transformed)
summary(lm_tree)
##
## Call:
## lm(formula = Volume ~ Height2 + HoursCategory + Height_Girth,
## data = trees_transformed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9613 -2.3670 0.6707 2.4586 5.3090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.736987 5.343963 -2.196 0.03684 *
## Height2 -0.003460 0.001122 -3.084 0.00467 **
## HoursCategory 2.021131 2.001415 1.010 0.32153
## Height_Girth 0.060037 0.002753 21.807 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.109 on 27 degrees of freedom
## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9642
## F-statistic: 270.5 on 3 and 27 DF, p-value: < 2.2e-16
Residual vs Fitted - Q4
plot(fitted(lm_tree),resid(lm_tree), main="Residuals vs Fitted", xlab = "Fitted", ylab = "Residuals")
abline(0, 0)
Q-Q Plot
qqnorm(resid(lm_tree))
qqline(resid(lm_tree))
Height2 + HoursCategory + Height_Girth) apear to have a
linear relationship and this condition is satisfied.
Even though this model might suffer from multi-collinearity, in general, a linear model seem to be appropriated for the data.