Question

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Solution

Load required libraries

library(tidyverse)
library(kableExtra)

Read the Data into memory

data(trees)

Display the data The data is an in-built data provided by R. It is about Diameter, Height and Volume for Black Cherry Trees.

trees_display = kable(trees) %>%
                  kable_paper("hover", full_width = F) %>%
                  scroll_box(width = "850px", height = "350px")
trees_display
Girth Height Volume
8.3 70 10.3
8.6 65 10.3
8.8 63 10.2
10.5 72 16.4
10.7 81 18.8
10.8 83 19.7
11.0 66 15.6
11.0 75 18.2
11.1 80 22.6
11.2 75 19.9
11.3 79 24.2
11.4 76 21.0
11.4 76 21.4
11.7 69 21.3
12.0 75 19.1
12.9 74 22.2
12.9 85 33.8
13.3 86 27.4
13.7 71 25.7
13.8 64 24.9
14.0 78 34.5
14.2 80 31.7
14.5 74 36.3
16.0 72 38.3
16.3 77 42.6
17.3 81 55.4
17.5 82 55.7
17.9 80 58.3
18.0 80 51.5
18.0 80 51.0
20.6 87 77.0


Glipmse of the data

glimpse(trees)
## Rows: 31
## Columns: 3
## $ Girth  <dbl> 8.3, 8.6, 8.8, 10.5, 10.7, 10.8, 11.0, 11.0, 11.1, 11.2, 11.3, …
## $ Height <dbl> 70, 65, 63, 72, 81, 83, 66, 75, 80, 75, 79, 76, 76, 69, 75, 74,…
## $ Volume <dbl> 10.3, 10.3, 10.2, 16.4, 18.8, 19.7, 15.6, 18.2, 22.6, 19.9, 24.…


Summary of the data

summary(trees)
##      Girth           Height       Volume     
##  Min.   : 8.30   Min.   :63   Min.   :10.20  
##  1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
##  Median :12.90   Median :76   Median :24.20  
##  Mean   :13.25   Mean   :76   Mean   :30.17  
##  3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
##  Max.   :20.60   Max.   :87   Max.   :77.00
trees_transformed <- trees %>%
                  mutate(Height2 = Height ^ 2,  # quadratic term
                  HoursCategory = ifelse(Height > mean(Height), 1.0, 0.0),  #dichotomous term
                  Height_Girth = Height * Girth  #quantitative interaction term
  )

Scatter Plot - Q4

p = ggplot(trees_transformed, aes(x=(Height2 + HoursCategory + Height_Girth), y = Volume)) + geom_point() + theme_minimal() +
    theme(panel.grid.major = element_line(colour = "lemonchiffon3"),
    panel.grid.minor = element_line(colour = "lemonchiffon3"),
    axis.title = element_text(size = 13),
    axis.text = element_text(size = 11),
    axis.text.x = element_text(family = "sans",
        size = 11), axis.text.y = element_text(family = "sans",
        size = 11), plot.title = element_text(size = 15,
        hjust = 0.5), panel.background = element_rect(fill = "gray85"),
    plot.background = element_rect(fill = "antiquewhite")) +labs(title = "Volume - Multi Regression",
    x = "Height2 + HoursCategory + Height_Girth", y = "Volume")
p


Simple Linear Regression - Q4

lm_tree <- lm(Volume ~ Height2 + HoursCategory + Height_Girth, data = trees_transformed)
summary(lm_tree)
## 
## Call:
## lm(formula = Volume ~ Height2 + HoursCategory + Height_Girth, 
##     data = trees_transformed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9613 -2.3670  0.6707  2.4586  5.3090 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -11.736987   5.343963  -2.196  0.03684 *  
## Height2        -0.003460   0.001122  -3.084  0.00467 ** 
## HoursCategory   2.021131   2.001415   1.010  0.32153    
## Height_Girth    0.060037   0.002753  21.807  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.109 on 27 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9642 
## F-statistic: 270.5 on 3 and 27 DF,  p-value: < 2.2e-16

Residual vs Fitted - Q4

plot(fitted(lm_tree),resid(lm_tree), main="Residuals vs Fitted", xlab = "Fitted", ylab = "Residuals")
abline(0, 0)


Q-Q Plot

qqnorm(resid(lm_tree))
qqline(resid(lm_tree))


Interpret the F statistics, R^2, standard error,and p-values only:
F statistic: 270.5 on 3 and 27 DF
p-value: < 2.2e-16 - The p-value being less than 0.05 means that the result is statistically significant. R-squared: 0.9678 - This means that the model accounts for about 97% of variability in the data.
Residual standard error: The TotExp can deviate from the regression line by approximately 3.109 on 27 degrees of freedom
There are four (4) main assumptions for Linear Regression and they are:
  • Linearity: The relationship between X and Y must be linear. As can be seen from the scatter plot above, Height2 + HoursCategory + Height_Girth) apear to have a linear relationship and this condition is satisfied.
  • Homoscedacity: There should be constant variance in the residuals. From the Residual vs Fitted Plot shown above, it seem to appear that there is a constant variance and thus the homoscedacity criterion is satisfied.
  • Normality: The data should be normally distributed. From the Q-Q plot shown above, the data is nearly normal.
  • Independence: The observations should be independent of each other. This may be difficult to determine from looking at the data and we may have to rely on the assumptions provided by the data collector. From the residual plot, we can assume that this criteria is met. Since the Linearity, Homoscedacity, and Normality conditions are satisfied, we can conclude that the assumptions for Linear Regression are met.

  • Even though this model might suffer from multi-collinearity, in general, a linear model seem to be appropriated for the data.