Linear Regression

Assignment instructions

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Introduction of Concepts

Multiple Regression, or Multiple Linear Regression, provides an expected value of a response, or dependent, variable using multiple explanatory, or independent, variables.

A quadratic term is a second column created by raising another column to the power of 2. This lets you use linear regression to model a specific non-linear term. If the power of 2 isn’t the optimal non linear relationship you could use polynomial regression.

a dichotomous term is one that splits a dataset into two mutually exclusive subsets whose union is the original set, also called a dichotomy.

From wikipedia, “a quantitative interaction between A and B is a situation where the magnitude of the effect of B depends on the value of A, but the direction of the effect of B is constant for all A”. I’m not sure how practical this is. If A is 0 or 1 and we multiply A times B then we’re only seeing the value of B for the subset of the data where A = 1.

Interpret all coefficients of the regression model: Each coefficient, one for each independent/explanatory variable, tells you how much the mean of the dependent variable shifts, positive or negative, for a one unit shift in the independent variable all other variables held equal. If the p-values for any given coefficient is below your significance level then the relationship is statistically significant for a larger sample, otherwise not.

Conduct residual analysis. Residual analysis is looking at how different the expected versus the actual outcome for the data points are which tells you how well the model accounts for variation in the data. First look at the residual plots to see intuitively if the residuals aren’t too low or too high for any given range. High R-squared values tell you there are smaller differences between the observed data and the fitted values. When the data points are closer to the regression line then the R-squared is higher.

This website helped explain residual analysis and has greater information than I go into here: https://statisticsbyjim.com/regression/interpret-r-squared-regression/

Data

# Here we select the built-in-to-R data, trees
#data(trees)
df <- trees
summary(df)

##      Girth           Height       Volume     
##  Min.   : 8.30   Min.   :63   Min.   :10.20  
##  1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
##  Median :12.90   Median :76   Median :24.20  
##  Mean   :13.25   Mean   :76   Mean   :30.17  
##  3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
##  Max.   :20.60   Max.   :87   Max.   :77.00

# Here we inspect what some of the rows look like
# Already this looks uninteresting because Volume should be a function of Girth and Height.  We don't have a column for species of tree and we don't know how volume is calculated (weight of chipped tree times density of chips?)
# However Girth would make an interesting quadratic factor since pi*r^2 is area of a cross-section of the tree.
head(df)

##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7

Build new terms

# make a quadratic term 
df$Girth_sq <- df$Girth^2

# make a dichotomous term
df$Short <- df$Height < 74

# make a dichotomous vs. quantitative interaction term
df$ShortGirth <- df$Short * df$Girth

# Show new terms
head(df)

##   Girth Height Volume Girth_sq Short ShortGirth
## 1   8.3     70   10.3    68.89  TRUE        8.3
## 2   8.6     65   10.3    73.96  TRUE        8.6
## 3   8.8     63   10.2    77.44  TRUE        8.8
## 4  10.5     72   16.4   110.25  TRUE       10.5
## 5  10.7     81   18.8   114.49 FALSE        0.0
## 6  10.8     83   19.7   116.64 FALSE        0.0

Build a Multiple Regression Model

# Here we build the multiple linear regression model, mlr, where Volume is dependent on the other variables.
mlr <- lm(Volume ~ Girth+Height+Girth_sq+Short+ShortGirth, data=df)
summary(mlr)

## 
## Call:
## lm(formula = Volume ~ Girth + Height + Girth_sq + Short + ShortGirth, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5903 -1.3269  0.3297  1.9062  3.9737 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -22.03145   21.02174  -1.048  0.30465   
## Girth        -1.23040    2.13511  -0.576  0.56959   
## Height        0.37547    0.14793   2.538  0.01775 * 
## Girth_sq      0.21522    0.07186   2.995  0.00611 **
## ShortTRUE     7.75375    7.92361   0.979  0.33717   
## ShortGirth   -0.63883    0.59305  -1.077  0.29168   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.663 on 25 degrees of freedom
## Multiple R-squared:  0.9781, Adjusted R-squared:  0.9738 
## F-statistic: 223.7 on 5 and 25 DF,  p-value: < 2.2e-16

Interpret all coefficients

Going line by line:

Girth -1.23040 with a p-value of 0.56959

An increase in Girth is associated with a negative change in Volume, which doesn’t make sense but it also has a p-value of 0.56959, which is above our standard 0.05 so it’s not a statistically significant result.

Height 0.37547 with a p-value of 0.01775

An increase in height is associated with a statistically significant increase in Volume.

Girth_sq 0.21522 with a p-value of 0.00611

An increase in girth squared is associated with a statistically significant increase in volume. We predicted this when looking at the data since pi*r^2 is the area of a cross section of the tree.

ShortTRUE 7.75375 with a p-value of 0.33717

A tree being shorter than 74 is associated with an increase in volume which does not make sense (unless there are multiple species of trees) however the p-value is well above 0.05 so it’s not a statistically significant relationship.

ShortGirth -0.63883 with a p-value of 0.29168

The girth of just short trees which taller trees girth being set to zero has a negative relationship with volume. This makes sense because taller trees are likely to have more volume so setting their girth to zero will mean a unit-increase in ShortGirth means less volume, however the p-value is above 0.05 so it’s not statistically significant.

Let’s fix our model

We’ve adjusted our model by removing the explanatory variables that weren’t significant. Now let’s move on to residual analysis.

mlr <- lm(Volume ~ Height+Girth_sq, data=df)
summary(mlr)

## 
## Call:
## lm(formula = Volume ~ Height + Girth_sq, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8844 -2.2105  0.1196  2.6134  4.2404 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -27.511603   6.557697  -4.195 0.000248 ***
## Height        0.348809   0.093152   3.744 0.000830 ***
## Girth_sq      0.168458   0.006679  25.222  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.799 on 28 degrees of freedom
## Multiple R-squared:  0.9729, Adjusted R-squared:  0.971 
## F-statistic: 503.2 on 2 and 28 DF,  p-value: < 2.2e-16

Conduct residual analysis

#This is a hard plot to read.  It looks like over the range of expected outputs from ~5 to ~75 the expected values are always zero and the residuals are plotted as how far away from the expected value they were, positioned over the expected value they were expected to have
#It's not clear that the residuals are fairly close to the line which would mean r-squared was high.  Which it is at 97%.
#Also it's hard to tell if the residuals are consistent or unbiased across the range of expected values.  It looks like it's concentrated at expected volumes of 20.  So from this graph I'd say we have a problem but that doesn't seem supported by the next graph or the R-squared value.
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.2      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

ggplot(mlr, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0)

# Here's the qq plot.  Since the last discussion we found the qqline function which show that the residuals are normal except for below one standard deviation below and a little less than one standard deviation above.
# It looks like we're ok.  The residuals drop off the line towards the top before we get to the first standard deviation away so maybe that points to the fewer dots in the bottom right quadrant of the residual plot above.
qqnorm(mlr$residuals)
qqline(mlr$residuals)

Conclusion

The multiple linear regression model after the adjustment looks appropriate. It seems trivial because as we discussed in the Data section Volume is calculated from height and the area of a cross section of the tree, which is a consonant (pi/2) times girth squared, which we calculated as our quadratic term. This is born out with the high R-squared statistic, saying 97% of the variance in our data was explained by our two variables, height and girth squared.