Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Multiple Regression, or Multiple Linear Regression, provides an expected value of a response, or dependent, variable using multiple explanatory, or independent, variables.
A quadratic term is a second column created by raising another column to the power of 2. This lets you use linear regression to model a specific non-linear term. If the power of 2 isn’t the optimal non linear relationship you could use polynomial regression.
a dichotomous term is one that splits a dataset into two mutually exclusive subsets whose union is the original set, also called a dichotomy.
From wikipedia, “a quantitative interaction between A and B is a situation where the magnitude of the effect of B depends on the value of A, but the direction of the effect of B is constant for all A”. I’m not sure how practical this is. If A is 0 or 1 and we multiply A times B then we’re only seeing the value of B for the subset of the data where A = 1.
Interpret all coefficients of the regression model: Each coefficient, one for each independent/explanatory variable, tells you how much the mean of the dependent variable shifts, positive or negative, for a one unit shift in the independent variable all other variables held equal. If the p-values for any given coefficient is below your significance level then the relationship is statistically significant for a larger sample, otherwise not.
Conduct residual analysis. Residual analysis is looking at how different the expected versus the actual outcome for the data points are which tells you how well the model accounts for variation in the data. First look at the residual plots to see intuitively if the residuals aren’t too low or too high for any given range. High R-squared values tell you there are smaller differences between the observed data and the fitted values. When the data points are closer to the regression line then the R-squared is higher.
# Here we select the built-in-to-R data, trees
#data(trees)
df <- trees
summary(df)
## Girth Height Volume
## Min. : 8.30 Min. :63 Min. :10.20
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
## Median :12.90 Median :76 Median :24.20
## Mean :13.25 Mean :76 Mean :30.17
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
## Max. :20.60 Max. :87 Max. :77.00
# Here we inspect what some of the rows look like
# Already this looks uninteresting because Volume should be a function of Girth and Height. We don't have a column for species of tree and we don't know how volume is calculated (weight of chipped tree times density of chips?)
# However Girth would make an interesting quadratic factor since pi*r^2 is area of a cross-section of the tree.
head(df)
## Girth Height Volume
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
# make a quadratic term
df$Girth_sq <- df$Girth^2
# make a dichotomous term
df$Short <- df$Height < 74
# make a dichotomous vs. quantitative interaction term
df$ShortGirth <- df$Short * df$Girth
# Show new terms
head(df)
## Girth Height Volume Girth_sq Short ShortGirth
## 1 8.3 70 10.3 68.89 TRUE 8.3
## 2 8.6 65 10.3 73.96 TRUE 8.6
## 3 8.8 63 10.2 77.44 TRUE 8.8
## 4 10.5 72 16.4 110.25 TRUE 10.5
## 5 10.7 81 18.8 114.49 FALSE 0.0
## 6 10.8 83 19.7 116.64 FALSE 0.0
# Here we build the multiple linear regression model, mlr, where Volume is dependent on the other variables.
mlr <- lm(Volume ~ Girth+Height+Girth_sq+Short+ShortGirth, data=df)
summary(mlr)
##
## Call:
## lm(formula = Volume ~ Girth + Height + Girth_sq + Short + ShortGirth,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5903 -1.3269 0.3297 1.9062 3.9737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22.03145 21.02174 -1.048 0.30465
## Girth -1.23040 2.13511 -0.576 0.56959
## Height 0.37547 0.14793 2.538 0.01775 *
## Girth_sq 0.21522 0.07186 2.995 0.00611 **
## ShortTRUE 7.75375 7.92361 0.979 0.33717
## ShortGirth -0.63883 0.59305 -1.077 0.29168
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.663 on 25 degrees of freedom
## Multiple R-squared: 0.9781, Adjusted R-squared: 0.9738
## F-statistic: 223.7 on 5 and 25 DF, p-value: < 2.2e-16
Going line by line:
Girth -1.23040 with a p-value of 0.56959
An increase in Girth is associated with a negative change in Volume, which doesn’t make sense but it also has a p-value of 0.56959, which is above our standard 0.05 so it’s not a statistically significant result.
Height 0.37547 with a p-value of 0.01775
An increase in height is associated with a statistically significant increase in Volume.
Girth_sq 0.21522 with a p-value of 0.00611
An increase in girth squared is associated with a statistically significant increase in volume. We predicted this when looking at the data since pi*r^2 is the area of a cross section of the tree.
ShortTRUE 7.75375 with a p-value of 0.33717
A tree being shorter than 74 is associated with an increase in volume which does not make sense (unless there are multiple species of trees) however the p-value is well above 0.05 so it’s not a statistically significant relationship.
ShortGirth -0.63883 with a p-value of 0.29168
The girth of just short trees which taller trees girth being set to zero has a negative relationship with volume. This makes sense because taller trees are likely to have more volume so setting their girth to zero will mean a unit-increase in ShortGirth means less volume, however the p-value is above 0.05 so it’s not statistically significant.
We’ve adjusted our model by removing the explanatory variables that weren’t significant. Now let’s move on to residual analysis.
mlr <- lm(Volume ~ Height+Girth_sq, data=df)
summary(mlr)
##
## Call:
## lm(formula = Volume ~ Height + Girth_sq, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8844 -2.2105 0.1196 2.6134 4.2404
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -27.511603 6.557697 -4.195 0.000248 ***
## Height 0.348809 0.093152 3.744 0.000830 ***
## Girth_sq 0.168458 0.006679 25.222 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.799 on 28 degrees of freedom
## Multiple R-squared: 0.9729, Adjusted R-squared: 0.971
## F-statistic: 503.2 on 2 and 28 DF, p-value: < 2.2e-16
#This is a hard plot to read. It looks like over the range of expected outputs from ~5 to ~75 the expected values are always zero and the residuals are plotted as how far away from the expected value they were, positioned over the expected value they were expected to have
#It's not clear that the residuals are fairly close to the line which would mean r-squared was high. Which it is at 97%.
#Also it's hard to tell if the residuals are consistent or unbiased across the range of expected values. It looks like it's concentrated at expected volumes of 20. So from this graph I'd say we have a problem but that doesn't seem supported by the next graph or the R-squared value.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## âś” ggplot2 3.4.2 âś” purrr 1.0.1
## âś” tibble 3.1.8 âś” dplyr 1.0.10
## âś” tidyr 1.3.0 âś” stringr 1.5.0
## âś” readr 2.1.3 âś” forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
ggplot(mlr, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
# Here's the qq plot. Since the last discussion we found the qqline function which show that the residuals are normal except for below one standard deviation below and a little less than one standard deviation above.
# It looks like we're ok. The residuals drop off the line towards the top before we get to the first standard deviation away so maybe that points to the fewer dots in the bottom right quadrant of the residual plot above.
qqnorm(mlr$residuals)
qqline(mlr$residuals)
The multiple linear regression model after the adjustment looks appropriate. It seems trivial because as we discussed in the Data section Volume is calculated from height and the area of a cross section of the tree, which is a consonant (pi/2) times girth squared, which we calculated as our quadratic term. This is born out with the high R-squared statistic, saying 97% of the variance in our data was explained by our two variables, height and girth squared.