LINEAR REGRESSION

To better understand some aspects of univariate and multivariate linear regression, the following investigations have been undertaken:

Graphical Representation of Linear Regression and Associated Outputs: Function to visualise output of simple model with a single predictor variable. Shows the data with regression line, confidence interval for coefficients, mean absolute error, prediction interval and residuals.

Categorical or Continuous?: Example of modelling a numerical variable as continuous or ordered categorical.

Test for Linear and Non-Linear Relationships for Individual Predictors: Function that fits each predictor in a data frame individually to a common response. The predictor can have a polynomial, square root, log base 10 or natural log transformation. The coefficients, standard error, p value, RSE and R squared of each model can then be compared.

Finding Interaction Terms: Function that fits each predictor in a data frame with each other as interaction terms and outputs p values for comparison.

library(tidyverse)
library(ISLR)
library(MASS)
select <- dplyr::select

Graphical Representation of Linear Regression and Associated Outputs

To get further insight in to output from the linear regression model, these are shown graphically. The function plot_lm() models a simple linear regression of a single predictor against a response (entered as strings).

Two plots are generated. The first shows: - Scatterplot of the data coloured by whether the point lies within a prediction interval (PI) at a 95% confidence level. - Blue solid linear regression line with a grey shaded area showing a point-wise 95% confidence level, i.e. the regression line should lie within this area at a 95% confidence level. - Black solid line showing the upper and lower bounds of the regression line at a 95% confidence level (CI). It is analagous to the grey shaded area but is calculated from the whole data set using standard errors of the intercept coefficient, b0, and of the regression coefficient, b1. It illustrates the interpretation of the coefficients b0 and b1 and their associated confidence intervals. In the case of b1, it is the average increase in the response, y, between the upper and lower bounds given a unit increase in the predictor, x. - Small dashed line shows the upper and lower bounds of the fitted values plus or minus the training mean absolute error (MAE). It represents how far on average each observation is from the regression line along the y-axis. Note that in the three examples given below approximately 60-65% of data points lie within this region. - Dashed line shows the upper and lower bounds of the prediction interval (PI) at a 95% confidence level, i.e. approximately 95% of the training data will lie within this region. When prediction intervals are given for predictions made by the model it is saying that based on the training data, there is a 95% probability the prediction will be between these bounds.

The second plot shows: - Scatterplot of residuals coloured by whether the residual lies within the MAE. - Dotted line shows the MAE.

The fraction of training data within the MAE and prediction interval is also given.

plot_lm <- function(df, response, predictor, show_summary=TRUE){
  
  # fit regression model
  model <- lm(df[[response]] ~ df[[predictor]])
  summ <- summary(model)
  
  # retrieve coefficients
  b0 <- summ$coefficients[1,1]
  b0_se <- summ$coefficients[1,2]
  b1 <- summ$coefficients[2,1]
  b1_se <- summ$coefficients[2,2]

  # calculate t statistic for 95% confidence interval on coefficients
  qt <- qt(.975, nrow(df) - 2)

  # metric upper and lower limits
  df_upper <- df %>% 
    mutate(CI = (b0 + qt * b0_se) + (b1 + qt * b1_se) * df[[predictor]],
           MAE = model$fitted.values + mean(abs(model$residuals)),
           PI = predict(model, df[predictor], interval = "predict")[,"upr"]) 
  
  df_lower <- df %>% 
    mutate(CI = (b0 - qt * b0_se) + (b1 - qt * b1_se) * df[[predictor]],
           MAE = model$fitted.values - mean(abs(model$residuals)),
           PI = predict(model, df[predictor], interval = "predict")[,"lwr"])
  
  df <- df %>% mutate(in_MAE = df[[response]] < df_upper %>% pull(MAE) & df[[response]] > df_lower %>% pull(MAE),
                      in_PI = df[[response]] < df_upper %>% pull(PI) & df[[response]] > df_lower %>% pull(PI))
  
  df_upper <- df_upper %>% gather("metric", "value", c("CI", "MAE", "PI"))
  df_lower <- df_lower %>% gather("metric", "value", c("CI", "MAE", "PI"))
  
  # fitted model plot
  plot <- ggplot(df, aes(df[[predictor]], df[[response]])) +
    geom_point(aes(col = in_PI)) +
    geom_smooth(method = "lm") +
    geom_line(data = df_upper, aes(df_upper[[predictor]], value, linetype = metric)) +
    geom_line(data = df_lower, aes(df_lower[[predictor]], value, linetype = metric)) +
    labs(x = colnames(df[predictor]), y = colnames(df[response]))

  # residual plot  
  df_resid_upper <- df %>% 
    mutate(residual_mean = mean(model$residuals), 
           MAE = mean(abs(model$residuals))) %>% 
    gather("metric", "value", residual_mean:MAE)
  
  df_resid_lower <- df %>% mutate(metric = "MAE", value = -mean(abs(model$residuals)))
  
  res_plot <- ggplot(df, aes(df[[predictor]], model$residuals)) +
    geom_point(aes(col = in_MAE)) +
    geom_hline(data = df_resid_upper, aes(yintercept = value, linetype = metric)) +
    geom_hline(data = df_resid_lower, aes(yintercept = value, linetype = metric)) +
    labs(x = colnames(df[predictor]))
  
  if(show_summary==TRUE){
    list(res_plot, plot, summ)
  } else {
    list(res_plot, plot)
  }
}

Examples:

These examples only consider one predictor in the intepretation of the model outputs. Other predictors will likely effect the response.

ISLR::Auto, horsepower of a car vs its engine size displacement in cubic inches.

plot_lm(Auto, "horsepower", "displacement")
## [[1]]

## 
## [[2]]

## 
## [[3]]
## 
## Call:
## lm(formula = df[[response]] ~ df[[predictor]])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.819 -10.695  -0.819   8.676  64.742 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     40.306108   1.815099   22.21   <2e-16 ***
## df[[predictor]]  0.330038   0.008223   40.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.02 on 390 degrees of freedom
## Multiple R-squared:  0.8051, Adjusted R-squared:  0.8046 
## F-statistic:  1611 on 1 and 390 DF,  p-value: < 2.2e-16

A 300 cubic inch car is predicted to have a 139 horsepower engine on average plus or minus an average error of 12.5 hp. There is a 95% probability it will have a horsepower of between 106 and 173. For an increase in 100 cubic inches, the average horsepower will increase by between 31 and 35.

# fit regression model
auto_model <- lm(horsepower ~ displacement, Auto)
auto_summ <- summary(auto_model)
# retrieve coefficients
b0 <- auto_summ$coefficients[1,1]
b0_se <- auto_summ$coefficients[1,2]
b1 <- auto_summ$coefficients[2,1]
b1_se <- auto_summ$coefficients[2,2]
# prediction
b0 + b1 * 300
## [1] 139.3174
# MAE
mean(abs(auto_model$residuals))
## [1] 12.49397
# prediction interval
predict(auto_model, data.frame(displacement=300), interval = "predict")
##        fit      lwr     upr
## 1 139.3174 105.7768 172.858
# calculate t statistic for 95% confidence interval on coefficients
qt <- qt(.975, nrow(Auto) - 2)
# unit increase in predictor gives increase in response
b1 - qt * b1_se
## [1] 0.3138698
b1 + qt * b1_se
## [1] 0.3462055

ISLR::Carseats, Sales of child car seats at a store vs the local advertising budget at that store (both in thousands of dollars).

plot_lm(Carseats, "Sales", "Advertising")
## [[1]]

## 
## [[2]]

## 
## [[3]]
## 
## Call:
## lm(formula = df[[response]] ~ df[[predictor]])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3770 -1.9634 -0.1037  1.7222  8.3208 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.7370     0.1925  35.007  < 2e-16 ***
## df[[predictor]]   0.1144     0.0205   5.583 4.38e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.723 on 398 degrees of freedom
## Multiple R-squared:  0.07263,    Adjusted R-squared:  0.0703 
## F-statistic: 31.17 on 1 and 398 DF,  p-value: 4.378e-08
# Carseats %>% filter(Advertising == 0) %>% summarise(mean(Sales))

MASS::Boston, average number of rooms per dwelling rm in an area of Boston vs median value of owner-occupied homes medv (in thousands of dollars) for that area.

plot_lm(Boston, "rm", "medv")
## [[1]]

## 
## [[2]]

## 
## [[3]]
## 
## Call:
## lm(formula = df[[response]] ~ df[[predictor]])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.98750 -0.24448  0.01893  0.27379  2.52898 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5.087639   0.059510   85.49   <2e-16 ***
## df[[predictor]] 0.053122   0.002446   21.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5054 on 504 degrees of freedom
## Multiple R-squared:  0.4835, Adjusted R-squared:  0.4825 
## F-statistic: 471.8 on 1 and 504 DF,  p-value: < 2.2e-16

Categorical or Continuous?

Are some variables better off being considered as continuous or categorical? In the Auto data set cylinders could be considered as ordered categorical variables as it is not possible to have fractions of a cylinder.

In this example they seem better off as factors as the RSE is lower and R squared is higher.

# linear model with cylinders as continuous
summary(lm(mpg ~ cylinders, Auto))
## 
## Call:
## lm(formula = mpg ~ cylinders, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2413  -3.1832  -0.6332   2.5491  17.9168 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.9155     0.8349   51.40   <2e-16 ***
## cylinders    -3.5581     0.1457  -24.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.914 on 390 degrees of freedom
## Multiple R-squared:  0.6047, Adjusted R-squared:  0.6037 
## F-statistic: 596.6 on 1 and 390 DF,  p-value: < 2.2e-16
# linear model with cylinders as factors
Auto_df <- Auto %>% select(-name) %>% mutate(fact_cyl = as.factor(cylinders))
lm_fact_cyl <- summary(lm(mpg~fact_cyl, Auto_df))
lm_fact_cyl
## 
## Call:
## lm(formula = mpg ~ fact_cyl, data = Auto_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2839  -2.9037  -0.9631   2.3437  18.0265 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  20.5500     2.3494   8.747  < 2e-16 ***
## fact_cyl4     8.7339     2.3729   3.681 0.000266 ***
## fact_cyl5     6.8167     3.5888   1.899 0.058250 .  
## fact_cyl6    -0.5765     2.4053  -0.240 0.810708    
## fact_cyl8    -5.5869     2.3946  -2.333 0.020153 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.699 on 387 degrees of freedom
## Multiple R-squared:  0.6413, Adjusted R-squared:  0.6376 
## F-statistic:   173 on 4 and 387 DF,  p-value: < 2.2e-16

The chart below shows why converting cylinders to ordered categorical performs better. The blue line is the regression using numerical continuous values and the red line shows regression using ordered categorical. It can be seen that the red line is able to better capture the mean mpg for each number of cylinders.

# create data frame of regression coefficients
plotr <- lm_fact_cyl$coefficients %>% 
  as.data.frame() %>% rownames_to_column("fact_cyl") 
# convert factors back to numerical for plotting purposes
plotr <- plotr %>% 
  mutate(cylinders = ifelse(fact_cyl == "(Intercept)", 3, str_replace_all(fact_cyl, "\\D", "")) %>% as.numeric(),
         mpg_fact_cyl = ifelse(cylinders == 3, plotr[1,2], plotr[1,2] + Estimate))

ggplot(Auto_df, aes(cylinders, mpg)) + geom_point() + geom_smooth(method = "lm") +
  geom_line(data = plotr, aes(cylinders, mpg_fact_cyl), col = "red")

Test for Linear and Non-Linear Relationships for Individual Predictors

To test for linear and non-linear relationships between each individual predictor against a response, the function test_relationship() is used. For every predictor it fits a simple linear model to the response and can apply a transformation to the predictor.

The output is a data frame containing regression coefficient estimate, standard error, p value, RSE, R squared and boolean of significance for each predictor.

Arguments: df data frame for analysis response string of response variable n_poly numerical polynomial order, defaults to 1, univariate regression sqrt boolean for square root transformation of the predictor, defaults to false log “10” for base 10 and “natural” for log transformation of the predictor, defaults to false signif_level significance level to assess coefficients, defaults to 0.05 show_intercept boolean to show output for the intercept

test_relationship <- function(df, response, n_poly=1, sqrt=FALSE, log=FALSE, 
                              signif_level=0.05, show_intercept=TRUE){
  
  # remove categorical variables if fitting polynomials, square root or logs
  if(n_poly > 1 | sqrt == TRUE | log != FALSE) df <- df %>% select_if(is.numeric) 
  predictors <- colnames(df)[colnames(df) != response] # remove response to create predictor names vector
  
  # fit linear model
  fit_lm <- function(response, x, n_poly){
    if(sqrt == TRUE) {formula <- as.formula("df[[response]] ~ sqrt(df[[x]])")
    } else if(log == "10") {formula <- as.formula("df[[response]] ~ log10(df[[x]])")
    } else if(log == "natural") {formula <- as.formula("df[[response]] ~ log(df[[x]])")
    } else {
      formula <- data.frame(poly = seq(1, n_poly)) %>%
        mutate(term = case_when(poly == 1 ~ "df[[x]]",
                                poly > 1  ~ paste("I(df[[x]]^", as.character(poly), ")"))) %>%
        pull(term) %>% paste(., collapse=" + ") %>% paste("df[[response]] ~", .) %>%
        as.formula()
    }
    summary(lm(formula))
  }

  # model output as nested tables
  lm_output <- mapply(fit_lm, response, predictors, n_poly)
  colnames(lm_output) <- predictors
  model_output <- t(lm_output) %>% as.data.frame() %>% rownames_to_column("predictor")

  # extract coefficients
  coefs <- model_output %>%
    mutate(coefficients = map(coefficients, ~as.data.frame(.x)),
           coef = map(coefficients, ~.x %>% as.data.frame() %>% rownames())) %>% # levels for factors
    unnest(coef, coefficients) %>%
    rename(t_value = "t value", pval = "Pr(>|t|)")

  # extract single entry model stats
  model_stats <- model_output %>%
    unnest(rse = sigma, r.squared, .drop = TRUE)

  # combine coefficients and model stats
  output <- left_join(coefs, model_stats, by = "predictor") %>% select(-t_value) %>%
    mutate(signif = pval < signif_level)
  
  # show intercept or not
  if(show_intercept==FALSE) output %>% filter(coef != "(Intercept)") else output
}

Exercise 15 from chapter 3 of Introduction to Statistical Learning with Applications in R (ISLR) illustrates the use of test_relationship(). This involves prediction of per capita crime rate crim using the Boston data set.

Boston_df <- Boston %>% 
  mutate(chas = as.factor(chas),
         rad = as.factor(rad))

Fit simple linear models for each predictor.

single_pred <- test_relationship(Boston_df, "crim", show_intercept = FALSE)
single_pred
## # A tibble: 20 x 28
##    predictor call.x terms.x residuals.x Estimate `Std. Error`     pval aliased.x
##    <chr>     <name> <named> <named lis>    <dbl>        <dbl>    <dbl> <named l>
##  1 zn        <lang… <terms> <dbl [506]>  -0.0739      0.0161  5.51e- 6 <lgl [2]>
##  2 indus     <lang… <terms> <dbl [506]>   0.510       0.0510  1.45e-21 <lgl [2]>
##  3 chas      <lang… <terms> <dbl [506]>  -1.89        1.51    2.09e- 1 <lgl [2]>
##  4 nox       <lang… <terms> <dbl [506]>  31.2         3.00    3.75e-23 <lgl [2]>
##  5 rm        <lang… <terms> <dbl [506]>  -2.68        0.532   6.35e- 7 <lgl [2]>
##  6 age       <lang… <terms> <dbl [506]>   0.108       0.0127  2.85e-16 <lgl [2]>
##  7 dis       <lang… <terms> <dbl [506]>  -1.55        0.168   8.52e-19 <lgl [2]>
##  8 rad       <lang… <terms> <dbl [506]>   0.0473      2.03    9.81e- 1 <lgl [9]>
##  9 rad       <lang… <terms> <dbl [506]>   0.0613      1.85    9.74e- 1 <lgl [9]>
## 10 rad       <lang… <terms> <dbl [506]>   0.358       1.63    8.27e- 1 <lgl [9]>
## 11 rad       <lang… <terms> <dbl [506]>   0.652       1.63    6.89e- 1 <lgl [9]>
## 12 rad       <lang… <terms> <dbl [506]>   0.114       2.00    9.54e- 1 <lgl [9]>
## 13 rad       <lang… <terms> <dbl [506]>   0.114       2.21    9.59e- 1 <lgl [9]>
## 14 rad       <lang… <terms> <dbl [506]>   0.335       2.03    8.69e- 1 <lgl [9]>
## 15 rad       <lang… <terms> <dbl [506]>  12.7         1.61    1.84e-14 <lgl [9]>
## 16 tax       <lang… <terms> <dbl [506]>   0.0297      0.00185 2.36e-47 <lgl [2]>
## 17 ptratio   <lang… <terms> <dbl [506]>   1.15        0.169   2.94e-11 <lgl [2]>
## 18 black     <lang… <terms> <dbl [506]>  -0.0363      0.00387 2.49e-19 <lgl [2]>
## 19 lstat     <lang… <terms> <dbl [506]>   0.549       0.0478  2.65e-27 <lgl [2]>
## 20 medv      <lang… <terms> <dbl [506]>  -0.363       0.0384  1.17e-19 <lgl [2]>
## # … with 20 more variables: sigma.x <named list>, df.x <named list>,
## #   r.squared.x <named list>, adj.r.squared.x <named list>, fstatistic.x <named
## #   list>, cov.unscaled.x <named list>, coef <chr>, call.y <named list>,
## #   terms.y <named list>, residuals.y <named list>, coefficients <named list>,
## #   aliased.y <named list>, sigma.y <named list>, df.y <named list>,
## #   r.squared.y <dbl>, adj.r.squared.y <named list>, fstatistic.y <named list>,
## #   cov.unscaled.y <named list>, rse <dbl>, signif <lgl>

These predictors are statistically significant (i.e. all except chas):

single_pred %>% filter(signif == TRUE) %>% pull(predictor)
##  [1] "zn"      "indus"   "nox"     "rm"      "age"     "dis"     "rad"    
##  [8] "tax"     "ptratio" "black"   "lstat"   "medv"

Fitting the full model gives zn, nox, dis, rad and medv as significant.

all_pred_summ <- summary(lm(crim ~ ., Boston_df))
all_pred_summ
## 
## Call:
## lm(formula = crim ~ ., data = Boston_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.910 -1.832 -0.269  0.928 74.823 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  21.286683   7.720816   2.757 0.006052 ** 
## zn            0.038739   0.019656   1.971 0.049315 *  
## indus        -0.078839   0.087368  -0.902 0.367302    
## chas1        -0.750706   1.195613  -0.628 0.530376    
## nox         -10.812042   5.441141  -1.987 0.047474 *  
## rm            0.397653   0.621841   0.639 0.522814    
## age           0.001899   0.018168   0.105 0.916815    
## dis          -1.016331   0.290162  -3.503 0.000503 ***
## rad2         -0.704036   2.031423  -0.347 0.729063    
## rad3          0.555208   1.857126   0.299 0.765098    
## rad4          0.207187   1.638875   0.126 0.899452    
## rad5          0.491248   1.668619   0.294 0.768575    
## rad6         -0.925776   2.011933  -0.460 0.645620    
## rad7          1.614476   2.178361   0.741 0.458966    
## rad8          1.608235   2.069816   0.777 0.437541    
## rad24        12.045021   2.440130   4.936  1.1e-06 ***
## tax          -0.003125   0.005376  -0.581 0.561365    
## ptratio      -0.351181   0.206908  -1.697 0.090285 .  
## black        -0.007033   0.003688  -1.907 0.057075 .  
## lstat         0.121993   0.076796   1.589 0.112815    
## medv         -0.205330   0.061673  -3.329 0.000937 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.447 on 485 degrees of freedom
## Multiple R-squared:  0.4604, Adjusted R-squared:  0.4382 
## F-statistic: 20.69 on 20 and 485 DF,  p-value: < 2.2e-16

Plot the multivariate coefficients against the univariate coefficients. The dashed line shows where the coefficients are equal. Most coefficients are similar, except nox.

# attach category numbers to predictor strings
uni_pred <- single_pred %>% 
  mutate(predictor = paste(predictor, str_replace(coef, "df\\[\\[x\\]\\]", ""), sep = "")) %>% 
  rename(univariate_coefficient = Estimate)

# data frame of multiple regression
multi_pred <- data.frame(multivariate_coefficient = all_pred_summ$coefficients[, "Estimate"]) %>% 
  rownames_to_column("predictor")

# join regression results data frames and plot
coeffs <- left_join(uni_pred, multi_pred)
ggplot(coeffs, aes(univariate_coefficient, multivariate_coefficient)) + geom_point() + 
  geom_abline(slope=1, intercept=0, linetype=2) +
  geom_text(data = coeffs %>% filter(univariate_coefficient > 30), aes(label = predictor), vjust=-1)

Check if there is any third order polynomial relationships.

poly_pred <- test_relationship(Boston_df, "crim", n_poly = 3, show_intercept = FALSE)
poly_pred
## # A tibble: 33 x 28
##    predictor call.x terms.x residuals.x Estimate `Std. Error`     pval aliased.x
##    <chr>     <name> <named> <named lis>    <dbl>        <dbl>    <dbl> <named l>
##  1 zn        <lang… <terms> <dbl [506]> -3.32e-1    0.110     2.61e- 3 <lgl [4]>
##  2 zn        <lang… <terms> <dbl [506]>  6.48e-3    0.00386   9.38e- 2 <lgl [4]>
##  3 zn        <lang… <terms> <dbl [506]> -3.78e-5    0.0000314 2.30e- 1 <lgl [4]>
##  4 indus     <lang… <terms> <dbl [506]> -1.97e+0    0.482     5.30e- 5 <lgl [4]>
##  5 indus     <lang… <terms> <dbl [506]>  2.52e-1    0.0393    3.42e-10 <lgl [4]>
##  6 indus     <lang… <terms> <dbl [506]> -6.98e-3    0.000957  1.20e-12 <lgl [4]>
##  7 nox       <lang… <terms> <dbl [506]> -1.28e+3  170.        2.76e-13 <lgl [4]>
##  8 nox       <lang… <terms> <dbl [506]>  2.25e+3  280.        6.81e-15 <lgl [4]>
##  9 nox       <lang… <terms> <dbl [506]> -1.25e+3  149.        6.96e-16 <lgl [4]>
## 10 rm        <lang… <terms> <dbl [506]> -3.92e+1   31.3       2.12e- 1 <lgl [4]>
## # … with 23 more rows, and 20 more variables: sigma.x <named list>, df.x <named
## #   list>, r.squared.x <named list>, adj.r.squared.x <named list>,
## #   fstatistic.x <named list>, cov.unscaled.x <named list>, coef <chr>,
## #   call.y <named list>, terms.y <named list>, residuals.y <named list>,
## #   coefficients <named list>, aliased.y <named list>, sigma.y <named list>,
## #   df.y <named list>, r.squared.y <dbl>, adj.r.squared.y <named list>,
## #   fstatistic.y <named list>, cov.unscaled.y <named list>, rse <dbl>,
## #   signif <lgl>

The predictors below have significance for all polynomial orders. The R squared values also increase, though only from poor correlations to slightly less poor!

# find predictors with all 3 polynomial orders being significant
poly_signif <- poly_pred %>% 
  group_by(predictor) %>% summarise(sum_signif = sum(signif)) %>% filter(sum_signif == 3) %>% pull(predictor)

# compare r squared for simple models and polynomial models each with single predictors
inner_join(single_pred, poly_pred, by = c("predictor", "coef"), suffix = c(".single", ".poly")) %>% 
  select(predictor, matches("signif|r.squared")) %>% filter(predictor %in% poly_signif) 
## # A tibble: 5 x 11
##   predictor r.squared.x.sin… adj.r.squared.x… r.squared.y.sin… adj.r.squared.y…
##   <chr>     <named list>     <named list>                <dbl> <named list>    
## 1 indus     <dbl [1]>        <dbl [1]>                  0.165  <dbl [1]>       
## 2 nox       <dbl [1]>        <dbl [1]>                  0.177  <dbl [1]>       
## 3 dis       <dbl [1]>        <dbl [1]>                  0.144  <dbl [1]>       
## 4 ptratio   <dbl [1]>        <dbl [1]>                  0.0841 <dbl [1]>       
## 5 medv      <dbl [1]>        <dbl [1]>                  0.151  <dbl [1]>       
## # … with 6 more variables: signif.single <lgl>, r.squared.x.poly <named list>,
## #   adj.r.squared.x.poly <named list>, r.squared.y.poly <dbl>,
## #   adj.r.squared.y.poly <named list>, signif.poly <lgl>

Finding Interaction Terms

One potential approach to finding interaction terms is to fit a series of models with two predictors each in the form of y ~ x1 + x2 + x1:x2 for every combination of predictor.

The function find_interactions() does this given a single response (entered as a string) at a set significance level. It outputs two tables with the first giving the p value for the coefficients of the intercept, x1, x2 and x1:x2 for each pair of predictors and the second table translates these p values in to a boolean of statistical significance.

find_interactions <- function(df, response, signif_level=0.05){
  vars <- colnames(df)[colnames(df) != response] # remove response variable
  combos <- expand.grid(vars, vars) %>% filter(Var1 != Var2) # remove same predictors
  # get all permutations for lm
  # https://stackoverflow.com/questions/17017374/how-to-expand-grid-with-string-elements-the-half
  combos <- apply(combos, 1, sort) # sort variables A-Z in their pairs (ie sort across rows)
  combos <- t(combos) # transpose back to long df
  combos <- combos[!duplicated(combos),] # use boolean filtering to remove duplicates

  fit_lm <- function(response, x1, x2){
    summary(lm(df[[response]] ~ df[[x1]] * df[[x2]]))$coefficients[,4]
  }
  
  pvals_df <- mapply(fit_lm, "mpg", combos[,1], combos[,2]) %>% as.data.frame()
  column_names <- paste(combos[,1], combos[,2], sep = ":")
  colnames(pvals_df) <- column_names
  rownames(pvals_df) <- c("int", "x1", "x2", "x1:x2")
  signif_df <- data.frame(pvals_df < signif_level)
  colnames(signif_df) <- column_names
  
  list(pvals_df, signif_df)
}

Interactions for the Auto dataset. Most pairs of predictors provide statistically significant relationships!

find_interactions(Auto %>% select(-name), "mpg")
## [[1]]
##       cylinders:displacement cylinders:horsepower cylinders:weight
## int             5.334199e-64         1.856863e-77     5.032775e-51
## x1              8.076482e-06         3.299066e-31     1.261187e-08
## x2              1.502868e-15         1.260080e-28     4.184416e-19
## x1:x2           2.237576e-08         4.826362e-21     2.828875e-07
##       acceleration:cylinders cylinders:year cylinders:origin
## int             6.125745e-09   5.471026e-05     3.473597e-46
## x1              2.990319e-02   4.477560e-02     3.307858e-13
## x2              3.150044e-02   5.988064e-11     5.861205e-01
## x1:x2           5.488879e-02   1.994514e-03     1.307719e-01
##       displacement:horsepower displacement:weight acceleration:displacement
## int             3.020057e-121        5.988019e-94              3.129597e-14
## x1               4.320329e-39        1.852096e-11              6.443529e-06
## x2               2.802363e-28        5.225958e-23              7.819670e-01
## x1:x2            1.681625e-25        1.062306e-09              1.384785e-08
##       displacement:year displacement:origin horsepower:weight
## int        8.918936e-17        4.255163e-55      1.234344e-91
## x1         1.315283e-09        8.183386e-02      2.322587e-18
## x2         1.869071e-31        2.896267e-05      5.137202e-36
## x1:x2      4.955428e-13        6.396804e-04      9.933617e-15
##       acceleration:horsepower horsepower:year horsepower:origin
## int              2.117840e-20    1.104778e-22      1.045534e-43
## x1               1.837764e-04    6.333010e-18      3.958037e-04
## x2               5.216637e-01    1.167343e-34      1.995061e-11
## x1:x2            4.446532e-14    7.366572e-22      1.948668e-06
##       acceleration:weight  weight:year origin:weight acceleration:year
## int          1.568129e-08 3.295213e-16  1.251800e-51       0.009032344
## x1           3.495387e-04 1.136624e-09  6.094951e-03       0.336094144
## x2           3.076021e-02 5.881890e-28  7.561205e-12       0.003653537
## x1:x2        4.263375e-03 8.015486e-14  4.230193e-02       0.547269163
##       acceleration:origin  origin:year
## int           0.420919335 1.568401e-11
## x1            0.002811774 1.137792e-02
## x2            0.160574086 1.679918e-15
## x1:x2         0.697108414 6.211123e-02
## 
## [[2]]
##       cylinders:displacement cylinders:horsepower cylinders:weight
## int                     TRUE                 TRUE             TRUE
## x1                      TRUE                 TRUE             TRUE
## x2                      TRUE                 TRUE             TRUE
## x1:x2                   TRUE                 TRUE             TRUE
##       acceleration:cylinders cylinders:year cylinders:origin
## int                     TRUE           TRUE             TRUE
## x1                      TRUE           TRUE             TRUE
## x2                      TRUE           TRUE            FALSE
## x1:x2                  FALSE           TRUE            FALSE
##       displacement:horsepower displacement:weight acceleration:displacement
## int                      TRUE                TRUE                      TRUE
## x1                       TRUE                TRUE                      TRUE
## x2                       TRUE                TRUE                     FALSE
## x1:x2                    TRUE                TRUE                      TRUE
##       displacement:year displacement:origin horsepower:weight
## int                TRUE                TRUE              TRUE
## x1                 TRUE               FALSE              TRUE
## x2                 TRUE                TRUE              TRUE
## x1:x2              TRUE                TRUE              TRUE
##       acceleration:horsepower horsepower:year horsepower:origin
## int                      TRUE            TRUE              TRUE
## x1                       TRUE            TRUE              TRUE
## x2                      FALSE            TRUE              TRUE
## x1:x2                    TRUE            TRUE              TRUE
##       acceleration:weight weight:year origin:weight acceleration:year
## int                  TRUE        TRUE          TRUE              TRUE
## x1                   TRUE        TRUE          TRUE             FALSE
## x2                   TRUE        TRUE          TRUE              TRUE
## x1:x2                TRUE        TRUE          TRUE             FALSE
##       acceleration:origin origin:year
## int                 FALSE        TRUE
## x1                   TRUE        TRUE
## x2                  FALSE        TRUE
## x1:x2               FALSE       FALSE

The hierachical principle states if theres an interaction effect, leave in the main effects even if they aren’t statistically significant (i.e. if x1:x2 p value is low enough, include x1 and x2 even if their p values are too high).

However, we can’t just look at each y ~ x1 + x2 + x1:x2 for significance using find_interactions(). For example, in the Auto dataset cylinders:displacement and displacement:weight are both statistically significant on their own but combined in the same model lm(formula = mpg ~ cylinders * displacement + displacement * weight, data = Auto) only displacement:weight is significant.

Comparing the RSE and adjusted R squared for each interaction could yield a better analysis.

summary(lm(formula = mpg ~ cylinders * displacement + displacement * weight, data = Auto))
## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + displacement * 
##     weight, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2934  -2.5184  -0.3476   1.8399  17.7723 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.262e+01  2.237e+00  23.519  < 2e-16 ***
## cylinders               7.606e-01  7.669e-01   0.992    0.322    
## displacement           -7.351e-02  1.669e-02  -4.403 1.38e-05 ***
## weight                 -9.888e-03  1.329e-03  -7.438 6.69e-13 ***
## cylinders:displacement -2.986e-03  3.426e-03  -0.872    0.384    
## displacement:weight     2.128e-05  5.002e-06   4.254 2.64e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.103 on 386 degrees of freedom
## Multiple R-squared:  0.7272, Adjusted R-squared:  0.7237 
## F-statistic: 205.8 on 5 and 386 DF,  p-value: < 2.2e-16

Future Considerations

  • Include legend for plot_lm()
  • Compare RSE and adjusted R squared for find_interactions()