To better understand some aspects of univariate and multivariate linear regression, the following investigations have been undertaken:
Graphical Representation of Linear Regression and Associated Outputs: Function to visualise output of simple model with a single predictor variable. Shows the data with regression line, confidence interval for coefficients, mean absolute error, prediction interval and residuals.
Categorical or Continuous?: Example of modelling a numerical variable as continuous or ordered categorical.
Test for Linear and Non-Linear Relationships for Individual Predictors: Function that fits each predictor in a data frame individually to a common response. The predictor can have a polynomial, square root, log base 10 or natural log transformation. The coefficients, standard error, p value, RSE and R squared of each model can then be compared.
Finding Interaction Terms: Function that fits each predictor in a data frame with each other as interaction terms and outputs p values for comparison.
library(tidyverse)
library(ISLR)
library(MASS)
select <- dplyr::select
To get further insight in to output from the linear regression model, these are shown graphically. The function plot_lm() models a simple linear regression of a single predictor against a response (entered as strings).
Two plots are generated. The first shows: - Scatterplot of the data coloured by whether the point lies within a prediction interval (PI) at a 95% confidence level. - Blue solid linear regression line with a grey shaded area showing a point-wise 95% confidence level, i.e. the regression line should lie within this area at a 95% confidence level. - Black solid line showing the upper and lower bounds of the regression line at a 95% confidence level (CI). It is analagous to the grey shaded area but is calculated from the whole data set using standard errors of the intercept coefficient, b0, and of the regression coefficient, b1. It illustrates the interpretation of the coefficients b0 and b1 and their associated confidence intervals. In the case of b1, it is the average increase in the response, y, between the upper and lower bounds given a unit increase in the predictor, x. - Small dashed line shows the upper and lower bounds of the fitted values plus or minus the training mean absolute error (MAE). It represents how far on average each observation is from the regression line along the y-axis. Note that in the three examples given below approximately 60-65% of data points lie within this region. - Dashed line shows the upper and lower bounds of the prediction interval (PI) at a 95% confidence level, i.e. approximately 95% of the training data will lie within this region. When prediction intervals are given for predictions made by the model it is saying that based on the training data, there is a 95% probability the prediction will be between these bounds.
The second plot shows: - Scatterplot of residuals coloured by whether the residual lies within the MAE. - Dotted line shows the MAE.
The fraction of training data within the MAE and prediction interval is also given.
plot_lm <- function(df, response, predictor, show_summary=TRUE){
# fit regression model
model <- lm(df[[response]] ~ df[[predictor]])
summ <- summary(model)
# retrieve coefficients
b0 <- summ$coefficients[1,1]
b0_se <- summ$coefficients[1,2]
b1 <- summ$coefficients[2,1]
b1_se <- summ$coefficients[2,2]
# calculate t statistic for 95% confidence interval on coefficients
qt <- qt(.975, nrow(df) - 2)
# metric upper and lower limits
df_upper <- df %>%
mutate(CI = (b0 + qt * b0_se) + (b1 + qt * b1_se) * df[[predictor]],
MAE = model$fitted.values + mean(abs(model$residuals)),
PI = predict(model, df[predictor], interval = "predict")[,"upr"])
df_lower <- df %>%
mutate(CI = (b0 - qt * b0_se) + (b1 - qt * b1_se) * df[[predictor]],
MAE = model$fitted.values - mean(abs(model$residuals)),
PI = predict(model, df[predictor], interval = "predict")[,"lwr"])
df <- df %>% mutate(in_MAE = df[[response]] < df_upper %>% pull(MAE) & df[[response]] > df_lower %>% pull(MAE),
in_PI = df[[response]] < df_upper %>% pull(PI) & df[[response]] > df_lower %>% pull(PI))
df_upper <- df_upper %>% gather("metric", "value", c("CI", "MAE", "PI"))
df_lower <- df_lower %>% gather("metric", "value", c("CI", "MAE", "PI"))
# fitted model plot
plot <- ggplot(df, aes(df[[predictor]], df[[response]])) +
geom_point(aes(col = in_PI)) +
geom_smooth(method = "lm") +
geom_line(data = df_upper, aes(df_upper[[predictor]], value, linetype = metric)) +
geom_line(data = df_lower, aes(df_lower[[predictor]], value, linetype = metric)) +
labs(x = colnames(df[predictor]), y = colnames(df[response]))
# residual plot
df_resid_upper <- df %>%
mutate(residual_mean = mean(model$residuals),
MAE = mean(abs(model$residuals))) %>%
gather("metric", "value", residual_mean:MAE)
df_resid_lower <- df %>% mutate(metric = "MAE", value = -mean(abs(model$residuals)))
res_plot <- ggplot(df, aes(df[[predictor]], model$residuals)) +
geom_point(aes(col = in_MAE)) +
geom_hline(data = df_resid_upper, aes(yintercept = value, linetype = metric)) +
geom_hline(data = df_resid_lower, aes(yintercept = value, linetype = metric)) +
labs(x = colnames(df[predictor]))
if(show_summary==TRUE){
list(res_plot, plot, summ)
} else {
list(res_plot, plot)
}
}
Examples:
These examples only consider one predictor in the intepretation of the model outputs. Other predictors will likely effect the response.
ISLR::Auto, horsepower of a car vs its engine size displacement in cubic inches.
plot_lm(Auto, "horsepower", "displacement")
## [[1]]
##
## [[2]]
##
## [[3]]
##
## Call:
## lm(formula = df[[response]] ~ df[[predictor]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.819 -10.695 -0.819 8.676 64.742
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.306108 1.815099 22.21 <2e-16 ***
## df[[predictor]] 0.330038 0.008223 40.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.02 on 390 degrees of freedom
## Multiple R-squared: 0.8051, Adjusted R-squared: 0.8046
## F-statistic: 1611 on 1 and 390 DF, p-value: < 2.2e-16
A 300 cubic inch car is predicted to have a 139 horsepower engine on average plus or minus an average error of 12.5 hp. There is a 95% probability it will have a horsepower of between 106 and 173. For an increase in 100 cubic inches, the average horsepower will increase by between 31 and 35.
# fit regression model
auto_model <- lm(horsepower ~ displacement, Auto)
auto_summ <- summary(auto_model)
# retrieve coefficients
b0 <- auto_summ$coefficients[1,1]
b0_se <- auto_summ$coefficients[1,2]
b1 <- auto_summ$coefficients[2,1]
b1_se <- auto_summ$coefficients[2,2]
# prediction
b0 + b1 * 300
## [1] 139.3174
# MAE
mean(abs(auto_model$residuals))
## [1] 12.49397
# prediction interval
predict(auto_model, data.frame(displacement=300), interval = "predict")
## fit lwr upr
## 1 139.3174 105.7768 172.858
# calculate t statistic for 95% confidence interval on coefficients
qt <- qt(.975, nrow(Auto) - 2)
# unit increase in predictor gives increase in response
b1 - qt * b1_se
## [1] 0.3138698
b1 + qt * b1_se
## [1] 0.3462055
ISLR::Carseats, Sales of child car seats at a store vs the local advertising budget at that store (both in thousands of dollars).
plot_lm(Carseats, "Sales", "Advertising")
## [[1]]
##
## [[2]]
##
## [[3]]
##
## Call:
## lm(formula = df[[response]] ~ df[[predictor]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3770 -1.9634 -0.1037 1.7222 8.3208
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.7370 0.1925 35.007 < 2e-16 ***
## df[[predictor]] 0.1144 0.0205 5.583 4.38e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.723 on 398 degrees of freedom
## Multiple R-squared: 0.07263, Adjusted R-squared: 0.0703
## F-statistic: 31.17 on 1 and 398 DF, p-value: 4.378e-08
# Carseats %>% filter(Advertising == 0) %>% summarise(mean(Sales))
MASS::Boston, average number of rooms per dwelling rm in an area of Boston vs median value of owner-occupied homes medv (in thousands of dollars) for that area.
plot_lm(Boston, "rm", "medv")
## [[1]]
##
## [[2]]
##
## [[3]]
##
## Call:
## lm(formula = df[[response]] ~ df[[predictor]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.98750 -0.24448 0.01893 0.27379 2.52898
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.087639 0.059510 85.49 <2e-16 ***
## df[[predictor]] 0.053122 0.002446 21.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5054 on 504 degrees of freedom
## Multiple R-squared: 0.4835, Adjusted R-squared: 0.4825
## F-statistic: 471.8 on 1 and 504 DF, p-value: < 2.2e-16
Are some variables better off being considered as continuous or categorical? In the Auto data set cylinders could be considered as ordered categorical variables as it is not possible to have fractions of a cylinder.
In this example they seem better off as factors as the RSE is lower and R squared is higher.
# linear model with cylinders as continuous
summary(lm(mpg ~ cylinders, Auto))
##
## Call:
## lm(formula = mpg ~ cylinders, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2413 -3.1832 -0.6332 2.5491 17.9168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.9155 0.8349 51.40 <2e-16 ***
## cylinders -3.5581 0.1457 -24.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.914 on 390 degrees of freedom
## Multiple R-squared: 0.6047, Adjusted R-squared: 0.6037
## F-statistic: 596.6 on 1 and 390 DF, p-value: < 2.2e-16
# linear model with cylinders as factors
Auto_df <- Auto %>% select(-name) %>% mutate(fact_cyl = as.factor(cylinders))
lm_fact_cyl <- summary(lm(mpg~fact_cyl, Auto_df))
lm_fact_cyl
##
## Call:
## lm(formula = mpg ~ fact_cyl, data = Auto_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2839 -2.9037 -0.9631 2.3437 18.0265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.5500 2.3494 8.747 < 2e-16 ***
## fact_cyl4 8.7339 2.3729 3.681 0.000266 ***
## fact_cyl5 6.8167 3.5888 1.899 0.058250 .
## fact_cyl6 -0.5765 2.4053 -0.240 0.810708
## fact_cyl8 -5.5869 2.3946 -2.333 0.020153 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.699 on 387 degrees of freedom
## Multiple R-squared: 0.6413, Adjusted R-squared: 0.6376
## F-statistic: 173 on 4 and 387 DF, p-value: < 2.2e-16
The chart below shows why converting cylinders to ordered categorical performs better. The blue line is the regression using numerical continuous values and the red line shows regression using ordered categorical. It can be seen that the red line is able to better capture the mean mpg for each number of cylinders.
# create data frame of regression coefficients
plotr <- lm_fact_cyl$coefficients %>%
as.data.frame() %>% rownames_to_column("fact_cyl")
# convert factors back to numerical for plotting purposes
plotr <- plotr %>%
mutate(cylinders = ifelse(fact_cyl == "(Intercept)", 3, str_replace_all(fact_cyl, "\\D", "")) %>% as.numeric(),
mpg_fact_cyl = ifelse(cylinders == 3, plotr[1,2], plotr[1,2] + Estimate))
ggplot(Auto_df, aes(cylinders, mpg)) + geom_point() + geom_smooth(method = "lm") +
geom_line(data = plotr, aes(cylinders, mpg_fact_cyl), col = "red")
To test for linear and non-linear relationships between each individual predictor against a response, the function test_relationship() is used. For every predictor it fits a simple linear model to the response and can apply a transformation to the predictor.
The output is a data frame containing regression coefficient estimate, standard error, p value, RSE, R squared and boolean of significance for each predictor.
Arguments: df data frame for analysis response string of response variable n_poly numerical polynomial order, defaults to 1, univariate regression sqrt boolean for square root transformation of the predictor, defaults to false log “10” for base 10 and “natural” for log transformation of the predictor, defaults to false signif_level significance level to assess coefficients, defaults to 0.05 show_intercept boolean to show output for the intercept
test_relationship <- function(df, response, n_poly=1, sqrt=FALSE, log=FALSE,
signif_level=0.05, show_intercept=TRUE){
# remove categorical variables if fitting polynomials, square root or logs
if(n_poly > 1 | sqrt == TRUE | log != FALSE) df <- df %>% select_if(is.numeric)
predictors <- colnames(df)[colnames(df) != response] # remove response to create predictor names vector
# fit linear model
fit_lm <- function(response, x, n_poly){
if(sqrt == TRUE) {formula <- as.formula("df[[response]] ~ sqrt(df[[x]])")
} else if(log == "10") {formula <- as.formula("df[[response]] ~ log10(df[[x]])")
} else if(log == "natural") {formula <- as.formula("df[[response]] ~ log(df[[x]])")
} else {
formula <- data.frame(poly = seq(1, n_poly)) %>%
mutate(term = case_when(poly == 1 ~ "df[[x]]",
poly > 1 ~ paste("I(df[[x]]^", as.character(poly), ")"))) %>%
pull(term) %>% paste(., collapse=" + ") %>% paste("df[[response]] ~", .) %>%
as.formula()
}
summary(lm(formula))
}
# model output as nested tables
lm_output <- mapply(fit_lm, response, predictors, n_poly)
colnames(lm_output) <- predictors
model_output <- t(lm_output) %>% as.data.frame() %>% rownames_to_column("predictor")
# extract coefficients
coefs <- model_output %>%
mutate(coefficients = map(coefficients, ~as.data.frame(.x)),
coef = map(coefficients, ~.x %>% as.data.frame() %>% rownames())) %>% # levels for factors
unnest(coef, coefficients) %>%
rename(t_value = "t value", pval = "Pr(>|t|)")
# extract single entry model stats
model_stats <- model_output %>%
unnest(rse = sigma, r.squared, .drop = TRUE)
# combine coefficients and model stats
output <- left_join(coefs, model_stats, by = "predictor") %>% select(-t_value) %>%
mutate(signif = pval < signif_level)
# show intercept or not
if(show_intercept==FALSE) output %>% filter(coef != "(Intercept)") else output
}
Exercise 15 from chapter 3 of Introduction to Statistical Learning with Applications in R (ISLR) illustrates the use of test_relationship(). This involves prediction of per capita crime rate crim using the Boston data set.
Boston_df <- Boston %>%
mutate(chas = as.factor(chas),
rad = as.factor(rad))
Fit simple linear models for each predictor.
single_pred <- test_relationship(Boston_df, "crim", show_intercept = FALSE)
single_pred
## # A tibble: 20 x 28
## predictor call.x terms.x residuals.x Estimate `Std. Error` pval aliased.x
## <chr> <name> <named> <named lis> <dbl> <dbl> <dbl> <named l>
## 1 zn <lang… <terms> <dbl [506]> -0.0739 0.0161 5.51e- 6 <lgl [2]>
## 2 indus <lang… <terms> <dbl [506]> 0.510 0.0510 1.45e-21 <lgl [2]>
## 3 chas <lang… <terms> <dbl [506]> -1.89 1.51 2.09e- 1 <lgl [2]>
## 4 nox <lang… <terms> <dbl [506]> 31.2 3.00 3.75e-23 <lgl [2]>
## 5 rm <lang… <terms> <dbl [506]> -2.68 0.532 6.35e- 7 <lgl [2]>
## 6 age <lang… <terms> <dbl [506]> 0.108 0.0127 2.85e-16 <lgl [2]>
## 7 dis <lang… <terms> <dbl [506]> -1.55 0.168 8.52e-19 <lgl [2]>
## 8 rad <lang… <terms> <dbl [506]> 0.0473 2.03 9.81e- 1 <lgl [9]>
## 9 rad <lang… <terms> <dbl [506]> 0.0613 1.85 9.74e- 1 <lgl [9]>
## 10 rad <lang… <terms> <dbl [506]> 0.358 1.63 8.27e- 1 <lgl [9]>
## 11 rad <lang… <terms> <dbl [506]> 0.652 1.63 6.89e- 1 <lgl [9]>
## 12 rad <lang… <terms> <dbl [506]> 0.114 2.00 9.54e- 1 <lgl [9]>
## 13 rad <lang… <terms> <dbl [506]> 0.114 2.21 9.59e- 1 <lgl [9]>
## 14 rad <lang… <terms> <dbl [506]> 0.335 2.03 8.69e- 1 <lgl [9]>
## 15 rad <lang… <terms> <dbl [506]> 12.7 1.61 1.84e-14 <lgl [9]>
## 16 tax <lang… <terms> <dbl [506]> 0.0297 0.00185 2.36e-47 <lgl [2]>
## 17 ptratio <lang… <terms> <dbl [506]> 1.15 0.169 2.94e-11 <lgl [2]>
## 18 black <lang… <terms> <dbl [506]> -0.0363 0.00387 2.49e-19 <lgl [2]>
## 19 lstat <lang… <terms> <dbl [506]> 0.549 0.0478 2.65e-27 <lgl [2]>
## 20 medv <lang… <terms> <dbl [506]> -0.363 0.0384 1.17e-19 <lgl [2]>
## # … with 20 more variables: sigma.x <named list>, df.x <named list>,
## # r.squared.x <named list>, adj.r.squared.x <named list>, fstatistic.x <named
## # list>, cov.unscaled.x <named list>, coef <chr>, call.y <named list>,
## # terms.y <named list>, residuals.y <named list>, coefficients <named list>,
## # aliased.y <named list>, sigma.y <named list>, df.y <named list>,
## # r.squared.y <dbl>, adj.r.squared.y <named list>, fstatistic.y <named list>,
## # cov.unscaled.y <named list>, rse <dbl>, signif <lgl>
These predictors are statistically significant (i.e. all except chas):
single_pred %>% filter(signif == TRUE) %>% pull(predictor)
## [1] "zn" "indus" "nox" "rm" "age" "dis" "rad"
## [8] "tax" "ptratio" "black" "lstat" "medv"
Fitting the full model gives zn, nox, dis, rad and medv as significant.
all_pred_summ <- summary(lm(crim ~ ., Boston_df))
all_pred_summ
##
## Call:
## lm(formula = crim ~ ., data = Boston_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.910 -1.832 -0.269 0.928 74.823
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.286683 7.720816 2.757 0.006052 **
## zn 0.038739 0.019656 1.971 0.049315 *
## indus -0.078839 0.087368 -0.902 0.367302
## chas1 -0.750706 1.195613 -0.628 0.530376
## nox -10.812042 5.441141 -1.987 0.047474 *
## rm 0.397653 0.621841 0.639 0.522814
## age 0.001899 0.018168 0.105 0.916815
## dis -1.016331 0.290162 -3.503 0.000503 ***
## rad2 -0.704036 2.031423 -0.347 0.729063
## rad3 0.555208 1.857126 0.299 0.765098
## rad4 0.207187 1.638875 0.126 0.899452
## rad5 0.491248 1.668619 0.294 0.768575
## rad6 -0.925776 2.011933 -0.460 0.645620
## rad7 1.614476 2.178361 0.741 0.458966
## rad8 1.608235 2.069816 0.777 0.437541
## rad24 12.045021 2.440130 4.936 1.1e-06 ***
## tax -0.003125 0.005376 -0.581 0.561365
## ptratio -0.351181 0.206908 -1.697 0.090285 .
## black -0.007033 0.003688 -1.907 0.057075 .
## lstat 0.121993 0.076796 1.589 0.112815
## medv -0.205330 0.061673 -3.329 0.000937 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.447 on 485 degrees of freedom
## Multiple R-squared: 0.4604, Adjusted R-squared: 0.4382
## F-statistic: 20.69 on 20 and 485 DF, p-value: < 2.2e-16
Plot the multivariate coefficients against the univariate coefficients. The dashed line shows where the coefficients are equal. Most coefficients are similar, except nox.
# attach category numbers to predictor strings
uni_pred <- single_pred %>%
mutate(predictor = paste(predictor, str_replace(coef, "df\\[\\[x\\]\\]", ""), sep = "")) %>%
rename(univariate_coefficient = Estimate)
# data frame of multiple regression
multi_pred <- data.frame(multivariate_coefficient = all_pred_summ$coefficients[, "Estimate"]) %>%
rownames_to_column("predictor")
# join regression results data frames and plot
coeffs <- left_join(uni_pred, multi_pred)
ggplot(coeffs, aes(univariate_coefficient, multivariate_coefficient)) + geom_point() +
geom_abline(slope=1, intercept=0, linetype=2) +
geom_text(data = coeffs %>% filter(univariate_coefficient > 30), aes(label = predictor), vjust=-1)
Check if there is any third order polynomial relationships.
poly_pred <- test_relationship(Boston_df, "crim", n_poly = 3, show_intercept = FALSE)
poly_pred
## # A tibble: 33 x 28
## predictor call.x terms.x residuals.x Estimate `Std. Error` pval aliased.x
## <chr> <name> <named> <named lis> <dbl> <dbl> <dbl> <named l>
## 1 zn <lang… <terms> <dbl [506]> -3.32e-1 0.110 2.61e- 3 <lgl [4]>
## 2 zn <lang… <terms> <dbl [506]> 6.48e-3 0.00386 9.38e- 2 <lgl [4]>
## 3 zn <lang… <terms> <dbl [506]> -3.78e-5 0.0000314 2.30e- 1 <lgl [4]>
## 4 indus <lang… <terms> <dbl [506]> -1.97e+0 0.482 5.30e- 5 <lgl [4]>
## 5 indus <lang… <terms> <dbl [506]> 2.52e-1 0.0393 3.42e-10 <lgl [4]>
## 6 indus <lang… <terms> <dbl [506]> -6.98e-3 0.000957 1.20e-12 <lgl [4]>
## 7 nox <lang… <terms> <dbl [506]> -1.28e+3 170. 2.76e-13 <lgl [4]>
## 8 nox <lang… <terms> <dbl [506]> 2.25e+3 280. 6.81e-15 <lgl [4]>
## 9 nox <lang… <terms> <dbl [506]> -1.25e+3 149. 6.96e-16 <lgl [4]>
## 10 rm <lang… <terms> <dbl [506]> -3.92e+1 31.3 2.12e- 1 <lgl [4]>
## # … with 23 more rows, and 20 more variables: sigma.x <named list>, df.x <named
## # list>, r.squared.x <named list>, adj.r.squared.x <named list>,
## # fstatistic.x <named list>, cov.unscaled.x <named list>, coef <chr>,
## # call.y <named list>, terms.y <named list>, residuals.y <named list>,
## # coefficients <named list>, aliased.y <named list>, sigma.y <named list>,
## # df.y <named list>, r.squared.y <dbl>, adj.r.squared.y <named list>,
## # fstatistic.y <named list>, cov.unscaled.y <named list>, rse <dbl>,
## # signif <lgl>
The predictors below have significance for all polynomial orders. The R squared values also increase, though only from poor correlations to slightly less poor!
# find predictors with all 3 polynomial orders being significant
poly_signif <- poly_pred %>%
group_by(predictor) %>% summarise(sum_signif = sum(signif)) %>% filter(sum_signif == 3) %>% pull(predictor)
# compare r squared for simple models and polynomial models each with single predictors
inner_join(single_pred, poly_pred, by = c("predictor", "coef"), suffix = c(".single", ".poly")) %>%
select(predictor, matches("signif|r.squared")) %>% filter(predictor %in% poly_signif)
## # A tibble: 5 x 11
## predictor r.squared.x.sin… adj.r.squared.x… r.squared.y.sin… adj.r.squared.y…
## <chr> <named list> <named list> <dbl> <named list>
## 1 indus <dbl [1]> <dbl [1]> 0.165 <dbl [1]>
## 2 nox <dbl [1]> <dbl [1]> 0.177 <dbl [1]>
## 3 dis <dbl [1]> <dbl [1]> 0.144 <dbl [1]>
## 4 ptratio <dbl [1]> <dbl [1]> 0.0841 <dbl [1]>
## 5 medv <dbl [1]> <dbl [1]> 0.151 <dbl [1]>
## # … with 6 more variables: signif.single <lgl>, r.squared.x.poly <named list>,
## # adj.r.squared.x.poly <named list>, r.squared.y.poly <dbl>,
## # adj.r.squared.y.poly <named list>, signif.poly <lgl>
One potential approach to finding interaction terms is to fit a series of models with two predictors each in the form of y ~ x1 + x2 + x1:x2 for every combination of predictor.
The function find_interactions() does this given a single response (entered as a string) at a set significance level. It outputs two tables with the first giving the p value for the coefficients of the intercept, x1, x2 and x1:x2 for each pair of predictors and the second table translates these p values in to a boolean of statistical significance.
find_interactions <- function(df, response, signif_level=0.05){
vars <- colnames(df)[colnames(df) != response] # remove response variable
combos <- expand.grid(vars, vars) %>% filter(Var1 != Var2) # remove same predictors
# get all permutations for lm
# https://stackoverflow.com/questions/17017374/how-to-expand-grid-with-string-elements-the-half
combos <- apply(combos, 1, sort) # sort variables A-Z in their pairs (ie sort across rows)
combos <- t(combos) # transpose back to long df
combos <- combos[!duplicated(combos),] # use boolean filtering to remove duplicates
fit_lm <- function(response, x1, x2){
summary(lm(df[[response]] ~ df[[x1]] * df[[x2]]))$coefficients[,4]
}
pvals_df <- mapply(fit_lm, "mpg", combos[,1], combos[,2]) %>% as.data.frame()
column_names <- paste(combos[,1], combos[,2], sep = ":")
colnames(pvals_df) <- column_names
rownames(pvals_df) <- c("int", "x1", "x2", "x1:x2")
signif_df <- data.frame(pvals_df < signif_level)
colnames(signif_df) <- column_names
list(pvals_df, signif_df)
}
Interactions for the Auto dataset. Most pairs of predictors provide statistically significant relationships!
find_interactions(Auto %>% select(-name), "mpg")
## [[1]]
## cylinders:displacement cylinders:horsepower cylinders:weight
## int 5.334199e-64 1.856863e-77 5.032775e-51
## x1 8.076482e-06 3.299066e-31 1.261187e-08
## x2 1.502868e-15 1.260080e-28 4.184416e-19
## x1:x2 2.237576e-08 4.826362e-21 2.828875e-07
## acceleration:cylinders cylinders:year cylinders:origin
## int 6.125745e-09 5.471026e-05 3.473597e-46
## x1 2.990319e-02 4.477560e-02 3.307858e-13
## x2 3.150044e-02 5.988064e-11 5.861205e-01
## x1:x2 5.488879e-02 1.994514e-03 1.307719e-01
## displacement:horsepower displacement:weight acceleration:displacement
## int 3.020057e-121 5.988019e-94 3.129597e-14
## x1 4.320329e-39 1.852096e-11 6.443529e-06
## x2 2.802363e-28 5.225958e-23 7.819670e-01
## x1:x2 1.681625e-25 1.062306e-09 1.384785e-08
## displacement:year displacement:origin horsepower:weight
## int 8.918936e-17 4.255163e-55 1.234344e-91
## x1 1.315283e-09 8.183386e-02 2.322587e-18
## x2 1.869071e-31 2.896267e-05 5.137202e-36
## x1:x2 4.955428e-13 6.396804e-04 9.933617e-15
## acceleration:horsepower horsepower:year horsepower:origin
## int 2.117840e-20 1.104778e-22 1.045534e-43
## x1 1.837764e-04 6.333010e-18 3.958037e-04
## x2 5.216637e-01 1.167343e-34 1.995061e-11
## x1:x2 4.446532e-14 7.366572e-22 1.948668e-06
## acceleration:weight weight:year origin:weight acceleration:year
## int 1.568129e-08 3.295213e-16 1.251800e-51 0.009032344
## x1 3.495387e-04 1.136624e-09 6.094951e-03 0.336094144
## x2 3.076021e-02 5.881890e-28 7.561205e-12 0.003653537
## x1:x2 4.263375e-03 8.015486e-14 4.230193e-02 0.547269163
## acceleration:origin origin:year
## int 0.420919335 1.568401e-11
## x1 0.002811774 1.137792e-02
## x2 0.160574086 1.679918e-15
## x1:x2 0.697108414 6.211123e-02
##
## [[2]]
## cylinders:displacement cylinders:horsepower cylinders:weight
## int TRUE TRUE TRUE
## x1 TRUE TRUE TRUE
## x2 TRUE TRUE TRUE
## x1:x2 TRUE TRUE TRUE
## acceleration:cylinders cylinders:year cylinders:origin
## int TRUE TRUE TRUE
## x1 TRUE TRUE TRUE
## x2 TRUE TRUE FALSE
## x1:x2 FALSE TRUE FALSE
## displacement:horsepower displacement:weight acceleration:displacement
## int TRUE TRUE TRUE
## x1 TRUE TRUE TRUE
## x2 TRUE TRUE FALSE
## x1:x2 TRUE TRUE TRUE
## displacement:year displacement:origin horsepower:weight
## int TRUE TRUE TRUE
## x1 TRUE FALSE TRUE
## x2 TRUE TRUE TRUE
## x1:x2 TRUE TRUE TRUE
## acceleration:horsepower horsepower:year horsepower:origin
## int TRUE TRUE TRUE
## x1 TRUE TRUE TRUE
## x2 FALSE TRUE TRUE
## x1:x2 TRUE TRUE TRUE
## acceleration:weight weight:year origin:weight acceleration:year
## int TRUE TRUE TRUE TRUE
## x1 TRUE TRUE TRUE FALSE
## x2 TRUE TRUE TRUE TRUE
## x1:x2 TRUE TRUE TRUE FALSE
## acceleration:origin origin:year
## int FALSE TRUE
## x1 TRUE TRUE
## x2 FALSE TRUE
## x1:x2 FALSE FALSE
The hierachical principle states if theres an interaction effect, leave in the main effects even if they aren’t statistically significant (i.e. if x1:x2 p value is low enough, include x1 and x2 even if their p values are too high).
However, we can’t just look at each y ~ x1 + x2 + x1:x2 for significance using find_interactions(). For example, in the Auto dataset cylinders:displacement and displacement:weight are both statistically significant on their own but combined in the same model lm(formula = mpg ~ cylinders * displacement + displacement * weight, data = Auto) only displacement:weight is significant.
Comparing the RSE and adjusted R squared for each interaction could yield a better analysis.
summary(lm(formula = mpg ~ cylinders * displacement + displacement * weight, data = Auto))
##
## Call:
## lm(formula = mpg ~ cylinders * displacement + displacement *
## weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2934 -2.5184 -0.3476 1.8399 17.7723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.262e+01 2.237e+00 23.519 < 2e-16 ***
## cylinders 7.606e-01 7.669e-01 0.992 0.322
## displacement -7.351e-02 1.669e-02 -4.403 1.38e-05 ***
## weight -9.888e-03 1.329e-03 -7.438 6.69e-13 ***
## cylinders:displacement -2.986e-03 3.426e-03 -0.872 0.384
## displacement:weight 2.128e-05 5.002e-06 4.254 2.64e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.103 on 386 degrees of freedom
## Multiple R-squared: 0.7272, Adjusted R-squared: 0.7237
## F-statistic: 205.8 on 5 and 386 DF, p-value: < 2.2e-16
plot_lm()find_interactions()