The broom package takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy tibbles.
It summarizes key information about models in tidy
tibble()s. broom provides three verbs
to make it
convenient to interact with model objects:
tidy() summarizes information about model components glance() reports information about the entire model augment() adds information about observations to a data set
#vignette("broom")
#Installation
# we recommend installing the entire tidyverse
# modeling set, which includes broom:
#install.packages("tidymodels")
# alternatively, to install just broom:
#install.packages("broom")
#install.packages("https://cran.rstudio.com/bin/windows/Rtools")
# to get the development version from GitHub:
#install.packages("devtools")
#devtools::install_github("tidymodels/broom")
tidy() produces a tibble() where each row contains information about an important component of a model.For regression models, this often corresponds to regression coefficients. This is can be useful if you want to inspect a model or create custom visualizations.
#devtools::install_github("kassmbara/datarium")
#library(tidyverse) # for Data Manipulation
#install.packages("datarium")
library(broom)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ dials 1.1.0 ✔ rsample 1.1.1
## ✔ dplyr 1.1.0 ✔ tibble 3.1.8
## ✔ ggplot2 3.4.0 ✔ tidyr 1.3.0
## ✔ infer 1.0.4 ✔ tune 1.0.1
## ✔ modeldata 1.1.0 ✔ workflows 1.1.2
## ✔ parsnip 1.0.3 ✔ workflowsets 1.0.0
## ✔ purrr 1.0.1 ✔ yardstick 1.1.0
## ✔ recipes 1.0.4
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2
## ──
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ✔ stringr 1.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ stringr::fixed() masks recipes::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::spec() masks yardstick::spec()
library(datarium)
data("marketing", package = "datarium")
head(marketing, 4)
## youtube facebook newspaper sales
## 1 276.12 45.36 83.04 26.52
## 2 53.40 47.16 54.12 12.48
## 3 20.64 55.08 83.16 11.16
## 4 181.80 49.56 70.20 22.20
We want to build a model for estimating sales based on the advertising budget invested in youtube, facebook and newspaper, as follow:
model <- lm(sales ~ youtube + facebook + newspaper, data = marketing)
summary(model)
##
## Call:
## lm(formula = sales ~ youtube + facebook + newspaper, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5932 -1.0690 0.2902 1.4272 3.3951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.526667 0.374290 9.422 <2e-16 ***
## youtube 0.045765 0.001395 32.809 <2e-16 ***
## facebook 0.188530 0.008611 21.893 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.023 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
summary(model)$coefficient # to see only the regression cofficient part
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.526667243 0.374289884 9.4222884 1.267295e-17
## youtube 0.045764645 0.001394897 32.8086244 1.509960e-81
## facebook 0.188530017 0.008611234 21.8934961 1.505339e-54
## newspaper -0.001037493 0.005871010 -0.1767146 8.599151e-01
For a given the predictor, the t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable, that is whether the beta coefficient of the predictor is significantly different from zero.
It can be seen that, changing in youtube and facebook advertising budget are significantly associated to changes in sales while changes in newspaper budget is not significantly associated with sales.
For a given predictor variable, the coefficient (b) can be interpreted as the average effect on y of a one unit increase in predictor, holding all other predictors fixed.
For example, for a fixed amount of youtube and newspaper advertising budget, spending an additional 1 000 dollars on facebook advertising leads to an increase in sales by approximately 0.1885*1000 = 189 sale units, on average.
The youtube coefficient suggests that for every 1 000 dollars increase in youtube advertising budget, holding all other predictors constant, we can expect an increase of 0.045*1000 = 45 sales units, on average.
We found that newspaper is not significant in the multiple regression model. This means that, for a fixed amount of youtube and newspaper advertising budget, changes in the newspaper advertising budget will not significantly affect sales units.
As the newspaper variable is not significant, it is possible to remove it from the model:
model2 <- lm(sales ~ youtube + facebook, data = marketing)
summary(model)
##
## Call:
## lm(formula = sales ~ youtube + facebook + newspaper, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5932 -1.0690 0.2902 1.4272 3.3951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.526667 0.374290 9.422 <2e-16 ***
## youtube 0.045765 0.001395 32.809 <2e-16 ***
## facebook 0.188530 0.008611 21.893 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.023 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
Finally, our model equation can be written as follow: sales = 3.5 + 0.045youtube + 0.187facebook.
The confidence interval of the model coefficient can be extracted as follow:
confint(model2)
## 2.5 % 97.5 %
## (Intercept) 2.80841159 4.20222820
## youtube 0.04301292 0.04849671
## facebook 0.17213877 0.20384969
As we have seen in simple linear regression, the overall quality of the model can be assessed by examining the R-squared (R2) and Residual Standard Error (RSE).
In multiple linear regression, the R2 represents the correlation coefficient between the observed values of the outcome variable (y) and the fitted (i.e., predicted) values of y. For this reason, the value of R will always be positive and will range from zero to one.
R2 represents the proportion of variance, in the outcome variable y, that may be predicted by knowing the value of the x variables. An R2 value close to 1 indicates that the model explains a large portion of the variance in the outcome variable.
A problem with the R2, is that, it will always increase when more variables are added to the model, even if those variables are only weakly associated with the response (James et al. 2014). A solution is to adjust the R2 by taking into account the number of predictor variables.
The adjustment in the “Adjusted R Square” value in the summary output is a correction for the number of x variables included in the prediction model.
The RSE estimate gives a measure of error of prediction. The lower the RSE, the more accurate the model (on the data in hand).The error-rate can be estimated by
dividing the RSE by the mean outcome variable:
sigma(model2)/mean(marketing$sales)
## [1] 0.1199045
These model objects are messy. why?
1.Extracting coefficients takes multiple steps:data.frame(coef(summary(lmfit))
Information in row names (can’t combine models)
Column names are inconvenient (access with $“Pr(>|t|)”,converts to Pr…t..)
Information is computed in print method, not stored
Thus, broom’s tidy() method does the work for us!
broom defines tidying methods for extracting three kinds of statistics from an object:
augment(): observation-level statistics tidy(): component-level statistics glance(): model-level statistics
broom takes model objects and turns them into tidy data frames that can be used with tidy tools such as dplyr, ggplot2, tidyverse, etc
library(broom)
tidy(model2) # each row is a coefficient
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.51 0.353 9.92 4.57e-19
## 2 youtube 0.0458 0.00139 32.9 5.44e-82
## 3 facebook 0.188 0.00804 23.4 9.78e-59
#One function to call,Information stored in columns,never row names,
# Convenient column names including the p-value column
each row is an observation from the original data
augment(model2) # note that new columns start with .
## # A tibble: 200 × 9
## sales youtube facebook .fitted .resid .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 26.5 276. 45.4 24.7 1.85 0.0140 2.02 0.00406 0.925
## 2 12.5 53.4 47.2 14.8 -2.33 0.0188 2.02 0.00871 -1.17
## 3 11.2 20.6 55.1 14.8 -3.64 0.0295 2.01 0.0341 -1.83
## 4 22.2 182. 49.6 21.1 1.06 0.0124 2.02 0.00117 0.528
## 5 15.5 217. 13.0 15.9 -0.389 0.00951 2.02 0.000120 -0.194
## 6 8.64 10.4 58.7 15.0 -6.37 0.0347 1.97 0.124 -3.22
## 7 14.2 69 39.4 14.1 0.0981 0.0129 2.02 0.0000105 0.0490
## 8 15.8 144. 23.5 14.5 1.31 0.00576 2.02 0.000823 0.653
## 9 5.76 10.3 2.52 4.45 1.31 0.0271 2.02 0.00401 0.658
## 10 12.7 240. 3.12 15.1 -2.34 0.0171 2.02 0.00797 -1.17
## # … with 190 more rows
glance(model2)
## # A tibble: 1 × 12
## r.squared adj.r.squa…¹ sigma stati…² p.value df logLik AIC BIC devia…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.897 0.896 2.02 860. 4.83e-98 2 -423. 853. 867. 802.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## # variable names ¹adj.r.squared, ²statistic, ³deviance
#Nonlinear least squares: before
n <- nls(mpg ~ k * e ^ wt, data = mtcars, start = list(k = 1, e = 2))
summary(n)
##
## Formula: mpg ~ k * e^wt
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## k 49.65969 3.78876 13.11 5.96e-14 ***
## e 0.74559 0.01986 37.54 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.668 on 30 degrees of freedom
##
## Number of iterations to convergence: 10
## Achieved convergence tolerance: 2.043e-06
tidy(n) # each row is one estimated parameter
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 k 49.7 3.79 13.1 5.96e-14
## 2 e 0.746 0.0199 37.5 8.86e-27
augment(n) # each row is an observation from the original data
## # A tibble: 32 × 4
## mpg wt .fitted .resid
## <dbl> <dbl> <dbl> <dbl>
## 1 21 2.62 23.0 -2.01
## 2 21 2.88 21.4 -0.352
## 3 22.8 2.32 25.1 -2.33
## 4 21.4 3.22 19.3 2.08
## 5 18.7 3.44 18.1 0.611
## 6 18.1 3.46 18.0 0.117
## 7 14.3 3.57 17.4 -3.11
## 8 24.4 3.19 19.5 4.93
## 9 22.8 3.15 19.7 3.10
## 10 19.2 3.44 18.1 1.11
## # … with 22 more rows
glance(n) # one row for the model
## # A tibble: 1 × 9
## sigma isConv finTol logLik AIC BIC deviance df.residual nobs
## <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 2.67 TRUE 0.00000204 -75.8 158. 162. 214. 30 32
m <- as.matrix(iris[, -5])
k <- kmeans(m, 3)
k
## K-means clustering with 3 clusters of sizes 50, 38, 62
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.006000 3.428000 1.462000 0.246000
## 2 6.850000 3.073684 5.742105 2.071053
## 3 5.901613 2.748387 4.393548 1.433871
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [75] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
## [112] 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2
## [149] 2 3
##
## Within cluster sum of squares by cluster:
## [1] 15.15100 23.87947 39.82097
## (between_SS / total_SS = 88.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
tidy(k) # each row is one cluster
## # A tibble: 3 × 7
## Sepal.Length Sepal.Width Petal.Length Petal.Width size withinss cluster
## <dbl> <dbl> <dbl> <dbl> <int> <dbl> <fct>
## 1 5.01 3.43 1.46 0.246 50 15.2 1
## 2 6.85 3.07 5.74 2.07 38 23.9 2
## 3 5.90 2.75 4.39 1.43 62 39.8 3
head(augment(k, m)) # each row is one assignment
## # A tibble: 6 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width .cluster
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 1
## 2 4.9 3 1.4 0.2 1
## 3 4.7 3.2 1.3 0.2 1
## 4 4.6 3.1 1.5 0.2 1
## 5 5 3.6 1.4 0.2 1
## 6 5.4 3.9 1.7 0.4 1
glance(k) # one row describing the entire clustering operation
## # A tibble: 1 × 4
## totss tot.withinss betweenss iter
## <dbl> <dbl> <dbl> <int>
## 1 681. 78.9 603. 3
#Why are tidy models useful? 1. ggplot2 can visualize tidy model data
td <- tidy(model2, conf.int = TRUE)
ggplot(td, aes(estimate, term, color = term))+
geom_point()+
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high))
# Multiple Models
Tidy models can be combined and compared • different parameters • different methods • bootstrap replicates • subgroup models (within each country, gene…) • ensemble voting
augumented <- augment(model2)
ggplot(augumented, aes(facebook,.fitted)) +geom_line()
Basic Use: Plotting Results from One or More Regression Models Generating dot-and-whisker plots from model objects generated by the most commonly used regression functions is straightforward. To make a basic dot-and-whisker plot of any single model object of a class supported by parameters::parameters, simply pass it to dwplot.
#Package preload
#install.packages("dotwhisker")
library(dotwhisker)
library(dplyr)
# draw a dot-and-whisker plot
dwplot(model2)
By default, the whiskers span the 95% confidence interval. To change the width of the confidence interval, specify a ci argument to pass to parameters::parameters():
dwplot(model2, ci = .60) + # using 60% of confidence intervals
theme(legend.position = "none")
# run a regression compatible with tidy
m1 <- lm(mpg ~ wt + cyl + disp + gear, data = mtcars)
# draw a dot-and-whisker plot
dwplot(m1)
Plotting the results of more than one regression model is just as
easy.
Just pass the model objects to dwplot as a list. The dodge_size argument is used to adjust the space between the estimates of one variable when multiple models are presented in a single plot. Its default value of .4 will usually be fine, but, depending on the dimensions of the desired plot, more pleasing results may be achieved by setting dodge_size to lower values when the plotted results include a relatively small number of predictors or to higher values when many models appear on the same plot.
m2 <- update(m1, . ~ . + hp) # add another predictor
m3 <- update(m2, . ~ . + am) # and another
dwplot(list(m1, m2, m3))
Model intercepts are rarely theoretically interesting (see
Kastellec and Leoni 2007, 765),so they are excluded by dwplot by
default. They are easy to include if desired, however, by setting
the show_intercept argument to true.
dwplot(list(m1, m2, m3), show_intercept = TRUE)
Users are free to customize the order of the models and variables to
present with the arguments model_order and vars_order. Moreover, the
output of dwplot is a ggplot object. Add or change any ggplot layers
after calling dwplot to achieve the desired presentation. Users can
provide a named character vector to relabel_predictors, a dotwhisker
function, conveniently renames the predictors. Note that both vars_order
and relabel_predictors changes the presenting order of variables. When
both are used, the later overwrites the former.
dwplot(list(m1, m2, m3),
vline = geom_vline(
xintercept = 0,
colour = "grey60",
linetype = 2
),
vars_order = c("am", "cyl", "disp", "gear", "hp", "wt"),
model_order = c("Model 2", "Model 1", "Model 3")
) %>% # plot line at zero _behind_coefs
relabel_predictors(
c(
am = "Manual",
cyl = "Cylinders",
disp = "Displacement",
wt = "Weight",
gear = "Gears",
hp = "Horsepower"
)
) +
theme_bw(base_size = 4) +
# Setting `base_size` for fit the theme
# No need to set `base_size` in most usage
xlab("Coefficient Estimate") + ylab("") +
geom_vline(xintercept = 0,
colour = "grey60",
linetype = 2) +
ggtitle("Predicting Gas Mileage") +
theme(
plot.title = element_text(face = "bold"),
legend.position = c(0.007, 0.01),
legend.justification = c(0, 0),
legend.background = element_rect(colour = "grey80"),
legend.title = element_blank()
)
##Plotting Results Stored in a Tidy Data Frame
# regression compatible with tidy
m1_df <- broom::tidy(m1) # create data.frame of regression results
m1_df # a tidy data.frame available for dwplot
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 43.5 4.86 8.96 0.00000000142
## 2 wt -3.79 1.08 -3.51 0.00161
## 3 cyl -1.78 0.614 -2.91 0.00722
## 4 disp 0.00694 0.0120 0.578 0.568
## 5 gear -0.490 0.790 -0.621 0.540
dwplot(m1_df) #same as dwplot(m1)
Using tidy can be helpful when one wishes to omit certain model
estimates from the plot. To illustrate, we drop the intercept
(although this is in fact done by dwplot automatically by default):
m1_df <-
broom::tidy(m1) %>% filter(term != "(Intercept)") %>% mutate(model = "Model 1")
m2_df <-
broom::tidy(m2) %>% filter(term != "(Intercept)") %>% mutate(model = "Model 2")
two_models <- rbind(m1_df, m2_df)
dwplot(two_models)
# Transform cyl to factor variable in the data
m_factor <-
lm(mpg ~ wt + cyl + disp + gear, data = mtcars %>% mutate(cyl = factor(cyl)))
# Remove all model estimates that start with cyl*
m_factor_df <- broom::tidy(m_factor) %>%
filter(!grepl('cyl*', term))
dwplot(m_factor_df)
# Run model on subsets of data, save results as tidy df, make a model variable,
#and relabel predictors
by_trans <- mtcars %>%
group_by(am) %>% # group data by trans
do(broom::tidy(lm(mpg ~ wt + cyl + disp + gear, data = .))) %>%
# run model on each grp
rename(model = am) %>% # make model variable
relabel_predictors(c(
wt = "Weight",
# relabel predictors
cyl = "Cylinders",
disp = "Displacement",
gear = "Gear"
))
by_trans
## # A tibble: 10 × 6
## # Groups: model [2]
## model term estimate std.error statistic p.value
## <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 0 Weight -2.81 1.27 -2.22 0.0434
## 2 0 Cylinders -1.30 0.599 -2.17 0.0473
## 3 0 Displacement 0.00692 0.0129 0.534 0.601
## 4 0 Gear 1.26 1.81 0.696 0.498
## 5 0 (Intercept) 30.7 7.41 4.15 0.000986
## 6 1 Weight -7.53 2.77 -2.71 0.0265
## 7 1 Cylinders 0.198 1.70 0.116 0.910
## 8 1 Displacement -0.0146 0.0315 -0.464 0.655
## 9 1 Gear -1.08 2.14 -0.506 0.627
## 10 1 (Intercept) 48.4 11.1 4.34 0.00247
dwplot(by_trans,
vline = geom_vline(
xintercept = 0,
colour = "grey60",
linetype = 2
)) + # plot line at zero _behind_ coefs
theme_bw(base_size = 4) + xlab("Coefficient Estimate") + ylab("") +
ggtitle("Predicting Gas Mileage by Transmission Type") +
theme(
plot.title = element_text(face = "bold"),
legend.position = c(0.007, 0.01),
legend.justification = c(0, 0),
legend.background = element_rect(colour = "grey80"),
legend.title.align = .5
) +
scale_colour_grey(
start = .3,
end = .7,
name = "Transmission",
breaks = c(0, 1),
labels = c("Automatic", "Manual")
)
dwplot(
by_trans,
vline = geom_vline(
xintercept = 0,
colour = "grey60",
linetype = 2
),
# plot line at zero _behind_ coefs
dot_args = list(aes(shape = model)),
whisker_args = list(aes(linetype = model))
) +
theme_bw(base_size = 4) + xlab("Coefficient Estimate") + ylab("") +
ggtitle("Predicting Gas Mileage by Transmission Type") +
theme(
plot.title = element_text(face = "bold"),
legend.position = c(0.007, 0.01),
legend.justification = c(0, 0),
legend.background = element_rect(colour = "grey80"),
legend.title.align = .5
) +
scale_colour_grey(
start = .1,
end = .1,
# if start and end same value, use same colour for all models
name = "Model",
breaks = c(0, 1),
labels = c("Automatic", "Manual")
) +
scale_shape_discrete(
name = "Model",
breaks = c(0, 1),
labels = c("Automatic", "Manual")
) +
guides(
shape = guide_legend("Model"),
colour = guide_legend("Model")
) # Combine the legends for shape and color