This is a reproduction of the workflow for model basics, which comes from Chapter 23 of R for Data Science. The purpose was to learn the concepts. I skipped the simple model section and instead concentrate of visualizing models, which adds functionality from the modelr package. I deviate a little from the book because focusing on the new modelr package was what helped in my learning. Just note that what is here has some differences than the chapter, mostly extending it for my learning benefit.
library(tidyverse) # ggplot2, purrr, dplyr, tidyr, readr, tibble
library(modelr)
options(na.action = na.warn)
Use modelr::data_grid() to make evenly spaced grid for modelling. This is a little unintuitive, but examples help. I’ll step through the modelr::sim1 data set.
modelr::sim1 # sim1 data set
Visualize the data set:
sim1 %>%
ggplot(aes(x, y)) +
geom_point()
Create a model:
sim1_mod <- lm(y ~ x, data = sim1)
Initially this was confusing, but after exploring the documentation I realized data_grid() allows us to:
This capability is important because often the data is bunched up (typically with continuous variables), but modeling the x-axis is much better when the x-points are evenly distributed.
As it turns out, sim1 is evenly spaced, so the data grid we create is not gaining us much (it does reduce the repetition in x-points), but it becomes very important when the data is not uniformly spaced.
We want to model as a function of x, so we setup our grid along the range of x:
grid <- sim1 %>%
data_grid(x)
grid
Next, use add_predictions() to apply the model sim1_mod predictions to the grid.
grid <- grid %>%
add_predictions(sim1_mod)
grid
Visualize the predictions by applying both the sim1 data and the grid data to a ggplot:
ggplot(data = sim1) +
geom_point(aes(x = x, y = y)) +
# Overlay grid predictions on ggplot
geom_line(aes(x = x, y = pred), data = grid, color = 'red', size = 1)
The predictions let us know what pattern the model captured, and the residuals let us know what the model has missed. The key is to look for pattern in the residuals. Use add_residuals() to add residuals to the original sim data. Note that it’s important to add residuals to the initial data frame (and not grid!) because the residuals show how far the original data points are from the model. The data grid is for visualizing the model predictions, while the original data is for assessing the residuals.
sim1 <- sim1 %>%
add_residuals(sim1_mod)
sim1
Visualize the residuals:
ggplot(sim1, aes(x, resid)) +
geom_ref_line(h = 0) +
geom_point() +
scale_y_continuous(limits = c(-5,5))
The randomness of the residuals means the model, sim1_mod, did a good job of capturing the pattern.
Instead of using lm() to fit a straight line, you can use loess() to fit a smooth curve. Repeat the process of model fitting, grid generation, predictions, and visualization on sim1 using loess() instead of lm(). How does the result compare to geom_smooth()?
Generate the model and the data grid:
# loess model
sim1_mod_loess <- loess(y ~ x, data=sim1)
# Generate grid
grid <- sim1 %>%
data_grid(x) %>%
add_predictions(sim1_mod_loess)
Visualize the predictions and compare with geom_smooth():
ggplot(sim1, aes(x = x)) +
geom_point(aes(y = y)) +
geom_smooth(aes(y = y)) +
geom_line(aes(y = pred), data = grid, colour = "red", size = .5)
The loess model compares very closely with the line generated by loess_smooth().
What does geom_ref_line() do? What package does it come from? Why is displaying a reference line in plots showing residuals useful and important?
We need to visualize the residuals. First, add the residuals to sim1. Note that this replaces the previous resid column unless you rename (i.e. add_residuals(sim1_mod_loess, "some_new_name_for_loess_residuals")).
sim1 <- sim1 %>%
add_residuals(sim1_mod_loess)
sim1
Now, we can visualize the residuals using the geom_ref_line().
ggplot(sim1, aes(x, resid)) +
geom_ref_line(h = 0, size = 2, colour = 'steelblue') +
geom_point() +
scale_y_continuous(limits = c(-5,5))
We can see that a reference line has been added at y = 0. This helps us visualize the pattern around the reference line.
Note that I skip the model_matrix() function, which converts a data set and a function (e.g. y ~ x) to a model matrix that describes what R is modelling under the hood. If interested, you should definitely check the documentation.
We analyze modelr::sim2 data set, which contains categorical variables for x (a, b, c, d).
sim2
Let’s visualize:
ggplot(sim2, aes(x, y)) +
geom_point()
Next, fit a model and add predictions:
mod2 <- lm(y ~ x, data = sim2)
grid <- sim2 %>%
data_grid(x) %>%
add_predictions(mod2)
grid
Visualize the data with the predictions. Remember, grid contains the predictions at each uniquely spaced interval, which for sim2 is the categories.
ggplot(sim2, aes(x = x)) +
geom_point(aes(y = y)) +
geom_point(data = grid, aes(y = pred), color = "red", size = 4)
Data set sim3 contains both continuous and categorical variables to include in model:
sim3
We want to model y using both x1 and x2. First, let’s visualize using a scatter plot for continuous y vs x1 and using a color for the categorical x2:
ggplot(sim3, aes(x1, y, color = x2)) +
geom_point()
Two ways to fit the data:
mod1 <- lm(y ~ x1 + x2, data = sim3)
mod2 <- lm(y ~ x1 * x2, data = sim3)
To visualize these models we need two new tricks:
We have two predictors, so we need to give data_grid() both variables. It finds all the unique values of x1 and x2 and then generates all combinations.
To generate predictions from both models simultaneously, we can use gather_predictions() which adds each prediction as a row. The complement of gather_predictions() is spread_predictions() which adds each prediction to a new column.
Using gather_predictions():
grid <- sim3 %>%
data_grid(x1, x2) %>%
gather_predictions(mod1, mod2)
grid
Using spread_predictions():
grid_spread <- sim3 %>%
data_grid(x1, x2) %>%
spread_predictions(mod1, mod2)
grid_spread
Note that the only difference between gather and spread is the layout of the predictions. Gather returns predictions in a tidy format (good for plotting) whereas spread returns in a more intuitive format with predictions in model columns. We’ll continue with the gather format.
We can visualize the results using a facet_wrap():
ggplot(data = sim3, aes(x = x1, y = y, color = x2)) +
geom_point() +
geom_line(data = grid, aes(y = pred)) + # Overlay predictions
facet_wrap(~ model) # facet on model variable
We can also take this one step further using facet_grid, which really shows off the difference in model performance:
ggplot(data = sim3, aes(x = x1, y = y, color = x2)) +
geom_point() +
geom_line(data = grid, aes(y = pred)) + # Overlay predictions
facet_grid(model ~ x2) # facet on both x2 and model variable
Finally, we can take a look at the residuals to see what pattern is leftover:
sim3 <- sim3 %>%
gather_residuals(mod1, mod2)
ggplot(sim3, aes(x1, resid, color = x2)) +
geom_ref_line(h = 0) +
geom_point() +
facet_grid(model ~ x2)
We can qualitatively see that mod2 does a much better job of removing the pattern versus mod1.
Let’s take a look at sim4:
sim4
We want to use x1 and x2 to predict y, but now both predictors are categorical. We create two models, one without interactions and one with:
mod1 <- lm(y ~ x1 + x2, data = sim4)
mod2 <- lm(y ~ x1 * x2, data = sim4)
Before we jump into predicting using models, it’s important to understand the data. Let’s take a look at x1 first:
sim4$x1 %>% unique()
[1] -1.0000000 -0.7777778 -0.5555556 -0.3333333 -0.1111111 0.1111111 0.3333333 0.5555556
[9] 0.7777778 1.0000000
x1 has ten uniformly spaced numeric values. A convenient way to model numeric data is to use data_grid() in conjunction with seq_range() to control the grid size. seq_range() takes the range of the values (in our case x1) and generates a uniform sequence over a specified interval number.
sim4$x1 %>% seq_range(n = 5)
[1] -1.0 -0.5 0.0 0.5 1.0
Important arguments to seq_range() (besides n):
pretty = TRUE: Generates a pretty sequence with values rounded.seq_range(c(0.0123, 0.923423), n = 5)
[1] 0.0123000 0.2400808 0.4678615 0.6956423 0.9234230
seq_range(c(0.0123, 0.923423), n = 5, pretty = TRUE)
[1] 0.0 0.2 0.4 0.6 0.8 1.0
trim = 0.2: Trims off 20% of the tail values (i.e. -1 becomes -0.8)sim4$x1 %>% seq_range(n = 5, trim = 0.2) %>% round(2)
[1] -0.8 -0.4 0.0 0.4 0.8
expand = 0.2: Expands the range by 20%.sim4$x1 %>% seq_range(n = 5, expand = 0.2)
[1] -1.2 -0.6 0.0 0.6 1.2
What we want to do is use the models to generate predictions for visualization. We combine data_grid() with seq_range() to create a 5 x 5 grid for model 1 and a 5 x 5 grid for model 2:
grid <- sim4 %>%
data_grid(
x1 = seq_range(x1, n = 5),
x2 = seq_range(x2, n = 5)
) %>%
gather_predictions(mod1, mod2)
grid
Visualizing is a challenge with multiple predictors, but we can accomplish using a combination of color and faceting:
ggplot(data = sim4,
aes(x = x1,
y = y,
color = x2 %>% round(2) %>% factor(),
group = x2
)
) +
geom_point() +
geom_line(data = grid,
aes(x = x1,
y = pred,
color = x2 %>% round(2) %>% factor(),
group = x2
)
) +
labs(color = "x2") +
facet_wrap(~ model, ncol = 1)
The interaction allows the model to consider both the x1 and x2 simultaneously.
Transformations can be used to approximate non-linear functions.
Create a non-linear data set:
set.seed(150)
sim5 <- tibble(
x = seq(0, 3.5 * pi, length.out = 50),
y = 4 * sin(x) + rnorm(length(x))
)
sim5
Visualize the data set:
ggplot(sim5, aes(x, y)) +
geom_point()
Fit models to the data using natural splines:
library(splines)
mod1 <- lm(y ~ ns(x, 1), data = sim5)
mod2 <- lm(y ~ ns(x, 2), data = sim5)
mod3 <- lm(y ~ ns(x, 3), data = sim5)
mod4 <- lm(y ~ ns(x, 4), data = sim5)
mod5 <- lm(y ~ ns(x, 5), data = sim5)
Create the data grid:
grid <- sim5 %>%
data_grid(x = seq_range(x, n = 50, expand = 0.1)) %>%
gather_predictions(mod1, mod2, mod3, mod4, mod5)
grid
Visualize the predictions:
ggplot(sim5, aes(x, y)) +
geom_point() +
geom_line(data = grid, aes(y = pred), color = "red", size = 2) +
facet_wrap(~ model)
R defaults to silently drop NA’s. To get a warning, set options(na.action = na.warn).
options(na.action = na.warn)
df <- tribble(
~x, ~y,
1, 2.2,
2, NA,
3, 3.5,
4, 8.3,
NA, 10
)
mod <- lm(y ~ x, data = df)
Dropping 2 rows with missing values
Generalized linear models, e.g. stats::glm(). Linear models assume that the response is continuous and the error has a normal distribution. Generalized linear models extend linear models to include non-continuous responses (e.g. binary data or counts). They work by defining a distance metric based on the statistical idea of likelihood.
Generalized additive models, e.g. mgcv::gam(), extend generalized linear models to incorporate arbitrary smooth functions. That means you can write a formula like y ~ s(x) which becomes an equation like y = f(x) and let gam() estimate what that function is (subject to some smoothness constraints to make the problem tractable).
Penalized linear models, e.g. glmnet::glmnet(), add a penalty term to the distance that penalizes complex models (as defined by the distance between the parameter vector and the origin). This tends to make models that generalize better to new data sets from the same population.
Robust linear models, e.g. MASS:rlm(), tweak the distance to down weight points that are very far away. This makes them less sensitive to the presence of outliers, at the cost of being not quite as good when there are no outliers.
Trees, e.g. rpart::rpart(), attack the problem in a completely different way than linear models. They fit a piece-wise constant model, splitting the data into progressively smaller and smaller pieces. Trees aren’t terribly effective by themselves, but they are very powerful when used in aggregate by models like random forests (e.g. randomForest::randomForest()) or gradient boosting machines (e.g. xgboost::xgboost.)
In addition, the Caret Package (caret), is a set of functions that streamlines the process for creating predictive models.