Load Libraries

library(tidyverse)
library(modelr)
library(nycflights13)


1. Introduction to model building


In this module, we will learn how to build a model using examples of model building with diamonds and `flights data sets.

Before we go to a case study, let’s learn some basic functions in model building. We will still limit ourselves to linear models (possibly with interaction terms). More models are going to be introduced in future courses.


Least Square Solution

First, let’s understand the mathematical principle to build a model. Usually the goal is to find the optimized parameters for a particular model by minimizing the error between model prediction and measured data.

Let’s look at sim1 data set, which is a toy example in modelr package. This data set contains artificial data about the relationship between two continuous random variables:

sim1
## # A tibble: 30 × 2
##        x     y
##    <int> <dbl>
##  1     1  4.20
##  2     1  7.51
##  3     1  2.13
##  4     2  8.99
##  5     2 10.2 
##  6     2 11.3 
##  7     3  7.36
##  8     3 10.5 
##  9     3 10.5 
## 10     4 12.4 
## # … with 20 more rows
ggplot(sim1, aes(x,y)) + 
  geom_point()

There is an apparent pattern in the data - a linear relationship. So we are going to use a simple linear model to fit the data:

\[ y = \beta_0 + \beta_1 x\]

However, out of all possible lines (depending on the values of the intercept \(\beta_0\) and the slope of \(x\), \(\beta_1\)), which one is the best? Of course, we hope to minimize the error and we need a metric here. The most commonly used metric in statistical modeling is the mean square error (MSE):

In the figure above (which shifts the \(x\) values of sim1 a little bit to make things more visually clear), the solid line is a linear fit to the data shown as scattered points. The error for each data point is marked as the blue vertical segments. The length of each blue segment represents the error (the distance between model prediction and actual target value).

The mathematical principle here is to minimize the sum of square errors:

\[ \mathrm{find} \; \beta_0, \beta_1 \; \mathrm{to \; minimize \;} \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y_i})^2\]

where \(\hat{y_i}\) is the model prediction for each \(x_i\) value, in a linear model it is

\[ \hat{y_i} = \beta_0 + \beta_1 x_i \] Such a solution is called the least square solution (LSE). The mathematical details regarding how to find the values of \(\beta_0\) and \(\beta_1\) are introduced in courses such as Regression Analysis.


Visualise models

Next, let’s learn how to visualise our model which is useful to verify its correctness, understand our model behavior etc. We will learn a few useful functions that help us analyze our model.

For example, to create a linear model for sim1 data set and visualise it, we aim to create something like the plot below:

model1 <- lm(y ~ x, data = sim1)

grid <- sim1 %>%
  data_grid(x) %>%
  add_predictions(model1) 

ggplot() + 
  geom_point(aes(x, y), data = sim1) +
  geom_line(aes(x, y = pred), data = grid, colour = "red", size = 1)

In this figure, we both plot the original data in a scatter plot (by geom_point) and the fitted model in a line plot (by geom_line). For the latter part, we introduce two new functions here data_grid and add_predictions.

The modelr::data_grid() function finds all unique values for each variable and generates all combinations to be used in creating model predictions at each possible \(x\) values:

unique(sim1$x)
##  [1]  1  2  3  4  5  6  7  8  9 10
data_grid(data = sim1, x)
## # A tibble: 10 × 1
##        x
##    <int>
##  1     1
##  2     2
##  3     3
##  4     4
##  5     5
##  6     6
##  7     7
##  8     8
##  9     9
## 10    10

Then the add_predictions function take the data frame created by data_grid as the first argument and the model as the second argument, to create a new column named pred containing model predictions for each \(x\) value.

model1 <- lm(y ~ x, data = sim1)
grid <- data_grid(sim1, x)
grid <- add_predictions(grid, model1) # Update "grid" with a new column `pred`
grid
## # A tibble: 10 × 2
##        x  pred
##    <int> <dbl>
##  1     1  6.27
##  2     2  8.32
##  3     3 10.4 
##  4     4 12.4 
##  5     5 14.5 
##  6     6 16.5 
##  7     7 18.6 
##  8     8 20.6 
##  9     9 22.7 
## 10    10 24.7

Then we can plot this out:

ggplot() + 
  geom_line(aes(x = x, y = pred), data = grid)

Then plotting this line along with the original data results in the graph at the beginning of this subsection.


Visualise residuals

We have learned how to use plot(<model>, which = 1) to visualise the residuals of a model. Sometimes we want to create some customized residual plots. let’s learn how to do it:

Similar to add_predictions, there is a function add_residuals to add residuals to a data frame, in a new column resid.

sim1 <- add_residuals(sim1, model1)
sim1
## # A tibble: 30 × 3
##        x     y    resid
##    <int> <dbl>    <dbl>
##  1     1  4.20 -2.07   
##  2     1  7.51  1.24   
##  3     1  2.13 -4.15   
##  4     2  8.99  0.665  
##  5     2 10.2   1.92   
##  6     2 11.3   2.97   
##  7     3  7.36 -3.02   
##  8     3 10.5   0.130  
##  9     3 10.5   0.136  
## 10     4 12.4   0.00763
## # … with 20 more rows

Since there is a unique residual for each data point, usually we need to update the original data set when using add_residuals.

Now we can visualise the residuals in our own way. If we hope to get something similar to plot(model1, which = 1), we can do the following:

ggplot(sim1, aes(y, resid)) + 
  geom_ref_line(h = 0) + 
  geom_point()

It can be useful to plot the residuals against \(x\) as well:

ggplot(sim1, aes(x, resid)) + 
  geom_ref_line(h = 0) + 
  geom_point()

This looks like random noise, suggesting that our model has done a good job of capturing the patterns in the data set. If we hope to check the normality plot of our residuals, we can do:

qqnorm(sim1$resid)

The residuals look pretty normal, which validates our model assumption.


Modeling Categorical Variables

Next, let’s study the basics of using a linear model (possibly with interaction terms) to model data with categorical variables.

For categorical variables, since their values are labels, not numbers, we have to convert their value to numbers before putting them into a linear model (or most other models).

There are three different situations here:

  • The categorical variable is binary (only two possible lables)
  • The categorical variable is not binary and not ordered
  • The categorical variable is ordered

We will use the function model_matrix in modelr to understand how this works. Let’s first look at a binary variable.


Binary categorical variable

bank_data <- read_csv("~/Documents/Fei Tian/Course_DAS422_Exploratory_Data_Analysis_and_Visualization_Spring2023/Datasets/BankChurners.csv")

unique(bank_data$Gender)
## [1] "M" "F"

So we see that in the bank customer data, there are only two values for Gender - “M” (male) and “F” (female). Let’s say we hope to model the continuous variable Credit_Limit with Gender, we can use model_matrix function to see what values are used in model equations.

model_data <- model_matrix(bank_data, Credit_Limit ~ Gender) 
mutate(model_data, Gender = bank_data$Gender)
## # A tibble: 10,127 × 3
##    `(Intercept)` GenderM Gender
##            <dbl>   <dbl> <chr> 
##  1             1       1 M     
##  2             1       0 F     
##  3             1       1 M     
##  4             1       0 F     
##  5             1       1 M     
##  6             1       1 M     
##  7             1       1 M     
##  8             1       1 M     
##  9             1       1 M     
## 10             1       1 M     
## # … with 10,117 more rows

The data frame returned by model_matrix summarises the \(x\) values (but not including \(y\)) used for model fitting. For a linear model, there is always a column of ones for the Intercept \(\beta_0\) since \(y = \beta_0 * 1 + \beta_1 * x\). For GenderM, we see that R automatically encodes “M” into a value of one, and “F” into a value of zero:

\[ \mathrm{Gender} = \cases{1, \quad \mathrm{for\;label\;"M"} \\ 0, \quad \mathrm{for\;label\;"F"}}\] Here the name “GenderM” is due to the fact that the value “M” is encoded as one.


Non-binary, non-ordered categorical variable

If a non-ordered categofical variable has more than two labels, the most commonly encoding method is the one-hot coding. Let’s use mpg data set as an example:

unique(mpg$drv)
## [1] "f" "4" "r"
model_matrix(mpg, cty ~ drv) %>%
  mutate(drv = mpg$drv) %>%
  print(n=30)
## # A tibble: 234 × 4
##    `(Intercept)`  drvf  drvr drv  
##            <dbl> <dbl> <dbl> <chr>
##  1             1     1     0 f    
##  2             1     1     0 f    
##  3             1     1     0 f    
##  4             1     1     0 f    
##  5             1     1     0 f    
##  6             1     1     0 f    
##  7             1     1     0 f    
##  8             1     0     0 4    
##  9             1     0     0 4    
## 10             1     0     0 4    
## 11             1     0     0 4    
## 12             1     0     0 4    
## 13             1     0     0 4    
## 14             1     0     0 4    
## 15             1     0     0 4    
## 16             1     0     0 4    
## 17             1     0     0 4    
## 18             1     0     0 4    
## 19             1     0     1 r    
## 20             1     0     1 r    
## 21             1     0     1 r    
## 22             1     0     1 r    
## 23             1     0     1 r    
## 24             1     0     1 r    
## 25             1     0     1 r    
## 26             1     0     1 r    
## 27             1     0     1 r    
## 28             1     0     1 r    
## 29             1     0     0 4    
## 30             1     0     0 4    
## # … with 204 more rows

Here the drv (drive train) is a categorical variable with three possible labels: “f” (forward), “r” (rear) and “4” (4-wheel). In this case, it is not a good idea to encode them as “1”, “2”, “3”, since there is no intrinsic order in the variable.

Instead, for non-ordered categorical variables with \(n\) labels, we create \(n-1\) new binary dummy variables, which takes the value of one for the 1st, 2nd, …, (n-1)th variable, and zero otherwise. For the \(n\)th label, all dummy variables are zero. In this example:

\[ \mathrm{drvf} = \cases{1, \quad \mathrm{for\;label\;"f"} \\ 0, \quad \mathrm{for\;label\;"r"\; and\;"4"}} \] \[ \mathrm{drvr} = \cases{1, \quad \mathrm{for\;label\;"r"} \\ 0, \quad \mathrm{for\;label\;"f"\; and\;"4"}} \] So for label “4”, both drvf and drvr take the value of zero. This encoding system is called one-hot encoding.


Ordered (ordinal) Variable

For an ordered variable, things are more complicated. There are multiple ways of labeling an ordered variable. For example, using “1”, “2”, “3”, “4”, etc. R uses a relatively complicated encoding method called polynomial contrasts. The theory behind is beyond the scope of this course. But we can glimpse how it looks like:

unique(diamonds$cut)
## [1] Ideal     Premium   Good      Very Good Fair     
## Levels: Fair < Good < Very Good < Premium < Ideal
model_matrix(diamonds, price ~ cut) %>%
  mutate(cut = diamonds$cut)
## # A tibble: 53,940 × 6
##    `(Intercept)`     cut.L  cut.Q     cut.C `cut^4` cut      
##            <dbl>     <dbl>  <dbl>     <dbl>   <dbl> <ord>    
##  1             1  6.32e- 1  0.535  3.16e- 1   0.120 Ideal    
##  2             1  3.16e- 1 -0.267 -6.32e- 1  -0.478 Premium  
##  3             1 -3.16e- 1 -0.267  6.32e- 1  -0.478 Good     
##  4             1  3.16e- 1 -0.267 -6.32e- 1  -0.478 Premium  
##  5             1 -3.16e- 1 -0.267  6.32e- 1  -0.478 Good     
##  6             1 -1.48e-18 -0.535 -3.89e-16   0.717 Very Good
##  7             1 -1.48e-18 -0.535 -3.89e-16   0.717 Very Good
##  8             1 -1.48e-18 -0.535 -3.89e-16   0.717 Very Good
##  9             1 -6.32e- 1  0.535 -3.16e- 1   0.120 Fair     
## 10             1 -1.48e-18 -0.535 -3.89e-16   0.717 Very Good
## # … with 53,930 more rows

In the diamonds data set, we have a few ordered variables - cut, color and clarity. If we model price against cut, we see that R creates four new numeric variables (since there are five levels in cut) from cut to cut^4 (“L”, “Q”, “C” represent “linear”, “quadratic” and “cubic” respectively). Each level is converted into a particular combination of values for linear fit. That’s all we need to know for now.


Modeling examples

Next, let’s see how to create linear models in different situations. We have learned how to do that with two numeric variables. It’s good to do that again with diamonds data set by fitting price to carat.


Modeling two numeric variables

Before modeling, let’s visualise our data first.

ggplot(diamonds, aes(carat, price)) + 
  geom_bin_2d(bins = 50)

It seems that price follows a non-linear relationship with carat. To do a reasonable linear model, we can take logarithm to both of them (which usually make things much more linear), and also ignore diamonds of more than 3 carats since they are rare.

diamonds1 <- diamonds %>%
  filter(carat <= 3) %>%
  mutate(log_price = log2(price), log_carat = log2(carat))

ggplot(diamonds1, aes(log_carat, log_price)) + 
  geom_bin_2d(bins = 50)

Now the relationship looks very linear, so let’s fit it and visualise it.

model_diamonds <- lm(log_price ~ log_carat, data = diamonds1) 
plot(model_diamonds, which = 1)

plot(model_diamonds, which = 2)

grid <- diamonds1 %>%
  data_grid(log_carat) %>%
  add_predictions(model_diamonds)

ggplot(diamonds1) + 
  geom_bin_2d(aes(log_carat, log_price)) +
  geom_line(aes(log_carat, pred), data = grid, colour = "red", size = 1)

We can also visualise in the original scale

ggplot(diamonds1) + 
  geom_bin_2d(aes(carat, price)) +
  geom_line(aes(2^log_carat, 2^pred), data = grid, colour = "red", size = 1)

Let’s check the coefficients for the model:

model_diamonds$coefficients
## (Intercept)   log_carat 
##   12.190950    1.678231

So the model is

\[ \mathrm{log\_price} = 12.19 + 1.68 \times \mathrm{log\_carat}\]

or equivalently

\[ \mathrm{price} = 4672 \times \mathrm{carat} ^ {1.68} \]


Modeling a continuous variable with a categorical variable

Next, let’s explore the relationship between carat and cut. As we have analyzed earlier in the semester, the diamonds data set shows a “weird” trend that better cut quality results in overall lower price:

ggplot(diamonds, aes(cut, price)) + geom_boxplot()

We had explained this result by the relationship between carat and cut - better-cut diamonds are smaller on average in this data set. Let’s now model this relationship.

diamonds2 <- diamonds %>%
  filter(carat <= 3) %>%
  mutate(cut = as.character(cut))

Let’s ignore the ordered status of cut and simply make cut a non-ordered character. Let’s then see the model matrix first:

model_matrix(diamonds2, carat ~ cut)
## # A tibble: 53,908 × 5
##    `(Intercept)` cutGood cutIdeal cutPremium `cutVery Good`
##            <dbl>   <dbl>    <dbl>      <dbl>          <dbl>
##  1             1       0        1          0              0
##  2             1       0        0          1              0
##  3             1       1        0          0              0
##  4             1       0        0          1              0
##  5             1       1        0          0              0
##  6             1       0        0          0              1
##  7             1       0        0          0              1
##  8             1       0        0          0              1
##  9             1       0        0          0              0
## 10             1       0        0          0              1
## # … with 53,898 more rows

Now we see that the one-hot encoding is implemented. Let’s put this into a linear model.

model_diamonds2 <- lm(carat ~ cut, data = diamonds2)
model_diamonds2$coefficients
##  (Intercept)      cutGood     cutIdeal   cutPremium cutVery Good 
##    1.0302687   -0.1824062   -0.3278925   -0.1405634   -0.2243366

We need to interpret this result correctly. Note that we encode all dummy variables to be zero for fair cut. Therefore, the Intercept here is the average carat for fair cut diamonds. For other dummy variables, the slope shows the difference of average carat from the Intercept, which are all negative indicating smaller diamonds.

We can visualise this model in the following way:

grid <- diamonds2 %>%
  data_grid(cut) %>%
  add_predictions(model_diamonds2)

grid
## # A tibble: 5 × 2
##   cut        pred
##   <chr>     <dbl>
## 1 Fair      1.03 
## 2 Good      0.848
## 3 Ideal     0.702
## 4 Premium   0.890
## 5 Very Good 0.806

So the predicted values are simply the mean carat for each category:

diamonds2$cut <- fct_relevel(diamonds2$cut, "Fair", "Good", "Very Good", "Premium", "Ideal") # make the labels in the right order for plotting purpose since we removed the order previously

ggplot() + 
  geom_point(aes(cut, carat), data = diamonds2) +
  geom_point(aes(cut, pred), data = grid, colour = "red", size = 4)

From the plot, it is obvious that fair cut diamonds are heavier than ideal cut diamonds on average.


Modeling price with carat and cut together

Fitting price with carat or cut alone is not particularly interesting. Now let’s fit price with carat and cut at the same time, which would provide better insights.

When we have both numeric and categorical variables in \(x\), there are two models to use. One is without the interaction term:

\[ \mathrm{price} = \beta_0 + \beta_{carat} \times \mathrm{carat} + \sum \beta_i \times \mathrm{cut\_dummy_i}\]

In this case for each cut group, there is a linear relationship between price and carat with the same slope but different intercepts.

diamonds3 <- diamonds %>%
  filter(carat <= 3) %>%
  mutate(log_price = log2(price), log_carat = log2(carat)) %>%
  mutate(cut = as.character(cut)) 

diamonds3$cut <- fct_relevel(diamonds3$cut, "Fair", "Good", "Very Good", "Premium", "Ideal")
  


model_diamonds3 <- lm(log_price ~ log_carat + cut, data = diamonds3)

grid <- diamonds3 %>%
  data_grid(log_carat, cut) %>%
  add_predictions(model_diamonds3)

grid
## # A tibble: 1,280 × 3
##    log_carat cut        pred
##        <dbl> <fct>     <dbl>
##  1     -2.32 Fair       7.89
##  2     -2.32 Good       8.12
##  3     -2.32 Very Good  8.24
##  4     -2.32 Premium    8.23
##  5     -2.32 Ideal      8.35
##  6     -2.25 Fair       8.01
##  7     -2.25 Good       8.24
##  8     -2.25 Very Good  8.36
##  9     -2.25 Premium    8.35
## 10     -2.25 Ideal      8.47
## # … with 1,270 more rows
ggplot(diamonds3) + 
  geom_point(aes(x = log_carat, y = log_price, colour = cut)) + 
  geom_line(aes(x = log_carat, y = pred, colour = cut), data = grid)

This may not be a good idea since the linear model for each cut group shares the same slope. We may add the interaction term to resolve this:

\[ \mathrm{price} = \beta_0 + \beta_{carat} \times \mathrm{carat} + \sum \beta_i \times \mathrm{cut\_dummy_i} + \sum \beta_{c,i} \times \mathrm{carat} \times \mathrm{cut\_dummy_i}\]

By introducing the interaction terms, now for different cut groups the slopes can be different.

diamonds3 <- diamonds %>%
  filter(carat <= 3) %>%
  mutate(log_price = log2(price), log_carat = log2(carat)) %>%
  mutate(cut = as.character(cut)) 

diamonds3$cut <- fct_relevel(diamonds3$cut, "Fair", "Good", "Very Good", "Premium", "Ideal")
  


model_diamonds4 <- lm(log_price ~ log_carat * cut, data = diamonds3)

grid <- diamonds3 %>%
  data_grid(log_carat, cut) %>%
  add_predictions(model_diamonds4)

grid
## # A tibble: 1,280 × 3
##    log_carat cut        pred
##        <dbl> <fct>     <dbl>
##  1     -2.32 Fair       8.30
##  2     -2.32 Good       8.05
##  3     -2.32 Very Good  8.18
##  4     -2.32 Premium    8.31
##  5     -2.32 Ideal      8.33
##  6     -2.25 Fair       8.41
##  7     -2.25 Good       8.17
##  8     -2.25 Very Good  8.31
##  9     -2.25 Premium    8.42
## 10     -2.25 Ideal      8.45
## # … with 1,270 more rows
ggplot(diamonds3) + 
  geom_point(aes(x = log_carat, y = log_price, colour = cut)) + 
  geom_line(aes(x = log_carat, y = pred, colour = cut), data = grid)

Actually in this case, the interaction terms do not change the trend very significantly. But if we check their p-values, we should still keep them.

summary(model_diamonds4)
## 
## Call:
## lm(formula = log_price ~ log_carat * cut, data = diamonds3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.26407 -0.23864 -0.00931  0.23171  2.01353 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            11.818586   0.009230 1280.52   <2e-16 ***
## log_carat               1.515429   0.013897  109.05   <2e-16 ***
## cutGood                 0.266001   0.010996   24.19   <2e-16 ***
## cutVery Good            0.376526   0.010042   37.49   <2e-16 ***
## cutPremium              0.341711   0.009852   34.68   <2e-16 ***
## cutIdeal                0.478839   0.009830   48.71   <2e-16 ***
## log_carat:cutGood       0.221837   0.015386   14.42   <2e-16 ***
## log_carat:cutVery Good  0.211954   0.014445   14.67   <2e-16 ***
## log_carat:cutPremium    0.144763   0.014351   10.09   <2e-16 ***
## log_carat:cutIdeal      0.192736   0.014232   13.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3647 on 53898 degrees of freedom
## Multiple R-squared:  0.9378, Adjusted R-squared:  0.9378 
## F-statistic: 9.036e+04 on 9 and 53898 DF,  p-value: < 2.2e-16

But overall, we see that after considering the carat, we correctly see that for the same carat value, on average the best cutting quality results in the highest average price, and the worst cutting quality results in the lowest average price.