Load Libraries

library(tidyverse)
library(modelr)
library(nycflights13)


0. Introduction


Starting this module, we are going to learn some modeling basics to set foundations for future courses such as data mining or machine learning.

The focus of our course is data visualization and exploration. However, doing those usually serves the purpose of produce better models. Therefore, understanding some modeling basics actually helps with better EDA process with a more clear sense of the next-step work.


What is a model?

Within the context of data science, a model usually refers to a mathematical model that captures the relationship between variables of given data set.

For example, a simple linear model can be represented by a function

\[ y = ax + b\]

Here \(y\) is called the target variable or the response variable, and \(x\) can be called a feature, predictor, of independent variable etc. In a linear model, \(y\) is usually a numeric variable, and \(x\) can be numeric or categorical.

\(a\) and \(b\) are coefficients of the model. They are usually determined by data using different algorithms and approaches.

This model is called a linear model since no matter what values \(a\) and \(b\) take, \(y\) and \(x\) always follow a linear relationship.

On the other hand, a quadratic model is in the function form of:

\[ y = ax^2 + bx + c\]

It is obvious that the name “quadratic” comes from the fact that when \(a \neq 0\), \(y\) is a quadratic function of \(x\).


Deterministic model vs stochastic model

An important way to classify models is by whether the model formula contains a random error that is not accounted for by the given data. For example, if we have data on \(x\) and \(y\), then the model

\[ \begin{equation}\tag{1} y = \beta_0 + \beta_1 x \end{equation}\]

is a deterministic model because given any \(x\) value we can compute a definitive \(y\) value with given parameters \((a,b,c)\).

On the other hand, a model is called stochastic if there is an error term that can be assumed to follow some probability distribution independent of \(x\) but cannot be predicted with certainty. For example,

\[ \begin{equation}\tag{2} y = \beta_0 + \beta_1 x + \epsilon, \; \epsilon \sim \mathcal{N}(0, \sigma^2) \end{equation}\] Here we assume that even given \(x\) value, there is still some variance in \(y\) that cannot be accounted for, which is of course nearly always true for real data. However, formula (1) and (2) frequently lead to the same model when we use least square method to fit parameters in (1) and maximum-likelihood estimate for parameters in (2). You will learn more details in the course of Linear Regression.


Least Square Solution

First, let’s understand the mathematical principle to build a model. Usually the goal is to find the optimized parameters for a particular model by minimizing the error between model prediction and measured data.

Let’s look at sim1 data set, which is a toy example in modelr package. This data set contains artificial data about the relationship between two continuous random variables:

sim1
## # A tibble: 30 × 2
##        x     y
##    <int> <dbl>
##  1     1  4.20
##  2     1  7.51
##  3     1  2.13
##  4     2  8.99
##  5     2 10.2 
##  6     2 11.3 
##  7     3  7.36
##  8     3 10.5 
##  9     3 10.5 
## 10     4 12.4 
## # … with 20 more rows
ggplot(sim1, aes(x,y)) + 
  geom_point()

There is an apparent pattern in the data - a linear relationship. So we are going to use a simple linear model to fit the data:

\[ y = \beta_0 + \beta_1 x\]

However, out of all possible lines (depending on the values of the intercept \(\beta_0\) and the slope of \(x\), \(\beta_1\)), which one is the best? Of course, we hope to minimize the error and we need a metric here. The most commonly used metric in statistical modeling is the mean square error (MSE):

In the figure above (which shifts the \(x\) values of sim1 a little bit to make things more visually clear), the solid line is a linear fit to the data shown as scattered points. The error for each data point is marked as the blue vertical segments. The length of each blue segment represents the error (the distance between model prediction and actual target value).

The mathematical principle here is to minimize the sum of square errors:

\[ \mathrm{find} \; \beta_0, \beta_1 \; \mathrm{to \; minimize \;} \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y_i})^2\]

where \(\hat{y_i}\) is the model prediction for each \(x_i\) value, in a linear model it is

\[ \hat{y_i} = \beta_0 + \beta_1 x_i \] Such a solution is called the least square solution (LSE). The mathematical details regarding how to find the values of \(\beta_0\) and \(\beta_1\) are introduced in courses such as Regression Analysis.


Linear vs non-linear model

Linear models are simple, while non-linear models are more flexible since they have more parameters. An important rule of thumb in data modeling is that more flexible models may not be better than a simple model due to the overfitting problem. Which one to use depends on factors such as data size, data quality, and the actual trend of data. Let’s look at the following example to understand it.

The figure below is from the textbook “Introduction to Statistical Learning”. The left figure shows the data of wage (in $1,000) vs age from an income survey of men from the central Atlantic region of the United States. The right figure shows the data of wage vs year for the same data set.

For wage vs age, we see that on average, it seems that wage first increases with age then decreases at around 60. This does make sense since we know that usually people earn more as their working experience increases. However, after retirement people usually earn less since they don’t work as much as before.

This increasing then decreasing trend can be captured by the quadratic model, but not the linear model. Therefore, it is appropriate to model the relationship between wage and age as as quadratic function.

On the other hand, for wage vs year, on average there is a slow but steady increase between 2003 and 2009. In this case, a linear model is enough to capture the steadily increasing or decreasing trend.

Although linear model is only a special case for the quadratic model (when the quadratic term vanishes), but using the more flexible quadratic model may not necessarily be the better choice because:

  • More flexible (complicated) models are more prone to overfitting, which refers to model too many noises rather than useful information in the data set.
  • More flexible (complicated) models take more computational time. In some situations, this can be critical.
  • More flexible (complicated) models require more data to work well, which is sometimes not available.

You will learn more details in the course studying statistical learning. Here is only a very brief introduction of what modeling is.


1. Simple Linear Regression

Now let’s learn how to do simple modeling with base R package. We will do linear regression which is very simple yet very important in modeling data. There are two basic types of linear regression:

\[ y = \beta_0 + \beta_1 x\] - multiple linear regression: there are more than one independent variables

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n\]

The goal is to find best estimated values for \(\beta_i\) values based on given data set and quantify uncertainties.

For this part, we are going to use the following libraries:

library(MASS)
library(ISLR2)

If you have not installed these packages, please install them. The examples are from section 3.6 of the textbook “Introduction to Statistical Learning, 2nd Edition”. The electronic version is free for access online:

https://hastie.su.domains/ISLR2/ISLRv2_website.pdf

Now let’s use the Boston data set from the ISLR2 library. The data set records medv (median house value) for 506 census tracts in Boston.

Check the help documentation for the meaning of each variable with ?Boston.

class(Boston)
## [1] "data.frame"
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7

Let’s start by using the lm() function to fit a simple linear regression model, with medv as the response and the lstat as the predictor.

lm.fit <- lm(medv ~ lstat, data = Boston)

The input medv ~ lstat is called a formula. The variable on the left of ~ is the response, and the variables on the right are predictors. The meaning of medv ~ lstat is precisely a linear model:

\[ medv = \beta_0 + \beta_1 * lstat\]

The lm() function then performs the actual data fitting to find out the best estimated values of \(\beta_0\) and \(\beta_1\). The data argument tells lm() which data set to use.

The result of this linear model is stored in the object lm.fit, which is a list. If we just print itself, we will have a very brief summary of the model giving the values of estimated coefficients.

lm.fit
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Coefficients:
## (Intercept)        lstat  
##       34.55        -0.95

So \(\beta_0\) (intercept) is estimated to be 34.55, and \(\beta_1\) (coefficient of lstat) is estimated to be -0.95. So our model turns out to be:

\[ \begin{equation} \tag{3} medv = 34.55 - 0.95 * lstat \end{equation}\]

To have a more comprehensive look of model details, we can use the function summary()

summary(lm.fit)
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16


Inferential model vs predictive model

Usually, a model can be used for two main purposes - for statistical inference or predict for unseen data. With the example above, let’s see how the two purposes are different but relate to each other.


Prediction

For predictive purpose, we simply put a new value of lstat into our model as the input variable, and it will give us a predicted output value for \(medv\). The following code predict values of \(medv\) for \(lstat = 5, 10, 15\).

predict(lm.fit, data.frame(lstat = c(5, 10, 15)))
##        1        2        3 
## 29.80359 25.05335 20.30310

But please be noted that this prediction is from model (3), which is a deterministic model to predict the average medv value for given lstat. Usually we also hope to know the uncertainty of this prediction. We may do the following:

predict(lm.fit, data.frame(lstat = 10), interval = "confidence")
##        fit      lwr      upr
## 1 25.05335 24.47413 25.63256
predict(lm.fit, data.frame(lstat = 10), interval = "prediction")
##        fit      lwr      upr
## 1 25.05335 12.82763 37.27907

So given lstat to be 10 (10 percent lower status in the population), the model predicts an average house median value to be 25.053. The 95% CI is [24.47, 25.63] for the average house median.

The 95% CI for any house value is [12.83, 37.28] (as specified by "prediction" for interval argument). This is called prediction interval. It has a much bigger uncertainty due to randomness of value from home to home. In other words, the random error \(\epsilon\) in equation (2) also plays a role here.


Inference

For inferential purpose, usually we need to perform some kind of statistical tests (such as hypothesis testing) to understand and explain relationships between variables for the underlying population. These inferences often concern the mechanisms and forces shaping the data.

For example, if we hope to know whether lstat has a significant impact to medv, we may simply check its p-value in the lm.fit result.

summary(lm.fit)
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

The result shows the p-value for the hypothesis test \(H_0: \beta_1 = 0\) and \(H_a: \beta_1 \neq 0\) is nearly zero. So we should reject \(H_0\) and our data can serve as evidence that lstat does have a non-negligible effect on medv.

we can also check the confidence interval estimate for our parameters:

confint(lm.fit)
##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505

We will have another lecture on different statistical tests that are typically performed to tabular data for different data types and purposes.


lm object is a list

We can check the data structure of lm.fit by

typeof(lm.fit)
## [1] "list"
class(lm.fit)
## [1] "lm"
str(lm.fit)
## List of 12
##  $ coefficients : Named num [1:2] 34.55 -0.95
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "lstat"
##  $ residuals    : Named num [1:506] -5.82 -4.27 3.97 1.64 6.71 ...
##   ..- attr(*, "names")= chr [1:506] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:506] -506.86 -152.46 4.43 2.12 7.13 ...
##   ..- attr(*, "names")= chr [1:506] "(Intercept)" "lstat" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:506] 29.8 25.9 30.7 31.8 29.5 ...
##   ..- attr(*, "names")= chr [1:506] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:506, 1:2] -22.4944 0.0445 0.0445 0.0445 0.0445 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:506] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "lstat"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.04 1.02
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 504
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = medv ~ lstat, data = Boston)
##  $ terms        :Classes 'terms', 'formula'  language medv ~ lstat
##   .. ..- attr(*, "variables")= language list(medv, lstat)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "medv" "lstat"
##   .. .. .. ..$ : chr "lstat"
##   .. ..- attr(*, "term.labels")= chr "lstat"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(medv, lstat)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "medv" "lstat"
##  $ model        :'data.frame':   506 obs. of  2 variables:
##   ..$ medv : num [1:506] 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
##   ..$ lstat: num [1:506] 4.98 9.14 4.03 2.94 5.33 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language medv ~ lstat
##   .. .. ..- attr(*, "variables")= language list(medv, lstat)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "medv" "lstat"
##   .. .. .. .. ..$ : chr "lstat"
##   .. .. ..- attr(*, "term.labels")= chr "lstat"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(medv, lstat)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "medv" "lstat"
##  - attr(*, "class")= chr "lm"

We see that there is a lot of information stored in lm.fit! Usually, we don’t directly work on lm objects using $. Rather, we use helpful accessing functions to access the information we need such as:

coef(lm.fit)
## (Intercept)       lstat 
##  34.5538409  -0.9500494
confint(lm.fit)
##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505


Visualize fitting results

In this example, most likely we would like to see how well our model fits the data by plotting the fitting line on top of raw data:

attach(Boston)
plot(lstat, medv)
abline(lm.fit)

Here we use the attach() function to tell R that we will always use my_Boston data set for any relevant functions. The plot() function is the original function in R to create a scatter plot. The abline function accepts coefficients from a lm result and create the corresponding linear fit.

Of course we can do the same by ggplot:

ggplot(Boston) + geom_point(aes(x = lstat, y = medv)) + geom_smooth(aes(x = lstat, y = coef(lm.fit)[1] + lstat * coef(lm.fit)[2]))

An important step in data modeling is to validate the model assumptions. For example, a simple linear regression assumes that the residual error should be normally distributed. We can check this assumption by making the residual plot:

plot(lm.fit, which = 1)

The dashed line shows zero error line, and we expect the error to be normally distributed about zero without dependence on lstat. However, here we clearly see that the error does depend on lstat, which cast doubt on the model validity.

We can also make a normality plot to check the residual distribution:

plot(lm.fit, which = 2)

The error is also not quite normally distributed when lstat has a z-score between 1 and 3. So a liner model may be problematic here. This is expected since raw data show a non-linear trend. So a quadratic model probably would work better here.


Fit a polynomial model to data

To fit a polynomial model (in this case a quadratic model) to data, we may do the following:

attach(Boston)
lm.fit2 <- lm(medv ~ poly(lstat, 2))

This fits data to the model

\[ medv = \beta_0 + \beta_1 * lstat + \beta_2 * lstat^2\]

We can check how well the model fits the data by the code:

plot(lstat, medv)
ix <- sort(lstat, index.return=T)$ix
lines(lstat[ix], predict(lm.fit2)[ix], col = "red", lwd = 2)

More details of polynomial model will be given in a future example.


2. Multiple Linear regression


Now let’s look at an example of multiple linear regression. This part is from:

https://www.tmwr.org/base-r.html

We are going to use experimental data from McDonald (2009), by way of Mangiafico (2015), on the relationship between the ambient temperature and the rate of cricket chirps per minute. Data were collected for two species: O. exclamationis and O. niveus. The data are contained in a data frame called crickets in the modeldata package with a total of 31 data points.

library(tidyverse)

data(crickets, package = "modeldata")
names(crickets)
## [1] "species" "temp"    "rate"
# Plot the temperature on the x-axis, the chirp rate on the y-axis. The plot
# elements will be colored differently for each species:
ggplot(crickets, 
       aes(x = temp, y = rate, color = species, pch = species, lty = species)) + 
  # Plot points for each data point and color by species
  geom_point(size = 2) + 
  # Show a simple linear model fit created separately for each species:
  geom_smooth(method = lm, se = FALSE) + 
  scale_color_brewer(palette = "Paired") +
  labs(x = "Temperature (C)", y = "Chirp Rate (per minute)")

The data exhibit fairly linear trends for each species. For a given temperature, O. exclamationis appears to chirp more per minute than the other species. For an inferential model, the researchers might have specified the following null hypotheses prior to seeing the data:

There may be some scientific or practical value in predicting the chirp rate but in this example we will focus on inference.

For multiple linear regression, we use the following formula

rate ~ temp + species
## rate ~ temp + species

Here rate is the response and both temp and species are predictors. The true meaning of this formula is not temp “plus” species, but a general linear model:

\[ rate = \beta_0 + \beta_1 * temp + \beta_2 * species\] Note that Species is not a quantitative variable; in the data frame, it is represented as a factor column with levels “O. exclamationis” and “O. niveus”. The vast majority of model functions cannot operate on nonnumeric data. For species, the model needs to encode the species data in a numeric format. The most common approach is to use indicator variables (also known as dummy variables) in place of the original qualitative values.

In this instance, since species has two possible values, the model formula will automatically encode this column as numeric by adding a new column that has a value of zero when the species is “O. exclamationis” and a value of one when the data correspond to “O. niveus”.

attach(crickets)
lm.fit2 <- lm(rate ~ temp + species)
lm.fit2
## 
## Call:
## lm(formula = rate ~ temp + species)
## 
## Coefficients:
##      (Intercept)              temp  speciesO. niveus  
##           -7.211             3.603           -10.065

To interpret the result, with each degree of temperature increase, the chirping rate increases by 3.6 on average. The species O.niveus chirps 10.06 per minute less than the other species on average.

The only issue in this analysis is the intercept value. It indicates that at 0° C, there are negative chirps per minute for both species. While this doesn’t make sense, the data only go as low as 17.2° C and interpreting the model at 0° C would be an extrapolation. This would be a bad idea. That being said, the model fit is good within the applicable range of the temperature values; the conclusions should be limited to the observed temperature range.


Modeling Categorical Variables

Next, let’s study the basics of using a linear model (possibly with interaction terms) to model data with categorical variables.

For categorical variables, since their values are labels, not numbers, we have to convert their value to numbers before putting them into a linear model (or most other models).

There are three different situations here:

  • The categorical variable is binary (only two possible lables)
  • The categorical variable is not binary and not ordered
  • The categorical variable is ordered

We will use the function model_matrix in modelr to understand how this works. Let’s first look at a binary variable.


Binary categorical variable

bank_data <- read_csv("~/Documents/Fei Tian/Course_DAS422_Exploratory_Data_Analysis_and_Visualization_Spring2023/Datasets/BankChurners.csv")

unique(bank_data$Gender)
## [1] "M" "F"

So we see that in the bank customer data, there are only two values for Gender - “M” (male) and “F” (female). Let’s say we hope to model the continuous variable Credit_Limit with Gender, we can use model_matrix function to see what values are used in model equations.

model_data <- model_matrix(bank_data, Credit_Limit ~ Gender) 
mutate(model_data, Gender = bank_data$Gender)
## # A tibble: 10,127 × 3
##    `(Intercept)` GenderM Gender
##            <dbl>   <dbl> <chr> 
##  1             1       1 M     
##  2             1       0 F     
##  3             1       1 M     
##  4             1       0 F     
##  5             1       1 M     
##  6             1       1 M     
##  7             1       1 M     
##  8             1       1 M     
##  9             1       1 M     
## 10             1       1 M     
## # … with 10,117 more rows

The data frame returned by model_matrix summarises the \(x\) values (but not including \(y\)) used for model fitting. For a linear model, there is always a column of ones for the Intercept \(\beta_0\) since \(y = \beta_0 * 1 + \beta_1 * x\). For GenderM, we see that R automatically encodes “M” into a value of one, and “F” into a value of zero:

\[ \mathrm{Gender} = \cases{1, \quad \mathrm{for\;label\;"M"} \\ 0, \quad \mathrm{for\;label\;"F"}}\] Here the name “GenderM” is due to the fact that the value “M” is encoded as one.


Non-binary, non-ordered categorical variable

If a non-ordered categofical variable has more than two labels, the most commonly encoding method is the one-hot coding. Let’s use mpg data set as an example:

unique(mpg$drv)
## [1] "f" "4" "r"
model_matrix(mpg, cty ~ drv) %>%
  mutate(drv = mpg$drv) %>%
  print(n=30)
## # A tibble: 234 × 4
##    `(Intercept)`  drvf  drvr drv  
##            <dbl> <dbl> <dbl> <chr>
##  1             1     1     0 f    
##  2             1     1     0 f    
##  3             1     1     0 f    
##  4             1     1     0 f    
##  5             1     1     0 f    
##  6             1     1     0 f    
##  7             1     1     0 f    
##  8             1     0     0 4    
##  9             1     0     0 4    
## 10             1     0     0 4    
## 11             1     0     0 4    
## 12             1     0     0 4    
## 13             1     0     0 4    
## 14             1     0     0 4    
## 15             1     0     0 4    
## 16             1     0     0 4    
## 17             1     0     0 4    
## 18             1     0     0 4    
## 19             1     0     1 r    
## 20             1     0     1 r    
## 21             1     0     1 r    
## 22             1     0     1 r    
## 23             1     0     1 r    
## 24             1     0     1 r    
## 25             1     0     1 r    
## 26             1     0     1 r    
## 27             1     0     1 r    
## 28             1     0     1 r    
## 29             1     0     0 4    
## 30             1     0     0 4    
## # … with 204 more rows

Here the drv (drive train) is a categorical variable with three possible labels: “f” (forward), “r” (rear) and “4” (4-wheel). In this case, it is not a good idea to encode them as “1”, “2”, “3”, since there is no intrinsic order in the variable.

Instead, for non-ordered categorical variables with \(n\) labels, we create \(n-1\) new binary dummy variables, which takes the value of one for the 1st, 2nd, …, (n-1)th variable, and zero otherwise. For the \(n\)th label, all dummy variables are zero. In this example:

\[ \mathrm{drvf} = \cases{1, \quad \mathrm{for\;label\;"f"} \\ 0, \quad \mathrm{for\;label\;"r"\; and\;"4"}} \] \[ \mathrm{drvr} = \cases{1, \quad \mathrm{for\;label\;"r"} \\ 0, \quad \mathrm{for\;label\;"f"\; and\;"4"}} \] So for label “4”, both drvf and drvr take the value of zero. This encoding system is called one-hot encoding.


Ordered (ordinal) Variable

For an ordered variable, things are more complicated. There are multiple ways of labeling an ordered variable. For example, using “1”, “2”, “3”, “4”, etc. R uses a relatively complicated encoding method called polynomial contrasts. The theory behind is beyond the scope of this course. But we can glimpse how it looks like:

unique(diamonds$cut)
## [1] Ideal     Premium   Good      Very Good Fair     
## Levels: Fair < Good < Very Good < Premium < Ideal
model_matrix(diamonds, price ~ cut) %>%
  mutate(cut = diamonds$cut)
## # A tibble: 53,940 × 6
##    `(Intercept)`     cut.L  cut.Q     cut.C `cut^4` cut      
##            <dbl>     <dbl>  <dbl>     <dbl>   <dbl> <ord>    
##  1             1  6.32e- 1  0.535  3.16e- 1   0.120 Ideal    
##  2             1  3.16e- 1 -0.267 -6.32e- 1  -0.478 Premium  
##  3             1 -3.16e- 1 -0.267  6.32e- 1  -0.478 Good     
##  4             1  3.16e- 1 -0.267 -6.32e- 1  -0.478 Premium  
##  5             1 -3.16e- 1 -0.267  6.32e- 1  -0.478 Good     
##  6             1 -1.48e-18 -0.535 -3.89e-16   0.717 Very Good
##  7             1 -1.48e-18 -0.535 -3.89e-16   0.717 Very Good
##  8             1 -1.48e-18 -0.535 -3.89e-16   0.717 Very Good
##  9             1 -6.32e- 1  0.535 -3.16e- 1   0.120 Fair     
## 10             1 -1.48e-18 -0.535 -3.89e-16   0.717 Very Good
## # … with 53,930 more rows

In the diamonds data set, we have a few ordered variables - cut, color and clarity. If we model price against cut, we see that R creates four new numeric variables (since there are five levels in cut) from cut to cut^4 (“L”, “Q”, “C” represent “linear”, “quadratic” and “cubic” respectively). Each level is converted into a particular combination of values for linear fit. That’s all we need to know for now.


3. Modeling examples

Next, let’s see how to create linear models in different situations. First, let’s see when and how to include interaction terms between variables.


Interaction term

Recall that for a linear model with multiple predictors, the modeling function is:

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n\]

The meaning of coefficients \(\beta_1, \beta_2, \cdots, \beta_n\) are the increase (or decrease in \(y\)) with each unit increase in \(x_1, x_2, \cdots, x_n\), respectively, holding all other predictors constant.

Quite obviously, here we assume that the effect of each predictor is separable and there is no interaction between them. In other words, the value of \(x_2, x_3, \cdots, x_n\) has nothing to do with \(\beta_1\). This may not be true in real situations.

To study this effect, we are going to study an advertising data set from the text book Introduction to Statistical Learning. The data set can be downloaded from: https://www.statlearning.com/s/Advertising.csv


a) The Advertising data set

The data set has four variables, where the sales is the target variable measured in thousands of units sold for a particular product in 200 different markets. The three predictors TV, radio, and newspaper refer to advertising expenditure (in thousands of dollars) in each of the three media for different markets.

If we plot sales against each of the predictor, we obtain the following graphs:

So we see that the sales increases with all three variables on average. Now let’s model this problem and we may get some different insights.

First, let’s do single linear regression with each of the three variable.

advertising <- read_csv("~/Documents/Fei Tian/Course_DAS422_Exploratory_Data_Analysis_and_Visualization_Spring2023/Datasets/Advertising.csv")
glimpse(advertising)
## Rows: 200
## Columns: 5
## $ ...1      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ TV        <dbl> 230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2, 8.6, 199.…
## $ radio     <dbl> 37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6, 2.1, 2.6, 5.…
## $ newspaper <dbl> 69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6, 1.0, 21.2, 2…
## $ sales     <dbl> 22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.6, 8.6…
attach(advertising)
lm.fit.tv <- lm(sales ~ TV)
summary(lm.fit.tv)
## 
## Call:
## lm(formula = sales ~ TV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

To interpret the result, we see that on average, an additional thousand dollar spent in TV advertising results in 0.047 thousand or 47 more sales of the product. The p-value is very low indicating that we can safely reject the null hypothesis that TV advertising has no effect (\(\beta_1 = 0\)).

Similarly, let’s now do this for the other two variables:

attach(advertising)
lm.fit.tv <- lm(sales ~ radio)
summary(lm.fit.tv)
## 
## Call:
## lm(formula = sales ~ radio)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7305  -2.1324   0.7707   2.7775   8.1810 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.31164    0.56290  16.542   <2e-16 ***
## radio        0.20250    0.02041   9.921   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.275 on 198 degrees of freedom
## Multiple R-squared:  0.332,  Adjusted R-squared:  0.3287 
## F-statistic: 98.42 on 1 and 198 DF,  p-value: < 2.2e-16
attach(advertising)
lm.fit.tv <- lm(sales ~ newspaper)
summary(lm.fit.tv)
## 
## Call:
## lm(formula = sales ~ newspaper)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2272  -3.3873  -0.8392   3.5059  12.7751 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.35141    0.62142   19.88  < 2e-16 ***
## newspaper    0.05469    0.01658    3.30  0.00115 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.092 on 198 degrees of freedom
## Multiple R-squared:  0.05212,    Adjusted R-squared:  0.04733 
## F-statistic: 10.89 on 1 and 198 DF,  p-value: 0.001148

So we see that each additional thousand dollar spent in radio and newspaper advertising results in 203 and 55 more sales, respectively. It seems that radio is the most effective advertising method by a large margin, and TV/newspaper has similar effects.

Now let’s do the multiple linear regression. That is, to study the effects of all three variables altogether.

lm.fit.all <- lm(sales ~ TV + radio + newspaper)
summary(lm.fit.all)
## 
## Call:
## lm(formula = sales ~ TV + radio + newspaper)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## radio        0.188530   0.008611  21.893   <2e-16 ***
## newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

This result is different from those of simple linear regression. Now the coefficient of newspaper is very small and its p-value suggests that there is no significant correlation between sales and newspaper.

Does it make sense for the multiple regression to suggest no relationship between sales and newspaper while the simple linear regression implies the opposite? In fact it does. Consider the correlation matrix for the three predictor variables and response variable:

cor(advertising)
##                  ...1         TV       radio   newspaper       sales
## ...1       1.00000000 0.01771469 -0.11068044 -0.15494414 -0.05161625
## TV         0.01771469 1.00000000  0.05480866  0.05664787  0.78222442
## radio     -0.11068044 0.05480866  1.00000000  0.35410375  0.57622257
## newspaper -0.15494414 0.05664787  0.35410375  1.00000000  0.22829903
## sales     -0.05161625 0.78222442  0.57622257  0.22829903  1.00000000

Notice that the correlation between radio and newspaper is 0.35. This indicates that markets with high newspaper advertising tend to also have high radio advertising.

Now suppose that the multiple regression is correct and newspaper advertising is not associated with sales, but radio advertising is associated with sales. Then in markets where we spend more on radio our sales will tend to be higher, and as our correlation matrix shows, we also tend to spend more on newspaper advertising in those same markets.

Hence, in a simple linear regression which only examines sales versus newspaper, we will observe that higher values of newspaper tend to be associated with higher values of sales, even though newspaper advertising is not directly associated with sales. So newspaper advertising is a surrogate for radio advertising; newspaper gets “credit” for the association between radio on sales.

There is rich theory and information behind the analysis above. Here we only give a brief introduction to this example. For more details, you may refer to the textbook of Introduction to Statistical Learning.

It is also worthwhile to note that regression models only capture association, not causation. So there is no guarantee that if we increase expenditure in radio advertising, there will be more sales observed.


Interaction effect

Now go back to the interaction effect. The statistical model for a multiple linear regression is:

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n + \epsilon\] where \(\epsilon\) is assumed to incorporate all related factors to \(y\) that are missing in the data set so we basically cannot model them using \(\beta_1, \beta_2, \cdots, \beta_n\). This term is called irreducible error since it’s not possible to predict them without our model.

In linear regression theory, \(\epsilon\) is assumed to satisfy

  • \(\epsilon\) should be normally distributed
  • \(\epsilon\) should be independent of all predictors
  • \(\epsilon\) should have a constant variance

In real problems, either assumption can often be invalid. Particularly, the second issue sometimes can be resolved by introducing interaction terms and nonlinear terms, as shown below.

Now we already know that the association between sales and newspaper is weak if we consider radio. Therefore let’s remove newspaper from our model.

lm.fit.tvradio <- lm(sales ~ TV + radio)

Now let’s check the residual plot to see whether there is any non-linear effect or heterogeneous variance:

plot(lm.fit.tvradio, which = 1)

So we see a relatively clear non-linear trend here, which suggests potential non-linear relationship.

Now let’s see whether the residual is dependent of \(TV \times radio\) here.

advertising <- advertising %>%
  mutate(res_tvradio = lm.fit.tvradio$residual) 

ggplot(advertising, mapping = aes(TV*radio, res_tvradio)) +
  geom_point() + 
  geom_smooth(method = "lm", color = "red")

Here we first add the residual of the last fit into the data set, and then plot it against \(TV \times radio\). We see a pretty obvious increasing trend here. So this term should indeed be used in the fit.


Synergy effect

The product of two variables, such as \(TV \times radio\), is called an interaction term. Now the regression model becomes:

\[ sales = \beta_0 + \beta_1 \times TV + \beta_2 \times radio + \beta_{12} \times TV \times radio\] What is the meaning of this model? It simply means that when there is a unit increase in TV advertising expenditure, the increase in sales is no longer a constant and depends on the value of radio as well.

\[ sales = \beta_0 + (\beta_1 + \beta_{12} \times radio) \times TV + \beta_2 \times radio\] Symmetrically, a unit increase in radio advertising expenditure also results in different changes in sales, depending on the value of TV.

$ sales = _0 + _1 TV + (2 + {12} TV)radio$$

Such effect is commonly seen in real life, and is called a synergy effect. That is, if the company does TV and radio advertising at the same time, each medias help the other to be more effective.


Adding interaction term in R

In R, adding an interaction term can be done in a few ways:

lm(target ~ var1 + var2 + var1:var2)
lm(target ~ var1 * var2)
lm(target ~ (var1 + var2)^2)

All three ways of writing the formula give the same model:

\[ target = \beta_0 + \beta_1 \times var1 + \beta_2 \times var2 + \beta_{12} \times var1 \times var2\] You should be noted that in the R formula, ~ (var1 + var2)^2 is not the same as \((var1^2 + var2^2 + 2*var1*var2)\), neither does var1 * var2 mean \((var1 \times var2)\).

Now let’s do the regression using the new model:

lm.fit.interact <- lm(sales ~ (TV + radio)^2)
plot(lm.fit.interact, which = 1)

We see that there is still some nonlinear effect left. But the magnitude of residuals become smaller overall, so this is a better fit than the previous one (which is not necessarily always good due to potential overfitting problem).

Let’s look at the fit summary:

coef(lm.fit.interact)
## (Intercept)          TV       radio    TV:radio 
## 6.750220203 0.019101074 0.028860340 0.001086495
summary(lm.fit.interact)
## 
## Call:
## lm(formula = sales ~ (TV + radio)^2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3366 -0.4028  0.1831  0.5948  1.5246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
## TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
## radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
## TV:radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9673 
## F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16

We see all three coefficients have a low p-value, suggesting that we should reject the null hypothesis that they are zero. So all three terms have a non-negligible effect on sales in this model.


Polynomial regression

Next, let’s see whether we can improve the model to account for the remaining nonlinear effect in the residual plot. We may see that which variable is accounting for this:

plot(advertising$TV, lm.fit.interact$residuals)

plot(advertising$radio, lm.fit.interact$residuals)

plot(advertising$radio * advertising$TV, lm.fit.interact$residuals)

So it seems that the residual is a quadratic function of TV. Then let’s throw the \(TV^2\) term into the model.

lm.fit.interact2 <- lm(sales ~ (TV + radio)^2 + I(TV^2))
summary(lm.fit.interact2)
## 
## Call:
## lm(formula = sales ~ (TV + radio)^2 + I(TV^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9949 -0.2969 -0.0066  0.3798  1.1686 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.137e+00  1.927e-01  26.663  < 2e-16 ***
## TV           5.092e-02  2.232e-03  22.810  < 2e-16 ***
## radio        3.516e-02  5.901e-03   5.959 1.17e-08 ***
## I(TV^2)     -1.097e-04  6.893e-06 -15.920  < 2e-16 ***
## TV:radio     1.077e-03  3.466e-05  31.061  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6238 on 195 degrees of freedom
## Multiple R-squared:  0.986,  Adjusted R-squared:  0.9857 
## F-statistic:  3432 on 4 and 195 DF,  p-value: < 2.2e-16

To throw in a quadratic term, we use I(TV^2). The formula in I() will be interpreted just as what it is. Now let’s look at our residual plots.

plot(advertising$TV, lm.fit.interact2$residuals)

plot(advertising$radio, lm.fit.interact2$residuals)

plot(advertising$radio * advertising$TV, lm.fit.interact2$residuals)

There is still some visible nonlinear relationship between the residuals and TV. So we can keep throw more power terms of TV into the model, such as \(TV^3\), \(TV^4\), \(TV^5\), until this association disappears.

lm.fit.interact5 <- lm(sales ~ (TV + radio)^2 + I(TV^2) + I(TV^3) + I(TV^4) + I(TV^5))
summary(lm.fit.interact5)
## 
## Call:
## lm(formula = sales ~ (TV + radio)^2 + I(TV^2) + I(TV^3) + I(TV^4) + 
##     I(TV^5))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.96970 -0.17890  0.00147  0.20770  1.03572 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.698e+00  2.070e-01  13.034  < 2e-16 ***
## TV           2.015e-01  1.271e-02  15.856  < 2e-16 ***
## radio        4.302e-02  3.906e-03  11.014  < 2e-16 ***
## I(TV^2)     -2.621e-03  2.649e-04  -9.893  < 2e-16 ***
## I(TV^3)      1.743e-05  2.258e-06   7.719 6.27e-13 ***
## I(TV^4)     -5.473e-08  8.381e-09  -6.530 5.74e-10 ***
## I(TV^5)      6.467e-11  1.124e-11   5.752 3.43e-08 ***
## TV:radio     1.036e-03  2.288e-05  45.272  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4087 on 192 degrees of freedom
## Multiple R-squared:  0.9941, Adjusted R-squared:  0.9939 
## F-statistic:  4605 on 7 and 192 DF,  p-value: < 2.2e-16
plot(lm.fit.interact5, which = 1)

plot(lm.fit.interact5, which = 2)

In principle, these models are still linear models in the sense that the target variable is still linear to multiple predictors, but the predictors themselves become nonlinear functions of original predictors. If those functions are power terms, it is called polynomial regression.

Now we can visualise the goodness of fit with the original data. Since we have two variables involved (other predictors are simply functions of TV and radio), it is still possible to visualise the data and our fitting plane in 3D. Here we use the function plotPlane from the package rockchalk.

library(rockchalk)

par(mfrow=c(1,2), mar=c(1,1,1,1))

plotPlane(lm.fit.interact5, "TV", "radio", pch=16, col=rgb(0,0,1,0.1), drawArrows=TRUE, alength=0, acol="red", alty=1,alwd=1, theta=315, phi=10)

plotPlane(lm.fit.interact5, "TV", "radio", pch=16, col=rgb(0,0,1,0.1), drawArrows=TRUE, alength=0, acol="red", alty=1,alwd=1, theta=315, phi=-10)

So we see that our fit is already quite accurate and the error is pretty small overall.


b) diamonds data set

In the next example, let’s model diamond price with the diamonds data set. First, let’s study the relationship between price and carat. Visualizing our data gives us

ggplot(diamonds, aes(carat, price)) + 
  geom_bin_2d(bins = 50)

It seems that price follows a non-linear relationship with carat. To do a reasonable linear model, we can take logarithm to both of them (which usually make things much more linear), and also ignore diamonds of more than 3 carats since they are rare.

diamonds1 <- diamonds %>%
  filter(carat <= 3) %>%
  mutate(log_price = log2(price), log_carat = log2(carat))

ggplot(diamonds1, aes(log_carat, log_price)) + 
  geom_bin_2d(bins = 50)

Now the relationship looks very linear, so let’s fit it and visualise it.

model_diamonds <- lm(log_price ~ log_carat, data = diamonds1) 
plot(model_diamonds, which = 1)

plot(model_diamonds, which = 2)

grid <- diamonds1 %>%
  data_grid(log_carat) %>%
  add_predictions(model_diamonds)

ggplot(diamonds1) + 
  geom_bin_2d(aes(log_carat, log_price)) +
  geom_line(aes(log_carat, pred), data = grid, colour = "red", size = 1)

We can also visualise in the original scale

ggplot(diamonds1) + 
  geom_bin_2d(aes(carat, price)) +
  geom_line(aes(2^log_carat, 2^pred), data = grid, colour = "red", size = 1)

Let’s check the coefficients for the model:

model_diamonds$coefficients
## (Intercept)   log_carat 
##   12.190950    1.678231

So the model is

\[ \mathrm{log\_price} = 12.19 + 1.68 \times \mathrm{log\_carat}\]

or equivalently

\[ \mathrm{price} = 4672 \times \mathrm{carat} ^ {1.68} \]


Modeling a continuous variable with a categorical variable

Next, let’s explore the relationship between carat and cut. As we have analyzed earlier in the semester, the diamonds data set shows a “weird” trend that better cut quality results in overall lower price:

ggplot(diamonds, aes(cut, price)) + geom_boxplot()

We had explained this result by the relationship between carat and cut - better-cut diamonds are smaller on average in this data set. Let’s now model this relationship.

diamonds2 <- diamonds %>%
  filter(carat <= 3) %>%
  mutate(cut = as.character(cut))

Let’s ignore the ordered status of cut and simply make cut a non-ordered character. Let’s then see the model matrix first:

model_matrix(diamonds2, carat ~ cut)
## # A tibble: 53,908 × 5
##    `(Intercept)` cutGood cutIdeal cutPremium `cutVery Good`
##            <dbl>   <dbl>    <dbl>      <dbl>          <dbl>
##  1             1       0        1          0              0
##  2             1       0        0          1              0
##  3             1       1        0          0              0
##  4             1       0        0          1              0
##  5             1       1        0          0              0
##  6             1       0        0          0              1
##  7             1       0        0          0              1
##  8             1       0        0          0              1
##  9             1       0        0          0              0
## 10             1       0        0          0              1
## # … with 53,898 more rows

Now we see that the one-hot encoding is implemented. Let’s put this into a linear model.

model_diamonds2 <- lm(carat ~ cut, data = diamonds2)
model_diamonds2$coefficients
##  (Intercept)      cutGood     cutIdeal   cutPremium cutVery Good 
##    1.0302687   -0.1824062   -0.3278925   -0.1405634   -0.2243366

We need to interpret this result correctly. Note that we encode all dummy variables to be zero for fair cut. Therefore, the Intercept here is the average carat for fair cut diamonds. For other dummy variables, the slope shows the difference of average carat from the Intercept, which are all negative indicating smaller diamonds.

We can visualise this model in the following way:

grid <- diamonds2 %>%
  data_grid(cut) %>%
  add_predictions(model_diamonds2)

grid
## # A tibble: 5 × 2
##   cut        pred
##   <chr>     <dbl>
## 1 Fair      1.03 
## 2 Good      0.848
## 3 Ideal     0.702
## 4 Premium   0.890
## 5 Very Good 0.806

So the predicted values are simply the mean carat for each category:

diamonds2$cut <- fct_relevel(diamonds2$cut, "Fair", "Good", "Very Good", "Premium", "Ideal") # make the labels in the right order for plotting purpose since we removed the order previously

ggplot() + 
  geom_point(aes(cut, carat), data = diamonds2) +
  geom_point(aes(cut, pred), data = grid, colour = "red", size = 4)

From the plot, it is obvious that fair cut diamonds are heavier than ideal cut diamonds on average.


Modeling price with carat and cut together

Fitting price with carat or cut alone is not particularly interesting. Now let’s fit price with carat and cut at the same time, which would provide better insights.

When we have both numeric and categorical variables in \(x\), there are two models to use. One is without the interaction term:

\[ \mathrm{price} = \beta_0 + \beta_{carat} \times \mathrm{carat} + \sum \beta_i \times \mathrm{cut\_dummy_i}\]

In this case for each cut group, there is a linear relationship between price and carat with the same slope but different intercepts.

diamonds3 <- diamonds %>%
  filter(carat <= 3) %>%
  mutate(log_price = log2(price), log_carat = log2(carat)) %>%
  mutate(cut = as.character(cut)) 

diamonds3$cut <- fct_relevel(diamonds3$cut, "Fair", "Good", "Very Good", "Premium", "Ideal")
  


model_diamonds3 <- lm(log_price ~ log_carat + cut, data = diamonds3)

grid <- diamonds3 %>%
  data_grid(log_carat, cut) %>%
  add_predictions(model_diamonds3)

grid
## # A tibble: 1,280 × 3
##    log_carat cut        pred
##        <dbl> <fct>     <dbl>
##  1     -2.32 Fair       7.89
##  2     -2.32 Good       8.12
##  3     -2.32 Very Good  8.24
##  4     -2.32 Premium    8.23
##  5     -2.32 Ideal      8.35
##  6     -2.25 Fair       8.01
##  7     -2.25 Good       8.24
##  8     -2.25 Very Good  8.36
##  9     -2.25 Premium    8.35
## 10     -2.25 Ideal      8.47
## # … with 1,270 more rows
ggplot(diamonds3) + 
  geom_point(aes(x = log_carat, y = log_price, colour = cut)) + 
  geom_line(aes(x = log_carat, y = pred, colour = cut), data = grid)

This may not be a good idea since the linear model for each cut group shares the same slope. We may add the interaction term to resolve this:

\[ \mathrm{price} = \beta_0 + \beta_{carat} \times \mathrm{carat} + \sum \beta_i \times \mathrm{cut\_dummy_i} + \sum \beta_{c,i} \times \mathrm{carat} \times \mathrm{cut\_dummy_i}\]

By introducing the interaction terms, now for different cut groups the slopes can be different.

diamonds3 <- diamonds %>%
  filter(carat <= 3) %>%
  mutate(log_price = log2(price), log_carat = log2(carat)) %>%
  mutate(cut = as.character(cut)) 

diamonds3$cut <- fct_relevel(diamonds3$cut, "Fair", "Good", "Very Good", "Premium", "Ideal")
  


model_diamonds4 <- lm(log_price ~ log_carat * cut, data = diamonds3)

grid <- diamonds3 %>%
  data_grid(log_carat, cut) %>%
  add_predictions(model_diamonds4)

grid
## # A tibble: 1,280 × 3
##    log_carat cut        pred
##        <dbl> <fct>     <dbl>
##  1     -2.32 Fair       8.30
##  2     -2.32 Good       8.05
##  3     -2.32 Very Good  8.18
##  4     -2.32 Premium    8.31
##  5     -2.32 Ideal      8.33
##  6     -2.25 Fair       8.41
##  7     -2.25 Good       8.17
##  8     -2.25 Very Good  8.31
##  9     -2.25 Premium    8.42
## 10     -2.25 Ideal      8.45
## # … with 1,270 more rows
ggplot(diamonds3) + 
  geom_point(aes(x = log_carat, y = log_price, colour = cut)) + 
  geom_line(aes(x = log_carat, y = pred, colour = cut), data = grid)

Actually in this case, the interaction terms do not change the trend very significantly. But if we check their p-values, we should still keep them.

summary(model_diamonds4)
## 
## Call:
## lm(formula = log_price ~ log_carat * cut, data = diamonds3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.26407 -0.23864 -0.00931  0.23171  2.01353 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            11.818586   0.009230 1280.52   <2e-16 ***
## log_carat               1.515429   0.013897  109.05   <2e-16 ***
## cutGood                 0.266001   0.010996   24.19   <2e-16 ***
## cutVery Good            0.376526   0.010042   37.49   <2e-16 ***
## cutPremium              0.341711   0.009852   34.68   <2e-16 ***
## cutIdeal                0.478839   0.009830   48.71   <2e-16 ***
## log_carat:cutGood       0.221837   0.015386   14.42   <2e-16 ***
## log_carat:cutVery Good  0.211954   0.014445   14.67   <2e-16 ***
## log_carat:cutPremium    0.144763   0.014351   10.09   <2e-16 ***
## log_carat:cutIdeal      0.192736   0.014232   13.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3647 on 53898 degrees of freedom
## Multiple R-squared:  0.9378, Adjusted R-squared:  0.9378 
## F-statistic: 9.036e+04 on 9 and 53898 DF,  p-value: < 2.2e-16

But overall, we see that after considering the carat, we correctly see that for the same carat value, on average the best cutting quality results in the highest average price, and the worst cutting quality results in the lowest average price.