library(tidyverse)
library(modelr)
library(nycflights13)
Now let’s look at an example of multiple linear regression. This part is from:
https://www.tmwr.org/base-r.html
We are going to use experimental data from McDonald (2009), by way of
Mangiafico (2015), on the relationship between the ambient temperature
and the rate of cricket chirps per minute. Data were collected for two
species: O. exclamationis and O. niveus. The data are
contained in a data frame called crickets
in the
modeldata
package with a total of 31 data points.
library(tidyverse)
data(crickets, package = "modeldata")
names(crickets)
## [1] "species" "temp" "rate"
# Plot the temperature on the x-axis, the chirp rate on the y-axis. The plot
# elements will be colored differently for each species:
ggplot(crickets,
aes(x = temp, y = rate, color = species, pch = species, lty = species)) +
# Plot points for each data point and color by species
geom_point(size = 2) +
# Show a simple linear model fit created separately for each species:
geom_smooth(method = lm, se = FALSE) +
scale_color_brewer(palette = "Paired") +
labs(x = "Temperature (C)", y = "Chirp Rate (per minute)")
The data exhibit fairly linear trends for each species. For a given temperature, O. exclamationis appears to chirp more per minute than the other species. For an inferential model, the researchers might have specified the following null hypotheses prior to seeing the data:
Temperature has no effect on the chirp rate.
There are no differences between the species’ chirp rate.
There may be some scientific or practical value in predicting the chirp rate but in this example we will focus on inference.
For multiple linear regression, we use the following formula
rate ~ temp + species
## rate ~ temp + species
Here rate
is the response and both temp
and
species
are predictors. The true meaning of this formula is
not temp
“plus” species
, but a general linear
model:
\[ rate = \beta_0 + \beta_1 * temp + \beta_2 * species\] Note that Species is not a quantitative variable; in the data frame, it is represented as a factor column with levels “O. exclamationis” and “O. niveus”. The vast majority of model functions cannot operate on nonnumeric data. For species, the model needs to encode the species data in a numeric format. The most common approach is to use indicator variables (also known as dummy variables) in place of the original qualitative values.
In this instance, since species has two possible values, the model formula will automatically encode this column as numeric by adding a new column that has a value of zero when the species is “O. exclamationis” and a value of one when the data correspond to “O. niveus”.
attach(crickets)
lm.fit2 <- lm(rate ~ temp + species)
lm.fit2
##
## Call:
## lm(formula = rate ~ temp + species)
##
## Coefficients:
## (Intercept) temp speciesO. niveus
## -7.211 3.603 -10.065
To interpret the result, with each degree of temperature increase, the chirping rate increases by 3.6 on average. The species O.niveus chirps 10.06 per minute less than the other species on average.
The only issue in this analysis is the intercept value. It indicates that at 0° C, there are negative chirps per minute for both species. While this doesn’t make sense, the data only go as low as 17.2° C and interpreting the model at 0° C would be an extrapolation. This would be a bad idea. That being said, the model fit is good within the applicable range of the temperature values; the conclusions should be limited to the observed temperature range.
Next, let’s study the basics of using a linear model (possibly with interaction terms) to model data with categorical variables.
For categorical variables, since their values are labels, not numbers, we have to convert their value to numbers before putting them into a linear model (or most other models).
There are three different situations here:
We will use the function model_matrix
in
modelr
to understand how this works. Let’s first look at a
binary variable.
bank_data <- read_csv("~/Documents/Fei Tian/Course_DAS422_Exploratory_Data_Analysis_and_Visualization_Spring2023/Datasets/BankChurners.csv")
unique(bank_data$Gender)
## [1] "M" "F"
So we see that in the bank customer data, there are only two values
for Gender
- “M” (male) and “F” (female). Let’s say we hope
to model the continuous variable Credit_Limit
with
Gender
, we can use model_matrix
function to
see what values are used in model equations.
model_data <- model_matrix(bank_data, Credit_Limit ~ Gender)
mutate(model_data, Gender = bank_data$Gender)
## # A tibble: 10,127 × 3
## `(Intercept)` GenderM Gender
## <dbl> <dbl> <chr>
## 1 1 1 M
## 2 1 0 F
## 3 1 1 M
## 4 1 0 F
## 5 1 1 M
## 6 1 1 M
## 7 1 1 M
## 8 1 1 M
## 9 1 1 M
## 10 1 1 M
## # ℹ 10,117 more rows
The data frame returned by model_matrix
summarises the
\(x\) values (but not including \(y\)) used for model fitting. For a linear
model, there is always a column of ones for the Intercept \(\beta_0\) since \(y = \beta_0 * 1 + \beta_1 * x\). For
GenderM
, we see that R automatically encodes “M” into a
value of one, and “F” into a value of zero:
\[ \mathrm{Gender} = \cases{1, \quad \mathrm{for\;label\;"M"} \\ 0, \quad \mathrm{for\;label\;"F"}}\] Here the name “GenderM” is due to the fact that the value “M” is encoded as one.
If a non-ordered categofical variable has more than two labels, the
most commonly encoding method is the one-hot coding.
Let’s use mpg
data set as an example:
unique(mpg$drv)
## [1] "f" "4" "r"
model_matrix(mpg, cty ~ drv) %>%
mutate(drv = mpg$drv) %>%
print(n=30)
## # A tibble: 234 × 4
## `(Intercept)` drvf drvr drv
## <dbl> <dbl> <dbl> <chr>
## 1 1 1 0 f
## 2 1 1 0 f
## 3 1 1 0 f
## 4 1 1 0 f
## 5 1 1 0 f
## 6 1 1 0 f
## 7 1 1 0 f
## 8 1 0 0 4
## 9 1 0 0 4
## 10 1 0 0 4
## 11 1 0 0 4
## 12 1 0 0 4
## 13 1 0 0 4
## 14 1 0 0 4
## 15 1 0 0 4
## 16 1 0 0 4
## 17 1 0 0 4
## 18 1 0 0 4
## 19 1 0 1 r
## 20 1 0 1 r
## 21 1 0 1 r
## 22 1 0 1 r
## 23 1 0 1 r
## 24 1 0 1 r
## 25 1 0 1 r
## 26 1 0 1 r
## 27 1 0 1 r
## 28 1 0 1 r
## 29 1 0 0 4
## 30 1 0 0 4
## # ℹ 204 more rows
Here the drv
(drive train) is a categorical variable
with three possible labels: “f” (forward), “r” (rear) and “4” (4-wheel).
In this case, it is not a good idea to encode them as “1”, “2”, “3”,
since there is no intrinsic order in the variable.
Instead, for non-ordered categorical variables with \(n\) labels, we create \(n-1\) new binary dummy variables, which takes the value of one for the 1st, 2nd, …, (n-1)th variable, and zero otherwise. For the \(n\)th label, all dummy variables are zero. In this example:
\[ \mathrm{drvf} = \cases{1, \quad
\mathrm{for\;label\;"f"} \\ 0, \quad
\mathrm{for\;label\;"r"\; and\;"4"}} \] \[ \mathrm{drvr} = \cases{1, \quad
\mathrm{for\;label\;"r"} \\ 0, \quad
\mathrm{for\;label\;"f"\; and\;"4"}} \] So
for label “4”, both drvf
and drvr
take the
value of zero. This encoding system is called one-hot
encoding.
For an ordered variable, things are more complicated. There are multiple ways of labeling an ordered variable. For example, using “1”, “2”, “3”, “4”, etc. R uses a relatively complicated encoding method called polynomial contrasts. The theory behind is beyond the scope of this course. But we can glimpse how it looks like:
unique(diamonds$cut)
## [1] Ideal Premium Good Very Good Fair
## Levels: Fair < Good < Very Good < Premium < Ideal
model_matrix(diamonds, price ~ cut) %>%
mutate(cut = diamonds$cut)
## # A tibble: 53,940 × 6
## `(Intercept)` cut.L cut.Q cut.C `cut^4` cut
## <dbl> <dbl> <dbl> <dbl> <dbl> <ord>
## 1 1 6.32e- 1 0.535 3.16e- 1 0.120 Ideal
## 2 1 3.16e- 1 -0.267 -6.32e- 1 -0.478 Premium
## 3 1 -3.16e- 1 -0.267 6.32e- 1 -0.478 Good
## 4 1 3.16e- 1 -0.267 -6.32e- 1 -0.478 Premium
## 5 1 -3.16e- 1 -0.267 6.32e- 1 -0.478 Good
## 6 1 -3.29e-17 -0.535 9.64e-17 0.717 Very Good
## 7 1 -3.29e-17 -0.535 9.64e-17 0.717 Very Good
## 8 1 -3.29e-17 -0.535 9.64e-17 0.717 Very Good
## 9 1 -6.32e- 1 0.535 -3.16e- 1 0.120 Fair
## 10 1 -3.29e-17 -0.535 9.64e-17 0.717 Very Good
## # ℹ 53,930 more rows
In the diamonds
data set, we have a few ordered
variables - cut
, color
and
clarity
. If we model price
against
cut
, we see that R creates four new numeric variables
(since there are five levels in cut
) from cut
to cut^4
(“L”, “Q”, “C” represent “linear”, “quadratic” and
“cubic” respectively). Each level is converted into a particular
combination of values for linear fit. That’s all we need to know for
now.
Next, let’s see how to create linear models in different situations. First, let’s see when and how to include interaction terms between variables.
Recall that for a linear model with multiple predictors, the modeling function is:
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n\]
The meaning of coefficients \(\beta_1, \beta_2, \cdots, \beta_n\) are the increase (or decrease in \(y\)) with each unit increase in \(x_1, x_2, \cdots, x_n\), respectively, holding all other predictors constant.
Quite obviously, here we assume that the effect of each predictor is separable and there is no interaction between them. In other words, the value of \(x_2, x_3, \cdots, x_n\) has nothing to do with \(\beta_1\). This may not be true in real situations.
To study this effect, we are going to study an advertising data set from the text book Introduction to Statistical Learning. The data set can be downloaded from: https://www.statlearning.com/s/Advertising.csv
Advertising
data setThe data set has four variables, where the sales
is the
target variable measured in thousands of units sold for a particular
product in 200 different markets. The three predictors TV
,
radio
, and newspaper
refer to advertising
expenditure (in thousands of dollars) in each of the three media for
different markets.
If we plot sales
against each of the predictor, we
obtain the following graphs:
So we see that the sales increases with all three variables on average. Now let’s model this problem and we may get some different insights.
First, let’s do single linear regression with each of the three variable.
advertising <- read_csv("~/Documents/Fei Tian/Course_DAS422_Exploratory_Data_Analysis_and_Visualization_Spring2023/Datasets/Advertising.csv")
glimpse(advertising)
## Rows: 200
## Columns: 5
## $ ...1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ TV <dbl> 230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2, 8.6, 199.…
## $ radio <dbl> 37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6, 2.1, 2.6, 5.…
## $ newspaper <dbl> 69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6, 1.0, 21.2, 2…
## $ sales <dbl> 22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.6, 8.6…
attach(advertising)
lm.fit.tv <- lm(sales ~ TV)
summary(lm.fit.tv)
##
## Call:
## lm(formula = sales ~ TV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
To interpret the result, we see that on average, an additional thousand dollar spent in TV advertising results in 0.047 thousand or 47 more sales of the product. The p-value is very low indicating that we can safely reject the null hypothesis that TV advertising has no effect (\(\beta_1 = 0\)).
Similarly, let’s now do this for the other two variables:
attach(advertising)
lm.fit.tv <- lm(sales ~ radio)
summary(lm.fit.tv)
##
## Call:
## lm(formula = sales ~ radio)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.7305 -2.1324 0.7707 2.7775 8.1810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.31164 0.56290 16.542 <2e-16 ***
## radio 0.20250 0.02041 9.921 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.275 on 198 degrees of freedom
## Multiple R-squared: 0.332, Adjusted R-squared: 0.3287
## F-statistic: 98.42 on 1 and 198 DF, p-value: < 2.2e-16
attach(advertising)
lm.fit.tv <- lm(sales ~ newspaper)
summary(lm.fit.tv)
##
## Call:
## lm(formula = sales ~ newspaper)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2272 -3.3873 -0.8392 3.5059 12.7751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.35141 0.62142 19.88 < 2e-16 ***
## newspaper 0.05469 0.01658 3.30 0.00115 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.092 on 198 degrees of freedom
## Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
## F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148
So we see that each additional thousand dollar spent in radio and newspaper advertising results in 203 and 55 more sales, respectively. It seems that radio is the most effective advertising method by a large margin, and TV/newspaper has similar effects.
Now let’s do the multiple linear regression. That is, to study the effects of all three variables altogether.
lm.fit.all <- lm(sales ~ TV + radio + newspaper)
summary(lm.fit.all)
##
## Call:
## lm(formula = sales ~ TV + radio + newspaper)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## radio 0.188530 0.008611 21.893 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
This result is different from those of simple linear regression. Now
the coefficient of newspaper
is very small and its p-value
suggests that there is no significant correlation between
sales
and newspaper
.
Does it make sense for the multiple regression to suggest no relationship between sales and newspaper while the simple linear regression implies the opposite? In fact it does. Consider the correlation matrix for the three predictor variables and response variable:
cor(advertising)
## ...1 TV radio newspaper sales
## ...1 1.00000000 0.01771469 -0.11068044 -0.15494414 -0.05161625
## TV 0.01771469 1.00000000 0.05480866 0.05664787 0.78222442
## radio -0.11068044 0.05480866 1.00000000 0.35410375 0.57622257
## newspaper -0.15494414 0.05664787 0.35410375 1.00000000 0.22829903
## sales -0.05161625 0.78222442 0.57622257 0.22829903 1.00000000
Notice that the correlation between radio
and
newspaper
is 0.35. This indicates that markets with
high newspaper
advertising tend to also have high
radio
advertising.
Now suppose that the multiple regression is correct and
newspaper
advertising is not associated with sales, but
radio
advertising is associated with sales. Then in markets
where we spend more on radio
our sales will tend to be
higher, and as our correlation matrix shows, we also tend to spend more
on newspaper
advertising in those same markets.
Hence, in a simple linear regression which only examines sales versus
newspaper
, we will observe that higher values of
newspaper
tend to be associated with higher values of
sales, even though newspaper
advertising is not directly
associated with sales. So newspaper
advertising is a
surrogate for radio
advertising;
newspaper
gets “credit” for the association between
radio
on sales.
There is rich theory and information behind the analysis above. Here we only give a brief introduction to this example. For more details, you may refer to the textbook of Introduction to Statistical Learning.
It is also worthwhile to note that regression models only capture
association, not causation. So there
is no guarantee that if we increase expenditure in radio
advertising, there will be more sales observed.
Now go back to the interaction effect. The statistical model for a multiple linear regression is:
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n + \epsilon\] where \(\epsilon\) is assumed to incorporate all related factors to \(y\) that are missing in the data set so we basically cannot model them using \(\beta_1, \beta_2, \cdots, \beta_n\). This term is called irreducible error since it’s not possible to predict them without our model.
In linear regression theory, \(\epsilon\) is assumed to satisfy
In real problems, either assumption can often be invalid. Particularly, the second issue sometimes can be resolved by introducing interaction terms and nonlinear terms, as shown below.
Now we already know that the association between sales
and newspaper
is weak if we consider radio
.
Therefore let’s remove newspaper
from our model.
lm.fit.tvradio <- lm(sales ~ TV + radio)
Now let’s check the residual plot to see whether there is any non-linear effect or heterogeneous variance:
plot(lm.fit.tvradio, which = 1)
So we see a relatively clear non-linear trend here, which suggests potential non-linear relationship.
Now let’s see whether the residual is dependent of \(TV \times radio\) here.
advertising <- advertising %>%
mutate(res_tvradio = lm.fit.tvradio$residual)
ggplot(advertising, mapping = aes(TV*radio, res_tvradio)) +
geom_point() +
geom_smooth(method = "lm", color = "red")
Here we first add the residual of the last fit into the data set, and then plot it against \(TV \times radio\). We see a pretty obvious increasing trend here. So this term should indeed be used in the fit.
The product of two variables, such as \(TV \times radio\), is called an interaction term. Now the regression model becomes:
\[ sales = \beta_0 + \beta_1 \times TV +
\beta_2 \times radio + \beta_{12} \times TV \times radio\] What
is the meaning of this model? It simply means that when there is a unit
increase in TV
advertising expenditure, the increase in
sales is no longer a constant and depends on the value of
radio
as well.
\[ sales = \beta_0 + (\beta_1 + \beta_{12}
\times radio) \times TV + \beta_2 \times radio\] Symmetrically, a
unit increase in radio
advertising expenditure also results
in different changes in sales, depending on the value of
TV
.
$ sales = _0 + _1 TV + (2 + {12} TV)radio$$
Such effect is commonly seen in real life, and is called a synergy effect. That is, if the company does TV and radio advertising at the same time, each medias help the other to be more effective.
In R, adding an interaction term can be done in a few ways:
lm(target ~ var1 + var2 + var1:var2)
lm(target ~ var1 * var2)
lm(target ~ (var1 + var2)^2)
All three ways of writing the formula give the same model:
\[ target = \beta_0 + \beta_1 \times var1
+ \beta_2 \times var2 + \beta_{12} \times var1 \times var2\] You
should be noted that in the R formula, ~ (var1 + var2)^2
is
not the same as \((var1^2 + var2^2 +
2*var1*var2)\), neither does var1 * var2
mean \((var1 \times var2)\).
Now let’s do the regression using the new model:
lm.fit.interact <- lm(sales ~ (TV + radio)^2)
plot(lm.fit.interact, which = 1)
We see that there is still some nonlinear effect left. But the magnitude of residuals become smaller overall, so this is a better fit than the previous one (which is not necessarily always good due to potential overfitting problem).
Let’s look at the fit summary:
coef(lm.fit.interact)
## (Intercept) TV radio TV:radio
## 6.750220203 0.019101074 0.028860340 0.001086495
summary(lm.fit.interact)
##
## Call:
## lm(formula = sales ~ (TV + radio)^2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3366 -0.4028 0.1831 0.5948 1.5246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
## TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
## radio 2.886e-02 8.905e-03 3.241 0.0014 **
## TV:radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673
## F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16
We see all three coefficients have a low p-value, suggesting that we
should reject the null hypothesis that they are zero. So all three terms
have a non-negligible effect on sales
in this model.
Next, let’s see whether we can improve the model to account for the remaining nonlinear effect in the residual plot. We may see that which variable is accounting for this:
plot(advertising$TV, lm.fit.interact$residuals)
plot(advertising$radio, lm.fit.interact$residuals)
plot(advertising$radio * advertising$TV, lm.fit.interact$residuals)
So it seems that the residual is a quadratic function of
TV
. Then let’s throw the \(TV^2\) term into the model.
lm.fit.interact2 <- lm(sales ~ (TV + radio)^2 + I(TV^2))
summary(lm.fit.interact2)
##
## Call:
## lm(formula = sales ~ (TV + radio)^2 + I(TV^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9949 -0.2969 -0.0066 0.3798 1.1686
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.137e+00 1.927e-01 26.663 < 2e-16 ***
## TV 5.092e-02 2.232e-03 22.810 < 2e-16 ***
## radio 3.516e-02 5.901e-03 5.959 1.17e-08 ***
## I(TV^2) -1.097e-04 6.893e-06 -15.920 < 2e-16 ***
## TV:radio 1.077e-03 3.466e-05 31.061 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6238 on 195 degrees of freedom
## Multiple R-squared: 0.986, Adjusted R-squared: 0.9857
## F-statistic: 3432 on 4 and 195 DF, p-value: < 2.2e-16
To throw in a quadratic term, we use I(TV^2)
. The
formula in I()
will be interpreted just as what it is. Now
let’s look at our residual plots.
plot(advertising$TV, lm.fit.interact2$residuals)
plot(advertising$radio, lm.fit.interact2$residuals)
plot(advertising$radio * advertising$TV, lm.fit.interact2$residuals)
There is still some visible nonlinear relationship between the
residuals and TV
. So we can keep throw more power terms of
TV
into the model, such as \(TV^3\), \(TV^4\), \(TV^5\), until this association
disappears.
lm.fit.interact5 <- lm(sales ~ (TV + radio)^2 + I(TV^2) + I(TV^3) + I(TV^4) + I(TV^5))
summary(lm.fit.interact5)
##
## Call:
## lm(formula = sales ~ (TV + radio)^2 + I(TV^2) + I(TV^3) + I(TV^4) +
## I(TV^5))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.96970 -0.17890 0.00147 0.20770 1.03572
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.698e+00 2.070e-01 13.034 < 2e-16 ***
## TV 2.015e-01 1.271e-02 15.856 < 2e-16 ***
## radio 4.302e-02 3.906e-03 11.014 < 2e-16 ***
## I(TV^2) -2.621e-03 2.649e-04 -9.893 < 2e-16 ***
## I(TV^3) 1.743e-05 2.258e-06 7.719 6.27e-13 ***
## I(TV^4) -5.473e-08 8.381e-09 -6.530 5.74e-10 ***
## I(TV^5) 6.467e-11 1.124e-11 5.752 3.43e-08 ***
## TV:radio 1.036e-03 2.288e-05 45.272 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4087 on 192 degrees of freedom
## Multiple R-squared: 0.9941, Adjusted R-squared: 0.9939
## F-statistic: 4605 on 7 and 192 DF, p-value: < 2.2e-16
plot(lm.fit.interact5, which = 1)
plot(lm.fit.interact5, which = 2)
In principle, these models are still linear models in the sense that the target variable is still linear to multiple predictors, but the predictors themselves become nonlinear functions of original predictors. If those functions are power terms, it is called polynomial regression.
Now we can visualise the goodness of fit with the original data.
Since we have two variables involved (other predictors are simply
functions of TV
and radio
), it is still
possible to visualise the data and our fitting plane in 3D. Here we use
the function plotPlane
from the package
rockchalk
.
library(rockchalk)
par(mfrow=c(1,2), mar=c(1,1,1,1))
plotPlane(lm.fit.interact5, "TV", "radio", pch=16, col=rgb(0,0,1,0.1), drawArrows=TRUE, alength=0, acol="red", alty=1,alwd=1, theta=315, phi=10)
plotPlane(lm.fit.interact5, "TV", "radio", pch=16, col=rgb(0,0,1,0.1), drawArrows=TRUE, alength=0, acol="red", alty=1,alwd=1, theta=315, phi=-10)
So we see that our fit is already quite accurate and the error is pretty small overall.
diamonds
data setIn the next example, let’s model diamond price with the
diamonds
data set. First, let’s study the relationship
between price
and carat
. Visualizing our data
gives us
ggplot(diamonds, aes(carat, price)) +
geom_bin_2d(bins = 50)
It seems that price
follows a non-linear relationship
with carat
. To do a reasonable linear model, we can take
logarithm to both of them (which usually make things much more linear),
and also ignore diamonds of more than 3 carats since they are rare.
diamonds1 <- diamonds %>%
filter(carat <= 3) %>%
mutate(log_price = log2(price), log_carat = log2(carat))
ggplot(diamonds1, aes(log_carat, log_price)) +
geom_bin_2d(bins = 50)
Now the relationship looks very linear, so let’s fit it and visualise it.
model_diamonds <- lm(log_price ~ log_carat, data = diamonds1)
plot(model_diamonds, which = 1)
plot(model_diamonds, which = 2)
grid <- diamonds1 %>%
data_grid(log_carat) %>%
add_predictions(model_diamonds)
ggplot(diamonds1) +
geom_bin_2d(aes(log_carat, log_price)) +
geom_line(aes(log_carat, pred), data = grid, colour = "red", size = 1)
We can also visualise in the original scale
ggplot(diamonds1) +
geom_bin_2d(aes(carat, price)) +
geom_line(aes(2^log_carat, 2^pred), data = grid, colour = "red", size = 1)
Let’s check the coefficients for the model:
model_diamonds$coefficients
## (Intercept) log_carat
## 12.190950 1.678231
So the model is
\[ \mathrm{log\_price} = 12.19 + 1.68 \times \mathrm{log\_carat}\]
or equivalently
\[ \mathrm{price} = 4672 \times \mathrm{carat} ^ {1.68} \]
Next, let’s explore the relationship between carat
and
cut
. As we have analyzed earlier in the semester, the
diamonds
data set shows a “weird” trend that better cut
quality results in overall lower price:
ggplot(diamonds, aes(cut, price)) + geom_boxplot()
We had explained this result by the relationship between
carat
and cut
- better-cut diamonds are
smaller on average in this data set. Let’s now model this
relationship.
diamonds2 <- diamonds %>%
filter(carat <= 3) %>%
mutate(cut = as.character(cut))
Let’s ignore the ordered status of cut
and simply make
cut
a non-ordered character. Let’s then see the model
matrix first:
model_matrix(diamonds2, carat ~ cut)
## # A tibble: 53,908 × 5
## `(Intercept)` cutGood cutIdeal cutPremium `cutVery Good`
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 1 0 0
## 2 1 0 0 1 0
## 3 1 1 0 0 0
## 4 1 0 0 1 0
## 5 1 1 0 0 0
## 6 1 0 0 0 1
## 7 1 0 0 0 1
## 8 1 0 0 0 1
## 9 1 0 0 0 0
## 10 1 0 0 0 1
## # ℹ 53,898 more rows
Now we see that the one-hot encoding is implemented. Let’s put this into a linear model.
model_diamonds2 <- lm(carat ~ cut, data = diamonds2)
model_diamonds2$coefficients
## (Intercept) cutGood cutIdeal cutPremium cutVery Good
## 1.0302687 -0.1824062 -0.3278925 -0.1405634 -0.2243366
We need to interpret this result correctly. Note that we encode all
dummy variables to be zero for fair
cut. Therefore, the
Intercept
here is the average carat for fair
cut diamonds. For other dummy variables, the slope shows the difference
of average carat from the Intercept
, which are all negative
indicating smaller diamonds.
We can visualise this model in the following way:
grid <- diamonds2 %>%
data_grid(cut) %>%
add_predictions(model_diamonds2)
grid
## # A tibble: 5 × 2
## cut pred
## <chr> <dbl>
## 1 Fair 1.03
## 2 Good 0.848
## 3 Ideal 0.702
## 4 Premium 0.890
## 5 Very Good 0.806
So the predicted values are simply the mean carat for each category:
diamonds2$cut <- fct_relevel(diamonds2$cut, "Fair", "Good", "Very Good", "Premium", "Ideal") # make the labels in the right order for plotting purpose since we removed the order previously
ggplot() +
geom_point(aes(cut, carat), data = diamonds2) +
geom_point(aes(cut, pred), data = grid, colour = "red", size = 4)
From the plot, it is obvious that fair
cut diamonds are
heavier than ideal
cut diamonds on average.
price
with carat
and
cut
togetherFitting price
with carat
or
cut
alone is not particularly interesting. Now let’s fit
price
with carat
and cut
at the
same time, which would provide better insights.
When we have both numeric and categorical variables in \(x\), there are two models to use. One is without the interaction term:
\[ \mathrm{price} = \beta_0 + \beta_{carat} \times \mathrm{carat} + \sum \beta_i \times \mathrm{cut\_dummy_i}\]
In this case for each cut group, there is a linear relationship
between price
and carat
with the same
slope but different intercepts.
diamonds3 <- diamonds %>%
filter(carat <= 3) %>%
mutate(log_price = log2(price), log_carat = log2(carat)) %>%
mutate(cut = as.character(cut))
diamonds3$cut <- fct_relevel(diamonds3$cut, "Fair", "Good", "Very Good", "Premium", "Ideal")
model_diamonds3 <- lm(log_price ~ log_carat + cut, data = diamonds3)
grid <- diamonds3 %>%
data_grid(log_carat, cut) %>%
add_predictions(model_diamonds3)
grid
## # A tibble: 1,280 × 3
## log_carat cut pred
## <dbl> <fct> <dbl>
## 1 -2.32 Fair 7.89
## 2 -2.32 Good 8.12
## 3 -2.32 Very Good 8.24
## 4 -2.32 Premium 8.23
## 5 -2.32 Ideal 8.35
## 6 -2.25 Fair 8.01
## 7 -2.25 Good 8.24
## 8 -2.25 Very Good 8.36
## 9 -2.25 Premium 8.35
## 10 -2.25 Ideal 8.47
## # ℹ 1,270 more rows
ggplot(diamonds3) +
geom_point(aes(x = log_carat, y = log_price, colour = cut)) +
geom_line(aes(x = log_carat, y = pred, colour = cut), data = grid)
This may not be a good idea since the linear model for each cut group shares the same slope. We may add the interaction term to resolve this:
\[ \mathrm{price} = \beta_0 + \beta_{carat} \times \mathrm{carat} + \sum \beta_i \times \mathrm{cut\_dummy_i} + \sum \beta_{c,i} \times \mathrm{carat} \times \mathrm{cut\_dummy_i}\]
By introducing the interaction terms, now for different cut groups the slopes can be different.
diamonds3 <- diamonds %>%
filter(carat <= 3) %>%
mutate(log_price = log2(price), log_carat = log2(carat)) %>%
mutate(cut = as.character(cut))
diamonds3$cut <- fct_relevel(diamonds3$cut, "Fair", "Good", "Very Good", "Premium", "Ideal")
model_diamonds4 <- lm(log_price ~ log_carat * cut, data = diamonds3)
grid <- diamonds3 %>%
data_grid(log_carat, cut) %>%
add_predictions(model_diamonds4)
grid
## # A tibble: 1,280 × 3
## log_carat cut pred
## <dbl> <fct> <dbl>
## 1 -2.32 Fair 8.30
## 2 -2.32 Good 8.05
## 3 -2.32 Very Good 8.18
## 4 -2.32 Premium 8.31
## 5 -2.32 Ideal 8.33
## 6 -2.25 Fair 8.41
## 7 -2.25 Good 8.17
## 8 -2.25 Very Good 8.31
## 9 -2.25 Premium 8.42
## 10 -2.25 Ideal 8.45
## # ℹ 1,270 more rows
ggplot(diamonds3) +
geom_point(aes(x = log_carat, y = log_price, colour = cut)) +
geom_line(aes(x = log_carat, y = pred, colour = cut), data = grid)
Actually in this case, the interaction terms do not change the trend very significantly. But if we check their p-values, we should still keep them.
summary(model_diamonds4)
##
## Call:
## lm(formula = log_price ~ log_carat * cut, data = diamonds3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.26407 -0.23864 -0.00931 0.23171 2.01353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.818586 0.009230 1280.52 <2e-16 ***
## log_carat 1.515429 0.013897 109.05 <2e-16 ***
## cutGood 0.266001 0.010996 24.19 <2e-16 ***
## cutVery Good 0.376526 0.010042 37.49 <2e-16 ***
## cutPremium 0.341711 0.009852 34.68 <2e-16 ***
## cutIdeal 0.478839 0.009830 48.71 <2e-16 ***
## log_carat:cutGood 0.221837 0.015386 14.42 <2e-16 ***
## log_carat:cutVery Good 0.211954 0.014445 14.67 <2e-16 ***
## log_carat:cutPremium 0.144763 0.014351 10.09 <2e-16 ***
## log_carat:cutIdeal 0.192736 0.014232 13.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3647 on 53898 degrees of freedom
## Multiple R-squared: 0.9378, Adjusted R-squared: 0.9378
## F-statistic: 9.036e+04 on 9 and 53898 DF, p-value: < 2.2e-16
But overall, we see that after considering the carat
, we
correctly see that for the same carat
value, on average the
best cutting quality results in the highest average price, and the worst
cutting quality results in the lowest average price.