In today’s class, we’ll learn some extensions to the linear regression model. First of all, we can add more and more variables. How many of them? \(k\) variables:
\[ y_i=\beta_0+\beta_1x_{1,i}+\beta_2x_{2,i}+...+\beta_kx_{k,i}+u_i \]
being:
- \(y_i\) a quantitative continuous variable (or a discrete variable with a wide range of possible values)
- \(x\) can be any kind of variable: continuous, discrete or qualitative. We need to know how to interpret the results.
Let us start with the following model from the week3:
library(readr)
Advertising <- read_csv("Advertising.csv")
## New names:
## Rows: 200 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (5): ...1, TV, radio, newspaper, sales
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
View(Advertising)
m1<-lm(sales~TV+radio+newspaper, data=Advertising)
summary(m1)
##
## Call:
## lm(formula = sales ~ TV + radio + newspaper, data = Advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## radio 0.188530 0.008611 21.893 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
For being interpreted, we need to know the units of measurement of each variable.
In the Advertising data set, sales are represented in thousands of units, while TV, radio, and newspaper budgets are in thousands of dollars.
- For each increase of one thousand dollars on TV advertising (keeping the others constant), sales will increase -on average- in 45.7 units
- For each increase of one thousand dollars on radio advertising (keeping the others constant), sales will increase- on average- in 188.5 units
- For each increase of one thousand dollars on newspaper advertising (keeping the others constant), sales will decrease -on average- in 0.1 units.
However, we can improve the fitting of the model to the reality. Let’s review the previous plot we did in the Week #3
par(mfrow=c(2, 2))
plot(Advertising$TV,Advertising$sales)
plot(Advertising$radio,Advertising$sales)
plot(Advertising$newspaper,Advertising$sales)
As we can see, the relationship of TV advertising doesn’t resemble a linear one. For instance, we can suggest that it looks like logarithm
(https://en.wikipedia.org/wiki/Logarithm)
So, we can fit this other model which is non-linear in the variables but linear in the parameters:
\[ Sales_i=\beta_0+\beta_1 log(TV_{i})+\beta_2 radio_i+\beta_3 newspaper_i+u_i \]
library(readr)
m2<-lm(sales~log(TV)+radio+newspaper, data=Advertising)
summary(m2)
##
## Call:
## lm(formula = sales ~ log(TV) + radio + newspaper, data = Advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2568 -0.9103 -0.2539 0.7834 4.9976
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.098876 0.576919 -15.771 <2e-16 ***
## log(TV) 3.936179 0.113373 34.719 <2e-16 ***
## radio 0.206700 0.008203 25.199 <2e-16 ***
## newspaper -0.002531 0.005596 -0.452 0.652
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.606 on 196 degrees of freedom
## Multiple R-squared: 0.9067, Adjusted R-squared: 0.9052
## F-statistic: 634.7 on 3 and 196 DF, p-value: < 2.2e-16
but the new coefficient has a different interpretation closed to elasticity:
- If we increase TV advertising in 1%, sales will increase in 39.3 units.
In the next class, we explain why :)
Just start borrowing the elasticity formula from Economics. Let us define a one variable function (it works similar with a several variables function)
\[ y=f(x) \]
the elasticity is
\[ \epsilon_{y,x}=\frac{\frac{\Delta y}{y}}{\frac{\Delta x}{x}} \]
Using differential calculus, the elasticity can be approximated by
\[ \epsilon_{y,x} \approx f'(x)\frac{y}{x} \]
and it means:
If I increase \(x\) by 1%, the \(y\) increases in \(\epsilon_{y,x}\%.\)
Also, remember that \[ \frac{\Delta y}{y}\times100 \]
means: The percentage increase of \(y.\)
A log-log regression model is stated as follows
\[ \log y_{i}=\beta_{0}+\beta_{1}\log x_{i}+u_{i} \]
If you use the exponential as the anti-logarithm function, then you get: \(e^{\log y_{i}}=e^{\beta_{0}+\beta_{1}\log x_{i}}\)) which can be rewritten as \(y=e^{\beta_{0}+\beta_{1}\log x}\)
By doing the derivative (using the chain rule)
\[ f'(x)=e^{\beta_{0}+\beta_{1}\log x}\beta_{1}\frac{1}{x} \]
Note that we can substitute \(y=e^{\beta_{0}+\beta_{1}\log x}\)
\[ f'(x)=y\beta_{1}\frac{1}{x} \]
\[ \beta_{1}=f'(x)\frac{y}{x} \]
So, \(\beta_{1}\) is-directly- the elasticity.
library(readr)
m3<-lm(log(sales)~log(TV)+radio+newspaper, data=Advertising)
summary(m3)
##
## Call:
## lm(formula = log(sales) ~ log(TV) + radio + newspaper, data = Advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51337 -0.03416 0.00673 0.03237 0.19089
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5812826 0.0246066 23.623 <2e-16 ***
## log(TV) 0.3570508 0.0048355 73.839 <2e-16 ***
## radio 0.0133395 0.0003499 38.129 <2e-16 ***
## newspaper 0.0001381 0.0002387 0.578 0.564
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0685 on 196 degrees of freedom
## Multiple R-squared: 0.9731, Adjusted R-squared: 0.9727
## F-statistic: 2362 on 3 and 196 DF, p-value: < 2.2e-16
In this case, we say that for each 1% of increase on TV advertising, sales will increase (remaining the other variables constant) in 0.35%.
A log-level regression model is stated as follows:
\[ \log y_{i}=\beta_{0}+\beta_{1}x_{i}+u_{i} \]
Again, If we use the exponential both sides (and, for simplicity, we “forget” the error term)
\[ e^{\log y_{i}}=e^{\beta_{0}+\beta_{1}x_{i}} \]
we have
\[ y=e^{\beta_{0}+\beta_{1}x} \]
Doing the derivative,
\[ f'(x)=e^{\beta_{0}+\beta_{1}x}\times\beta_{1} \]
If we substitute,
\[ f'(x)=y\times\beta_{1} \]
Now, we need to recall a result called “linear approximation of a function” or “Taylor order 1”. We need the following result:
\[ \Delta y\approx f'(x)\Delta x, \] where we can use \[ \frac{\Delta y}{\Delta x}\approx f'(x). \]
Plugging this result in the previous equation \(f'(x)=y\times\beta_{1}\)
\[ \frac{\Delta y}{\Delta x}\approx y\times\beta_{1} \]
In this case, rearranging terms and if we multiply both sides \(\times100:\)
\[ \frac{\Delta y}{y}\times100=\beta_{1}\times100\Delta x \]
and it can be interpreted: \[ \beta_{1}\times100 \]
is the percentage increase of \(y\) for each unitary increase in \(x\).
library(readr)
m3<-lm(log(sales)~log(TV)+radio+newspaper, data=Advertising)
summary(m3)
##
## Call:
## lm(formula = log(sales) ~ log(TV) + radio + newspaper, data = Advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51337 -0.03416 0.00673 0.03237 0.19089
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5812826 0.0246066 23.623 <2e-16 ***
## log(TV) 0.3570508 0.0048355 73.839 <2e-16 ***
## radio 0.0133395 0.0003499 38.129 <2e-16 ***
## newspaper 0.0001381 0.0002387 0.578 0.564
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0685 on 196 degrees of freedom
## Multiple R-squared: 0.9731, Adjusted R-squared: 0.9727
## F-statistic: 2362 on 3 and 196 DF, p-value: < 2.2e-16
In this case, for radio, we can say: for each 1 thousand dollars increase in radio budget, the sales will increase by 1.3%.
Finally, the level-log is stated as:
\[ y_{i}=\beta_{0}+\beta_{1}\log x_{i}+u_{i} \]
Again, if you can check (following similar steps we did before) that
\[ \Delta y=\beta_{1}\frac{\text{1}}{x}\Delta x \] so, if we multiply and divide by 100:
\[ \Delta y=\frac{\beta_{1}}{100}\times\underset{percentage\:increase}{100\frac{\Delta x}{x}} \]
and can be interpreted as:
for each 1% \(x\) increases, then \(y\) increases by \(\frac{\beta_{1}}{100}\) units.
library(readr)
m4<-lm(sales~log(TV)+radio+newspaper, data=Advertising)
summary(m4)
##
## Call:
## lm(formula = sales ~ log(TV) + radio + newspaper, data = Advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2568 -0.9103 -0.2539 0.7834 4.9976
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.098876 0.576919 -15.771 <2e-16 ***
## log(TV) 3.936179 0.113373 34.719 <2e-16 ***
## radio 0.206700 0.008203 25.199 <2e-16 ***
## newspaper -0.002531 0.005596 -0.452 0.652
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.606 on 196 degrees of freedom
## Multiple R-squared: 0.9067, Adjusted R-squared: 0.9052
## F-statistic: 634.7 on 3 and 196 DF, p-value: < 2.2e-16
. Interpreted as, if TV budget is increased by 1%, then sales will increase in 39.3 units.