Linear relationships may not be appropriate in some cases.
By appropriately redefining variables we can easily incorporate nonlinearities into the simple regression.
Our model will still be linear in parameters. We do not use nonlinear transformations of parameters.
In practice natural logarithmic transformations are widely used: log(y) or ln(y).
Other transformations may also be used, e.g., adding quadratic or cubic terms, inverse form, etc.
Remember that the linearity of the regression model is determined by the linearity of \(\beta\)’s not \(x\) and \(y\).
We can still use nonlinear transformations of \(x\) and \(y\) such as \(log x\), \(log y\), \(x^2\), \(sqrt{x}\), \(\frac{1}{x}\), \(y^{\frac{1}{4}}\). The model is still linear in parameters.
But the models that include nonlinear transformations of \(\beta\)’s not \(x\) and \(y\) are not linear in parameters and cannot be analyzed using OLS framework
For example the following models are not linear in parameters:
\[\begin{align} consumption &= \frac{1}{\beta_0 + \beta_1 income} + u \notag \\ y &=\beta_0 + \beta_1^2x + u \notag \\ y &= \beta_0 + e^{\beta_1x} + u \notag \end{align}\]
\[ log\; y = \beta_0 + \beta_1x + u \]
\[\begin{align} \Delta y &= \beta_1 \Delta x \notag \\ \% \Delta y &= 100\beta_1 \Delta x \notag \end{align}\]
Interpretation: For a one-unit change in \(x\), \(y\) changes by \((100\beta_1)\%\).
The relationship between \(x\) and \(y\), before the (natural) logarithmic transformation can be written as
\[ y = exp(\beta_0 + \beta_1 x + u) \equiv e^{\beta_0 + \beta_1 x + u} \]
\[ y = \beta_0 + \beta_1 log\;x + u \]
\[\begin{align} \Delta y &= \beta_1 \Delta log x \notag \\ &= \left( \frac{\beta_1}{100} \right) \underbrace{100\Delta log x}_{\% \Delta x} \notag \end{align}\]
\[ log \; y = \beta_0 + \beta_1 log\;x + u \]
\[\begin{align} \Delta log \; y &= \beta_1 \Delta log x \notag \\ \% \Delta y &= \beta_1 \% \Delta x \notag \\ \beta_1 &= \frac{\% \Delta y}{\% \Delta x} \end{align}\]
library(wooldridge)
data(wage1)
wage.logl <- lm(log(wage) ~ educ, data = wage1)
summary(wage.logl)
##
## Call:
## lm(formula = log(wage) ~ educ, data = wage1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.21158 -0.36393 -0.07263 0.29712 1.52339
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.583773 0.097336 5.998 3.74e-09 ***
## educ 0.082744 0.007567 10.935 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4801 on 524 degrees of freedom
## Multiple R-squared: 0.1858, Adjusted R-squared: 0.1843
## F-statistic: 119.6 on 1 and 524 DF, p-value: < 2.2e-16
\[ \widehat{log wage} = \underset{(0.097)}{0.584} + \underset{(0.008)}{0.083} educ \]
After multiplying the slope estimate by 100 it can be interpreted as %
An additional year of education is predicted to increase average wages by 8.3%. This is called return to another year of education.
\(R^2 = 0.186\): Education explains about 18.6% of the variation in log wage.
wage.lm <- lm(wage ~ educ, data = wage1)
plot(wage1$educ, wage1$lwage,
col = "steelblue",
pch = 20,
main = "Log-level Regression",
cex.main = 1,
ylab = "Wage",
xlab = "Education")
abline(wage.lm,
col = "blue",
lwd = 2)
abline(wage.logl,
col = "red",
lwd = 2)
\[ Score = \beta_0 + \beta_1 \;log(income) + u \]
library(AER)
data(CASchools)
CASchools$score <- (CASchools$read + CASchools$math) / 2
score.llog<- lm(score ~ log(income), data = CASchools)
summary(score.llog)
##
## Call:
## lm(formula = score ~ log(income), data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.256 -9.050 0.078 8.230 31.214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 557.832 4.200 132.81 <2e-16 ***
## log(income) 36.420 1.571 23.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.62 on 418 degrees of freedom
## Multiple R-squared: 0.5625, Adjusted R-squared: 0.5615
## F-statistic: 537.4 on 1 and 418 DF, p-value: < 2.2e-16
\[ \widehat{\text{Score}} = \underset{(4.20)}{557.832} + \underset{(1.571)}{36.420} \; \text{log(income)} \]
Interpretation: 1% increase in income is associated with \(\left(\frac{36.42}{100} \right) = 0.3642\) point increase in test scores.
\(R^2 = 0.5625\): log(income) can explain about 56.25% of the variation in test scores.
plot(CASchools$income, CASchools$score,
col = "steelblue",
pch = 20,
xlab = "District Income (thousands of dollars)",
ylab = "Test Score",
cex.main = 0.9,
main = "Test Score vs. District Income",
cex.main = 1)
order_id <- order(CASchools$income)
lines(CASchools$income[order_id],
fitted(score.llog)[order_id],
col = "red",
lwd = 2)
abline(lm(score ~ income, data = CASchools),
col = "blue",
lwd = 2)
\[ log(salary) = \beta_0 + \beta_1\; log(sales) + u \]
salary.loglog <- lm(log(salary) ~ log(sales), data = ceosal1)
summary(salary.loglog)
##
## Call:
## lm(formula = log(salary) ~ log(sales), data = ceosal1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.01038 -0.28140 -0.02723 0.21222 2.81128
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.82200 0.28834 16.723 < 2e-16 ***
## log(sales) 0.25667 0.03452 7.436 2.7e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5044 on 207 degrees of freedom
## Multiple R-squared: 0.2108, Adjusted R-squared: 0.207
## F-statistic: 55.3 on 1 and 207 DF, p-value: 2.703e-12
\[ \widehat{log(salary)} = \underset{(0.2880)}{4.822} + \underset{(0.0345)}{0.257} \; \text{log(sales)} \]
Interpretation: 1% increase in firm sales increases CEO salary by 0.257%. In other words, the elasticity of CEO salary with respect to sales is 0.257.
\(R^2 = 0.2108\): log(sales) can explain about 21.08% of variation in log(salary).
plot(ceosal1$sales, ceosal1$salary,
col = "steelblue",
pch = 20,
cex.main = 1,
xlab = "Sales",
ylab = "Salary")
abline(lm(salary~sales,
data=ceosal1),
ol = "red",
lwd = 2)
## Warning in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): "ol" is not
## a graphical parameter
plot(log(salary) ~ log(sales),
col = "steelblue",
pch = 20,
data = ceosal1,
main = "Log-Log Regression Fit",
cex.main = 1)
abline(salary.loglog,
col = "red",
lwd = 2)