Robbie Beane
When the relationship between a predictor and the response appears to be non-linear, we can consider transforming one or both of the variables in hopes that the new variables will exhibit a linear relationship. In this lesson, we will consider two synthetic examples where \(Y\) seems to depend exponentially on \(X\). We will consider the following models for these examples.
Linear Model: \(Y = \beta_0 + \beta_1 X + \varepsilon\).
“Exponential” Model: \(Y = \beta_0 + \beta_1 e^X + \varepsilon\).
Log-Level Model: \(\ln(Y) = \beta_0 + \beta_1 X + \varepsilon\)
Log-Log Model: \(\ln(Y) = \beta_0 + \beta_1 \ln(X) + \varepsilon\)
We will load the example data from a tab separated file located at the path data/log_ex01.txt.
We can see from a plot of the data that \(X\) and \(Y\) seem to exhibit a non-linear relationship. We also see that the variance in \(Y\) seems to increase with \(X\).
We will begin by creating a fitted linear model of the form: \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\)
##
## Call:
## lm(formula = y ~ x, data = ex01)
##
## Residuals:
## Min 1Q Median 3Q Max
## -149.97 -50.98 -10.01 33.54 318.51
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -221.754 26.707 -8.303 5.7e-13 ***
## x 66.240 4.746 13.958 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81.69 on 98 degrees of freedom
## Multiple R-squared: 0.6653, Adjusted R-squared: 0.6619
## F-statistic: 194.8 on 1 and 98 DF, p-value: < 2.2e-16
We plot the fitted line on top of the scatter plot.
plot(y ~ x, ex01, pch=21, col="black", bg="cyan")
abline(ex01_m1$coefficients, col="darkred", lwd=2)We can see strong evidence of a non-linear trend, as well as heteroskedasticity in the residual plot below.
res1 <- ex01_m1$residuals
plot(res1 ~ ex01$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)We can assess normality of the residuals using a Shapiro-Wilk test, a histogram, and a QQ-plot.
##
## Shapiro-Wilk normality test
##
## data: res1
## W = 0.92718, p-value = 3.454e-05
The Shapiro-Wilk test strongly suggests non-normality of the residuals. The histogram and QQ-plot reveal that the residuals are right-skewed.
We will now create a fitted model of the form \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 e^X\).
##
## Call:
## lm(formula = y ~ exp(x), data = ex01)
##
## Residuals:
## Min 1Q Median 3Q Max
## -173.688 -22.301 -5.473 21.888 202.380
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.869993 7.814467 4.206 5.75e-05 ***
## exp(x) 0.153758 0.007461 20.607 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.14 on 98 degrees of freedom
## Multiple R-squared: 0.8125, Adjusted R-squared: 0.8106
## F-statistic: 424.7 on 1 and 98 DF, p-value: < 2.2e-16
We plot the fitted line on top of the scatter plot. We see that the fitted line does seem to capture the relationship between X and Y reasonably well.
ptn = seq(from=2,to=8, by=0.1)
plot(y ~ x, ex01, pch=21, col="black", bg="cyan")
lines(ptn, predict(ex01_m2, data.frame(x=ptn)), col="darkred", lwd=2)The residual plot below shows some evidence of a very slight trend, as well as severe heteroskedasticity.
res2 <- ex01_m2$residuals
plot(res2 ~ ex01$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)The figure below shows a scatter plot of the data, along with the fitted curve and a 95% prediction band. We can see that the band is too wide in some places, and too narrow in others. This is a direct response of the heteroskedasticity in the residuals.
plot(y ~ x, ex01, pch=21, col="black", bg="cyan")
p2 <- predict(ex01_m2, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, p2[,"fit"], col="darkred", lwd=2)
lines(ptn, p2[,"lwr"], col="darkorange", lwd=2)
lines(ptn, p2[,"upr"], col="darkorange", lwd=2)We can assess normality of the residuals using a Shapiro-Wilks test, a histogram, and a QQ-plot.
##
## Shapiro-Wilk normality test
##
## data: res2
## W = 0.89644, p-value = 9.656e-07
We now consider a fitted model of the form: \(\ln(\hat Y) = \hat{\beta}_0 + \hat{\beta}_1 X\).
We will start by plotting our new response, \(\ln(Y)\) against \(X\). This plot exhibits a strong linear relationship.
We will now create our model.
##
## Call:
## lm(formula = log(y) ~ x, data = ex01)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.81094 -0.23060 0.03466 0.28640 0.67522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.99771 0.11466 8.702 7.91e-14 ***
## x 0.62018 0.02037 30.439 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3507 on 98 degrees of freedom
## Multiple R-squared: 0.9043, Adjusted R-squared: 0.9034
## F-statistic: 926.5 on 1 and 98 DF, p-value: < 2.2e-16
The residual plot below displays no apparent trend or heteroskedasticity. Note that these are residuals of \(\ln(Y)\), and not \(Y\).
res3 <- ex01_m3$residuals
plot(res3 ~ ex01$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)We now assess the normality of the residuals with a Shapiro-Wilks test, a histogram, and a QQ-plot.
##
## Shapiro-Wilk normality test
##
## data: res3
## W = 0.98234, p-value = 0.2015
These plots suggest that the residuals were likey drawn from a normal (or perhaps very slightly left-skewed) distribution.
We will add the fitted curve and to the scatter plot of \(\ln(Y)\) against \(X\).
plot(log(y) ~ x, ex01, pch=21, col="black", bg="cyan")
p3 <- predict(ex01_m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, p3[,"fit"], col="darkred", lwd=2)
lines(ptn, p3[,"lwr"], col="darkorange", lwd=2)
lines(ptn, p3[,"upr"], col="darkorange", lwd=2)We can exponentiate the values of \(\ln(Y)\) to transform the vertical axis back in terms of \(Y\). Notice that the fitted curve does seem to provide a good fit for the data, and the transformed prediction band seems to account for the heteroskedasticity.
plot(y ~ x, ex01, pch=21, col="black", bg="cyan")
p3 <- predict(ex01_m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, exp(p3[,"fit"]), col="darkred", lwd=2)
lines(ptn, exp(p3[,"lwr"]), col="darkorange", lwd=2)
lines(ptn, exp(p3[,"upr"]), col="darkorange", lwd=2)We now consider a fitted model of the form: \(\ln(\hat Y) = \hat{\beta}_0 + \hat{\beta}_1 \ln(X)\).
We will start by plotting \(\ln(Y)\) against \(\ln(X)\). This plot exhibits a slight non-linearity.
We will now create our model.
##
## Call:
## lm(formula = log(y) ~ log(x), data = ex01)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.92268 -0.30417 0.01803 0.28857 0.93462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.4137 0.1889 -2.19 0.0309 *
## log(x) 2.9236 0.1139 25.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.408 on 98 degrees of freedom
## Multiple R-squared: 0.8705, Adjusted R-squared: 0.8692
## F-statistic: 659 on 1 and 98 DF, p-value: < 2.2e-16
The previously observed non-linearity is readily apparent in the residual plot.
res4 <- ex01_m4$residuals
plot(res4 ~ ex01$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)We assess normality of the residuals using a Shapiro-Wilks test, a histogram, and a QQ-plot.
##
## Shapiro-Wilk normality test
##
## data: res4
## W = 0.98982, p-value = 0.6502
Of the models we have considered, Model 3 has the highest \(r^2\) value. Note, however, that since the response (\(Y\)) in the first two models is different from the response in the second pair of models (\(\ln(Y)\)), these \(r^2\) values are not directly comparable.
m1_r2 <- summary(ex01_m1)$r.squared
m2_r2 <- summary(ex01_m2)$r.squared
m3_r2 <- summary(ex01_m3)$r.squared
m4_r2 <- summary(ex01_m4)$r.squared
cbind(m1_r2, m2_r2, m3_r2, m4_r2)## m1_r2 m2_r2 m3_r2 m4_r2
## [1,] 0.6653207 0.8124973 0.9043473 0.8705464
If we are careful about performing the appropriate transformations in the log-level and log-log models, we can directly compare the SSE scores. However, keep in mind that these values only tell part of the story. As we see, Model 2 has the lowest SSE score, but as we noted, it has serious issues resulting from heteroskedasticity.
sse1 <- sum((ex01$y - ex01_m1$fitted.values)^2)
sse2 <- sum((ex01$y - ex01_m2$fitted.values)^2)
sse3 <- sum((ex01$y - exp(ex01_m3$fitted.values))^2)
sse4 <- sum((ex01$y - exp(ex01_m4$fitted.values))^2)
cbind(sse1, sse2, sse3, sse4)## sse1 sse2 sse3 sse4
## [1,] 653919.3 366355.6 383890.1 582747.8
We will use Model 3 going forward. It has a high \(r^2\) value, and seems to satisfy the Gauss-Markov assumptions.
When using a log-level model to create predictions for \(Y\), we need to first generate a prediction for \(\ln(Y)\), and then exponentiate. Let’s construct a 95% prediction interval for \(Y\) when \(X = 6.5\).
ln_yhat = predict(ex01_m3, data.frame(x=c(6.5)), interval="prediction", level=0.95)
yhat = exp(ln_yhat)
yhat## fit lwr upr
## 1 152.7606 75.78804 307.909
We will load the example data from a tab separated file located at the path data/log_ex02.txt.
We can see from a plot of the data that \(X\) and \(Y\) seem to exhibit a non-linear relationship. We also see that the variance in \(Y\) seems to increase with \(X\).
We will begin by creating a fitted linear model of the form: \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\)
##
## Call:
## lm(formula = y ~ x, data = ex02)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2189.4 -671.4 -170.1 516.8 3842.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3286.97 351.15 -9.361 2.96e-15 ***
## x 1012.61 65.69 15.416 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1083 on 98 degrees of freedom
## Multiple R-squared: 0.708, Adjusted R-squared: 0.705
## F-statistic: 237.6 on 1 and 98 DF, p-value: < 2.2e-16
We plot the fitted line on top of the scatter plot.
plot(y ~ x, ex02, pch=21, col="black", bg="cyan")
abline(ex02_m1$coefficients, col="darkred", lwd=2)We can see strong evidence of a non-linear trend, as well as heteroskedasticity in the residual plot below.
res1 <- ex02_m1$residuals
plot(res1 ~ ex02$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)We can assess normality of the residuals using a Shapiro-Wilk test, a histogram, and a QQ-plot.
##
## Shapiro-Wilk normality test
##
## data: res1
## W = 0.92902, p-value = 4.378e-05
The Shapiro-Wilk test strongly suggests non-normality of the residuals. The histogram and QQ-plot reveal that the residuals are right-skewed.
Noting the apparent heteroskedasticity in the model, we will now consider a fitted model of the form: \(\ln(\hat{Y}) = \hat{\beta}_0 + \hat{\beta}_1 X\).
We will start by plotting the response \(\ln(Y)\) against \(X\). This plot exhibits a somewhat non-linear relationship.
A summary of the log-level model is shown below.
##
## Call:
## lm(formula = log(y) ~ x, data = ex02)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00249 -0.26117 0.03103 0.26751 0.99605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1290 0.1465 21.36 <2e-16 ***
## x 0.7335 0.0274 26.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4517 on 98 degrees of freedom
## Multiple R-squared: 0.8797, Adjusted R-squared: 0.8785
## F-statistic: 716.6 on 1 and 98 DF, p-value: < 2.2e-16
Let’s consider the residual plot. Note that these are residuals of \(\ln(Y)\), and not \(Y\).
The previously observed non-linearity is apparent in the residual plot.
res2 <- ex02_m2$residuals
plot(res2 ~ ex02$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)We now assess the normality of the residuals with a histogram and a QQ-plot.
##
## Shapiro-Wilk normality test
##
## data: res2
## W = 0.9872, p-value = 0.4512
The figure below shows a scatter plot of \(\ln(Y)\) against \(X\). It also displays the fitted curve, as well as a 95% prediction band.
plot(log(y) ~ x, ex02, pch=21, col="black", bg="cyan")
p2 <- predict(ex02_m2, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, p2[,"fit"], col="darkred", lwd=2)
lines(ptn, p2[,"lwr"], col="darkorange", lwd=2)
lines(ptn, p2[,"upr"], col="darkorange", lwd=2)We can exponentiate the values of \(\ln(Y)\) to transform the vertical axis back in terms of \(Y\).
plot(y ~ x, ex02, pch=21, col="black", bg="cyan")
p2 <- predict(ex02_m2, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, exp(p2[,"fit"]), col="darkred", lwd=2)
lines(ptn, exp(p2[,"lwr"]), col="darkorange", lwd=2)
lines(ptn, exp(p2[,"upr"]), col="darkorange", lwd=2)We now consider a fitted model of the form: \(\ln(\hat Y) = \hat{\beta}_0 + \hat{\beta}_1 \ln(X)\).
We start by plotting \(\ln(Y)\) against \(\ln(X)\). This plot exhibits a strong linear relationship.
A summary of the fitted model is shown below.
##
## Call:
## lm(formula = log(y) ~ log(x), data = ex02)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.96888 -0.24472 -0.01224 0.23851 0.91477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2986 0.1855 7.001 3.2e-10 ***
## log(x) 3.5432 0.1154 30.699 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3997 on 98 degrees of freedom
## Multiple R-squared: 0.9058, Adjusted R-squared: 0.9048
## F-statistic: 942.4 on 1 and 98 DF, p-value: < 2.2e-16
The residual plot below displays no apparent trend or heteroskedasticity.
res3 <- ex02_m3$residuals
plot(res3 ~ ex02$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)There is no apparent heteroskedasticity or trend in this residual plot.
We now assess the normality of the residuals.
##
## Shapiro-Wilk normality test
##
## data: res3
## W = 0.99368, p-value = 0.9253
We will add the fitted curve and to the scatter plot of \(\ln(Y)\) against \(X\).
plot(log(y) ~ log(x), ex02, pch=21, col="black", bg="cyan")
p3 <- predict(ex02_m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(log(ptn), p3[,"fit"], col="darkred", lwd=2)
lines(log(ptn), p3[,"lwr"], col="darkorange", lwd=2)
lines(log(ptn), p3[,"upr"], col="darkorange", lwd=2)We can exponentiate the values of \(ln(Y)\) to transform back to be in terms of \(Y\).
plot(y ~ x, ex02, pch=21, col="black", bg="cyan")
p3 <- predict(ex02_m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, exp(p3[,"fit"]), col="darkred", lwd=2)
lines(ptn, exp(p3[,"lwr"]), col="darkorange", lwd=2)
lines(ptn, exp(p3[,"upr"]), col="darkorange", lwd=2)When using a log-log model to create predictions for \(Y\), we need to first generate a prediction for \(\ln(Y)\), and then exponentiate. Let’s construct a 95% prediction interval for \(Y\) when \(X = 6.5\).
ln_yhat = predict(ex02_m3, data.frame(x=c(6.5)), interval="prediction", level=0.95)
yhat = exp(ln_yhat)
yhat## fit lwr upr
## 1 2781.552 1249.637 6191.424
If we solve for \(Y\) in the log-level model, we see that the following forms for the model are equivalent:
\[\ln(Y) = \beta_0 + \beta_1 X + \varepsilon\]
\[Y = e^{\beta_0} e^{\beta_1 X} e^{\varepsilon}\] Since \(\beta_0\) is a constant, \(e^{\beta_1 X}\) is a positive constant. If we call this constant \(K\), we get the following expression for the log-level model:
\[Y = K e^{\beta_1 X} e^{\varepsilon}\]
If we solve for \(Y\) in the log-log model, we see that the following forms for the model are equivalent:
\[\ln(Y) = \beta_0 + \beta_1 \ln(X) + \varepsilon\]
\[Y = e^{\beta_0} X^{\beta_1} e^{\varepsilon}\]
Since \(\beta_0\) is a constant, \(e^{\beta_1 X}\) is a positive constant. If we call this constant \(K\), we get the following expression for the log-level model:
\[Y = K X^{\beta_1} e^{\varepsilon}\]