When the relationship between a predictor and the response appears to be non-linear, we can consider transforming one or both of the variables in hopes that the new variables will exhibit a linear relationship. In this lesson, we will consider two synthetic examples where \(Y\) seems to depend exponentially on \(X\). We will consider the following models for these examples.
Linear Model: \(Y = \beta_0 + \beta_1 X + \varepsilon\).
Exponential Model: \(Y = \beta_0 + \beta_1 e^X + \varepsilon\).
Log-Level Model: \(\ln(Y) = \beta_0 + \beta_1 X + \varepsilon\)
Log-Log Model: \(\ln(Y) = \beta_0 + \beta_1 \ln(X) + \varepsilon\)
We will load the example data from a tab separated file located at the path data/log_ex01.txt.
ex01 <- read.table("data/log_ex01.txt", sep="\t", header=TRUE)
summary(ex01)
x y
Min. :2.124 Min. : 8.333
1st Qu.:3.795 1st Qu.: 31.661
Median :5.436 Median : 78.320
Mean :5.358 Mean :133.155
3rd Qu.:6.944 3rd Qu.:199.158
Max. :7.979 Max. :624.952
We can see from a plot of the data that \(X\) and \(Y\) seem to exhibit a non-linear relationship. We also see that the variance in \(Y\) seems to increase with \(X\).
plot(y ~ x, ex01, pch=21, col="black", bg="cyan")
We will begin by creating a fitted linear model of the form: \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\)
ex01_m1 <- lm(y ~ x, ex01)
summary(ex01_m1)
Call:
lm(formula = y ~ x, data = ex01)
Residuals:
Min 1Q Median 3Q Max
-149.97 -50.98 -10.01 33.54 318.51
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -221.754 26.707 -8.303 5.7e-13 ***
x 66.240 4.746 13.958 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 81.69 on 98 degrees of freedom
Multiple R-squared: 0.6653, Adjusted R-squared: 0.6619
F-statistic: 194.8 on 1 and 98 DF, p-value: < 2.2e-16
We plot the fitted line on top of the scatter plot.
plot(y ~ x, ex01, pch=21, col="black", bg="cyan")
abline(ex01_m1$coefficients, col="darkred", lwd=2)
We can see strong evidence of a non-linear trend, as well as heteroskedasticity in the residual plot below.
res1 <- ex01_m1$residuals
plot(res1 ~ ex01$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)
We can assess normality of the residuals using a Shapiro-Wilk test, a histogram, and a QQ-plot.
shapiro.test(res1)
Shapiro-Wilk normality test
data: res1
W = 0.92718, p-value = 3.454e-05
The Shapiro-Wilk test strongly suggests non-normality of the residuals. The histogram and QQ-plot reveal that the residuals are right-skewed.
par(mfrow=c(1,2))
hist(res1, col='orchid')
qqnorm(res1)
qqline(res1)
par(mfrow=c(1,1))
We will now create a fitted model of the form \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 e^X\).
ex01_m2 <- lm(y ~ exp(x), ex01)
summary(ex01_m2)
Call:
lm(formula = y ~ exp(x), data = ex01)
Residuals:
Min 1Q Median 3Q Max
-173.688 -22.301 -5.473 21.888 202.380
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.869993 7.814467 4.206 5.75e-05 ***
exp(x) 0.153758 0.007461 20.607 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 61.14 on 98 degrees of freedom
Multiple R-squared: 0.8125, Adjusted R-squared: 0.8106
F-statistic: 424.7 on 1 and 98 DF, p-value: < 2.2e-16
We plot the fitted line on top of the scatter plot. We see that the fitted line does seem to capture the relationship between X and Y reasonably well.
ptn = seq(from=2,to=8, by=0.1)
plot(y ~ x, ex01, pch=21, col="black", bg="cyan")
lines(ptn, predict(ex01_m2, data.frame(x=ptn)), col="darkred", lwd=2)
The residual plot below shows some evidence of a very slight trend, as well as severe heteroskedasticity.
res2 <- ex01_m2$residuals
plot(res2 ~ ex01$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)
The figure below shows a scatter plot of the data, along with the fitted curve and a 95% prediction band. We can see that the band is too wide in some places, and too narrow in others. This is a direct response of the heteroskedasticity in the residuals.
plot(y ~ x, ex01, pch=21, col="black", bg="cyan")
p2 <- predict(ex01_m2, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, p2[,"fit"], col="darkred", lwd=2)
lines(ptn, p2[,"lwr"], col="darkorange", lwd=2)
lines(ptn, p2[,"upr"], col="darkorange", lwd=2)
We can assess normality of the residuals using a Shapiro-Wilks test, a histogram, and a QQ-plot.
shapiro.test(res2)
Shapiro-Wilk normality test
data: res2
W = 0.89644, p-value = 9.656e-07
par(mfrow=c(1,2))
hist(res2, col='orchid')
qqnorm(res2)
qqline(res2)
par(mfrow=c(1,1))
We now consider a fitted model of the form: \(\ln(\hat Y) = \hat{\beta}_0 + \hat{\beta}_1 X\).
We will start by plotting our new response, \(\ln(Y)\) against \(X\). This plot exhibits a strong linear relationship.
plot(log(y) ~ x, ex01, pch=21, col="black", bg="cyan")
We will now create our model.
ex01_m3 <- lm(log(y) ~ x, ex01)
summary(ex01_m3)
Call:
lm(formula = log(y) ~ x, data = ex01)
Residuals:
Min 1Q Median 3Q Max
-0.81094 -0.23060 0.03466 0.28640 0.67522
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.99771 0.11466 8.702 7.91e-14 ***
x 0.62018 0.02037 30.439 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3507 on 98 degrees of freedom
Multiple R-squared: 0.9043, Adjusted R-squared: 0.9034
F-statistic: 926.5 on 1 and 98 DF, p-value: < 2.2e-16
The residual plot below displays no apparent trend or heteroskedasticity. Note that these are residuals of \(\ln(Y)\), and not \(Y\).
res3 <- ex01_m3$residuals
plot(res3 ~ ex01$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)
We now assess the normality of the residuals with a Shapiro-Wilks test, a histogram, and a QQ-plot.
shapiro.test(res3)
Shapiro-Wilk normality test
data: res3
W = 0.98234, p-value = 0.2015
par(mfrow=c(1,2))
hist(res3, col='orchid')
qqnorm(res3)
qqline(res3)
par(mfrow=c(1,1))
These plots suggest that the residuals were likey drawn from a normal (or perhaps very slightly left-skewed) distribution.
We will add the fitted curve and to the scatter plot of \(\ln(Y)\) against \(X\).
plot(log(y) ~ x, ex01, pch=21, col="black", bg="cyan")
p3 <- predict(ex01_m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, p3[,"fit"], col="darkred", lwd=2)
lines(ptn, p3[,"lwr"], col="darkorange", lwd=2)
lines(ptn, p3[,"upr"], col="darkorange", lwd=2)
We can exponentiate the values of \(\ln(Y)\) to transform the vertical axis back in terms of \(Y\). Notice that the fitted curve does seem to provide a good fit for the data, and the transformed prediction band seems to account for the heteroskedasticity.
plot(y ~ x, ex01, pch=21, col="black", bg="cyan")
p3 <- predict(ex01_m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, exp(p3[,"fit"]), col="darkred", lwd=2)
lines(ptn, exp(p3[,"lwr"]), col="darkorange", lwd=2)
lines(ptn, exp(p3[,"upr"]), col="darkorange", lwd=2)
We now consider a fitted model of the form: \(\ln(\hat Y) = \hat{\beta}_0 + \hat{\beta}_1 \ln(X)\).
We will start by plotting \(\ln(Y)\) against \(\ln(X)\). This plot exhibits a slight non-linearity.
plot(log(y) ~ log(x), ex01, pch=21, col="black", bg="cyan")
We will now create our model.
ex01_m4 <- lm(log(y) ~ log(x), ex01)
summary(ex01_m4)
Call:
lm(formula = log(y) ~ log(x), data = ex01)
Residuals:
Min 1Q Median 3Q Max
-0.92268 -0.30417 0.01803 0.28857 0.93462
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.4137 0.1889 -2.19 0.0309 *
log(x) 2.9236 0.1139 25.67 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.408 on 98 degrees of freedom
Multiple R-squared: 0.8705, Adjusted R-squared: 0.8692
F-statistic: 659 on 1 and 98 DF, p-value: < 2.2e-16
The previously observed non-linearity is readily apparent in the residual plot.
res4 <- ex01_m4$residuals
plot(res4 ~ ex01$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)
We assess normality of the residuals using a Shapiro-Wilks test, a histogram, and a QQ-plot.
shapiro.test(res4)
Shapiro-Wilk normality test
data: res4
W = 0.98982, p-value = 0.6502
par(mfrow=c(1,2))
hist(res4, col='orchid')
qqnorm(res4)
qqline(res4)
par(mfrow=c(1,1))
Of the models we have considered, Model 3 has the highest \(r^2\) value. Note, however, that since the response (\(Y\)) in the first two models is different from the response in the second pair of models (\(\ln(Y)\)), these \(r^2\) values are not directly comparable.
m1_r2 <- summary(ex01_m1)$r.squared
m2_r2 <- summary(ex01_m2)$r.squared
m3_r2 <- summary(ex01_m3)$r.squared
m4_r2 <- summary(ex01_m4)$r.squared
cbind(m1_r2, m2_r2, m3_r2, m4_r2)
m1_r2 m2_r2 m3_r2 m4_r2
[1,] 0.6653207 0.8124973 0.9043473 0.8705464
If we are careful about performing the appropriate transformations in the log-level and log-log models, we can directly compare the SSE scores. However, keep in mind that these values only tell part of the story. As we see, Model 2 has the lowest SSE score, but as we noted, it has serious issues resulting from heteroskedasticity.
sse1 <- sum((ex01$y - ex01_m1$fitted.values)^2)
sse2 <- sum((ex01$y - ex01_m2$fitted.values)^2)
sse3 <- sum((ex01$y - exp(ex01_m3$fitted.values))^2)
sse4 <- sum((ex01$y - exp(ex01_m4$fitted.values))^2)
cbind(sse1, sse2, sse3, sse4)
sse1 sse2 sse3 sse4
[1,] 653919.3 366355.6 383890.1 582747.8
We will use Model 3 going forward. It has a high \(r^2\) value, and seems to satisfy the Gauss-Markov assumptions.
When using a log-level model to create predictions for \(Y\), we need to first generate a prediction for \(\ln(Y)\), and then exponentiate. Let’s construct a 95% prediction interval for \(Y\) when \(X = 6.5\).
ln_yhat = predict(ex01_m3, data.frame(x=c(6.5)), interval="prediction", level=0.95)
yhat = exp(ln_yhat)
yhat
fit lwr upr
1 152.7606 75.78804 307.909
We will load the example data from a tab separated file located at the path data/log_ex02.txt.
ex02 <- read.table("data/log_ex02.txt", sep="\t", header=TRUE)
summary(ex02)
x y
Min. :2.283 Min. : 46.79
1st Qu.:3.767 1st Qu.: 381.51
Median :4.985 Median : 953.84
Mean :5.085 Mean :1862.38
3rd Qu.:6.416 3rd Qu.:2888.49
Max. :7.918 Max. :7982.39
We can see from a plot of the data that \(X\) and \(Y\) seem to exhibit a non-linear relationship. We also see that the variance in \(Y\) seems to increase with \(X\).
plot(y ~ x, ex02, pch=21, col="black", bg="cyan")
We will begin by creating a fitted linear model of the form: \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\)
ex02_m1 <- lm(y ~ x, ex02)
summary(ex02_m1)
Call:
lm(formula = y ~ x, data = ex02)
Residuals:
Min 1Q Median 3Q Max
-2189.4 -671.4 -170.1 516.8 3842.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3286.97 351.15 -9.361 2.96e-15 ***
x 1012.61 65.69 15.416 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1083 on 98 degrees of freedom
Multiple R-squared: 0.708, Adjusted R-squared: 0.705
F-statistic: 237.6 on 1 and 98 DF, p-value: < 2.2e-16
We plot the fitted line on top of the scatter plot.
plot(y ~ x, ex02, pch=21, col="black", bg="cyan")
abline(ex02_m1$coefficients, col="darkred", lwd=2)
We can see strong evidence of a non-linear trend, as well as heteroskedasticity in the residual plot below.
res1 <- ex02_m1$residuals
plot(res1 ~ ex02$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)
We can assess normality of the residuals using a Shapiro-Wilk test, a histogram, and a QQ-plot.
shapiro.test(res1)
Shapiro-Wilk normality test
data: res1
W = 0.92902, p-value = 4.378e-05
The Shapiro-Wilk test strongly suggests non-normality of the residuals. The histogram and QQ-plot reveal that the residuals are right-skewed.
par(mfrow=c(1,2))
hist(res1, col='orchid')
qqnorm(res1)
qqline(res1)
par(mfrow=c(1,1))
Noting the apparent heteroskedasticity in the model, we will now consider a fitted model of the form: \(\ln(\hat{Y}) = \hat{\beta}_0 + \hat{\beta}_1 X\).
We will start by plotting the response \(\ln(Y)\) against \(X\). This plot exhibits a somewhat non-linear relationship.
plot(log(y) ~ x, ex02, pch=21, col="black", bg="cyan")
A summary of the log-level model is shown below.
ex02_m2 <- lm(log(y) ~ x, ex02)
summary(ex02_m2)
Call:
lm(formula = log(y) ~ x, data = ex02)
Residuals:
Min 1Q Median 3Q Max
-1.00249 -0.26117 0.03103 0.26751 0.99605
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.1290 0.1465 21.36 <2e-16 ***
x 0.7335 0.0274 26.77 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4517 on 98 degrees of freedom
Multiple R-squared: 0.8797, Adjusted R-squared: 0.8785
F-statistic: 716.6 on 1 and 98 DF, p-value: < 2.2e-16
Let’s consider the residual plot. Note that these are residuals of \(\ln(Y)\), and not \(Y\).
The previously observed non-linearity is apparent in the residual plot.
res2 <- ex02_m2$residuals
plot(res2 ~ ex02$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)
We now assess the normality of the residuals with a histogram and a QQ-plot.
shapiro.test(res2)
Shapiro-Wilk normality test
data: res2
W = 0.9872, p-value = 0.4512
The figure below shows a scatter plot of \(\ln(Y)\) against \(X\). It also displays the fitted curve, as well as a 95% prediction band.
We can exponentiate the values of \(\ln(Y)\) to transform the vertical axis back in terms of \(Y\).
plot(y ~ x, ex02, pch=21, col="black", bg="cyan")
p2 <- predict(ex02_m2, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, exp(p2[,"fit"]), col="darkred", lwd=2)
lines(ptn, exp(p2[,"lwr"]), col="darkorange", lwd=2)
lines(ptn, exp(p2[,"upr"]), col="darkorange", lwd=2)
We now consider a fitted model of the form: \(\ln(\hat Y) = \hat{\beta}_0 + \hat{\beta}_1 \ln(X)\).
We start by plotting \(\ln(Y)\) against \(\ln(X)\). This plot exhibits a strong linear relationship.
plot(log(y) ~ log(x), ex02, pch=21, col="black", bg="cyan")
A summary of the fitted model is shown below.
ex02_m3 <- lm(log(y) ~ log(x), ex02)
summary(ex02_m3)
Call:
lm(formula = log(y) ~ log(x), data = ex02)
Residuals:
Min 1Q Median 3Q Max
-0.96888 -0.24472 -0.01224 0.23851 0.91477
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2986 0.1855 7.001 3.2e-10 ***
log(x) 3.5432 0.1154 30.699 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3997 on 98 degrees of freedom
Multiple R-squared: 0.9058, Adjusted R-squared: 0.9048
F-statistic: 942.4 on 1 and 98 DF, p-value: < 2.2e-16
The residual plot below displays no apparent trend or heteroskedasticity.
res3 <- ex02_m3$residuals
plot(res3 ~ ex02$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)
There is no apparent heteroskedasticity or trend in this residual plot.
We now assess the normality of the residuals.
shapiro.test(res3)
Shapiro-Wilk normality test
data: res3
W = 0.99368, p-value = 0.9253
par(mfrow=c(1,2))
hist(res3, col='orchid')
qqnorm(res3)
qqline(res3)
par(mfrow=c(1,1))
We will add the fitted curve and to the scatter plot of \(\ln(Y)\) against \(X\).
plot(log(y) ~ log(x), ex02, pch=21, col="black", bg="cyan")
p3 <- predict(ex02_m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(log(ptn), p3[,"fit"], col="darkred", lwd=2)
lines(log(ptn), p3[,"lwr"], col="darkorange", lwd=2)
lines(log(ptn), p3[,"upr"], col="darkorange", lwd=2)
We can exponentiate the values of \(ln(Y)\) to transform back to be in terms of \(Y\).
plot(y ~ x, ex02, pch=21, col="black", bg="cyan")
p3 <- predict(ex02_m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, exp(p3[,"fit"]), col="darkred", lwd=2)
lines(ptn, exp(p3[,"lwr"]), col="darkorange", lwd=2)
lines(ptn, exp(p3[,"upr"]), col="darkorange", lwd=2)
When using a log-log model to create predictions for \(Y\), we need to first generate a prediction for \(\ln(Y)\), and then exponentiate. Let’s construct a 95% prediction interval for \(Y\) when \(X = 6.5\).
ln_yhat = predict(ex02_m3, data.frame(x=c(6.5)), interval="prediction", level=0.95)
yhat = exp(ln_yhat)
yhat
fit lwr upr
1 2781.552 1249.637 6191.424
If we solve for \(Y\) in the log-level model, we see that the following forms for the model are equivalent:
\[\ln(Y) = \beta_0 + \beta_1 X + \varepsilon\]
\[Y = e^{\beta_0} e^{\beta_1 X} e^{\varepsilon}\] Since \(\beta_0\) is a constant, \(e^{\beta_1 X}\) is a positive constant. If we call this constant \(K\), we get the following expression for the log-level model:
\[Y = K e^{\beta_1 X} e^{\varepsilon}\]
If we solve for \(Y\) in the log-log model, we see that the following forms for the model are equivalent:
\[\ln(Y) = \beta_0 + \beta_1 \ln(X) + \varepsilon\]
\[Y = e^{\beta_0} X^{\beta_1} e^{\varepsilon}\]
Since \(\beta_0\) is a constant, \(e^{\beta_1 X}\) is a positive constant. If we call this constant \(K\), we get the following expression for the log-level model:
\[Y = K X^{\beta_1} e^{\varepsilon}\]