Lesson 3.6 - Logarithmic Transformations

Robbie Beane

Introduction

When the relationship between a predictor and the response appears to be non-linear, we can consider transforming one or both of the variables in hopes that the new variables will exhibit a linear relationship. In this lesson, we will consider two synthetic examples where \(Y\) seems to depend exponentially on \(X\). We will consider the following models for these examples.

Example 1: Load and Explore the Data

We will load the example data from a tab separated file located at the path data/log_ex01.txt.

ex01 <- read.table("data/log_ex01.txt", sep="\t", header=TRUE)

We can see from a plot of the data that \(X\) and \(Y\) seem to exhibit a non-linear relationship. We also see that the variance in \(Y\) seems to increase with \(X\).

plot(y ~ x, ex01, pch=21, col="black", bg="cyan")

Model 1: Linear Model

We will begin by creating a fitted linear model of the form: \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\)

ex01_m1 <- lm(y ~ x, ex01)
summary(ex01_m1) 
## 
## Call:
## lm(formula = y ~ x, data = ex01)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -149.97  -50.98  -10.01   33.54  318.51 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -221.754     26.707  -8.303  5.7e-13 ***
## x             66.240      4.746  13.958  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81.69 on 98 degrees of freedom
## Multiple R-squared:  0.6653, Adjusted R-squared:  0.6619 
## F-statistic: 194.8 on 1 and 98 DF,  p-value: < 2.2e-16

Model 1: Linear Model

We plot the fitted line on top of the scatter plot.

plot(y ~ x, ex01, pch=21, col="black", bg="cyan")
abline(ex01_m1$coefficients, col="darkred", lwd=2)

Model 1: Linear Model

We can see strong evidence of a non-linear trend, as well as heteroskedasticity in the residual plot below.

res1 <- ex01_m1$residuals

plot(res1 ~ ex01$x, pch=21, col="black", bg="cyan", 
     xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)

Model 1: Linear Model

We can assess normality of the residuals using a Shapiro-Wilk test, a histogram, and a QQ-plot.

shapiro.test(res1)
## 
##  Shapiro-Wilk normality test
## 
## data:  res1
## W = 0.92718, p-value = 3.454e-05

The Shapiro-Wilk test strongly suggests non-normality of the residuals. The histogram and QQ-plot reveal that the residuals are right-skewed.

par(mfrow=c(1,2))
hist(res1, col='orchid')
qqnorm(res1)
qqline(res1)

par(mfrow=c(1,1))

Model 2: Exponetial Model

We will now create a fitted model of the form \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 e^X\).

ex01_m2 <- lm(y ~ exp(x), ex01)
summary(ex01_m2) 
## 
## Call:
## lm(formula = y ~ exp(x), data = ex01)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -173.688  -22.301   -5.473   21.888  202.380 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 32.869993   7.814467   4.206 5.75e-05 ***
## exp(x)       0.153758   0.007461  20.607  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.14 on 98 degrees of freedom
## Multiple R-squared:  0.8125, Adjusted R-squared:  0.8106 
## F-statistic: 424.7 on 1 and 98 DF,  p-value: < 2.2e-16

Model 2: Exponetial Model

We plot the fitted line on top of the scatter plot. We see that the fitted line does seem to capture the relationship between X and Y reasonably well.

ptn = seq(from=2,to=8, by=0.1)

plot(y ~ x, ex01, pch=21, col="black", bg="cyan")
lines(ptn, predict(ex01_m2, data.frame(x=ptn)), col="darkred", lwd=2)

Model 2: Exponetial Model

The residual plot below shows some evidence of a very slight trend, as well as severe heteroskedasticity.

res2 <- ex01_m2$residuals
plot(res2 ~ ex01$x, pch=21, col="black", bg="cyan", 
     xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)

Model 2: Exponetial Model

The figure below shows a scatter plot of the data, along with the fitted curve and a 95% prediction band. We can see that the band is too wide in some places, and too narrow in others. This is a direct response of the heteroskedasticity in the residuals.

plot(y ~ x, ex01, pch=21, col="black", bg="cyan")
p2 <- predict(ex01_m2, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, p2[,"fit"], col="darkred", lwd=2)
lines(ptn, p2[,"lwr"], col="darkorange", lwd=2)
lines(ptn, p2[,"upr"], col="darkorange", lwd=2)

Model 2: Exponetial Model

We can assess normality of the residuals using a Shapiro-Wilks test, a histogram, and a QQ-plot.

shapiro.test(res2)
## 
##  Shapiro-Wilk normality test
## 
## data:  res2
## W = 0.89644, p-value = 9.656e-07
par(mfrow=c(1,2))
hist(res2, col='orchid')
qqnorm(res2)
qqline(res2)

par(mfrow=c(1,1))

Model 3: Log-Level Model

We now consider a fitted model of the form: \(\ln(\hat Y) = \hat{\beta}_0 + \hat{\beta}_1 X\).

We will start by plotting our new response, \(\ln(Y)\) against \(X\). This plot exhibits a strong linear relationship.

plot(log(y) ~ x, ex01, pch=21, col="black", bg="cyan")

Model 3: Log-Level Model

We will now create our model.

ex01_m3 <- lm(log(y) ~ x, ex01)
summary(ex01_m3) 
## 
## Call:
## lm(formula = log(y) ~ x, data = ex01)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81094 -0.23060  0.03466  0.28640  0.67522 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.99771    0.11466   8.702 7.91e-14 ***
## x            0.62018    0.02037  30.439  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3507 on 98 degrees of freedom
## Multiple R-squared:  0.9043, Adjusted R-squared:  0.9034 
## F-statistic: 926.5 on 1 and 98 DF,  p-value: < 2.2e-16

Model 3: Log-Level Model

The residual plot below displays no apparent trend or heteroskedasticity. Note that these are residuals of \(\ln(Y)\), and not \(Y\).

res3 <- ex01_m3$residuals
plot(res3 ~ ex01$x, pch=21, col="black", bg="cyan", 
     xlab="x", ylab="Residuals", main="Residual Plot")

abline(h=0, col="darkred", lwd=2)

Model 3: Log-Level Model

We now assess the normality of the residuals with a Shapiro-Wilks test, a histogram, and a QQ-plot.

shapiro.test(res3)
## 
##  Shapiro-Wilk normality test
## 
## data:  res3
## W = 0.98234, p-value = 0.2015
par(mfrow=c(1,2))
hist(res3, col='orchid')
qqnorm(res3)
qqline(res3)

par(mfrow=c(1,1))

These plots suggest that the residuals were likey drawn from a normal (or perhaps very slightly left-skewed) distribution.

Model 3: Log-Level Model

We will add the fitted curve and to the scatter plot of \(\ln(Y)\) against \(X\).

plot(log(y) ~ x, ex01, pch=21, col="black", bg="cyan")

p3 <- predict(ex01_m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, p3[,"fit"], col="darkred", lwd=2)
lines(ptn, p3[,"lwr"], col="darkorange", lwd=2)
lines(ptn, p3[,"upr"], col="darkorange", lwd=2)

Model 3: Log-Level Model

We can exponentiate the values of \(\ln(Y)\) to transform the vertical axis back in terms of \(Y\). Notice that the fitted curve does seem to provide a good fit for the data, and the transformed prediction band seems to account for the heteroskedasticity.

plot(y ~ x, ex01, pch=21, col="black", bg="cyan")

p3 <- predict(ex01_m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, exp(p3[,"fit"]), col="darkred", lwd=2)
lines(ptn, exp(p3[,"lwr"]), col="darkorange", lwd=2)
lines(ptn, exp(p3[,"upr"]), col="darkorange", lwd=2)

Model 4: Log-Log Model

We now consider a fitted model of the form: \(\ln(\hat Y) = \hat{\beta}_0 + \hat{\beta}_1 \ln(X)\).

We will start by plotting \(\ln(Y)\) against \(\ln(X)\). This plot exhibits a slight non-linearity.

plot(log(y) ~ log(x), ex01, pch=21, col="black", bg="cyan")

Model 4: Log-Log Model

We will now create our model.

ex01_m4 <- lm(log(y) ~ log(x), ex01)
summary(ex01_m4)
## 
## Call:
## lm(formula = log(y) ~ log(x), data = ex01)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92268 -0.30417  0.01803  0.28857  0.93462 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.4137     0.1889   -2.19   0.0309 *  
## log(x)        2.9236     0.1139   25.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.408 on 98 degrees of freedom
## Multiple R-squared:  0.8705, Adjusted R-squared:  0.8692 
## F-statistic:   659 on 1 and 98 DF,  p-value: < 2.2e-16

Model 4: Log-Log Model

The previously observed non-linearity is readily apparent in the residual plot.

res4 <- ex01_m4$residuals
plot(res4 ~ ex01$x, pch=21, col="black", bg="cyan", 
     xlab="x", ylab="Residuals", main="Residual Plot")

abline(h=0, col="darkred", lwd=2)

Model 4: Log-Log Model

We assess normality of the residuals using a Shapiro-Wilks test, a histogram, and a QQ-plot.

shapiro.test(res4)
## 
##  Shapiro-Wilk normality test
## 
## data:  res4
## W = 0.98982, p-value = 0.6502
par(mfrow=c(1,2))
hist(res4, col='orchid')
qqnorm(res4)
qqline(res4)

par(mfrow=c(1,1))

Comparison of Models

Of the models we have considered, Model 3 has the highest \(r^2\) value. Note, however, that since the response (\(Y\)) in the first two models is different from the response in the second pair of models (\(\ln(Y)\)), these \(r^2\) values are not directly comparable.

m1_r2 <- summary(ex01_m1)$r.squared
m2_r2 <- summary(ex01_m2)$r.squared
m3_r2 <- summary(ex01_m3)$r.squared
m4_r2 <- summary(ex01_m4)$r.squared

cbind(m1_r2, m2_r2, m3_r2, m4_r2)
##          m1_r2     m2_r2     m3_r2     m4_r2
## [1,] 0.6653207 0.8124973 0.9043473 0.8705464

Comparison of Models

If we are careful about performing the appropriate transformations in the log-level and log-log models, we can directly compare the SSE scores. However, keep in mind that these values only tell part of the story. As we see, Model 2 has the lowest SSE score, but as we noted, it has serious issues resulting from heteroskedasticity.

sse1 <- sum((ex01$y - ex01_m1$fitted.values)^2)
sse2 <- sum((ex01$y - ex01_m2$fitted.values)^2)
sse3 <- sum((ex01$y - exp(ex01_m3$fitted.values))^2)
sse4 <- sum((ex01$y - exp(ex01_m4$fitted.values))^2)

cbind(sse1, sse2, sse3, sse4)
##          sse1     sse2     sse3     sse4
## [1,] 653919.3 366355.6 383890.1 582747.8

We will use Model 3 going forward. It has a high \(r^2\) value, and seems to satisfy the Gauss-Markov assumptions.

Generating Predictions with the Log-Level Model

When using a log-level model to create predictions for \(Y\), we need to first generate a prediction for \(\ln(Y)\), and then exponentiate. Let’s construct a 95% prediction interval for \(Y\) when \(X = 6.5\).

ln_yhat = predict(ex01_m3, data.frame(x=c(6.5)), interval="prediction", level=0.95)
yhat = exp(ln_yhat)
yhat
##        fit      lwr     upr
## 1 152.7606 75.78804 307.909

Example 2: Load and Explore the Data

We will load the example data from a tab separated file located at the path data/log_ex02.txt.

ex02 <- read.table("data/log_ex02.txt", sep="\t", header=TRUE)

We can see from a plot of the data that \(X\) and \(Y\) seem to exhibit a non-linear relationship. We also see that the variance in \(Y\) seems to increase with \(X\).

plot(y ~ x, ex02, pch=21, col="black", bg="cyan")

Model 1: Linear Model

We will begin by creating a fitted linear model of the form: \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\)

ex02_m1 <- lm(y ~ x, ex02)
summary(ex02_m1) 
## 
## Call:
## lm(formula = y ~ x, data = ex02)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2189.4  -671.4  -170.1   516.8  3842.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3286.97     351.15  -9.361 2.96e-15 ***
## x            1012.61      65.69  15.416  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1083 on 98 degrees of freedom
## Multiple R-squared:  0.708,  Adjusted R-squared:  0.705 
## F-statistic: 237.6 on 1 and 98 DF,  p-value: < 2.2e-16

Model 1: Linear Model

We plot the fitted line on top of the scatter plot.

plot(y ~ x, ex02, pch=21, col="black", bg="cyan")
abline(ex02_m1$coefficients, col="darkred", lwd=2)

Model 1: Linear Model

We can see strong evidence of a non-linear trend, as well as heteroskedasticity in the residual plot below.

res1 <- ex02_m1$residuals
plot(res1 ~ ex02$x, pch=21, col="black", bg="cyan", 
     xlab="x", ylab="Residuals", main="Residual Plot")

abline(h=0, col="darkred", lwd=2)

Model 1: Linear Model

We can assess normality of the residuals using a Shapiro-Wilk test, a histogram, and a QQ-plot.

shapiro.test(res1)
## 
##  Shapiro-Wilk normality test
## 
## data:  res1
## W = 0.92902, p-value = 4.378e-05

The Shapiro-Wilk test strongly suggests non-normality of the residuals. The histogram and QQ-plot reveal that the residuals are right-skewed.

par(mfrow=c(1,2))
hist(res1, col='orchid')
qqnorm(res1)
qqline(res1)

par(mfrow=c(1,1))

Model 2: Log-Level Model

Noting the apparent heteroskedasticity in the model, we will now consider a fitted model of the form: \(\ln(\hat{Y}) = \hat{\beta}_0 + \hat{\beta}_1 X\).

We will start by plotting the response \(\ln(Y)\) against \(X\). This plot exhibits a somewhat non-linear relationship.

plot(log(y) ~ x, ex02, pch=21, col="black", bg="cyan")

Model 2: Log-Level Model

A summary of the log-level model is shown below.

ex02_m2 <- lm(log(y) ~ x, ex02)
summary(ex02_m2) 
## 
## Call:
## lm(formula = log(y) ~ x, data = ex02)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00249 -0.26117  0.03103  0.26751  0.99605 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.1290     0.1465   21.36   <2e-16 ***
## x             0.7335     0.0274   26.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4517 on 98 degrees of freedom
## Multiple R-squared:  0.8797, Adjusted R-squared:  0.8785 
## F-statistic: 716.6 on 1 and 98 DF,  p-value: < 2.2e-16

Model 2: Log-Level Model

Let’s consider the residual plot. Note that these are residuals of \(\ln(Y)\), and not \(Y\).

The previously observed non-linearity is apparent in the residual plot.

res2 <- ex02_m2$residuals

plot(res2 ~ ex02$x, pch=21, col="black", bg="cyan", 
     xlab="x", ylab="Residuals", main="Residual Plot")

abline(h=0, col="darkred", lwd=2)

Model 2: Log-Level Model

We now assess the normality of the residuals with a histogram and a QQ-plot.

shapiro.test(res2)
## 
##  Shapiro-Wilk normality test
## 
## data:  res2
## W = 0.9872, p-value = 0.4512

Model 2: Log-Level Model

The figure below shows a scatter plot of \(\ln(Y)\) against \(X\). It also displays the fitted curve, as well as a 95% prediction band.

plot(log(y) ~ x, ex02, pch=21, col="black", bg="cyan")

p2 <- predict(ex02_m2, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, p2[,"fit"], col="darkred", lwd=2)
lines(ptn, p2[,"lwr"], col="darkorange", lwd=2)
lines(ptn, p2[,"upr"], col="darkorange", lwd=2)

Model 2: Log-Level Model

We can exponentiate the values of \(\ln(Y)\) to transform the vertical axis back in terms of \(Y\).

plot(y ~ x, ex02, pch=21, col="black", bg="cyan")

p2 <- predict(ex02_m2, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, exp(p2[,"fit"]), col="darkred", lwd=2)
lines(ptn, exp(p2[,"lwr"]), col="darkorange", lwd=2)
lines(ptn, exp(p2[,"upr"]), col="darkorange", lwd=2)

Model 3: Log-Log Model

We now consider a fitted model of the form: \(\ln(\hat Y) = \hat{\beta}_0 + \hat{\beta}_1 \ln(X)\).

We start by plotting \(\ln(Y)\) against \(\ln(X)\). This plot exhibits a strong linear relationship.

plot(log(y) ~ log(x), ex02, pch=21, col="black", bg="cyan")

Model 3: Log-Log Model

A summary of the fitted model is shown below.

ex02_m3 <- lm(log(y) ~ log(x), ex02)
summary(ex02_m3) 
## 
## Call:
## lm(formula = log(y) ~ log(x), data = ex02)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96888 -0.24472 -0.01224  0.23851  0.91477 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.2986     0.1855   7.001  3.2e-10 ***
## log(x)        3.5432     0.1154  30.699  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3997 on 98 degrees of freedom
## Multiple R-squared:  0.9058, Adjusted R-squared:  0.9048 
## F-statistic: 942.4 on 1 and 98 DF,  p-value: < 2.2e-16

Model 3: Log-Log Model

The residual plot below displays no apparent trend or heteroskedasticity.

res3 <- ex02_m3$residuals

plot(res3 ~ ex02$x, pch=21, col="black", bg="cyan", 
     xlab="x", ylab="Residuals", main="Residual Plot")

abline(h=0, col="darkred", lwd=2)

There is no apparent heteroskedasticity or trend in this residual plot.

Model 3: Log-Log Model

We now assess the normality of the residuals.

shapiro.test(res3)
## 
##  Shapiro-Wilk normality test
## 
## data:  res3
## W = 0.99368, p-value = 0.9253
par(mfrow=c(1,2))
hist(res3, col='orchid')
qqnorm(res3)
qqline(res3)

par(mfrow=c(1,1))

Model 3: Log-Log Model

We will add the fitted curve and to the scatter plot of \(\ln(Y)\) against \(X\).

plot(log(y) ~ log(x), ex02, pch=21, col="black", bg="cyan")

p3 <- predict(ex02_m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(log(ptn), p3[,"fit"], col="darkred", lwd=2)
lines(log(ptn), p3[,"lwr"], col="darkorange", lwd=2)
lines(log(ptn), p3[,"upr"], col="darkorange", lwd=2)

Model 3: Log-Log Model

We can exponentiate the values of \(ln(Y)\) to transform back to be in terms of \(Y\).

plot(y ~ x, ex02, pch=21, col="black", bg="cyan")

p3 <- predict(ex02_m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, exp(p3[,"fit"]), col="darkred", lwd=2)
lines(ptn, exp(p3[,"lwr"]), col="darkorange", lwd=2)
lines(ptn, exp(p3[,"upr"]), col="darkorange", lwd=2)

Generating Predictions with the Log-Log Model

When using a log-log model to create predictions for \(Y\), we need to first generate a prediction for \(\ln(Y)\), and then exponentiate. Let’s construct a 95% prediction interval for \(Y\) when \(X = 6.5\).

ln_yhat = predict(ex02_m3, data.frame(x=c(6.5)), interval="prediction", level=0.95)
yhat = exp(ln_yhat)
yhat
##        fit      lwr      upr
## 1 2781.552 1249.637 6191.424

Alternate Forms for Log-Level Model

If we solve for \(Y\) in the log-level model, we see that the following forms for the model are equivalent:

\[\ln(Y) = \beta_0 + \beta_1 X + \varepsilon\]

\[Y = e^{\beta_0} e^{\beta_1 X} e^{\varepsilon}\] Since \(\beta_0\) is a constant, \(e^{\beta_1 X}\) is a positive constant. If we call this constant \(K\), we get the following expression for the log-level model:

\[Y = K e^{\beta_1 X} e^{\varepsilon}\]

Alternate Forms for Log-Log Model

If we solve for \(Y\) in the log-log model, we see that the following forms for the model are equivalent:

\[\ln(Y) = \beta_0 + \beta_1 \ln(X) + \varepsilon\]

\[Y = e^{\beta_0} X^{\beta_1} e^{\varepsilon}\]

Since \(\beta_0\) is a constant, \(e^{\beta_1 X}\) is a positive constant. If we call this constant \(K\), we get the following expression for the log-level model:

\[Y = K X^{\beta_1} e^{\varepsilon}\]