If your model does not conform the LINE assupmtions (linearity, independence, normality, and equal variances), try the following:

Check for omitted predictor variables.
If the mean of the response is not a linear function of the predictors, transform the response and/or predictor variables.
If there are unequal error variances, transform the response and/or predictor variables, or use weighted least squares regression.
If there are one or more outliers, try a robust estimation procedure.
If error terms are not independent, fit a time series model.

This lesson focuses on transforming the response and predictor variables to correct non-linearity and/or unequal variances. In general, transforming the response variable corrects problems with the error terms (and may help the non-linearity), and transforming the predictor variables corrects non-linearity.

Transforming Predictor Variables

Transform predictor variables only if non-linearity is the only problem (independence, normality, and equal variance conditions are met).

Example

A study measured memory recall at various durations. Data set word (wordrecall.txt is a summary of n = 13 time periods:

y = percent of correct recall
time = time between learning and recall, in minutes

library(readr)
word <- read_tsv(file = url("https://newonlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/wordrecall/index.txt"))

print(word)

## # A tibble: 13 x 2
##     time   prop
##    <int>  <dbl>
##  1     1 0.840 
##  2     5 0.710 
##  3    15 0.610 
##  4    30 0.560 
##  5    60 0.540 
##  6   120 0.470 
##  7   240 0.450 
##  8   480 0.380 
##  9   720 0.360 
## 10  1440 0.260 
## 11  2880 0.200 
## 12  5760 0.160 
## 13 10080 0.0800

Formulate a regression model as prop ~ time. The relationship does not appear to be linear.

library(ggplot2)

ggplot(word, aes(y = prop, x = time)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("prop ~ time")

Estimate the model, and evaluate the LINE conditions.

word_lm <- lm(prop ~ time, data = word)

library(broom)  # for augment()
word_aug <- augment(word_lm)

The observations are independent because the subjects are chosen randomly. The residuals vs fits plot confirms the independence with correlation \(\rho \sim 0\) (-0.1366572). The residuals vs fits plot does not confirm the linearity condition however, because the residuals do not vary randomly around zero. Until we address the linearity violation, we cannot check the equal variances condition.

ggplot(data = word_aug, aes(y = .std.resid, x = .fitted)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_hline(yintercept = -2) +
  geom_hline(yintercept = 2) +
  ggtitle("Standardized Residuals vs Fits")

The last LINE condition to check is normality. The residuals normal probability plot is a plot of the theoretical percentiles of the normal distribution versus the observed sample percentiles. It should be approximately linear. It is.

qqnorm(word_aug$.resid)
qqline(word_aug$.resid)

For a more quantitative analysis, use the Anderson-Darling normality test. The test p-value is the probability of calculating the test statistic if the distribution is normal. The p-value is .6426, so do not reject \(H_0\).

library(nortest)
ad.test(word_aug$.resid)

## 
##  Anderson-Darling normality test
## 
## data:  word_aug$.resid
## A = 0.26197, p-value = 0.6426

In summary, the only problem with the model is the violation of the linearity condition. This is a situation where transforming a predictor variable is appropriate.

The most common predictor variable transformation is the log transformation. \(log(x)\) has the effect of spreading changes in small values of \(x\) and compressing changes in large values of \(x\).

library(dplyr)
data.frame(x = 1:1000,
   y = log(1:1000)) %>%
  ggplot(aes(x = x, y = y)) +
  geom_line() +
  ggtitle("Natural Log") +
  ylab("ln(x)")

Reformulate the regression model as prop ~ ln(time). The relationship now appears to be linear.

ggplot(word, aes(y = prop, x = log(time))) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("prop ~ ln(time)")

Estimate the model, and evaluate the LINE conditions.

word_lm <- lm(prop ~ log(time), data = word)

word_aug <- augment(word_lm)

The observations are independent because the subjects are chosen randomly. The residuals vs fits plot confirms the independence with correlation \(\rho \sim 0\) (0.0064634). The residuals vs fits plot confirms the linearity condition because the residuals vary randomly around zero. The residuals vs fits plot also confirms the equal variances condition because the residuals vary with a constant width \(\epsilon = 0\), with no fan shape at the low and high ends.

ggplot(data = word_aug, aes(y = .std.resid, x = .fitted)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_hline(yintercept = -2) +
  geom_hline(yintercept = 2) +
  ggtitle("Standardized Residuals vs Fits")

The last LINE condition to check is normality. The residuals normal probability plot should be approximately linear, and it is.

qqnorm(word_aug$.resid)
qqline(word_aug$.resid)

The Anderson-Darling normality test p-value is .4869, so do not reject the assumption of normality.

ad.test(word_aug$.resid)

## 
##  Anderson-Darling normality test
## 
## data:  word_aug$.resid
## A = 0.32157, p-value = 0.4869

Now that you have a good model for the data, use the model to answer the research questions.

Research Question 1. What is the nature of the association between time since memorized and the effectiveness of recall?

Answer. The proportion of correctly recalled words is negatively linearly related to the natural log of the time since the words were memorized. I.e., as the natural log of time increases, the proportion of recalled words decreases.

Research Question 2. Is there an association between time since memorized and effectiveness of recall?

Answer. Test the null hypothesis \(H_0: \beta_1 = 0\) using either the F-test or the t-test. The F-statistic for the model is 1076 (p < 0.001), and the t value for log(time) is -32.80 (p < .001). There is significant evidence at the 0.05 level to conclude that there is a linear association between the proportion of words recalled and the natural log of the time since memorized.

summary(word_lm)

## 
## Call:
## lm(formula = prop ~ log(time), data = word)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.036077 -0.015330 -0.006415  0.017967  0.037799 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.846415   0.014195   59.63 3.65e-15 ***
## log(time)   -0.079227   0.002416  -32.80 2.53e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02339 on 11 degrees of freedom
## Multiple R-squared:  0.9899, Adjusted R-squared:  0.989 
## F-statistic:  1076 on 1 and 11 DF,  p-value: 2.525e-12

Research Question 3. What is the expected proportion of words a randomly selected person would recall after 1000 minutes?

Answer. Use the predict.lm function. Notice that the model knows to transform time, so you do not need to explicitly tranform it. We can be 95% confident that after 1000 minutes a randomly selected person will recall between 24.5% and 35.3% of the words.

x_new <- data.frame(time = 1000)
predict.lm(object = word_lm, 
           newdata = data.frame(x_new), 
           interval = "prediction")

##         fit       lwr       upr
## 1 0.2991353 0.2449729 0.3532978

Research Question 4. How much does the expected recall change if time increases ten-fold?

Answer. A 10-fold increase in the predictor time is associated with a \(\beta_1 ln(10) = -0.079227 \times ln(10) = -0.182\) change in the mean of the response prop. Be careful to only consider changes in a predictor variable that make sense within the range of the predictor variable values. The 95% confidence interval for \(\beta_1\) is \(-0.079227 \pm 2.201(0.002416) = (-0.085, -0.074)\). Multiply these values by ln(10) for \((-0.170, -0.195)\). We can be 95% confident that the percentage of recalled words will decrease between 17.0% and 19.5% for each ten-fold increase in the time since memorization took place.

Transforming the Response Variable

Transform the response variable only if non-linearity and/or unequal variances are the only problem (independence and normality conditions are met).

Example

A study measured birthweight and length of gestation for various mammals. Data set mammgest (mammgest.txt is a summary of n = 11 mammals:

Gestation = length of gestation, in days
Birthwgt = birthweight, in kg

mammgest <- read_tsv(file = "./Data/mammgest.txt")

print(mammgest)

## # A tibble: 11 x 3
##    Mammal    Birthwgt Gestation
##    <chr>        <dbl>     <int>
##  1 Goat         2.75        155
##  2 Sheep        4.00        175
##  3 Deer         0.480       190
##  4 Porcupine    1.50        210
##  5 Bear         0.370       213
##  6 Hippo       50.0         243
##  7 Horse       30.0         340
##  8 Camel       40.0         380
##  9 Zebra       40.0         390
## 10 Giraffe     98.0         457
## 11 Elephant   113.          670

Formulate a regression model as Gestation ~ Birthwgt. The relationship does appears to be linear, but with unequal variances.

ggplot(mammgest, aes(y = Gestation, x = Birthwgt)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("Gestation ~ Birthwgt")

Estimate the model, and evaluate the LINE conditions.

mammgest_lm <- lm(Gestation ~ Birthwgt, data = mammgest)

mammgest_aug <- augment(mammgest_lm)

The observations are independent because the subjects are chosen randomly. The residuals vs fits plot confirms the independence with correlation \(\rho \sim 0\) (0.0371882). The residuals vs fits plot confirms the linearity condition because the residuals vary randomly around zero. However, the residuals vs fits plot does not confirm the equal variances condition because the residuals fan out.

ggplot(data = mammgest_aug, aes(y = .std.resid, x = .fitted)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_hline(yintercept = -2) +
  geom_hline(yintercept = 2) +
  ggtitle("Standardized Residuals vs Fits")

The last LINE condition to check is normality. The residuals normal probability plot should be approximately linear. It is.

qqnorm(mammgest_aug$.resid)
qqline(mammgest_aug$.resid)

For a more quantitative analysis, use the Anderson-Darling normality test. The test p-value is .503, so do not reject \(H_0\).

ad.test(mammgest_aug$.resid)

## 
##  Anderson-Darling normality test
## 
## data:  mammgest_aug$.resid
## A = 0.31159, p-value = 0.503

In summary, the only problem with the model is the violation of the equal variances condition. This is a situation where transforming the response variable is appropriate.

One possible response variable transformation is the log transformation. Reformulate the regression model as ln(Gestation) ~ Birthwgt. The relationship still appears to be linear.

ggplot(mammgest, aes(y = log(Gestation), x = Birthwgt)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("ln(Gestation) ~ Birthwgt")

Estimate the model, and evaluate the LINE conditions.

mammgest_lm <- lm(log(Gestation) ~ Birthwgt, data = mammgest)

mammgest_aug <- augment(mammgest_lm)

The observations are independent because the subjects are chosen randomly. The residuals vs fits plot confirms the independence with correlation \(\rho \sim 0\) (-0.0041985). The residuals vs fits plot confirms the linearity condition because the residuals vary around zero. The residuals vs fits plot also confirms the equal variances condition because the residuals vary with a constant width, with no fan shape at the low and high ends.

ggplot(data = mammgest_aug, aes(y = .std.resid, x = .fitted)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_hline(yintercept = -2) +
  geom_hline(yintercept = 2) +
  ggtitle("Standardized Residuals vs Fits")

The last LINE condition to check is normality. The residuals normal probability plot should be approximately linear, and it is.

qqnorm(mammgest_aug$.resid)
qqline(mammgest_aug$.resid)

The Anderson-Darling normality test p-value is .4963, so do not reject the assumption of normality.

ad.test(mammgest_aug$.resid)

## 
##  Anderson-Darling normality test
## 
## data:  mammgest_aug$.resid
## A = 0.31355, p-value = 0.4963

Now that you have a good model for the data, use the model to answer the research questions.

Research Question 1. What is the nature of the association between mammalian birth weight and length of gestation?

Answer. The natural logarithm of the length of gestation is positively linearly related to birthweight. Equivalently, the percentage change in gestation is postively linearly related to the unit change in birthweight.

Research Question 2. Is there an association between mammalian birth weight and length of gestation?

Answer. Test the null hypothesis \(H_0: \beta_1 = 0\) using either the F-test or the t-test. The F-statistic for the model is 36.75 (p < 0.001), and the t value for Birthwgt is 6.062 (p < .001). There is significant evidence at the 0.05 level to conclude that there is a linear association between the natural log of the mammalian gestation time and birthweight.

summary(mammgest_lm)

## 
## Call:
## lm(formula = log(Gestation) ~ Birthwgt, data = mammgest)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3063 -0.1650  0.0521  0.1582  0.2709 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.278817   0.088177  59.866  5.1e-13 ***
## Birthwgt    0.010410   0.001717   6.062 0.000188 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2163 on 9 degrees of freedom
## Multiple R-squared:  0.8033, Adjusted R-squared:  0.7814 
## F-statistic: 36.75 on 1 and 9 DF,  p-value: 0.0001878

Research Question 3. What is the expected gestation length of a 50 kg mammal?

Answer. Use the predict.lm function. We can be 95% confident that a 50kg mammal will gestate between 197 and 552 days with a point estimate of 330 days. Notice predict.lm() is wrapped in the exp() to convert the log back ito regular units.

x_new <- data.frame(Birthwgt = 50)
exp(predict.lm(object = mammgest_lm, 
           newdata = data.frame(x_new), 
           interval = "prediction"))

##        fit      lwr      upr
## 1 330.0781 197.3013 552.2092

Research Question 4. What is the expected change in length of gestation for each one pound increase in birth weight?

Answer. A one pound increase in the predictor Birthwgt is associated with a \(e^{\beta_1 \times 1} = e^{0.010410} = 1.014\) factor change in the mean of the gestation length. The change is a factor change, not an absolute change. Be careful to only consider changes in a predictor variable that make sense within the range of the predictor variable values. The 95% confidence interval for \(\beta_1\) is \(0.010410 \pm 2.226(0.001717) = (0.007, 0.014)\). Take the exponent of these values for \((1.007, 1.014)\). We can be 95% confident that the gestation time of a mammal increase between 0.7% and 1.4% for each pound increase in the birthweight.

Tranforming Predictor and Response Variables

Transform both the predictor variable and the response variable if the regression function is not linear and the error terms are not normal and have unequal variances.

Example

A study measured tree volume and diameter of trees. Data set shortleaf (shortleaf.txt is a summary of n = 70 shortleaf pines:

Diam = diameter of tree, in inches
Vol = volume of tree, in cubic feet

shortleaf <- read_tsv(file = url("https://newonlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/shortleaf/index.txt"))

head(shortleaf)

## # A tibble: 6 x 2
##    Diam   Vol
##   <dbl> <dbl>
## 1  4.40  2.00
## 2  4.60  2.20
## 3  5.00  3.00
## 4  5.10  4.30
## 5  5.10  3.00
## 6  5.20  2.90

Formulate a regression model as Vol ~ Diam. The relationship does not appear to be linear.

ggplot(shortleaf, aes(y = Vol, x = Diam)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("Vol ~ Diam")

Estimate the model, and evaluate the LINE conditions.

shortleaf_lm <- lm(Vol ~ Diam, data = shortleaf)

shortleaf_aug <- augment(shortleaf_lm)

The observations are independent because the subjects are chosen randomly. The residuals vs fits plot confirms the independence with correlation \(\rho \sim 0\) (0.012313). The residuals vs fits plot further shows that the linearity condition does not hold because the residuals do not vary randomly around zero.

ggplot(data = shortleaf_aug, aes(y = .std.resid, x = .fitted)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_hline(yintercept = -2) +
  geom_hline(yintercept = 2) +
  ggtitle("Standardized Residuals vs Fits")

The last LINE condition to check is normality. The residuals normal probability plot should be approximately linear. It is not. One outlier point may be the problem.

qqnorm(shortleaf_aug$.resid)
qqline(shortleaf_aug$.resid)

For a more quantitative analysis, use the Anderson-Darling normality test. The test p-value is .012, so reject \(H_0\).

ad.test(shortleaf_aug$.resid)

## 
##  Anderson-Darling normality test
## 
## data:  shortleaf_aug$.resid
## A = 0.99132, p-value = 0.01215

In summary, the relationship between tree diameter and volume is not linear, and the error terms are not normally distributed. This is a situation where transforming the response variable is appropriate.

Start by addressing the nonlinearity with a log transformation of Diam. Reformulate the regression model as Vol ~ ln(Diam). The relationship still does not appear to be linear.

ggplot(shortleaf, aes(y = Vol, x = log(Diam))) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("Vol ~ ln(Diam)")

Try also log transformationing Vol. Reformulate the regression model as ln(Vol) ~ ln(Diam). The relationship now appears to be linear.

ggplot(shortleaf, aes(y = log(Vol), x = log(Diam))) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("ln(Vol) ~ ln(Diam)")

Estimate the model, and evaluate the LINE conditions.

shortleaf_lm <- lm(log(Vol) ~ log(Diam), data = shortleaf)

shortleaf_aug <- augment(shortleaf_lm)

The observations are independent because the subjects are chosen randomly. The residuals vs fits plot confirms the independence with correlation \(\rho \sim 0\) (0.0029023). The residuals vs fits plot confirms the linearity condition because the residuals vary randomly around zero. The residuals vs fits plot also confirms the equal variances condition because the residuals vary with a constant width, with no fan shape at the low and high ends.

ggplot(data = shortleaf_aug, aes(y = .std.resid, x = .fitted)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_hline(yintercept = -2) +
  geom_hline(yintercept = 2) +
  ggtitle("Standardized Residuals vs Fits")

The last LINE condition to check is normality. The residuals normal probability plot should be approximately linear, and it is.

qqnorm(shortleaf_aug$.resid)
qqline(shortleaf_aug$.resid)

The Anderson-Darling normality test p-value is .1692, so do not reject the assumption of normality.

ad.test(shortleaf_aug$.resid)

## 
##  Anderson-Darling normality test
## 
## data:  shortleaf_aug$.resid
## A = 0.53086, p-value = 0.1692

In summary, the model ln(Vol) ~ ln(Diam) with the natural log of tree volume as the response and the natural log of tree diameter as the predictor works well. The relationship is linear and the error terms are independent and normally distributed with equal variances. Now that you have a good model, use it to answer the research questions.

Research Question 1. What is the nature of the association between diameter and volume of shortleaf pines?

Answer. The log of tree volume is positively linearly related to the log of tree diamater. Equivalently, the percentage change in tree volume is linearly related to teh percentage change in tree diameter.

Research Question 2. Is there an association between diameter and volume of shortleaf pines?

Answer. Test the null hypothesis \(H_0: \beta_1 = 0\) using either the F-test or the t-test. The F-statistic for the model is 2509 (p < 0.001), and the t value for Birthwgt is 50.09 (p < .001). There is significant evidence at the 0.01 level to conclude that there is a linear association between the natural logarithm of tree volume and the natural logarithm of tree diameter.

summary(shortleaf_lm)

## 
## Call:
## lm(formula = log(Vol) ~ log(Diam), data = shortleaf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3323 -0.1131  0.0267  0.1177  0.4280 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.8718     0.1216  -23.63   <2e-16 ***
## log(Diam)     2.5644     0.0512   50.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1703 on 68 degrees of freedom
## Multiple R-squared:  0.9736, Adjusted R-squared:  0.9732 
## F-statistic:  2509 on 1 and 68 DF,  p-value: < 2.2e-16

Research Question 3. What is the “average” volume of all shortleaf pine trees that are 10" in diameter?

Answer. Use the predict.lm function. We can be 95% confident that a 10" diameter shortleaf pine tree will have a volume between 19.9 and 21.6 cf with a point estimate of 20.8 cf. Notice predict.lm() is wrapped in the exp() to convert the log back ito regular units.

x_new <- data.frame(Diam = 10)
exp(predict.lm(object = shortleaf_lm, 
           newdata = data.frame(x_new), 
           interval = "confidence"))

##        fit      lwr      upr
## 1 20.75934 19.92952 21.62372

Research Question 4. What is expected change in volume for a two-fold increase in diameter?

Answer. A two-fold increase in the predictor Diam is associated with a \(2e^{\beta_1 \times 1} = 2e^{2.5644} = 26.0\) factor change in the tree volume. The change is a factor change, not an absolute change. Be careful to only consider changes in a predictor variable that make sense within the range of the predictor variable values. The 95% confidence interval for \(\beta_1\) is \(2.5644 \pm 1.9955(0.0512) = (2.46, 2.67)\). Take the exponent of these values for \((5.50, 6.36)\). We can be 95% confident that the median volume will increase by a factor between 5.50 and 6.36 for each two-fold increase in diameter.

Choosing Transformations

If the primary problem with your model is non-linearity, look at a scatter plot of the residuals to suggest transformations that might help. Try transformations of the x-variable(s) and/or y-variable such as square root, logarithmic, or reciprocal transformations. One of these will often end up working out. If you transform the y-variable, you will change the variance of the y-variable and the errors. You may wish to try transformations of the y-variable (e.g., \(ln(y)\), \(\sqrt(y)\), \(y^{-1}\)) when there is evidence of nonnormality and/or nonconstant variance problems in one or more residual plots. Try transformations of the x-variable(s) (e.g., \(x^{-1}\), \(x^2\), \(ln(x)\)) when there are strong nonlinear trends in one or more residual plots. However, these guidelines don’t always work.

Quantitative Interactions

Typically, regression models that include interactions between quantitative predictors \(X_1 X_2\), \(X_1 X_2\) also include the main effects terms \(X1\) and \(X2\) even if the coefficients are significant.

References

PSU STAT501.

Multiple Linear Regression: Variable Transformations

Tranforming variables to conform to LINE assumptions

Michael Foley

2019-05-13

Transforming Predictor Variables

Example

Transforming the Response Variable

Example

Tranforming Predictor and Response Variables

Example

Choosing Transformations

Quantitative Interactions

References