KITADA

Lesson #19

Simple Linear Regression Analysis Involving a Transformation

Motivation:

Conclusions from a simple linear regression analysis are valid to a population of interest only if the conditions listed on the tan sheet (Steps for doing a Simple Linear Regression Analysis) are met. However, if the last two conditions (constant variation and normality conditions) aren't perfectly met, conclusions can still be valid to the population of interest. But, if one or both of those conditions are far from being met, then any conclusions from the analysis may not be valid to a population of interest. When one or more of the linearity, constant variation, and/or normality conditions aren't met, a strategy to use is to attempt a transformation to see if it helps satisfy the conditions better. The example in this lesson will show how to implement such a strategy.

What you need to know from this lesson:

After completing this lesson, you should be able to

recognize when conditions for performing a Simple Linear Regression Analysis are met and when they aren't met
perform a logarithmic transformation on one or both of the variables
explain the steps that need to be taken if a transformation is performed
identify which model (original data with no transformations or transformed data) best satisfies the conditions for performing a Simple Linear Regression analysis
interpret the results of a Simple Linear Regression analysis on a model involving transformed data

To accomplish the above “What You Need to Know”, do the following:

1. Attend lecture and answer the questions on the following pages of this lesson.
2. There is no section in the text that has examples of an analysis involving a transformation
3. Do the Lesson 19 questions at the end of the lesson notes

The Lesson

The “Animal Gestation” example:

Do mammals with longer average life expectancies have longer gestation periods? The data on the next page are the average gestation period (in days) and average life expectancy in years for a sample of mammals. Perform an appropriate analysis to answer the question of interest (and other related questions).

1. What is the response and what is the explanatory variable? Are they quantitative or categorical?

Responsr: Gestation period (days) –> quantitative
Explanatory: Life expectancy (years) –> quantitative

2. What is the appropriate analysis to answer the question of interest?

We want to create a linear model to describe the relationsip. Specifically we want to assess whether or not there is a positive relationship between life expectancy and gestation period.

GESTATION[,1:3]

##          animal gestation longevity
## 1        baboon       187        20
## 2    black bear       219        18
## 3  grizzly bear       225        25
## 4    polar bear       240        20
## 5        beaver       122         5
## 6       buffalo       278        15
## 7         camel       406        12
## 8    chimpanzee       231        20
## 9           cat        63        12
## 10     chipmunk        31         6
## 11          cow       284        15
## 12         deer       201         8
## 13          dog        61        12
## 14       donkey       365        12
## 15     elephant       645        40
## 16          elk       250        15
## 17          fox        52         7
## 18      giraffe       425        10
## 19         goat       151         8
## 20      gorilla       257        20
## 21   guinea pig        68         4
## 22 hippopotamus       238        25
## 23        horse       330        20
## 24     kangaroo        42         7
## 25      leopard        98        12
## 26         lion       100        15
## 27       monkey       164        15
## 28        moose       240        12
## 29        mouse        21         3
## 30      opossum        15         1
## 31          pig       112        10
## 32         puma        90        12
## 33       rabbit        31         5
## 34   rhinoceros       450        15
## 35     sea lion       350        12
## 36        sheep       154        12
## 37     squirrel        44        10
## 38        tiger       105        16
## 39         wolf        63         5
## 40        zebra       365        15

Performing the analysis:

Step 1: Assess the conditions for conclusions from the analysis to be valid to the population of all mammals

### A NEW EXAMPLE: GESTATION ###
gest_mod <- with(GESTATION, lm(gestation~longevity))

## SCATTER PLOT ##
with(GESTATION, 
     plot(longevity, gestation, pch=16, main = "Animal Gestation",
          ylab="Average Gestation (days)", 
          xlab="Average Life Expectancy (years)")
)
abline(coefficients(gest_mod), 
       lwd=3, col="red", lty=2)

plot of chunk unnamed-chunk-4

## RESIDUAL PLOT ##
plot(GESTATION$longevity, resid(gest_mod), pch=16, 
     xlab="Average Life Expectancy (years)",
     ylab="Residual",
     main="Residual Plot")

abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-4

## QQ PLOT ##
qqnorm(resid(gest_mod))
qqline(resid(gest_mod))

plot of chunk unnamed-chunk-4

3. Is the sample representative of the population?

Comment on whether or not the sample taken is representative of the population of all mammals.

Only 40 animals are represented; however, the list does appear to be very diverse.

4. Linearity condition:

Based on the scatterplot and residual plot, is the relationship between average life expectancy and average gestation period linear?

It looks like there is a curved relationship.

5. Outlier condition (or other deviations):

Are there any potentially influential outliers? How might the outlier be influencing the analysis?

There appears to be one outlier around x=40. This might be influencial in making the slope of the line less than it should be.

Identify the outlier. Is there a “non-statistical” reason to remove the outlier from the analysis? If not, what should be done?

There doesn't appear to be any “non-statistical” reason to remove the elephant from the data set.

The outlying observation is left in the analysis that follows. A separate analysis without the outlier should be performed as well but is not done so here. The procedure for doing the analysis without the outlier would be the same as what follows.

Are there any other deviations from the pattern (i.e. clusters of points)?

Not really

6. independent observations?

Is it safe to say that the gestation period for one type of mammal is not being influenced by the gestation period of any other type of mammal?

Yes, they are independent.

7. constant variation condition:

Based on the residual plot and scatterplot, do you feel that the spread in gestation periods is the same for all life expectancies? (Another way to ask this question: do you feel that the residuals have constant variation for all life expectancies?) Why or why not?

The residuals are “fan-shaped” which suggest non-constant variation.

8. normality condition:

Based on the normal probability plot of the residuals, do you feel that the residuals are normally distributed (and, therefore, the values of gestation period are normally distributed for each value of life expectancy)?

There are strong deviations from normality in the upper tail.

Step 2: Decide if a transformation is necessary.

If no transformation is necessary, move to Step 3.

If a transformation is necessary, do so and re-check conditions 2, 5, and 6 in Step 1 on the transformed data. If the conditions are satisfied after performing a transformation, move on to Step 3 using the transformed data. If the conditions are still not satisfied, perform another transformation and re-check conditions 2, 5, and 6 in Step 1 on the newly transformed data. Repeat until a model can be chosen that best satisfies the conditions. Use that model in Step 3.

9. Is a transformation necessary in this problem? Why or why not?

Yes, since there is a curved relationship we should consider using transformations.

A natural log transformation of the response variable was first attempted. Here are the plots on the analysis of life expectancy vs. ln(gestation).

### A NEW EXAMPLE: GESTATION ###
lngest_mod <- with(GESTATION, lm(log.gestation.~longevity))

## SCATTER PLOT ##
with(GESTATION, 
     plot(longevity, log.gestation., pch=16, main = "Animal Gestation",
          ylab="LN(Average Gestation) (days)", 
          xlab="Average Life Expectancy (years)")
)
abline(coefficients(lngest_mod), 
       lwd=3, col="red", lty=2)

plot of chunk unnamed-chunk-5

## RESIDUAL PLOT ##
plot(GESTATION$longevity, resid(lngest_mod), pch=16, 
     xlab="Average Life Expectancy (years)",
     ylab="Residual",
     main="Residual Plot with LN MOD")

abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-5

## QQ PLOT ##
qqnorm(resid(lngest_mod))
qqline(resid(lngest_mod))

plot of chunk unnamed-chunk-5

Step 1 revisited: Reassess the linearity, constant variation, and normality conditions on the transformed data

Condition 2: Linearity condition:

10. Is the relationship between life expectancy and ln(gestation period) linear? Is it “more linear” or “less linear” compared to the original data?

It looks more linear that the original data; however, it still appears that there is some curve

Condition 5: constant variation condition:

11. Is the constant variation condition better met on the transformed data compared to the original data?

No, the variation is not constant.

Condition 6: normality condition:

12. Is the normality condition better met on the transformed data compared to the original data?

Again, there are deviations from normality however it is now considerable in both tails.

Step 2 revisited: Decide if further transformations are necessary.

13. Do you feel that another transformation should be performed to try to get the conditions “better satisfied”? If so, what type of transformation should be attempted next?

Yes, we should try another transformation in the x direction.

If it is decided to try another transformation, transforming both the response and explanatory variable should be attempted. (Since the constant variation condition and normality condition were not met on the original data, transforming only the explanatory variable would not help. Therefore, we’ll move right to transforming both.) Once again, we’ll need to revisit Step 1 after transforming both variables. Below are the plots after performing a natural log transformation on both gestation and life expectancy.

### A NEW EXAMPLE: GESTATION ###
lnlngest_mod <- with(GESTATION, lm(log.gestation.~log.longevity.))

## SCATTER PLOT ##
with(GESTATION, 
     plot(log.longevity., log.gestation., pch=16, main = "Animal Gestation",
          ylab="LN(Average Gestation) (days)", 
          xlab="LN(Average Life Expectancy) (years)")
)
abline(coefficients(lnlngest_mod), 
       lwd=3, col="red", lty=2)

plot of chunk unnamed-chunk-6

## RESIDUAL PLOT ##
plot(GESTATION$log.longevity., resid(lnlngest_mod), pch=16, 
     xlab="LN(Average Life Expectancy) (years)",
     ylab="Residual",
     main="Residual Plot with LN MOD")

abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-6

## QQ PLOT ##
qqnorm(resid(lnlngest_mod))
qqline(resid(lnlngest_mod))

plot of chunk unnamed-chunk-6

Step 1 revisited: Reassess the linearity, constant variation, and normality conditions on the when both variables are transformed

Condition 2: Linearity condition:

14. Is the relationship between ln(life expectancy) and ln(gestation period) linear? Is it “more linear” or “less linear” compared to the original data? How about compared to the model when only gestation was transformed?

Yes, it looks much more linear.

Condition 5: constant variation condition:

15. Is the constant variation condition better met when both variables are transformed compared to when only gestation was transformed?

Roughly constant variation (if the outliers are excluded)

Condition 6: normality condition:

16. The normality conditions seemed satisfied when only gestation was transformed. Is the normality condition still met when both variables are transformed?

Yes, looks pretty good.

Step 2 revisited: Decide which model is the best.

We’ve exhausted all of the possible transformations and now have to choose the model that best satisfies all of the conditions.

*17. Which model would you choose as the one that best satisfies all of the conditions (the one based on the original data, the one after taking the log of gestation, or the one after taking the log of both variables)? *

Taking the log of both variables seems to be best, considering the model conditions.

18. Do you feel comfortable saying that all of the conditions are met in the model you chose?

Yes, I feel pretty confident.

Step 3: Using the model that best satisfies the conditions, obtain the output from the simple linear regression analysis.

I decided to perform the analysis on the data after taking the natural log of both variables. Here is the output. Use the output to answer the questions that follow.

Remember that in R, using the log() function will take the natural log of the variable you type inside the parentheses

summary(lnlngest_mod)

## 
## Call:
## lm(formula = log.gestation. ~ log.longevity.)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.04393 -0.50505 -0.03331  0.40151  1.22397 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.4040     0.3614   6.652  7.3e-08 ***
## log.longevity.   1.0528     0.1450   7.259  1.1e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6025 on 38 degrees of freedom
## Multiple R-squared:  0.581,  Adjusted R-squared:   0.57 
## F-statistic: 52.69 on 1 and 38 DF,  p-value: 1.101e-08

19. Write the least-squares regression equation. Define the terms in the equation.

LN(GESTATION)=2.4040+1.0528*LN(LONGEVITY)

20. Can life expectancy of mammals be used to predict their gestation period? Perform the appropriate hypothesis test to answer this question of interest:

Write the null and alternative hypotheses in notation and words

\( H_0: b=0 \), (There is no effect of LN(LONGEVITY) on LN(GESTATION))

\( H_0: b \neq 0 \), (There is an effect of LN(LONGEVITY) on LN(GESTATION))

Calculate the t-statistic (with degrees of freedom)

## Estimate = 1.0528
## Std Err of Estimate = 0.1450

##Test Statistic = Estimate/Std Err of Estimate 

1.0528/0.1450

## [1] 7.26069

## Degrees of Freedom = n-2
40-2

## [1] 38

Determine the p-value

## P-value <0.0001

Write a sentence answering the question in the context of the problem.

There is convincing evidence to suggest that there is an effect of LN(LONGEVITY) on LN(GESTATION) with a p-value <0.0001. Therefore, we will reject the null hypothesis.

21. Predict gestation for a mammal with a life expectancy of 17 years.

## FIRST: TRANSFORM X
ln_x<-log(17)
ln_x

## [1] 2.833213

## SECOND: PREDICT WITH MODEL
ln_y<-2.4040+1.0528*ln_x
ln_y

## [1] 5.386807

## THIRD: BACK TRANSFORM
exp(ln_y)

## [1] 218.5046

22. What is the residual for the monkey?

## FIRST: TRANSFORM X FOR MONKEY
ln_x<-log(15)
ln_x

## [1] 2.70805

## SECOND: PREDICT WITH MODEL
ln_y<-2.4040+1.0528*ln_x
ln_y

## [1] 5.255035

## THIRD: BACK TRANSFORM
expected<-exp(ln_y)
expected

## [1] 191.5282

observed<-164

## RESIDUAL = OBSERVED - EXPECTED
observed-expected

## [1] -27.52824

23. What percent of the variation in ln(gestation) is explained by this regression model?

## Residual standard error: 0.6025

Interpretations when only the response variable is transformed.

Certain interpretations are difficult when both the response and explanatory variables are transformed, such as the interpretation of the slope and the construction and interpretation of a confidence interval for the slope. You will not be responsible for knowing these interpretations if both variables are transformed.

However, if only the response variable is transformed, you WILL be responsible for knowing the interpretation of the slope AND the construction and interpretation of a confidence interval for the slope. The following are illustrations of both of these using the Animal Gestation example, but with only the response variable (gestation) transformed.

1. If only the natural log of the response variable was taken, the least-squares regression equation is:

\( log_e(gestation)=3.8096+0.0856(longevity) \)

(output not shown)

The interpretation of the slope is as follows:

a one-year increase in life expectancy of an animal is associated with a multiplicative change in the median gestation period of e.0856 (or 1.09). In other words, the median gestation period for an animal with an average life expectancy of 17 years (for example) is predicted to be about 1.09 times longer than for an animal with an average life expectancy of 16 years.

2. When doing a log transformation on just the response variable, a confidence interval for the slope can be started in the same way as if no transformation was done, but it is finished in a slightly different way. Let’s use the regression equation after performing a log transformation on gestation (equation given in #1 above). (Note: the standard error of the slope is 0.0152 – again, output is not shown.)

95% confidence interval for \( \beta_1 \):

\( .0856 \pm (2.030)(.0152) = (0.0547 , 0.1166) \)

to finishl \( (e^{0.0547},e^{0.1165}) = (1.06, 1.12) \)

The interpretation: we are 95% confident that the median gestation period will be between 1.06 to 1.12 times longer for every increase of one year in life expectancy.

3. The estimate of the standard deviation of the residuals (S in the output) is also on the log scale. We won’t try to put it back on the original scale since (original data) since the constant variation condition didn’t appear to be met on the original scale.