KITADA
Lesson #19
Motivation:
Conclusions from a simple linear regression analysis are valid to a population of interest only if the conditions listed on the tan sheet (Steps for doing a Simple Linear Regression Analysis) are met. However, if the last two conditions (constant variation and normality conditions) aren't perfectly met, conclusions can still be valid to the population of interest. But, if one or both of those conditions are far from being met, then any conclusions from the analysis may not be valid to a population of interest. When one or more of the linearity, constant variation, and/or normality conditions aren't met, a strategy to use is to attempt a transformation to see if it helps satisfy the conditions better. The example in this lesson will show how to implement such a strategy.
What you need to know from this lesson:
After completing this lesson, you should be able to
To accomplish the above “What You Need to Know”, do the following:
The Lesson
The “Animal Gestation” example:
Do mammals with longer average life expectancies have longer gestation periods? The data on the next page are the average gestation period (in days) and average life expectancy in years for a sample of mammals. Perform an appropriate analysis to answer the question of interest (and other related questions).
1. What is the response and what is the explanatory variable? Are they quantitative or categorical?
2. What is the appropriate analysis to answer the question of interest?
We want to create a linear model to describe the relationsip. Specifically we want to assess whether or not there is a positive relationship between life expectancy and gestation period.
GESTATION[,1:3]
## animal gestation longevity
## 1 baboon 187 20
## 2 black bear 219 18
## 3 grizzly bear 225 25
## 4 polar bear 240 20
## 5 beaver 122 5
## 6 buffalo 278 15
## 7 camel 406 12
## 8 chimpanzee 231 20
## 9 cat 63 12
## 10 chipmunk 31 6
## 11 cow 284 15
## 12 deer 201 8
## 13 dog 61 12
## 14 donkey 365 12
## 15 elephant 645 40
## 16 elk 250 15
## 17 fox 52 7
## 18 giraffe 425 10
## 19 goat 151 8
## 20 gorilla 257 20
## 21 guinea pig 68 4
## 22 hippopotamus 238 25
## 23 horse 330 20
## 24 kangaroo 42 7
## 25 leopard 98 12
## 26 lion 100 15
## 27 monkey 164 15
## 28 moose 240 12
## 29 mouse 21 3
## 30 opossum 15 1
## 31 pig 112 10
## 32 puma 90 12
## 33 rabbit 31 5
## 34 rhinoceros 450 15
## 35 sea lion 350 12
## 36 sheep 154 12
## 37 squirrel 44 10
## 38 tiger 105 16
## 39 wolf 63 5
## 40 zebra 365 15
Performing the analysis:
Step 1: Assess the conditions for conclusions from the analysis to be valid to the population of all mammals
### A NEW EXAMPLE: GESTATION ###
gest_mod <- with(GESTATION, lm(gestation~longevity))
## SCATTER PLOT ##
with(GESTATION,
plot(longevity, gestation, pch=16, main = "Animal Gestation",
ylab="Average Gestation (days)",
xlab="Average Life Expectancy (years)")
)
abline(coefficients(gest_mod),
lwd=3, col="red", lty=2)
## RESIDUAL PLOT ##
plot(GESTATION$longevity, resid(gest_mod), pch=16,
xlab="Average Life Expectancy (years)",
ylab="Residual",
main="Residual Plot")
abline(h=0, lwd=2, lty=2,
col="blue")
## QQ PLOT ##
qqnorm(resid(gest_mod))
qqline(resid(gest_mod))
3. Is the sample representative of the population?
Comment on whether or not the sample taken is representative of the population of all mammals.
Only 40 animals are represented; however, the list does appear to be very diverse.
4. Linearity condition:
Based on the scatterplot and residual plot, is the relationship between average life expectancy and average gestation period linear?
It looks like there is a curved relationship.
5. Outlier condition (or other deviations):
Are there any potentially influential outliers? How might the outlier be influencing the analysis?
There appears to be one outlier around x=40. This might be influencial in making the slope of the line less than it should be.
Identify the outlier. Is there a “non-statistical” reason to remove the outlier from the analysis? If not, what should be done?
There doesn't appear to be any “non-statistical” reason to remove the elephant from the data set.
The outlying observation is left in the analysis that follows. A separate analysis without the outlier should be performed as well but is not done so here. The procedure for doing the analysis without the outlier would be the same as what follows.
Are there any other deviations from the pattern (i.e. clusters of points)?
Not really
6. independent observations?
Is it safe to say that the gestation period for one type of mammal is not being influenced by the gestation period of any other type of mammal?
Yes, they are independent.
7. constant variation condition:
Based on the residual plot and scatterplot, do you feel that the spread in gestation periods is the same for all life expectancies? (Another way to ask this question: do you feel that the residuals have constant variation for all life expectancies?) Why or why not?
The residuals are “fan-shaped” which suggest non-constant variation.
8. normality condition:
Based on the normal probability plot of the residuals, do you feel that the residuals are normally distributed (and, therefore, the values of gestation period are normally distributed for each value of life expectancy)?
There are strong deviations from normality in the upper tail.
Step 2: Decide if a transformation is necessary.
If no transformation is necessary, move to Step 3.
If a transformation is necessary, do so and re-check conditions 2, 5, and 6 in Step 1 on the transformed data. If the conditions are satisfied after performing a transformation, move on to Step 3 using the transformed data. If the conditions are still not satisfied, perform another transformation and re-check conditions 2, 5, and 6 in Step 1 on the newly transformed data. Repeat until a model can be chosen that best satisfies the conditions. Use that model in Step 3.
9. Is a transformation necessary in this problem? Why or why not?
Yes, since there is a curved relationship we should consider using transformations.
A natural log transformation of the response variable was first attempted. Here are the plots on the analysis of life expectancy vs. ln(gestation).
### A NEW EXAMPLE: GESTATION ###
lngest_mod <- with(GESTATION, lm(log.gestation.~longevity))
## SCATTER PLOT ##
with(GESTATION,
plot(longevity, log.gestation., pch=16, main = "Animal Gestation",
ylab="LN(Average Gestation) (days)",
xlab="Average Life Expectancy (years)")
)
abline(coefficients(lngest_mod),
lwd=3, col="red", lty=2)
## RESIDUAL PLOT ##
plot(GESTATION$longevity, resid(lngest_mod), pch=16,
xlab="Average Life Expectancy (years)",
ylab="Residual",
main="Residual Plot with LN MOD")
abline(h=0, lwd=2, lty=2,
col="blue")
## QQ PLOT ##
qqnorm(resid(lngest_mod))
qqline(resid(lngest_mod))
Step 1 revisited: Reassess the linearity, constant variation, and normality conditions on the transformed data
Condition 2: Linearity condition:
10. Is the relationship between life expectancy and ln(gestation period) linear? Is it “more linear” or “less linear” compared to the original data?
It looks more linear that the original data; however, it still appears that there is some curve
Condition 5: constant variation condition:
11. Is the constant variation condition better met on the transformed data compared to the original data?
No, the variation is not constant.
Condition 6: normality condition:
12. Is the normality condition better met on the transformed data compared to the original data?
Again, there are deviations from normality however it is now considerable in both tails.
Step 2 revisited: Decide if further transformations are necessary.
13. Do you feel that another transformation should be performed to try to get the conditions “better satisfied”? If so, what type of transformation should be attempted next?
Yes, we should try another transformation in the x direction.
If it is decided to try another transformation, transforming both the response and explanatory variable should be attempted. (Since the constant variation condition and normality condition were not met on the original data, transforming only the explanatory variable would not help. Therefore, we’ll move right to transforming both.) Once again, we’ll need to revisit Step 1 after transforming both variables. Below are the plots after performing a natural log transformation on both gestation and life expectancy.
### A NEW EXAMPLE: GESTATION ###
lnlngest_mod <- with(GESTATION, lm(log.gestation.~log.longevity.))
## SCATTER PLOT ##
with(GESTATION,
plot(log.longevity., log.gestation., pch=16, main = "Animal Gestation",
ylab="LN(Average Gestation) (days)",
xlab="LN(Average Life Expectancy) (years)")
)
abline(coefficients(lnlngest_mod),
lwd=3, col="red", lty=2)
## RESIDUAL PLOT ##
plot(GESTATION$log.longevity., resid(lnlngest_mod), pch=16,
xlab="LN(Average Life Expectancy) (years)",
ylab="Residual",
main="Residual Plot with LN MOD")
abline(h=0, lwd=2, lty=2,
col="blue")
## QQ PLOT ##
qqnorm(resid(lnlngest_mod))
qqline(resid(lnlngest_mod))
Step 1 revisited: Reassess the linearity, constant variation, and normality conditions on the when both variables are transformed
Condition 2: Linearity condition:
14. Is the relationship between ln(life expectancy) and ln(gestation period) linear? Is it “more linear” or “less linear” compared to the original data? How about compared to the model when only gestation was transformed?
Yes, it looks much more linear.
Condition 5: constant variation condition:
15. Is the constant variation condition better met when both variables are transformed compared to when only gestation was transformed?
Roughly constant variation (if the outliers are excluded)
Condition 6: normality condition:
16. The normality conditions seemed satisfied when only gestation was transformed. Is the normality condition still met when both variables are transformed?
Yes, looks pretty good.
Step 2 revisited: Decide which model is the best.
We’ve exhausted all of the possible transformations and now have to choose the model that best satisfies all of the conditions.
*17. Which model would you choose as the one that best satisfies all of the conditions (the one based on the original data, the one after taking the log of gestation, or the one after taking the log of both variables)? *
Taking the log of both variables seems to be best, considering the model conditions.
18. Do you feel comfortable saying that all of the conditions are met in the model you chose?
Yes, I feel pretty confident.
Step 3: Using the model that best satisfies the conditions, obtain the output from the simple linear regression analysis.
I decided to perform the analysis on the data after taking the natural log of both variables. Here is the output. Use the output to answer the questions that follow.
Remember that in R, using the log() function will take the natural log of the variable you type inside the parentheses
summary(lnlngest_mod)
##
## Call:
## lm(formula = log.gestation. ~ log.longevity.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.04393 -0.50505 -0.03331 0.40151 1.22397
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4040 0.3614 6.652 7.3e-08 ***
## log.longevity. 1.0528 0.1450 7.259 1.1e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6025 on 38 degrees of freedom
## Multiple R-squared: 0.581, Adjusted R-squared: 0.57
## F-statistic: 52.69 on 1 and 38 DF, p-value: 1.101e-08
19. Write the least-squares regression equation. Define the terms in the equation.
LN(GESTATION)=2.4040+1.0528*LN(LONGEVITY)
20. Can life expectancy of mammals be used to predict their gestation period? Perform the appropriate hypothesis test to answer this question of interest:
\( H_0: b=0 \), (There is no effect of LN(LONGEVITY) on LN(GESTATION))
\( H_0: b \neq 0 \), (There is an effect of LN(LONGEVITY) on LN(GESTATION))
## Estimate = 1.0528
## Std Err of Estimate = 0.1450
##Test Statistic = Estimate/Std Err of Estimate
1.0528/0.1450
## [1] 7.26069
## Degrees of Freedom = n-2
40-2
## [1] 38
## P-value <0.0001
There is convincing evidence to suggest that there is an effect of LN(LONGEVITY) on LN(GESTATION) with a p-value <0.0001. Therefore, we will reject the null hypothesis.
21. Predict gestation for a mammal with a life expectancy of 17 years.
## FIRST: TRANSFORM X
ln_x<-log(17)
ln_x
## [1] 2.833213
## SECOND: PREDICT WITH MODEL
ln_y<-2.4040+1.0528*ln_x
ln_y
## [1] 5.386807
## THIRD: BACK TRANSFORM
exp(ln_y)
## [1] 218.5046
22. What is the residual for the monkey?
## FIRST: TRANSFORM X FOR MONKEY
ln_x<-log(15)
ln_x
## [1] 2.70805
## SECOND: PREDICT WITH MODEL
ln_y<-2.4040+1.0528*ln_x
ln_y
## [1] 5.255035
## THIRD: BACK TRANSFORM
expected<-exp(ln_y)
expected
## [1] 191.5282
observed<-164
## RESIDUAL = OBSERVED - EXPECTED
observed-expected
## [1] -27.52824
23. What percent of the variation in ln(gestation) is explained by this regression model?
## Residual standard error: 0.6025
Interpretations when only the response variable is transformed.
Certain interpretations are difficult when both the response and explanatory variables are transformed, such as the interpretation of the slope and the construction and interpretation of a confidence interval for the slope. You will not be responsible for knowing these interpretations if both variables are transformed.
However, if only the response variable is transformed, you WILL be responsible for knowing the interpretation of the slope AND the construction and interpretation of a confidence interval for the slope. The following are illustrations of both of these using the Animal Gestation example, but with only the response variable (gestation) transformed.
\( log_e(gestation)=3.8096+0.0856(longevity) \)
(output not shown)
The interpretation of the slope is as follows:
a one-year increase in life expectancy of an animal is associated with a multiplicative change in the median gestation period of e.0856 (or 1.09). In other words, the median gestation period for an animal with an average life expectancy of 17 years (for example) is predicted to be about 1.09 times longer than for an animal with an average life expectancy of 16 years.
95% confidence interval for \( \beta_1 \):
\( .0856 \pm (2.030)(.0152) = (0.0547 , 0.1166) \)
to finishl \( (e^{0.0547},e^{0.1165}) = (1.06, 1.12) \)
The interpretation: we are 95% confident that the median gestation period will be between 1.06 to 1.12 times longer for every increase of one year in life expectancy.