KITADA

Lab #7

Transformations and Performing a Simple Linear Regression Analysis

Objectives:

Obtain and interpret appropriate graphical displays to assess conditions of the simple linear regression model
Explain when a transformation is needed by looking at appropriate graphical displays to assess conditions
Using R, perform an appropriate transformation on the response and/or explanatory variables
Using R, obtain output from a Simple Linear Regression analysis that best satisfies the conditions of the simple linear regression model.
- Using appropriate information in the output, write and interpret the least-squares regression equation
- Use the appropriate information in the output to perform a hypothesis test on the slope
- Using appropriate information in the output, report the estimate of the standard deviation of the residuals
- Using appropriate information in the output, report and interpret a confidence interval for the mean and a predication interval for given values of the explanatory variable.

Example: Study on FEV of children

FEV (forced expiratory volume) is an index of pulmonary function that measures the volume of air expelled after one second of constant effort. The data set FEV contains determinations of FEV on 654 children ages 3-19 who were seen in the Childhood Respiratory Disease Study (CRD Study) in East Boston, Massachusetts. These data are part of a longitudinal study to follow the change in pulmonary function over time in children.

Here is the documentation for the data set:

1. List the conditions that need to be met for inference in a simple linear regression analysis. Next to each item in the list, list the graphical check (if one exists) to assess the condition. (Some may have more than one check.)

Representative?: Statement of problem
Linearity?: Scatterplot and residual plot
Outliers?: Scatterplot and residual plot
Independence?: Statement of problem
Normality?: QQ plot
Constant Variation?: Residual plot

2. Obtain the appropriate graphs to check the conditions on the original scale. Comment on each condition listed below:

Linearity condition

Obtain a scatterplot and a residual plot to examine the relationship between FEV and height.

### SCATTERPLOT ###
with(FEV, plot(Height, FEV, 
               main="Height vs FEV in Children",
               xlab="Height (in)",
               ylab="FEV (L)",
               pch=16))

fev_mod<-with(FEV,lm(FEV~Height))

abline(coefficients(fev_mod), 
       lwd=2, lty=2, col="red")

plot of chunk unnamed-chunk-2

### RESIDUAL PLOT ###

with(FEV, plot(Height, resid(fev_mod), 
               main="Residual Plot",
               xlab="Height (in)",
               ylab="Residual",
               pch=16))

abline(h=0, 
       lwd=2, lty=2, col="blue")

plot of chunk unnamed-chunk-2

### QQ PLOT ###
qqnorm(resid(fev_mod))
qqline(resid(fev_mod))

plot of chunk unnamed-chunk-2

Describe this relationship between FEV and height.

There appears to be a curved moderately strong, curved, positive relationship between FEV and Height.

Outlier “condition”

*From the scatterplot and residual plot, are there any outliers that we need to be concerned about? *

It doesn't appear that there are any outliers.

What is the strategy in dealing with the outlier?

If there is a legitimate reason to remove an outlier (ie it is due to error) we can perform the analysis without it
If there is not a legitimate reason to remove an outlier we should perform both complete analyses.

*Constant variation condition *

Refer to the Residual Plot.

Do you feel the values of FEV have a similar spread for all values of height? Explain why or why not.

It appears that the residuals are fan shaped (ie there is non-constant variation.)

Therefore, do you feel comfortable saying variation of the response variable is the same for all values of the explanatory variable in the population of interest? That is, is the variation of the residuals the same for all values of the explanatory variable? (What patterns in this graph would indicate a violation of this condition?)

No, I would not feel comfortable saying that all heights share a common variation.

*Normality condition *

Obtain a Normal Probability Plot of the Residuals.

Do the residuals appear to be normally distributed? Why or why not?

Both tails deviate strongly from normality.

Therefore, is it safe to say the residuals in the population of interest are normally distributed? Why or why not?

No, it would not be safe to say that the errors come from a normal distribution.

Independence condition

Do you feel the observations are independent of each other? That is, do you feel that one person’s FEV is independent of any other person’s FEV in the sample? Why or why not?

Yes, the patients should be independent (unless a couple of them are siblings)

3. Is a transformation is necessary? Why or why not?

Yes, a transformation is nessescary due to the curved relationship.

4. Regardless of your answer to #3, perform a natural log transformation on the response variable (FEV).

Obtain a scatterplot, residual plot, and a normal probability plot of the residuals on the transformed data.

#### TRANSFORM ####

LN_FEV<-log(FEV$FEV)

### SCATTERPLOT ###
with(FEV, plot(Height, LN_FEV, 
               main="Height vs LN(FEV) in Children",
               xlab="Height (in)",
               ylab="LN(FEV) (L)",
               pch=16))

lnfev_mod<-with(FEV,lm(LN_FEV~Height))

abline(coefficients(lnfev_mod), 
       lwd=2, lty=2, col="red")

plot of chunk unnamed-chunk-3

### RESIDUAL PLOT ###

with(FEV, plot(Height, resid(lnfev_mod), 
               main="Residual Plot",
               xlab="Height (in)",
               ylab="Residual",
               pch=16))

abline(h=0, 
       lwd=2, lty=2, col="blue")

plot of chunk unnamed-chunk-3

### QQ PLOT ###
qqnorm(resid(lnfev_mod))
qqline(resid(lnfev_mod))

plot of chunk unnamed-chunk-3

a. Describe the relationship between loge(FEV) and height.

There is a moderately strong, positive, linear relationship between height and ln(FEV).

b. Is the “constant variation” condition better met on the transformed data?

There appears to be roughly constant variation.

c. Is the “normality” condition met on the transformed data?

Our residuals are pretty close to normality.

5. On which is it more appropriate to perform a simple linear regression analysis: the original data or the transformed data?

The transformed data is more appropriate.

6. Using the more appropriate data, obtain the output from a simple linear regression analysis. Answer the following questions:

a. Write the least-squares regression line, defining the terms in the equation.

summary(lnfev_mod)

## 
## Call:
## lm(formula = LN_FEV ~ Height)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70208 -0.08986  0.01190  0.09337  0.43174 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.271312   0.063531  -35.75   <2e-16 ***
## Height       0.052119   0.001035   50.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1508 on 652 degrees of freedom
## Multiple R-squared:  0.7956, Adjusted R-squared:  0.7953 
## F-statistic:  2538 on 1 and 652 DF,  p-value: < 2.2e-16

LN(FEV)=-2.271312+0.052119*HEIGHT

b. Predict FEV for a child with a height of 55 inches. (Do by hand but also know how to use R to obtain a predicted (or “fitted”) value).

### PREDICT ###
newdata=data.frame(Height=c(55))
predict.lm(lnfev_mod, newdata, interval="prediction")

##         fit       lwr       upr
## 1 0.5952392 0.2986692 0.8918092

exp(0.5952392)

## [1] 1.813465

c. Use R to perform a t-test to answer the question of interest, “Does a person’s height help explain their FEV?” Based on the p-value from the t-test, answer the question of interest.

Test Statistic = 50.38
DF = 652
P-value < 0.0001

There is convincing evidence to suggest that there is a significant effect of Height on FEV, with a p-value <.0001. Therefore, we will reject the null.

7. For the following questions, suppose a transformation was NOT performed. Answer these questions on the original scale.

a. Suppose we wanted to predict FEV for a child with a height of 55 inches. Use R to obtain a 95% prediction interval for a height of 55 inches. (A prediction interval is a confidence interval for the predicted value.) Interpret this interval.

### PREDICT ###
newdata=data.frame(Height=c(55))
predict.lm(fev_mod, newdata, interval="prediction")

##        fit       lwr      upr
## 1 1.825978 0.9789022 2.673053

b. Suppose we wanted to know the average FEV for all children who are 55 inches tall. Use R to construct a 95% confidence interval for the average (called a confidence interval for the mean response). Interpret this value.

### PREDICT ###
newdata=data.frame(Height=c(55))
predict.lm(fev_mod, newdata, interval="confidence")

##        fit      lwr      upr
## 1 1.825978 1.777354 1.874601

c. Estimate \( \sigma \), the standard deviation of the residuals. (It must be constant for this value to make sense.)

## Residual standard error: 0.4307

But this doesnt make sense because the variation is not constant throughout. Analyzing groups separately:

At times, it might be desirable and even necessary to analyze groups of observations separately. For example, it may be desirable to look at the relationship between FEV and Height separately for male children and female children to see if the relationship is the same for boys and girls.