KITADA
Lab #7
Objectives:
Example: Study on FEV of children
FEV (forced expiratory volume) is an index of pulmonary function that measures the volume of air expelled after one second of constant effort. The data set FEV contains determinations of FEV on 654 children ages 3-19 who were seen in the Childhood Respiratory Disease Study (CRD Study) in East Boston, Massachusetts. These data are part of a longitudinal study to follow the change in pulmonary function over time in children.
Here is the documentation for the data set:
1. List the conditions that need to be met for inference in a simple linear regression analysis. Next to each item in the list, list the graphical check (if one exists) to assess the condition. (Some may have more than one check.)
2. Obtain the appropriate graphs to check the conditions on the original scale. Comment on each condition listed below:
Linearity condition
Obtain a scatterplot and a residual plot to examine the relationship between FEV and height.
### SCATTERPLOT ###
with(FEV, plot(Height, FEV,
main="Height vs FEV in Children",
xlab="Height (in)",
ylab="FEV (L)",
pch=16))
fev_mod<-with(FEV,lm(FEV~Height))
abline(coefficients(fev_mod),
lwd=2, lty=2, col="red")
### RESIDUAL PLOT ###
with(FEV, plot(Height, resid(fev_mod),
main="Residual Plot",
xlab="Height (in)",
ylab="Residual",
pch=16))
abline(h=0,
lwd=2, lty=2, col="blue")
### QQ PLOT ###
qqnorm(resid(fev_mod))
qqline(resid(fev_mod))
Describe this relationship between FEV and height.
There appears to be a curved moderately strong, curved, positive relationship between FEV and Height.
Outlier “condition”
*From the scatterplot and residual plot, are there any outliers that we need to be concerned about? *
It doesn't appear that there are any outliers.
What is the strategy in dealing with the outlier?
*Constant variation condition *
Refer to the Residual Plot.
Do you feel the values of FEV have a similar spread for all values of height? Explain why or why not.
It appears that the residuals are fan shaped (ie there is non-constant variation.)
Therefore, do you feel comfortable saying variation of the response variable is the same for all values of the explanatory variable in the population of interest? That is, is the variation of the residuals the same for all values of the explanatory variable? (What patterns in this graph would indicate a violation of this condition?)
No, I would not feel comfortable saying that all heights share a common variation.
*Normality condition *
Obtain a Normal Probability Plot of the Residuals.
Do the residuals appear to be normally distributed? Why or why not?
Both tails deviate strongly from normality.
Therefore, is it safe to say the residuals in the population of interest are normally distributed? Why or why not?
No, it would not be safe to say that the errors come from a normal distribution.
Independence condition
Do you feel the observations are independent of each other? That is, do you feel that one person’s FEV is independent of any other person’s FEV in the sample? Why or why not?
Yes, the patients should be independent (unless a couple of them are siblings)
3. Is a transformation is necessary? Why or why not?
Yes, a transformation is nessescary due to the curved relationship.
4. Regardless of your answer to #3, perform a natural log transformation on the response variable (FEV).
Obtain a scatterplot, residual plot, and a normal probability plot of the residuals on the transformed data.
#### TRANSFORM ####
LN_FEV<-log(FEV$FEV)
### SCATTERPLOT ###
with(FEV, plot(Height, LN_FEV,
main="Height vs LN(FEV) in Children",
xlab="Height (in)",
ylab="LN(FEV) (L)",
pch=16))
lnfev_mod<-with(FEV,lm(LN_FEV~Height))
abline(coefficients(lnfev_mod),
lwd=2, lty=2, col="red")
### RESIDUAL PLOT ###
with(FEV, plot(Height, resid(lnfev_mod),
main="Residual Plot",
xlab="Height (in)",
ylab="Residual",
pch=16))
abline(h=0,
lwd=2, lty=2, col="blue")
### QQ PLOT ###
qqnorm(resid(lnfev_mod))
qqline(resid(lnfev_mod))
a. Describe the relationship between loge(FEV) and height.
There is a moderately strong, positive, linear relationship between height and ln(FEV).
b. Is the “constant variation” condition better met on the transformed data?
There appears to be roughly constant variation.
c. Is the “normality” condition met on the transformed data?
Our residuals are pretty close to normality.
5. On which is it more appropriate to perform a simple linear regression analysis: the original data or the transformed data?
The transformed data is more appropriate.
6. Using the more appropriate data, obtain the output from a simple linear regression analysis. Answer the following questions:
a. Write the least-squares regression line, defining the terms in the equation.
summary(lnfev_mod)
##
## Call:
## lm(formula = LN_FEV ~ Height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.70208 -0.08986 0.01190 0.09337 0.43174
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.271312 0.063531 -35.75 <2e-16 ***
## Height 0.052119 0.001035 50.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1508 on 652 degrees of freedom
## Multiple R-squared: 0.7956, Adjusted R-squared: 0.7953
## F-statistic: 2538 on 1 and 652 DF, p-value: < 2.2e-16
LN(FEV)=-2.271312+0.052119*HEIGHT
b. Predict FEV for a child with a height of 55 inches. (Do by hand but also know how to use R to obtain a predicted (or “fitted”) value).
### PREDICT ###
newdata=data.frame(Height=c(55))
predict.lm(lnfev_mod, newdata, interval="prediction")
## fit lwr upr
## 1 0.5952392 0.2986692 0.8918092
exp(0.5952392)
## [1] 1.813465
c. Use R to perform a t-test to answer the question of interest, “Does a person’s height help explain their FEV?” Based on the p-value from the t-test, answer the question of interest.
There is convincing evidence to suggest that there is a significant effect of Height on FEV, with a p-value <.0001. Therefore, we will reject the null.
7. For the following questions, suppose a transformation was NOT performed. Answer these questions on the original scale.
a. Suppose we wanted to predict FEV for a child with a height of 55 inches. Use R to obtain a 95% prediction interval for a height of 55 inches. (A prediction interval is a confidence interval for the predicted value.) Interpret this interval.
### PREDICT ###
newdata=data.frame(Height=c(55))
predict.lm(fev_mod, newdata, interval="prediction")
## fit lwr upr
## 1 1.825978 0.9789022 2.673053
b. Suppose we wanted to know the average FEV for all children who are 55 inches tall. Use R to construct a 95% confidence interval for the average (called a confidence interval for the mean response). Interpret this value.
### PREDICT ###
newdata=data.frame(Height=c(55))
predict.lm(fev_mod, newdata, interval="confidence")
## fit lwr upr
## 1 1.825978 1.777354 1.874601
c. Estimate \( \sigma \), the standard deviation of the residuals. (It must be constant for this value to make sense.)
## Residual standard error: 0.4307
But this doesnt make sense because the variation is not constant throughout. Analyzing groups separately:
At times, it might be desirable and even necessary to analyze groups of observations separately. For example, it may be desirable to look at the relationship between FEV and Height separately for male children and female children to see if the relationship is the same for boys and girls.