library("Hmisc")
getHdata(FEV)
  1. The FEV data set contains information on outcome variable forced expiratory volume (fev) and additional variables, age in years (age), height in inches (height), male or female sex (sex), and current or non-current smoking (smoker). I would expected to see an association between age and FEV, because it seems that lungs would get stronger as children age. For height, I would expect that taller people have greater lung volume and therefore have higher FEV. Sex seems like it would be associated with FEV as well, since men tend to be stronger than women. Finally, I would expect smoking to be negatively associated with FEV since smoking damages lungs.

  2. I think that smoking and age could interact with each other to influence the outcome of FEV because people who are older might have been smoking longer and would therefore have worse lung strength. I also think that smoking and sex could also interact to influence FEV outcome, because men and women have different metabolisms and hormones. I know that women are more vulnerable to lung carcinogens than men are, so I think that women’s FEV would be more heavily influenced by smoking.

  3. I do think that linear relationships would be sufficient to explain these relationships– the graphs show a linear trend and the variation of the points appears constant:

plot(FEV$age,FEV$fev,xlab="Age (Years)",ylab="Forced Expiratory Volume",main="Age and Forced Expiratory Volume",pch=16)

plot(FEV$height,FEV$fev,xlab="Height (Inches)",ylab="Forced Expiratory Volume",main="Height and Forced Expiratory Volume",pch=16)

qplot(sex,fev,data=FEV)

  1. I created a few models to try to explain some of the variation in FEV. First, I included each of the variables without any interaction terms (FEV~age+smoke+sex+height), but I suspected that this model might not be the best way to describe the data. Next, I incorporated some interaction terms, thinking that it’s possible that some effect modification is happening between variables. I made two models to demonstrate this assumption: FEV~sex+height+agexsmoke and FEV~age+height+sexxsmoke. Finally, I created a model without the smoking variable, since its coefficient was statistically insignificant in two of my previous models; I modeled FEV~age+height*sex.

Here is the R code and output for my two best-fitting models, along with their residual plots. I do not see a trend in my model’s residuals:

mlr1<-lm(fev~age+height+sex*smoke,data=FEV)
summary(mlr1)
## 
## Call:
## lm(formula = fev ~ age + height + sex * smoke, data = FEV)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.38404 -0.25547  0.00456  0.24666  1.93203 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -4.421924   0.222835 -19.844  < 2e-16 ***
## age                          0.065990   0.009465   6.972 7.73e-12 ***
## height                       0.103734   0.004750  21.840  < 2e-16 ***
## sexmale                      0.135409   0.034638   3.909 0.000102 ***
## smokecurrent smoker         -0.183075   0.074189  -2.468 0.013857 *  
## sexmale:smokecurrent smoker  0.234147   0.109605   2.136 0.033032 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4111 on 648 degrees of freedom
## Multiple R-squared:  0.7769, Adjusted R-squared:  0.7752 
## F-statistic: 451.4 on 5 and 648 DF,  p-value: < 2.2e-16
qplot(mlr1$residuals)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mlr2<-lm(fev~age+height*sex,data=FEV)
summary(mlr2)
## 
## Call:
## lm(formula = fev ~ age + height * sex, data = FEV)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.25462 -0.24277  0.00342  0.24299  1.85454 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -3.098624   0.328952  -9.420  < 2e-16 ***
## age             0.066850   0.008929   7.487 2.31e-13 ***
## height          0.081243   0.006304  12.888  < 2e-16 ***
## sexmale        -1.808288   0.360662  -5.014 6.89e-07 ***
## height:sexmale  0.032418   0.005913   5.483 6.00e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4037 on 649 degrees of freedom
## Multiple R-squared:  0.7846, Adjusted R-squared:  0.7833 
## F-statistic:   591 on 4 and 649 DF,  p-value: < 2.2e-16
qplot(mlr2$residuals)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. Of all my models, I think FEV ~ age + height * sex is the best fit. This model had the highest adjusted R squared value compared to my other models, meaning that it explains more of the vartiation in FEV than my other models do, and each of the model’s coefficients is statistically significant. To me, the most interesting aspect of this model is the interaction between height and sex, because it indicates that height has a different effect on FEV for men than it does for women. Here is a graph of this interaction:
interaction.plot(FEV$height, FEV$sex, FEV$fev)

If I were to collect similar data for a new study, I would collect more detailed information on smoking (e.g. packs smoked) to get a better picture of whether smoking influences FEV among children. I might also collect information on parents’ smoking status, since children are likely exposed to smoke if their parents are smokers. Finally, I might collect data on weight to see if weight or BMI influences FEV.