Exercise 1

I expect there to be a strong, negative, linear relationship between FEV and age. I believe this to be the case due to the fact that as one’s body ages, the FEV of that individual will most likely decrease since lung function should slowly get worse and worse. I expect there to be a moderate, positive, linear relationship between FEV and height. I believe this to be true since if someone is taller they are likely therefore larger, and would likely have a larger lung capacity to produce a higher FEV. I also expect for males to have slightly higher FEV values than females on average, since men are typically bigger than women and therefore would likely have larger, more powerful lungs. Lastly, I believe that on average, smokers would have a significantly lower FEV than non-smokers, since those who smoke would have much unhealthier lungs than non-smokers.

Exercise 2

One interaction that may exist between a plot of FEV versus height when also considering smoking status is that the rate of incline of FEV may be steeper for non-smokers than smokers. This interaction may exist since as people get taller, their lungs tend to grow and therefore can likely hold more air, and pairing this fact with smoking could potentially result in even higher FEV levels in someone who does not smoke than someone who is the same height but does smoke.

Another interaction that may exist between a plot of FEV versus gender when also considering smoking status could be that the effect of smoking status on females may lead to a more drastic decline in FEV in females who do smoke versus females who do not smoke when compared to males who do smoke versus males who do not smoke. This interaction may exist due to the fact that smoking greatly impairs lung function, and also that females are generally smaller than males (and therefore would likely have a smaller lung capacity), and when comparing a male who does not smoke to a male who does smoke to a female who does not smoke to a female who does smoke, a much lower average FEV value for females who do smoke compared to those who don’t may be present than the average FEV value for males who do not smoke compared to those who do smoke.

Exercise 3

Simple Plots

I believe that linear relationships will be enough to explain the data expressed by Height vs. FEV, since the scatterplot follows a very linear trend.

I believe that a linear relationship would not be useful to explain the data expressed by Gender vs. FEV, since gender is a categorical variable, while FEV is a quantitative variable. Instead, simply comparing the values and ranges of the box and whisker plots would be a reasonable way to explain the data.

Exercise 4

First model:

The first model of interest has FEV as the response variable, predicted by both height and smoking status. The summary is shown for this model, but is not displayed for models 2, 3 or 4, however they were analyzed in the same fashion.

lm1 <- lm(fev~1+height+smoke)
summary(lm1)
## 
## Call:
## lm(formula = fev ~ 1 + height + smoke)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7505 -0.2660 -0.0041  0.2447  2.1207 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -5.427620   0.187577 -28.935   <2e-16 ***
## height               0.131883   0.003081  42.808   <2e-16 ***
## smokecurrent smoker  0.006319   0.058686   0.108    0.914    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.431 on 651 degrees of freedom
## Multiple R-squared:  0.7537, Adjusted R-squared:  0.7529 
## F-statistic: 995.9 on 2 and 651 DF,  p-value: < 2.2e-16
Second Model:

The second model of interest has FEV as the response variable, predicted by both age and smoking status.

Third Model:

The third model of interest has FEV as the response variable, predicted by height, age, and smoking status.

lm3 <- lm(fev~1+age+height+smoke)
Fourth Model:

The fourth model of interest has FEV as the response variable, predicted by height, age, gender, and smoking status.

lm4 <- lm(fev~1+age+height+sex+smoke)

The coefficients for all of the models, except for lm2 since it gave by far the lowest R-squared value of 0.575, had very low decimal values close to zero. The R-squared value for lm1 was found to be 0.753, the R-squared value for lm2 was found to be 0.767, and the R-squared value for lm1 was found to be 0.774. Therefore, the top two models that were found were lm3 and lm4. When comparing the residual plots from lm3 and lm4, both models have residuals that are nearly normal, they have constant variability, the residuals are independent, and each variable is linearly related to the outcome. Below is one example of a Normal Q-Q Plot of the residuals from lm4:

qqnorm(resid(lm4))
qqline(resid(lm4))

Exercise 5

The Best Model for This Data: lm4

I believe lm4 is the “best” model for this data, since its adjusted R-squared value is the highest of all the models, and since its residuals are nearly normal, they have constant variability, they are independent, and each variable is linearly related to the outcome. The highest adjusted R-squared value is of the utmost importance, since it applies a penalty for the number of predictors included in the model. Even though lm4 had the most predictors of all the models, with the applied penalty for all of these covariates it still had the highest adjusted R-squared value, which is how we tend to choose the “best” models. One of the most interesting relationships between the covariates and FEV is the relationship between smoking status and FEV. I assumed, since smoking is detrimental to lung health, that the non-current smokers would have much higher FEV values than current smokers, since they would have likely have healthier, more efficient lungs. However, when viewing the boxplot for FEV vs. Smoking Status, a different trend emerges:

plot(smoke, fev, main="FEV vs. Smoking Status",xlab="Smoking Status", ylab="FEV (Volume in 1 second)")

It appears that on average, smokers actually tend to have higher FEV values than non-smokers. There are some outliers within the non-smokers that have much higher FEV values than smokers, but for the most part the smokers tend to have higher FEV values. This may be due to the fact that the age-range for the data is actually quite young, with ages only ranging from less than 5 years old to around 20 years old. With this in mind, there are probably not many 5 year olds who smoke cigarettes. Thus, they are classified as a non-smoker, but would likely have a much lower FEV value than a 20 year old who does smoke, simply because the age variable seems to have a much stronger effect on FEV.

If I were to collect similar data like this in a new study, what some additional variables I would be interested in collecting would be Body Mass Index (BMI), where the person was from (i.e. urban, suburban, or rural area), and racial status. BMI would likely be a very useful variable to include, since people with lower BMI’s tend to be much healthier than those with higher BMI’s, and someone who is healthier and less overweight would likely have better overall body and lung function than someone who is overweight. Also, including what type of area a person is from (urban, suburban, or rural) may influence FEV, since those who live in more urban environments tend to be exposed to more air pollution (and thus decreasing lung function) than those who live in rural environments. One last variable that would be interesting to investigate would be racial status. I do not know if race would have any effect at all on FEV, but it would be very interesting so see if there were any major disparities in FEV between different racial groups who lived in the same area, or who both smoked, etc.