Exercise 1: I would expect that FEV would decrease with age and smoker status. This is because both of these factors have a direct consequence on the lung tissue itself and could possibly weaken the lungs. I think that the person’s height or sex would have little influence on FEV. I don’t think there is a logical connection between lung strength or health and height or sex. However, it is possible that height and sex would influence lung capacity.

Exercise 2: A possible interaction exists between age and FEV when considering smoking status. Perhaps the older you are, the more severely smoking would affect you because there might be more years of damage. Another plausible interaction could be between sex and FEV when considering height. This might exist because assuming men have larger lung capacities than women, there is more tissue to damage. Therefore, the same amount of damage would affect men more slowly than women.

Exercise 3:

Based on these plots, I think that it is possible that these relationships could be explained through a linear model because age and FEV appear to have a linear relationship. However, height and FEV have a relationship that looks like it could be linear or exponential, depending on what the data to the right looks like.

Exercise 4: Possible models:

fev ~ age*smoke

fev ~ sex*height

## 
## Call:
## lm(formula = fev ~ sex * height, data = FEV)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.54654 -0.25282  0.00649  0.25666  2.00491 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -4.318219   0.297637 -14.508  < 2e-16 ***
## sexmale        -1.545629   0.373843  -4.134 4.02e-05 ***
## height          0.112426   0.004928  22.815  < 2e-16 ***
## sexmale:height  0.027457   0.006119   4.487 8.54e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4204 on 650 degrees of freedom
## Multiple R-squared:  0.766,  Adjusted R-squared:  0.7649 
## F-statistic: 709.2 on 3 and 650 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = fev ~ age * smoke, data = FEV)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76645 -0.34947 -0.03364  0.33679  2.05990 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              0.253396   0.082651   3.066  0.00226 ** 
## age                      0.242558   0.008332  29.113  < 2e-16 ***
## smokecurrent smoker      1.943571   0.414285   4.691 3.31e-06 ***
## age:smokecurrent smoker -0.162703   0.030738  -5.293 1.65e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5537 on 650 degrees of freedom
## Multiple R-squared:  0.5941, Adjusted R-squared:  0.5922 
## F-statistic: 317.1 on 3 and 650 DF,  p-value: < 2.2e-16

Based on the lecture notes, the model conditions include nearly normal residuals, constant variability in the residuals, independent residuals, and that each variable is linearly related to the outcome. For both models, the first diagnostic plot, residuals vs. fitted, reveals no problems. The smooth red line is relatively flat and lies close to the dashed gray line, indicating that a linear model is a good choice here. The second plot, normal Q-Q, shows that both models have normal residuals. The third plot, scale-location, is representative of the level of homoscedasticity or heteroscedasticity in a model. Ideally, the model is homoscedastic. It seems that the sexheight model is homoscedastic, but that perhaps the agesmoke model is not. There is a definite upward trend in the agesmoke scale-location plot. Finally, the last plot depicts residuals versus leverage. Ideally, this plot would show no outliers with more influence than average. In both cases this is what is seen. Therefore, the biggest problem is the heteroscedasaticity in the agesmoke model. To attempt to resolve this, I applied a log scale to the model.

I attempted to transform the data again by appling a log scale to both FEV and age. This improved the adjusted R-squared and and changed the diagnostic plots in a positive way. The adjusted R-squared rose to 0.6387 from 0.5734 (log age only) and 0.5922 (no log applied). The normal Q-Q plot is a tighter fit. The residuals vs. fitted plot, scale-location plot, and residuals vs. leverage plot are all close to the horizontal line and are smoothly fitted around it. In conclusion, the log(age) and log(fev) plot is the best solution to the problems outlined above.

## 
## Call:
## lm(formula = logfev ~ logage * smoke)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59173 -0.13370  0.00522  0.13817  0.55531 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -1.06624    0.06107 -17.460  < 2e-16 ***
## logage                      0.88380    0.02736  32.304  < 2e-16 ***
## smokecurrent smoker         1.32800    0.37972   3.497 0.000502 ***
## logage:smokecurrent smoker -0.53676    0.14697  -3.652 0.000281 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2003 on 650 degrees of freedom
## Multiple R-squared:  0.6404, Adjusted R-squared:  0.6387 
## F-statistic: 385.8 on 3 and 650 DF,  p-value: < 2.2e-16

Exercise 5: Despite the improvements to the agesex model, it is still not the best model. Based on the rationale given in Exercise 4, and the adjusted R-squared values, the best model I created is sexheight, where the interaction could be between sex and FEV when considering height.

Additional variables I would want to collect for this dataset would include weight, any preexisiting respiratory diseases, and overall a more varied population, specficially by age. The mean age of participants is around 10 years of age and a more diverse and representative population would be useful.