Exercise 1

## 
## Call:
## lm(formula = fev ~ age + smoke + height + sex, data = FEV)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.37656 -0.25033  0.00894  0.25588  1.92047 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.456974   0.222839 -20.001  < 2e-16 ***
## age                  0.065509   0.009489   6.904 1.21e-11 ***
## smokecurrent smoker -0.087246   0.059254  -1.472    0.141    
## height               0.104199   0.004758  21.901  < 2e-16 ***
## sexmale              0.157103   0.033207   4.731 2.74e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4122 on 649 degrees of freedom
## Multiple R-squared:  0.7754, Adjusted R-squared:  0.774 
## F-statistic:   560 on 4 and 649 DF,  p-value: < 2.2e-16

The intercept can be interpreted as when all variables are equal to zero, a person can be expected to have an FEV equal to -4.46 volume in one second. Clearly, this does not make any sense in the real world. The age coefficient can be interpreted as all else held constant, people that are 1 year older tend to have an FEV of about 0.07 more. The smoke coefficient can be interpreted as all else held constant, people that smoke tend to have an FEV of about 0.09 less. The height coefficient can be interpreted as all else held constant, people that are inch taller tend to have an FEV of about 0.10 more. And lastly, the sex coefficient can be interpreted as all else held constant, men tend to have an FEV of about 0.16 more.

Exercise 2

Smokers have higher FEV values than non-smokers for the data due to the fact that they are much older. Thus, since they are older (even though they smoke and are therefore impairing their lungs), they have much larger lung capacity and can produce higher FEV values. When shuffling smoking status, the current smoker boxplot is much more similar to the non-current smoker boxplot.It has shifted down significantly along the FEV axis, suggesting the FEV values for smokers vs. non-smokers here are very similar. This way, the age factor as described above does not come into play as much.

Exercise 3

A linear model was fitted 50, 250, and 500 times where smoke was shuffled. The summary from the “500 times” model is shown:

s50 <- do(50)*lm(fev ~ age + shuffle(smoke) + height + sex, data=FEV)
s250 <- do(250)*lm(fev ~ age + shuffle(smoke) + height + sex, data=FEV)
s500 <- do(500)*lm(fev ~ age + shuffle(smoke) + height + sex, data=FEV)
summary(s500)
##    Intercept           age          smoke.current.smoker     height      
##  Min.   :-4.497   Min.   :0.05920   Min.   :-0.1356066   Min.   :0.1036  
##  1st Qu.:-4.452   1st Qu.:0.06120   1st Qu.:-0.0347717   1st Qu.:0.1045  
##  Median :-4.449   Median :0.06136   Median : 0.0009334   Median :0.1046  
##  Mean   :-4.448   Mean   :0.06135   Mean   : 0.0026763   Mean   :0.1046  
##  3rd Qu.:-4.445   3rd Qu.:0.06150   3rd Qu.: 0.0396266   3rd Qu.:0.1046  
##  Max.   :-4.403   Max.   :0.06293   Max.   : 0.1480892   Max.   :0.1053  
##     sexmale           sigma          r.squared            F        
##  Min.   :0.1534   Min.   :0.4105   Min.   :0.7746   Min.   :557.6  
##  1st Qu.:0.1607   1st Qu.:0.4125   1st Qu.:0.7746   1st Qu.:557.7  
##  Median :0.1611   Median :0.4128   Median :0.7748   Median :558.1  
##  Mean   :0.1611   Mean   :0.4126   Mean   :0.7749   Mean   :558.7  
##  3rd Qu.:0.1616   3rd Qu.:0.4129   3rd Qu.:0.7751   3rd Qu.:559.0  
##  Max.   :0.1691   Max.   :0.4129   Max.   :0.7772   Max.   :566.0  
##      numdf       dendf    
##  Min.   :4   Min.   :649  
##  1st Qu.:4   1st Qu.:649  
##  Median :4   Median :649  
##  Mean   :4   Mean   :649  
##  3rd Qu.:4   3rd Qu.:649  
##  Max.   :4   Max.   :649

Exercise 4

ggplot() + geom_density(aes(x=smoke.current.smoker), fill="red", alpha=.5, data=s50) +
    geom_density(aes(x=smoke.current.smoker), fill="blue", alpha=.5, data=s250) + geom_density(aes(x=smoke.current.smoker), fill="green", alpha=.5, data=s500) + geom_vline(xintercept=coef(fm)["smokecurrent smoker"])

Code for p-value:

y <- s500$smoke.current.smoker
x <- coef(fm)["smokecurrent smoker"]
z <- abs(y) > abs(x)
sum(z)/length(z)
## [1] 0.102
Interpetation

The results from my test prove that smoking does not seem to an effect on FEV after controlling for the other variables in the model. The p-value found was about 0.1, and according to the definition of a p-value, if p > 0.05, then there is weak evidence to suggest that smoking has any effect on FEV.

Exercise 5

The p-value from the summary of fm = 0.141. This is very close to the p-value obtained in Exercise 4. This value is slightly less that than the value seen in fm, but it is close. There is clearly some room for slight computational differences when calculating the p-value in Exercise 4 vs. the “given” p-value for fm.

Exercise 6

There does not seem to be much evidence that smoking impacts FEV once adjusting for other variables. Based off of the p-value from fm of p = 0.141, this is > 0.05, therefore there is weak evidence to suggest that smoking has any effect on FEV. Also, the analysis suggests that the smokecurrent smoker coefficient in the model is -0.09 ± 0.12, with 95% confidence. Since when the Std. Error is multiplied by 2 is greater than the estimate for smokecurrent smoker, this also indicates that smoking status does not have a significant effect on FEV. In the original analysis of fm, the coefficient of smoking status was believed to suggest that all else held constant, people that smoke tend to have an FEV of about 0.09 less. Now in analyzing the data, while this still may “seem” true, when considering the p-value and confidence interval, people who smoke versus people who do not smoke tend not to differ at all in FEV, since there is very weak evidence suggesting that smoking status impacts FEV.

Extra Credit

Smoke has a different p-value in the two models because there are two different null hypotheses involved, due to the order in which the equations are constructed. In fm, the order in which the x-variables appear is age, smoke, height, sex respectively. However for fm1, the order is age, height, sex, smoke. In fm, the null is that smoke doesn’t contribute beyond the contribution already made by age. In fm1, however, the null is that smoke doesn’t contribute beyond the contribution already made by age, height, and sex. The two different reports just give different perspectives on smoke. The sum of squares is therefore different in each of the models, which is why the value of p actually changes numerically. While both models are “right”, I believe that the first model, fm, is better at showing the true association of smoking with FEV. This is because in fm, the null is that smoke doesn’t contribute beyond the contribution already made by age, as opposed to in fm1, the null is that smoke doesn’t contribute beyond the contribution already made by age, height, and sex. So fm is showing smoke’s contribution (coefficient) to the model only also considering age’s contribution, as opposed to it considering all 3 other variables’ contributions in fm1. In fm, smoke’s contribution is much closer to being just “by itself” (the first variable listed) which would be the most accurate depiction of the association between smoking and FEV, so therefore it seems like the better model.