##
## Call:
## lm(formula = fev ~ age + smoke + height + sex, data = FEV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.37656 -0.25033 0.00894 0.25588 1.92047
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.456974 0.222839 -20.001 < 2e-16 ***
## age 0.065509 0.009489 6.904 1.21e-11 ***
## smokecurrent smoker -0.087246 0.059254 -1.472 0.141
## height 0.104199 0.004758 21.901 < 2e-16 ***
## sexmale 0.157103 0.033207 4.731 2.74e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4122 on 649 degrees of freedom
## Multiple R-squared: 0.7754, Adjusted R-squared: 0.774
## F-statistic: 560 on 4 and 649 DF, p-value: < 2.2e-16
The intercept can be interpreted as when all variables are equal to zero, a person can be expected to have an FEV equal to -4.46 volume in one second. Clearly, this does not make any sense in the real world. The age coefficient can be interpreted as all else held constant, people that are 1 year older tend to have an FEV of about 0.07 more. The smoke coefficient can be interpreted as all else held constant, people that smoke tend to have an FEV of about 0.09 less. The height coefficient can be interpreted as all else held constant, people that are inch taller tend to have an FEV of about 0.10 more. And lastly, the sex coefficient can be interpreted as all else held constant, men tend to have an FEV of about 0.16 more.
Smokers have higher FEV values than non-smokers for the data due to the fact that they are much older. Thus, since they are older (even though they smoke and are therefore impairing their lungs), they have much larger lung capacity and can produce higher FEV values. When shuffling smoking status, the current smoker boxplot is much more similar to the non-current smoker boxplot.It has shifted down significantly along the FEV axis, suggesting the FEV values for smokers vs. non-smokers here are very similar. This way, the age factor as described above does not come into play as much.
A linear model was fitted 50, 250, and 500 times where smoke was shuffled. The summary from the “500 times” model is shown:
s50 <- do(50)*lm(fev ~ age + shuffle(smoke) + height + sex, data=FEV)
s250 <- do(250)*lm(fev ~ age + shuffle(smoke) + height + sex, data=FEV)
s500 <- do(500)*lm(fev ~ age + shuffle(smoke) + height + sex, data=FEV)
summary(s500)
## Intercept age smoke.current.smoker height
## Min. :-4.497 Min. :0.05920 Min. :-0.1356066 Min. :0.1036
## 1st Qu.:-4.452 1st Qu.:0.06120 1st Qu.:-0.0347717 1st Qu.:0.1045
## Median :-4.449 Median :0.06136 Median : 0.0009334 Median :0.1046
## Mean :-4.448 Mean :0.06135 Mean : 0.0026763 Mean :0.1046
## 3rd Qu.:-4.445 3rd Qu.:0.06150 3rd Qu.: 0.0396266 3rd Qu.:0.1046
## Max. :-4.403 Max. :0.06293 Max. : 0.1480892 Max. :0.1053
## sexmale sigma r.squared F
## Min. :0.1534 Min. :0.4105 Min. :0.7746 Min. :557.6
## 1st Qu.:0.1607 1st Qu.:0.4125 1st Qu.:0.7746 1st Qu.:557.7
## Median :0.1611 Median :0.4128 Median :0.7748 Median :558.1
## Mean :0.1611 Mean :0.4126 Mean :0.7749 Mean :558.7
## 3rd Qu.:0.1616 3rd Qu.:0.4129 3rd Qu.:0.7751 3rd Qu.:559.0
## Max. :0.1691 Max. :0.4129 Max. :0.7772 Max. :566.0
## numdf dendf
## Min. :4 Min. :649
## 1st Qu.:4 1st Qu.:649
## Median :4 Median :649
## Mean :4 Mean :649
## 3rd Qu.:4 3rd Qu.:649
## Max. :4 Max. :649
ggplot() + geom_density(aes(x=smoke.current.smoker), fill="red", alpha=.5, data=s50) +
geom_density(aes(x=smoke.current.smoker), fill="blue", alpha=.5, data=s250) + geom_density(aes(x=smoke.current.smoker), fill="green", alpha=.5, data=s500) + geom_vline(xintercept=coef(fm)["smokecurrent smoker"])
Code for p-value:
y <- s500$smoke.current.smoker
x <- coef(fm)["smokecurrent smoker"]
z <- abs(y) > abs(x)
sum(z)/length(z)
## [1] 0.102
The results from my test prove that smoking does not seem to an effect on FEV after controlling for the other variables in the model. The p-value found was about 0.1, and according to the definition of a p-value, if p > 0.05, then there is weak evidence to suggest that smoking has any effect on FEV.
The p-value from the summary of fm = 0.141. This is very close to the p-value obtained in Exercise 4. This value is slightly less that than the value seen in fm, but it is close. There is clearly some room for slight computational differences when calculating the p-value in Exercise 4 vs. the “given” p-value for fm.
There does not seem to be much evidence that smoking impacts FEV once adjusting for other variables. Based off of the p-value from fm of p = 0.141, this is > 0.05, therefore there is weak evidence to suggest that smoking has any effect on FEV. Also, the analysis suggests that the smokecurrent smoker coefficient in the model is -0.09 ± 0.12, with 95% confidence. Since when the Std. Error is multiplied by 2 is greater than the estimate for smokecurrent smoker, this also indicates that smoking status does not have a significant effect on FEV. In the original analysis of fm, the coefficient of smoking status was believed to suggest that all else held constant, people that smoke tend to have an FEV of about 0.09 less. Now in analyzing the data, while this still may “seem” true, when considering the p-value and confidence interval, people who smoke versus people who do not smoke tend not to differ at all in FEV, since there is very weak evidence suggesting that smoking status impacts FEV.
Smoke has a different p-value in the two models because there are two different null hypotheses involved, due to the order in which the equations are constructed. In fm, the order in which the x-variables appear is age, smoke, height, sex respectively. However for fm1, the order is age, height, sex, smoke. In fm, the null is that smoke doesn’t contribute beyond the contribution already made by age. In fm1, however, the null is that smoke doesn’t contribute beyond the contribution already made by age, height, and sex. The two different reports just give different perspectives on smoke. The sum of squares is therefore different in each of the models, which is why the value of p actually changes numerically. While both models are “right”, I believe that the first model, fm, is better at showing the true association of smoking with FEV. This is because in fm, the null is that smoke doesn’t contribute beyond the contribution already made by age, as opposed to in fm1, the null is that smoke doesn’t contribute beyond the contribution already made by age, height, and sex. So fm is showing smoke’s contribution (coefficient) to the model only also considering age’s contribution, as opposed to it considering all 3 other variables’ contributions in fm1. In fm, smoke’s contribution is much closer to being just “by itself” (the first variable listed) which would be the most accurate depiction of the association between smoking and FEV, so therefore it seems like the better model.