library("Hmisc")
getHdata(FEV)
fevmodel1 <-lm(fev ~ age + smoke + height + sex, data=FEV)
summary(fevmodel1)
##
## Call:
## lm(formula = fev ~ age + smoke + height + sex, data = FEV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.37656 -0.25033 0.00894 0.25588 1.92047
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.456974 0.222839 -20.001 < 2e-16 ***
## age 0.065509 0.009489 6.904 1.21e-11 ***
## smokecurrent smoker -0.087246 0.059254 -1.472 0.141
## height 0.104199 0.004758 21.901 < 2e-16 ***
## sexmale 0.157103 0.033207 4.731 2.74e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4122 on 649 degrees of freedom
## Multiple R-squared: 0.7754, Adjusted R-squared: 0.774
## F-statistic: 560 on 4 and 649 DF, p-value: < 2.2e-16
currentsmoke<-coef(fevmodel1)[3]
library(mosaic)
qplot(smoke,fev,data=FEV)+geom_boxplot()
qplot(shuffle(smoke),fev,data=FEV)+geom_boxplot()
When we shuffle the smoke variable, the association between smoking and FEV disappears; the mean FEV for smokers and nonsmokers looks to be the same since “smoke” was randomly assigned. 3. We can repeat this process 5000 times to get an idea of what the smoke coefficients could be if there truly is no relationship between smoking and FEV:
nullsmoke=do(5000)*lm(fev~age + shuffle(smoke) + height + sex, data=FEV)
library(ggplot2)
hist(nullsmoke$smoke.current.smoker,main="Null Distribution, Current Smoker",xlab="Current Smoker Coefficient Value",col="skyblue4")
abline(v=currentsmoke,col="red")
We can use the following code to find the p-value of the current smoker coefficient from our original model:
set.seed(12)
tally(~smoke.current.smoker>abs(currentsmoke),nullsmoke)
##
## TRUE FALSE
## 290 4710
tally(~smoke.current.smoker<currentsmoke,nullsmoke)
##
## TRUE FALSE
## 263 4737
pvalcurrentsmoke<-(291+234)/5000
According to my null distribution hypothesis test, the p-value of the smoker coefficient from my original model is 0.105, meaning that if smoking is not related to FEV, we would expect to see a coefficient value at least as extreme as -0.087246 in 10.5% of our trials. Since we aim for a p-value >.05 in order to reject the null hypothesis, we will fail to reject the null hypothesis for the smoking coefficient in this model. After adjusting for other variables in the model, there is no statistically significant association between smoking and FEV.
summary(fevmodel1)
##
## Call:
## lm(formula = fev ~ age + smoke + height + sex, data = FEV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.37656 -0.25033 0.00894 0.25588 1.92047
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.456974 0.222839 -20.001 < 2e-16 ***
## age 0.065509 0.009489 6.904 1.21e-11 ***
## smokecurrent smoker -0.087246 0.059254 -1.472 0.141
## height 0.104199 0.004758 21.901 < 2e-16 ***
## sexmale 0.157103 0.033207 4.731 2.74e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4122 on 649 degrees of freedom
## Multiple R-squared: 0.7754, Adjusted R-squared: 0.774
## F-statistic: 560 on 4 and 649 DF, p-value: < 2.2e-16
The p-value of the smoke variable according to R’s t-test is .141, which is higher than the p-value that I’ve calculated from my null distribution. Based on the t-test, there is no evidence that smoking influences FEV after adjusting for age, height, and sex.
## 2.5 % 97.5 %
## (Intercept) -4.89454680 -4.01940100
## age 0.04687736 0.08414129
## smokecurrent smoker -0.20359813 0.02910535
## height 0.09485705 0.11354180
## sexmale 0.09189669 0.22230917
## name lower upper level method estimate
## 1 Intercept -4.46836453 -4.42847738 0.95 stderr -4.45697390
## 2 age 0.06069765 0.06204733 0.95 stderr 0.06550932
## 3 smoke.current.smoker -0.10550224 0.10682377 0.95 stderr -0.08724639
## 4 height 0.10420190 0.10490794 0.95 stderr 0.10419943
## 5 sexmale 0.15852520 0.16369202 0.95 stderr 0.15710293
## 6 sigma 0.41167570 0.41349405 0.95 stderr 0.41221629
## 7 r.squared 0.77396917 0.77594923 0.95 stderr 0.77536138
## 8 F 555.53926052 561.92878088 0.95 stderr 560.02117831
## margin.of.error
## 1 0.0199435752
## 2 0.0006748371
## 3 0.1061630088
## 4 0.0003530211
## 5 0.0025834069
## 6 0.0009091748
## 7 0.0009900296
## 8 3.1947601803
By now I have calculated p-values and 95% conidence intervals for my ‘canned’ model (p=.141, 95% CI= -0.2036-0.0291) and my permutation-based distribution (p=.105, 95% CI=-0.1045-0.1066). Neither of these coefficients has a statistically significant relationship to FEV, but they are each slightly different from each other. The reason for these differences is that my ‘canned’ model’s confidence interval is based on Student’s t-distribution, and my second model is based on a permutation distribution.
Extra Credit: Though the two models are mathematically equivalent, the smoke variable is being added to the anova report at different times. The first anova is measuring the amount of variation in FEV that is explained by smoking after accounting for height, but the second anova is measuring how much of the variation of FEV is explained by smoking after accounting for age, height, smoking, and sex. I think that the first model better represents the relationship between smoking and FEV.