library("Hmisc")
getHdata(FEV)
fevmodel1 <-lm(fev ~ age + smoke + height + sex, data=FEV)
summary(fevmodel1)
## 
## Call:
## lm(formula = fev ~ age + smoke + height + sex, data = FEV)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.37656 -0.25033  0.00894  0.25588  1.92047 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.456974   0.222839 -20.001  < 2e-16 ***
## age                  0.065509   0.009489   6.904 1.21e-11 ***
## smokecurrent smoker -0.087246   0.059254  -1.472    0.141    
## height               0.104199   0.004758  21.901  < 2e-16 ***
## sexmale              0.157103   0.033207   4.731 2.74e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4122 on 649 degrees of freedom
## Multiple R-squared:  0.7754, Adjusted R-squared:  0.774 
## F-statistic:   560 on 4 and 649 DF,  p-value: < 2.2e-16
currentsmoke<-coef(fevmodel1)[3]
  1. The intercept coefficient for this model indicates that when age and height equal zero and the person is a nonsmoking female, FEV= -4.457. The age coefficient of 0.066 means that when the other variables are held constant, each year increase in age is associated with a .066 increase in FEV. The current smoker variable means that when all else is held constant, smokers have a .087 unit decrease in FEV compared to smokers. The height variable coefficient indicates that when other variables are held constant, each one inch increase in height is associated with a .104 unit increase in FEV. Finally, the sexmale coefficient means that when all else is held constant, males have FEV scores .157 units higher than females.
library(mosaic)
qplot(smoke,fev,data=FEV)+geom_boxplot()

qplot(shuffle(smoke),fev,data=FEV)+geom_boxplot()

  1. The graph of smoking and FEV shows a positive association between the two variables; that is, holding all else constant, smokers have higher FEV scores than nonsmokers. This relationship might be surprising since smoking damages lungs, but smoking and FEV are also associated with age. I think that smokers in this sample have higher FEV because they are also older than nonsmokers.

When we shuffle the smoke variable, the association between smoking and FEV disappears; the mean FEV for smokers and nonsmokers looks to be the same since “smoke” was randomly assigned. 3. We can repeat this process 5000 times to get an idea of what the smoke coefficients could be if there truly is no relationship between smoking and FEV:

nullsmoke=do(5000)*lm(fev~age + shuffle(smoke) + height + sex, data=FEV)
  1. Here is a plot of the distribution of the smoking variable marked with the coefficient value that we got in our original model (where smoke had not been shuffled):
library(ggplot2)
hist(nullsmoke$smoke.current.smoker,main="Null Distribution, Current Smoker",xlab="Current Smoker Coefficient Value",col="skyblue4")
abline(v=currentsmoke,col="red")

We can use the following code to find the p-value of the current smoker coefficient from our original model:

set.seed(12)
tally(~smoke.current.smoker>abs(currentsmoke),nullsmoke)
## 
##  TRUE FALSE 
##   290  4710
tally(~smoke.current.smoker<currentsmoke,nullsmoke)
## 
##  TRUE FALSE 
##   263  4737
pvalcurrentsmoke<-(291+234)/5000

According to my null distribution hypothesis test, the p-value of the smoker coefficient from my original model is 0.105, meaning that if smoking is not related to FEV, we would expect to see a coefficient value at least as extreme as -0.087246 in 10.5% of our trials. Since we aim for a p-value >.05 in order to reject the null hypothesis, we will fail to reject the null hypothesis for the smoking coefficient in this model. After adjusting for other variables in the model, there is no statistically significant association between smoking and FEV.

summary(fevmodel1)
## 
## Call:
## lm(formula = fev ~ age + smoke + height + sex, data = FEV)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.37656 -0.25033  0.00894  0.25588  1.92047 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.456974   0.222839 -20.001  < 2e-16 ***
## age                  0.065509   0.009489   6.904 1.21e-11 ***
## smokecurrent smoker -0.087246   0.059254  -1.472    0.141    
## height               0.104199   0.004758  21.901  < 2e-16 ***
## sexmale              0.157103   0.033207   4.731 2.74e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4122 on 649 degrees of freedom
## Multiple R-squared:  0.7754, Adjusted R-squared:  0.774 
## F-statistic:   560 on 4 and 649 DF,  p-value: < 2.2e-16
  1. The p-value of the smoke variable according to R’s t-test is .141, which is higher than the p-value that I’ve calculated from my null distribution. Based on the t-test, there is no evidence that smoking influences FEV after adjusting for age, height, and sex.

##                           2.5 %      97.5 %
## (Intercept)         -4.89454680 -4.01940100
## age                  0.04687736  0.08414129
## smokecurrent smoker -0.20359813  0.02910535
## height               0.09485705  0.11354180
## sexmale              0.09189669  0.22230917
##                   name        lower        upper level method     estimate
## 1            Intercept  -4.46836453  -4.42847738  0.95 stderr  -4.45697390
## 2                  age   0.06069765   0.06204733  0.95 stderr   0.06550932
## 3 smoke.current.smoker  -0.10550224   0.10682377  0.95 stderr  -0.08724639
## 4               height   0.10420190   0.10490794  0.95 stderr   0.10419943
## 5              sexmale   0.15852520   0.16369202  0.95 stderr   0.15710293
## 6                sigma   0.41167570   0.41349405  0.95 stderr   0.41221629
## 7            r.squared   0.77396917   0.77594923  0.95 stderr   0.77536138
## 8                    F 555.53926052 561.92878088  0.95 stderr 560.02117831
##   margin.of.error
## 1    0.0199435752
## 2    0.0006748371
## 3    0.1061630088
## 4    0.0003530211
## 5    0.0025834069
## 6    0.0009091748
## 7    0.0009900296
## 8    3.1947601803

By now I have calculated p-values and 95% conidence intervals for my ‘canned’ model (p=.141, 95% CI= -0.2036-0.0291) and my permutation-based distribution (p=.105, 95% CI=-0.1045-0.1066). Neither of these coefficients has a statistically significant relationship to FEV, but they are each slightly different from each other. The reason for these differences is that my ‘canned’ model’s confidence interval is based on Student’s t-distribution, and my second model is based on a permutation distribution.

Extra Credit: Though the two models are mathematically equivalent, the smoke variable is being added to the anova report at different times. The first anova is measuring the amount of variation in FEV that is explained by smoking after accounting for height, but the second anova is measuring how much of the variation of FEV is explained by smoking after accounting for age, height, smoking, and sex. I think that the first model better represents the relationship between smoking and FEV.