Exercise 1
fm <- lm(fev ~ age + smoke + height + sex, data=FEV)
This is a fitted model for FEV predicted by age, smoking status, height, and sex. If everything else is held constant than, as age rises 0.06551 years FEV increases by 1 unit. As height increases by 0.10420 units, FEV increases by one unit. If a person is a current smoker, they will have 0.08725 less units of FEV. If a person is a male, they will have 0.15710 more units of FEV.
Exercise 2
It seems that from the real data, if a person is a smoker they have a higher average FEV. It seems nonsmokers have an average of 2.5, with a range from 2 to 3. Current smokers seem to have an average of around 3.2, with a range from 2.75 to 3.75. Once shuffled it seems the range for current smokers expands quite a lot and the average FEV drops, where non-current smokers box plot seems to have shifted slighted upwards. For nonsmokers it looks as if the average is now 2.55, with a range of 2 to 3.2. Current smokers have an average of about 2.6 (slightly higher than nonsmokers), with a range from 2.2 to 3.3.
Exercise 3
I chose to shuffle the data 1500 times. This seemed to give me the best distribution around the null hypothesis without much change in the results. I came to this number by choosing numbers at random and comparing them on a plot to see which had a steeper slope and more uniform shape around the null hypothesis. The vertical line is the x intercept at zero, this was done for reference.
Exercise 4
## [1] 10.93333
About 9.4 percent (depending on the rerun of the loops) of the null distribution is further from zero then the coefficient of smoking. This means the p value is 0.094, which is larger than 0.05. Given that, we cannot say that smoking is statistically significant to determining FEV. We cannot reject the null hypothesis.
Exercise 5
##
## Call:
## lm(formula = fev ~ age + smoke + height + sex, data = FEV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.37656 -0.25033 0.00894 0.25588 1.92047
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.456974 0.222839 -20.001 < 2e-16 ***
## age 0.065509 0.009489 6.904 1.21e-11 ***
## smokecurrent smoker -0.087246 0.059254 -1.472 0.141
## height 0.104199 0.004758 21.901 < 2e-16 ***
## sexmale 0.157103 0.033207 4.731 2.74e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4122 on 649 degrees of freedom
## Multiple R-squared: 0.7754, Adjusted R-squared: 0.774
## F-statistic: 560 on 4 and 649 DF, p-value: < 2.2e-16
The pvalue for smoking staus is 0.141. This is close to the value I got in the last excersise, which was around 0.0933. Since these are both more than 0.05, then they can not be interperated as statistically signifigant to determining FEV. We cannot reject the null hypothesis.
Exercise 6
Once adjusting for other variables it seems that smoking did not affect FEV all that much. The confidence interval states, that with 95% confidence, the value for smoke-current smoker is between -0.20359813 and 0.02910535. This includes 0, so that tells us that our model cannot tell if it positively or negatively impacts FEV, which is why this agrees with our pvalues in the fact that we cannot say for certain that smoking is a statistically significant variable for FEV. We cannot reject the null hypothesis.
Extra Credit
In fm the pvalue for the coefficient smoke is 0.0004161 and in fm1 the pvalue is 0.1414. These are different because when running an anova unless specified it puts certain varibles first. So to it would be better to specifiy the ones you think are the most important and put them first when writing your model. This is why I would trust fm1 over fm.