## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

Exercise 1:

Before looking at the data, what types of relationships do you expect to see between each of these variables and FEV? Justify your answers briefly.

I expect to see a strong, linear, and inverse relationship between FEV and age. Meaning that as age increases the FEV will decrease, this is because as people age so do their bodies and they are probably more likely to have less healthy and strong lungs. I also expect to see that there will be a strong, negative, and linear correlation between low FEV and smoking status being marked a “smoker”, because smoking usually leads to weakened and unhealthy lungs.

Exercise 2:

Think about the different measures variables that you have at your disposal. Hypothesize at least two possible interactions, and come up with a plausible justification about why such an ineraction might exist.

It will be interesting to see how FEV does when compared with age and then compared using Smoker Vs Nonsmoker as a factor. Seeing those two relationships next to each other would be intriguing to analyze. I hypothesize that both will have, at least, a weak inverse relationship, yet the smokers will have a more negative slope than the nonsmokers. height vs FEV could be something worthwhile to dee the interaction between, it could be possible that the taller one is the stronger their lungs are and more air they are able to force out of them. I would also like to see age and smoking status against gender to see the kind of trends that are present there. Without any prior knowledge I hypothesize that more men would smoke more and start younger, this because they can be more impressionable and that smoking has often been related to “masculinity” or being “suave”.

Exercise 3:

Generate a few simple plots of the data to evaluate the possible bivariate relationships that exist. Based on these plots, do you think that linear relationships will be sufficient to explain the data?

Exercise 4:

m_height_vs_fev is a model of FEV by height, the Adjusted R-squared value is 0.7533, and the slope of the model is 0.131976. This means that for every unit of FEV that is increased height increases by 0.131976 units. The second model is called m_age_vs_fev, where FEV is modeled by age. The Adjusted R-squared is 0.5716, which is worse than the last model. According to this model as FEV increases by one-unit age increases by 0.222041 units. The third model is m_age_vs_fev_sex, where FEV is modeled by age plus sex. The Adjusted Rsquared is 0.6058, which is still worse than the first but better than the last. As FEV increases by one unit, age increases by 0.220445 and that it seems more males would be affected. mod_fev_height_plusAge_plusSex is the 4th model I ran, where FEV is modeled by height plus age plus sex. The Adjusted R-squared is 0.7736, the best so far of the models I’ve chosen. As FEV increases by one-unit height increases by 0.104560 units, age increases by 0.061364 units, and if the person is a male then they increase by 0.161112. mod_fev_height_plusAge_plusSex_heightAge is the second to last model I ran, where FEV is modeled by height plus age plus sex plus height times age. The Adjusted R-squared is 0.7922, better than even the last one. As FEV increases by one unit, height increases by 0.047668 units, age decreases by -0.373898. mod_fev_height_plusAge_plusSex_heightAge_heightSex is the last model i ran and has the best Adjusted Rsquared value, 0.7949. AS FEV increases by one unit, height increases by 0.0426603 units, and age seems to decrease by 0.3047092 units. I ran a residuals plot and saw that most of the points are evenly distributed so it seems that this last model will work well.

Exercise 5:

I chose to go with my last model, mod_fev_height_plusAge_plusSex_heightAge_heightSex. this is because the Adjusted Rsquared is the highest out of all my models, at 0.7949. I think that the most interesting relationships here are between FEV and height using sex as a factor.

Exercise 6:

Based on your thinking about this dataset and correlates of lung health more generally, if you were to collect similar data like this in a new study, what additional variables would you be interested in collecting and why?

If I was to collect similar data, I would like add in a variable for how long the person has been a smoker and a variable for how much they smoked (maybe cigarettes per day). I would also like to see a variable for if they have ever smoked before, because as it stands it is only current and non-current smokers, there is no variable to tell if they have ever smoked before.