Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

I chose the “Total Lung Capacity” data set from the Florida State University Department of Scientific Computing web site: http://people.sc.fsu.edu/~jburkardt/datasets/iswr/tlc.csv.

# bring in raw data
tlc <- read.csv("http://people.sc.fsu.edu/~jburkardt/datasets/iswr/tlc.csv")
# display a few rows of the data
head(tlc)
##   age sex height  tlc
## 1  35   1    149 3.40
## 2  11   1    138 3.41
## 3  12   2    148 3.80
## 4  16   1    156 3.90
## 5  32   1    152 4.00
## 6  16   1    157 4.10
# use pairs plot to see interaction between variables
pairs(tlc, gap = 0.5)

The independent variable is Total Lung Capacity (“tlc”) and the chosen terms for the model are as follows:

Quadratic Term - age Dichotomous Term - sex Dichotomous vs. Quantitative Interaction Term - height * sex

# create quadratic term
age_squared <- (tlc$age)^2
# generate model
tlc.lm <- lm(tlc$tlc ~ age_squared + tlc$sex + tlc$height:tlc$sex)
# summary for model
summary(tlc.lm)
## 
## Call:
## lm(formula = tlc$tlc ~ age_squared + tlc$sex + tlc$height:tlc$sex)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2517 -0.9033  0.1018  0.7559  2.5924 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.0832203  0.8311250   6.116 1.34e-06 ***
## age_squared        -0.0005084  0.0003799  -1.338  0.19156    
## tlc$sex            -7.9314198  2.7202881  -2.916  0.00691 ** 
## tlc$sex:tlc$height  0.0525256  0.0145790   3.603  0.00121 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.172 on 28 degrees of freedom
## Multiple R-squared:  0.5297, Adjusted R-squared:  0.4793 
## F-statistic: 10.51 on 3 and 28 DF,  p-value: 8.409e-05

The model is as follows: \(tlc = 5.083 - 0.000508 * age^2 - 7.931 * sex + 0.0525 * sex * height\)

The model and coefficients can be described as follows: Lung capacity is 5.083 (in thousands), which is then adjusted down by 0.000508 times the age squared, which is then adjusted down by 7.931 times the sex (1 = Female, 2 = Male) and finally adjusted up by 0.0525 times the sex multiplied by the height.

Next, residuals analysis and a Q-Q plot. The residuals are kind of symmetrically distributed throughout the range and clustered fairly tightly, without any clear patterns. The Q-Q plot is mediocre and suggest skew (short tail) on one end of the distribution.

plot(fitted(tlc.lm), resid(tlc.lm))
abline(0, 0)

qqnorm(tlc.lm$residuals)
qqline(tlc.lm$residuals)

My gut instinct is that the linear model could be improved. Part of it is my feeling that it doesn’t make sense conceptually for age to be a quadratic term in the model. I was limited by the fact that the assignment requires one Quadratic Term, one Dichotomous Term, and a Dichotomous vs. Quantitative Interaction Term.