Problem

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Solution

Data Gathering

Picture: Swallows | Audubon

For this example I will employ a data set provided by The Pennsylvania State University in which the researchers conducted a randomized experiment on 120 nestling bank swallows. In an underground burrow, they varied the percentage of oxygen at four different levels (13%, 15%, 17%, and 19%) and the percentage of carbon dioxide at five different levels (0%, 3%, 4.5%, 6%, and 9%). Under each of the resulting 5×4 = 20 experimental conditions, the researchers observed the total volume of air breathed per minute for each of 6 nestling bank swallows. They replicated the same randomized experiment on 120 adult bank swallows. In this way, they obtained the following data (allswallows.txt) on n = 240 swallows.

This data set can be found here.

The idea is as follows:

To have Response (y): percentage increase in “minute ventilation”, (Vent), i.e., total volume of air breathed per minute.
Potential predictor (x1): percentage of oxygen (O2) in the air the swallows breathe.
Potential predictor (x2): percentage of carbon dioxide (CO2)in the air the birds breathe.
Potential qualitative predictor (x3): (Type) 1 if bird is an adult, 0 if bird is a nestling.

Let’s take a look at some of the contents of the data set.

dim(allswallows.dat)

## [1] 240   5

Basically this data set contains 240 rows and 5 different variables.

str(allswallows.dat)

## 'data.frame':    240 obs. of  5 variables:
##  $ Bird: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Type: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Vent: int  -49 0 -98 148 49 49 -24 25 -123 222 ...
##  $ O2  : int  19 19 19 19 19 19 19 19 19 19 ...
##  $ CO2 : num  0 0 0 0 0 0 3 3 3 3 ...

As we can see, this data set contain 5 different variables: Bird, Type, Vent, O2 and CO3.

The description was presented above; but as you can see the dichotomous term is labeled Type which represents 1 if bird is an adult, 0 if the bird is a nestling.

Visualizations

Let’s create visualizations of the relationships in the data.

pairs(allswallows.dat, gap=0.5)

By looking at the above charts, we notice how it seems that there are no linear relationships in between variables.

Identifying Potential Predictors

For this, I will eliminate the variable Bird since it does not offer any information on the desired outcome; it’s just an incremental value with no valuable input into the model; by including it, it will add noise to our model.

The idea is to predict Vent; hence we will include it as our response variable; thus leaving us with Type, O2 and C02 as possible predictors of Vent.

For this, I will employ the The Backward Elimination Process as follows:

allswallows.lm <- lm(Vent ~ Type + O2 + CO2, 
                     data = allswallows.dat)
summary(allswallows.lm)

## 
## Call:
## lm(formula = Vent ~ Type + O2 + CO2, data = allswallows.dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -385.62 -107.76  -12.18  108.59  427.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  136.767     79.334   1.724    0.086 .  
## Type           9.925     21.308   0.466    0.642    
## O2            -8.834      4.765  -1.854    0.065 .  
## CO2           32.258      3.551   9.084   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 165 on 236 degrees of freedom
## Multiple R-squared:  0.2675, Adjusted R-squared:  0.2581 
## F-statistic: 28.72 on 3 and 236 DF,  p-value: 7.219e-16

From the above, we notice how the Type has a significantly high value; and also the \(R^2\) value is considerable low; let’s remove this from our list of predictors and see what happens.

allswallows.lm <- update(allswallows.lm, .~. - Type, 
                         data=allswallows.dat)
summary(allswallows.lm)

## 
## Call:
## lm(formula = Vent ~ O2 + CO2, data = allswallows.dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -380.66 -104.74   -8.32  113.50  422.79 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  141.730     78.485   1.806   0.0722 .  
## O2            -8.834      4.757  -1.857   0.0645 .  
## CO2           32.258      3.545   9.099   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 164.8 on 237 degrees of freedom
## Multiple R-squared:  0.2668, Adjusted R-squared:  0.2606 
## F-statistic: 43.12 on 2 and 237 DF,  p-value: < 2.2e-16

From the above, we notice how the \(R^2\) is still very low in comparison to our desired results. Also, as you can see, the C02 has a very good p-value; and O2 shows a slightly over our 0.05 defined threshold value but is not much higher so I “will not pay too much attention to it” since I would like to have a function in terms of O2 and CO2; that function will be as follows:

\(Vent = 141.73 - 8.834 \times O2 + 32.258 \times CO2\)

Please note that from the above collected information, we are starting to reject this model due to some statistical signs that indicate that this model might not be appropriate due to low R square value, “high” p-value for O2, the quantiles being quite apart from zero and by having a relatively high max and low min values.

Residual Analysis

First, let’s plot our model.

plot(allswallows.lm)

Let’s plot our residuals from the linear model from our original data set.

From the above we notice a big dispersion of the residual values data points not necessarily centered near zero.

Distribution Histogram

Let’s take a look at the distribution of our linear model to see what kind of insights we could get.

From here we can observe how the majority of values are on the left side of zero and also we notice again the spread of the values from -400 to 400; in theory these values should be smaller than that since this values are quite larger.

Q-Q plot

Let’s take a look at the Q-Q plot in order to have a better understanding of our data set.

Now, something interesting on this plot is that it seems that the residuals follow the indicated line. In this plot, we can see a bit more of a pattern and some obvious non-linearities, leading us to be slightly more cautious about concluding that the residuals are normally distributed. We should not necessarily reject the model based on this one test, but the results should serve as a reminder that all models are imperfect.

Conclusion

In conclusion, this model is not appropriate to model this data set; a more suitable linear model should be employed.