The Task

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

For this task I decided to work with a slightly modified version of the iris dataset

Load the data

Sepal.Length Sepal.Width Petal.Length Petal.Width.SQR is.Setosa
5.1 3.5 1.4 0.04 TRUE
4.9 3.0 1.4 0.04 TRUE
4.7 3.2 1.3 0.04 TRUE
4.6 3.1 1.5 0.04 TRUE
5.0 3.6 1.4 0.04 TRUE
5.4 3.9 1.7 0.16 TRUE
4.6 3.4 1.4 0.09 TRUE
5.0 3.4 1.5 0.04 TRUE
4.4 2.9 1.4 0.04 TRUE
4.9 3.1 1.5 0.01 TRUE

Visualize the Data

When we visualize the data below, we see a reasonably high degree of relatedness in the sub-plots. Most interestingly, the dichotomous “is.Setosa” truly does appear to have a dichotomous relationship to the other variables.

Create the Model

We’re goign to try to build a model that predicts sepal length using all the other columns in the table.

## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width.SQR + 
##     is.Setosa, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86863 -0.21543  0.01155  0.19733  0.80887 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.45280    0.28958   5.017 1.51e-06 ***
## Sepal.Width      0.54886    0.08256   6.648 5.62e-10 ***
## Petal.Length     0.74708    0.06173  12.103  < 2e-16 ***
## Petal.Width.SQR -0.14469    0.03521  -4.109 6.62e-05 ***
## is.SetosaTRUE    0.58983    0.18120   3.255  0.00141 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3126 on 145 degrees of freedom
## Multiple R-squared:  0.8613, Adjusted R-squared:  0.8575 
## F-statistic: 225.1 on 4 and 145 DF,  p-value: < 2.2e-16

We see that all of the independent variables appear to be good predictors of Sepal.Length and have sufficiently low p-values.

I’m a little bit surprised that this is the case for Petal.Width.SQR, for which I arbitrarily squared one of the columns before building the model.

It’s not surprising to me that the is.Setosa variable appears to have the worst fit. I’ve taken a variable that originally had several factor-levels and made it binary. I suspect that the relationship would be a much better fit if I had kept the levels intact, as in the original dataset.

Examine the Residuals

Here we see that the residuals are normally distributed according to the histogram. The Scatter shows now discernable patter, and the qqnorm plot shows an excellent fit. I suspect that this model could be improve by undoing the modifications that I made to the dataset (squaring a column, and making another dichotomous) but i would be confident in using this model as is to predict Sepal.Length