Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
For this task I decided to work with a slightly modified version of the iris dataset
library(datasets)
library(knitr)
data(iris)
d <- data.frame(iris)
#convert "Species" to dichotomous and rename
d$Species <- d$Species == "setosa"
colnames(d)[5] <- "is.Setosa"
#convert Petal.width to quadratic
d$Petal.Width = d$Petal.Width^2
colnames(d)[4] <- "Petal.Width.SQR"
#examine the data
kable(head(d,10))| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width.SQR | is.Setosa |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.04 | TRUE |
| 4.9 | 3.0 | 1.4 | 0.04 | TRUE |
| 4.7 | 3.2 | 1.3 | 0.04 | TRUE |
| 4.6 | 3.1 | 1.5 | 0.04 | TRUE |
| 5.0 | 3.6 | 1.4 | 0.04 | TRUE |
| 5.4 | 3.9 | 1.7 | 0.16 | TRUE |
| 4.6 | 3.4 | 1.4 | 0.09 | TRUE |
| 5.0 | 3.4 | 1.5 | 0.04 | TRUE |
| 4.4 | 2.9 | 1.4 | 0.04 | TRUE |
| 4.9 | 3.1 | 1.5 | 0.01 | TRUE |
When we visualize the data below, we see a reasonably high degree of relatedness in the sub-plots. Most interestingly, the dichotomous “is.Setosa” truly does appear to have a dichotomous relationship to the other variables.
We’re goign to try to build a model that predicts sepal length using all the other columns in the table.
m.lm <- lm(Sepal.Length ~ Sepal.Width + Petal.Length+Petal.Width.SQR+is.Setosa, data = d)
summary(m.lm)##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width.SQR +
## is.Setosa, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.86863 -0.21543 0.01155 0.19733 0.80887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.45280 0.28958 5.017 1.51e-06 ***
## Sepal.Width 0.54886 0.08256 6.648 5.62e-10 ***
## Petal.Length 0.74708 0.06173 12.103 < 2e-16 ***
## Petal.Width.SQR -0.14469 0.03521 -4.109 6.62e-05 ***
## is.SetosaTRUE 0.58983 0.18120 3.255 0.00141 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3126 on 145 degrees of freedom
## Multiple R-squared: 0.8613, Adjusted R-squared: 0.8575
## F-statistic: 225.1 on 4 and 145 DF, p-value: < 2.2e-16
We see that all of the independent variables appear to be good predictors of Sepal.Length and have sufficiently low p-values.
I’m a little bit surprised that this is the case for Petal.Width.SQR, for which I arbitrarily squared one of the columns before building the model.
It’s not surprising to me that the is.Setosa variable appears to have the worst fit. I’ve taken a variable that originally had several factor-levels and made it binary. I suspect that the relationship would be a much better fit if I had kept the levels intact, as in the original dataset.
Here we see that the residuals are normally distributed according to the histogram. The Scatter shows now discernable patter, and the qqnorm plot shows an excellent fit. I suspect that this model could be improve by undoing the modifications that I made to the dataset (squaring a column, and making another dichotomous) but i would be confident in using this model as is to predict Sepal.Length