Question

Using the ‘iris’ dataset, build a multiple regression that predicts the sepal.length. Use all remaining variables including ‘species.’ Interpret the results and residuals.

Answer

First we’ll create the model using Sepal.Length as the response variable and all other variables in the data set as explanatory variables. From there we’ll display a histogram and scatterplot of the residuals from the model.

# Create the model
model = lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data=iris)
summary(model)
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + 
##     Species, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79424 -0.21874  0.00899  0.20255  0.73103 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.17127    0.27979   7.760 1.43e-12 ***
## Sepal.Width        0.49589    0.08607   5.761 4.87e-08 ***
## Petal.Length       0.82924    0.06853  12.101  < 2e-16 ***
## Petal.Width       -0.31516    0.15120  -2.084  0.03889 *  
## Speciesversicolor -0.72356    0.24017  -3.013  0.00306 ** 
## Speciesvirginica  -1.02350    0.33373  -3.067  0.00258 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3068 on 144 degrees of freedom
## Multiple R-squared:  0.8673, Adjusted R-squared:  0.8627 
## F-statistic: 188.3 on 5 and 144 DF,  p-value: < 2.2e-16
# Generate Histogram and Plot
hist(model$residuals, xlab = "Residuals", xlim = range(seq(-1,1,0.1)), main = "Histogram of Residuals")

plot(iris$Sepal.Length, model$residuals, xlab = "Sepal Length", ylab = "Residuals", main = "Scatterplot of Residuals")

As we see from the linear model the equation is:
\(Sepal\ Length=2.171+0.496*Sepal\ Width+0.829*Petal\ Length-0.315*Petal\ Width-0.724*Species:Versicolor-1.024*Species:Virginica\)
The histogram and the scatterplot show the residuals are distributed normally with no significant outliers. We see the \(R^{2}_{adj}=0.863\). Next, let’s plot the other variables against the residuals.

plot(iris$Sepal.Width, model$residuals, xlab = "Sepal Width", ylab = "Residuals", main = "Scatterplot of Sepal Width vs. Residuals")

plot(iris$Petal.Length, model$residuals, xlab = "Petal Length", ylab = "Residuals", main = "Scatterplot of Petal Length vs. Residuals")

plot(iris$Petal.Width, model$residuals, xlab = "Petal Width", ylab = "Residuals", main = "Scatterplot of Petal Width vs. Residuals")

qplot(data=iris, x=Species, y=model$residuals, geom="boxplot", main="Species vs. Residuals")

As we see with the residual plots the residuals are spread similarly across species and randomly distributed when looking at the other variables.