Using the ‘iris’ dataset, build a multiple regression that predicts the sepal.length. Use all remaining variables including ‘species.’ Interpret the results and residuals.
First we’ll create the model using Sepal.Length as the response variable and all other variables in the data set as explanatory variables. From there we’ll display a histogram and scatterplot of the residuals from the model.
# Create the model
model = lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data=iris)
summary(model)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width +
## Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.79424 -0.21874 0.00899 0.20255 0.73103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.17127 0.27979 7.760 1.43e-12 ***
## Sepal.Width 0.49589 0.08607 5.761 4.87e-08 ***
## Petal.Length 0.82924 0.06853 12.101 < 2e-16 ***
## Petal.Width -0.31516 0.15120 -2.084 0.03889 *
## Speciesversicolor -0.72356 0.24017 -3.013 0.00306 **
## Speciesvirginica -1.02350 0.33373 -3.067 0.00258 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3068 on 144 degrees of freedom
## Multiple R-squared: 0.8673, Adjusted R-squared: 0.8627
## F-statistic: 188.3 on 5 and 144 DF, p-value: < 2.2e-16
# Generate Histogram and Plot
hist(model$residuals, xlab = "Residuals", xlim = range(seq(-1,1,0.1)), main = "Histogram of Residuals")
plot(iris$Sepal.Length, model$residuals, xlab = "Sepal Length", ylab = "Residuals", main = "Scatterplot of Residuals")
As we see from the linear model the equation is:
\(Sepal\ Length=2.171+0.496*Sepal\ Width+0.829*Petal\ Length-0.315*Petal\ Width-0.724*Species:Versicolor-1.024*Species:Virginica\)
The histogram and the scatterplot show the residuals are distributed normally with no significant outliers. We see the \(R^{2}_{adj}=0.863\). Next, let’s plot the other variables against the residuals.
plot(iris$Sepal.Width, model$residuals, xlab = "Sepal Width", ylab = "Residuals", main = "Scatterplot of Sepal Width vs. Residuals")
plot(iris$Petal.Length, model$residuals, xlab = "Petal Length", ylab = "Residuals", main = "Scatterplot of Petal Length vs. Residuals")
plot(iris$Petal.Width, model$residuals, xlab = "Petal Width", ylab = "Residuals", main = "Scatterplot of Petal Width vs. Residuals")
qplot(data=iris, x=Species, y=model$residuals, geom="boxplot", main="Species vs. Residuals")
As we see with the residual plots the residuals are spread similarly across species and randomly distributed when looking at the other variables.