Modified from Introduction to Multiple Linear Regression in R by Michael Hahsler.
The famous iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
The data set contains 150 rows and 5 columns. The variables include: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
This project aims to predict Pedal.Width using the rest of the 4 variables as predictors.
library(tidyverse)# Preview the head of the data
head(iris)# Summary of the data
glimpse(iris)Observations: 150
Variables: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, ...
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, ...
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, ...
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, ...
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, s...
summary(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
set.seed(200)Load and shuffle data (flowers are in order by species).
data(iris)
iris <- iris[sample(1:nrow(iris)),]
plot(iris, col = iris$Species)Make the data a little messy and add a useless feature.
iris[,1] <- iris[,1] + rnorm(nrow(iris))
iris[,2] <- iris[,2] + rnorm(nrow(iris))
iris[,3] <- iris[,3] + rnorm(nrow(iris))
iris <- cbind(iris[,-5], useless = mean(iris[,1]) + rnorm(nrow(iris)), Species = iris[,5])
plot(iris, col = iris$Species)summary(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width useless
Min. :2.989 Min. :0.6024 Min. :-0.9748 Min. :0.100 Min. :3.283
1st Qu.:4.942 1st Qu.:2.4882 1st Qu.: 2.1543 1st Qu.:0.300 1st Qu.:5.160
Median :5.777 Median :3.1525 Median : 4.0687 Median :1.300 Median :5.776
Mean :5.813 Mean :3.1597 Mean : 3.7479 Mean :1.199 Mean :5.835
3rd Qu.:6.726 3rd Qu.:3.7842 3rd Qu.: 5.2465 3rd Qu.:1.800 3rd Qu.:6.468
Max. :9.608 Max. :5.7160 Max. : 7.5299 Max. :2.500 Max. :9.087
Species
setosa :50
versicolor:50
virginica :50
Create some training and learning data.
train <- iris[1:100,]
test <- iris[101:150,]Let’s fit our first linear regression model. Here is the equation that we attempt to solve:
\[ Petal.Width = B_0 + B_1 \times Sepal.Length + B_2 \times Sepal.Width + B_3 \times Petal.Length + B_4 \times uesless \]
model1 <- lm(Petal.Width ~ . - Species, data = train)
summary(model1)
Call:
lm(formula = Petal.Width ~ . - Species, data = train)
Residuals:
Min 1Q Median 3Q Max
-0.81597 -0.29078 -0.00143 0.24502 1.44318
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.23150 0.36042 0.642 0.522
Sepal.Length 0.01156 0.04052 0.285 0.776
Sepal.Width -0.02343 0.03471 -0.675 0.501
Petal.Length 0.28989 0.02277 12.733 <2e-16 ***
useless -0.01605 0.03971 -0.404 0.687
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4059 on 95 degrees of freedom
Multiple R-squared: 0.7224, Adjusted R-squared: 0.7107
F-statistic: 61.81 on 4 and 95 DF, p-value: < 2.2e-16
plot(model1, which = 1)We try simpler models.
model2 <- lm(Petal.Width ~ . - Species - useless, data = train)
summary(model2)
Call:
lm(formula = Petal.Width ~ . - Species - useless, data = train)
Residuals:
Min 1Q Median 3Q Max
-0.81314 -0.28473 -0.00658 0.24585 1.44359
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.12267 0.23857 0.514 0.608
Sepal.Length 0.01512 0.03938 0.384 0.702
Sepal.Width -0.02516 0.03429 -0.734 0.465
Petal.Length 0.29003 0.02266 12.797 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4041 on 96 degrees of freedom
Multiple R-squared: 0.7219, Adjusted R-squared: 0.7132
F-statistic: 83.08 on 3 and 96 DF, p-value: < 2.2e-16
No intercept
model3 <- lm(Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length - 1,
data = train)
summary(model3)
Call:
lm(formula = Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length -
1, data = train)
Residuals:
Min 1Q Median 3Q Max
-0.75794 -0.28316 -0.01408 0.26425 1.43453
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Sepal.Length 0.03167 0.02262 1.40 0.165
Sepal.Width -0.01530 0.02832 -0.54 0.590
Petal.Length 0.28767 0.02211 13.01 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4026 on 97 degrees of freedom
Multiple R-squared: 0.921, Adjusted R-squared: 0.9185
F-statistic: 376.9 on 3 and 97 DF, p-value: < 2.2e-16
Very simple model.
model4 <- lm(Petal.Width ~ Petal.Length - 1,
data = train)
summary(model4)
Call:
lm(formula = Petal.Width ~ Petal.Length - 1, data = train)
Residuals:
Min 1Q Median 3Q Max
-0.87189 -0.27505 0.02271 0.28920 1.45757
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Petal.Length 0.317804 0.009481 33.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4034 on 99 degrees of freedom
Multiple R-squared: 0.919, Adjusted R-squared: 0.9182
F-statistic: 1124 on 1 and 99 DF, p-value: < 2.2e-16
Compare models (Null hypothesis: all treatments=models have the same effect). This only works for nested models. Models are nested only if one model contains all the predictors of the other model.
anova(model1, model2, model3, model4)Analysis of Variance Table
Model 1: Petal.Width ~ (Sepal.Length + Sepal.Width + Petal.Length + useless +
Species) - Species
Model 2: Petal.Width ~ (Sepal.Length + Sepal.Width + Petal.Length + useless +
Species) - Species - useless
Model 3: Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length - 1
Model 4: Petal.Width ~ Petal.Length - 1
Res.Df RSS Df Sum of Sq F Pr(>F)
1 95 15.649
2 96 15.676 -1 -0.02692 0.1634 0.6869
3 97 15.720 -1 -0.04317 0.2621 0.6099
4 99 16.110 -2 -0.39057 1.1855 0.3101
Models 1 is not significantly better than model 2. Model 2 is not significantly better than model 3. Model 3 is not significantly better than model 4! Use model 4 (simplest model)