Modified from Introduction to Multiple Linear Regression in R by Michael Hahsler.

Overview

The famous iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

The data set contains 150 rows and 5 columns. The variables include: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

This project aims to predict Pedal.Width using the rest of the 4 variables as predictors.

library(tidyverse)
# Preview the head of the data
head(iris)
# Summary of the data
glimpse(iris)
Observations: 150
Variables: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, ...
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, ...
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, ...
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, ...
$ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, s...
summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500                  

A First Model

set.seed(200)

Load and shuffle data (flowers are in order by species).

data(iris)
iris <- iris[sample(1:nrow(iris)),]
plot(iris, col = iris$Species)

Make the data a little messy and add a useless feature.

iris[,1] <- iris[,1] + rnorm(nrow(iris))
iris[,2] <- iris[,2] + rnorm(nrow(iris))
iris[,3] <- iris[,3] + rnorm(nrow(iris))
iris <- cbind(iris[,-5], useless = mean(iris[,1]) + rnorm(nrow(iris)), Species = iris[,5])
plot(iris, col = iris$Species)

summary(iris)
  Sepal.Length    Sepal.Width      Petal.Length      Petal.Width       useless     
 Min.   :2.989   Min.   :0.6024   Min.   :-0.9748   Min.   :0.100   Min.   :3.283  
 1st Qu.:4.942   1st Qu.:2.4882   1st Qu.: 2.1543   1st Qu.:0.300   1st Qu.:5.160  
 Median :5.777   Median :3.1525   Median : 4.0687   Median :1.300   Median :5.776  
 Mean   :5.813   Mean   :3.1597   Mean   : 3.7479   Mean   :1.199   Mean   :5.835  
 3rd Qu.:6.726   3rd Qu.:3.7842   3rd Qu.: 5.2465   3rd Qu.:1.800   3rd Qu.:6.468  
 Max.   :9.608   Max.   :5.7160   Max.   : 7.5299   Max.   :2.500   Max.   :9.087  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Create some training and learning data.

train <- iris[1:100,]
test <- iris[101:150,]

Let’s fit our first linear regression model. Here is the equation that we attempt to solve:

\[ Petal.Width = B_0 + B_1 \times Sepal.Length + B_2 \times Sepal.Width + B_3 \times Petal.Length + B_4 \times uesless \]

model1 <- lm(Petal.Width ~ . - Species, data = train)
summary(model1)

Call:
lm(formula = Petal.Width ~ . - Species, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.81597 -0.29078 -0.00143  0.24502  1.44318 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.23150    0.36042   0.642    0.522    
Sepal.Length  0.01156    0.04052   0.285    0.776    
Sepal.Width  -0.02343    0.03471  -0.675    0.501    
Petal.Length  0.28989    0.02277  12.733   <2e-16 ***
useless      -0.01605    0.03971  -0.404    0.687    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4059 on 95 degrees of freedom
Multiple R-squared:  0.7224,    Adjusted R-squared:  0.7107 
F-statistic: 61.81 on 4 and 95 DF,  p-value: < 2.2e-16
plot(model1, which = 1)

Comparing Nested Models

We try simpler models.

model2 <- lm(Petal.Width ~ . - Species - useless, data = train)
summary(model2)

Call:
lm(formula = Petal.Width ~ . - Species - useless, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.81314 -0.28473 -0.00658  0.24585  1.44359 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.12267    0.23857   0.514    0.608    
Sepal.Length  0.01512    0.03938   0.384    0.702    
Sepal.Width  -0.02516    0.03429  -0.734    0.465    
Petal.Length  0.29003    0.02266  12.797   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4041 on 96 degrees of freedom
Multiple R-squared:  0.7219,    Adjusted R-squared:  0.7132 
F-statistic: 83.08 on 3 and 96 DF,  p-value: < 2.2e-16

No intercept

model3 <- lm(Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length - 1,
             data = train)
summary(model3)

Call:
lm(formula = Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length - 
    1, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.75794 -0.28316 -0.01408  0.26425  1.43453 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
Sepal.Length  0.03167    0.02262    1.40    0.165    
Sepal.Width  -0.01530    0.02832   -0.54    0.590    
Petal.Length  0.28767    0.02211   13.01   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4026 on 97 degrees of freedom
Multiple R-squared:  0.921, Adjusted R-squared:  0.9185 
F-statistic: 376.9 on 3 and 97 DF,  p-value: < 2.2e-16

Very simple model.

model4 <- lm(Petal.Width ~ Petal.Length - 1,
             data = train)
summary(model4)

Call:
lm(formula = Petal.Width ~ Petal.Length - 1, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.87189 -0.27505  0.02271  0.28920  1.45757 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
Petal.Length 0.317804   0.009481   33.52   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4034 on 99 degrees of freedom
Multiple R-squared:  0.919, Adjusted R-squared:  0.9182 
F-statistic:  1124 on 1 and 99 DF,  p-value: < 2.2e-16

Compare models (Null hypothesis: all treatments=models have the same effect). This only works for nested models. Models are nested only if one model contains all the predictors of the other model.

anova(model1, model2, model3, model4)
Analysis of Variance Table

Model 1: Petal.Width ~ (Sepal.Length + Sepal.Width + Petal.Length + useless + 
    Species) - Species
Model 2: Petal.Width ~ (Sepal.Length + Sepal.Width + Petal.Length + useless + 
    Species) - Species - useless
Model 3: Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length - 1
Model 4: Petal.Width ~ Petal.Length - 1
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     95 15.649                           
2     96 15.676 -1  -0.02692 0.1634 0.6869
3     97 15.720 -1  -0.04317 0.2621 0.6099
4     99 16.110 -2  -0.39057 1.1855 0.3101

Models 1 is not significantly better than model 2. Model 2 is not significantly better than model 3. Model 3 is not significantly better than model 4! Use model 4 (simplest model)

LS0tCnRpdGxlOiAiSXJpcyDigJQgTGluZWFyIFJlZ3Jlc3Npb24iCm91dHB1dDogCiAgaHRtbF9ub3RlYm9vazoKICAgIGhpZ2hsaWdodDogcHlnbWVudHMKICAgIGNvZGVfZm9sZGluZzogbnVsbAotLS0KICAKKipNb2RpZmllZCBmcm9tIFtJbnRyb2R1Y3Rpb24gdG8gTXVsdGlwbGUgTGluZWFyIFJlZ3Jlc3Npb24gaW4gUl0oaHR0cDovL21pY2hhZWwuaGFoc2xlci5uZXQvU01VL0VNSVM3MzMxL1IvcmVncmVzc2lvbi5odG1sKSBieSBNaWNoYWVsIEhhaHNsZXIuKioKICAKICAKIyMgT3ZlcnZpZXcKClRoZSBmYW1vdXMgKippcmlzKiogZGF0YSBzZXQgZ2l2ZXMgdGhlIG1lYXN1cmVtZW50cyBpbiBjZW50aW1ldGVycyBvZiB0aGUgdmFyaWFibGVzIHNlcGFsIGxlbmd0aCBhbmQgd2lkdGggYW5kIHBldGFsIGxlbmd0aCBhbmQgd2lkdGgsIHJlc3BlY3RpdmVseSwgZm9yIDUwIGZsb3dlcnMgZnJvbSBlYWNoIG9mIDMgc3BlY2llcyBvZiBpcmlzLiBUaGUgc3BlY2llcyBhcmUgSXJpcyBzZXRvc2EsIHZlcnNpY29sb3IsIGFuZCB2aXJnaW5pY2EuCgpUaGUgZGF0YSBzZXQgY29udGFpbnMgMTUwIHJvd3MgYW5kIDUgY29sdW1ucy4gVGhlIHZhcmlhYmxlcyBpbmNsdWRlOiBgU2VwYWwuTGVuZ3RoYCwgYFNlcGFsLldpZHRoYCwgYFBldGFsLkxlbmd0aGAsIGBQZXRhbC5XaWR0aGAsIGFuZCBgU3BlY2llc2AuIAoKVGhpcyBwcm9qZWN0IGFpbXMgdG8gcHJlZGljdCBgUGVkYWwuV2lkdGhgIHVzaW5nIHRoZSByZXN0IG9mIHRoZSA0IHZhcmlhYmxlcyBhcyBwcmVkaWN0b3JzLgoKYGBge3J9CiMgTG9hZCBsaWJyYXJpZXMKbGlicmFyeSh0aWR5dmVyc2UpCmBgYAoKYGBge3J9CiMgUHJldmlldyB0aGUgaGVhZCBvZiB0aGUgZGF0YQpoZWFkKGlyaXMpCmBgYAoKYGBge3J9CiMgU3VtbWFyeSBvZiB0aGUgZGF0YQpnbGltcHNlKGlyaXMpCnN1bW1hcnkoaXJpcykKYGBgCiAgCiMjIEEgRmlyc3QgTW9kZWwKCmBgYHtyfQpzZXQuc2VlZCgyMDApCmBgYAoKTG9hZCBhbmQgc2h1ZmZsZSBkYXRhIChmbG93ZXJzIGFyZSBpbiBvcmRlciBieSBzcGVjaWVzKS4KCmBgYHtyfQpkYXRhKGlyaXMpCmlyaXMgPC0gaXJpc1tzYW1wbGUoMTpucm93KGlyaXMpKSxdCnBsb3QoaXJpcywgY29sID0gaXJpcyRTcGVjaWVzKQpgYGAKCk1ha2UgdGhlIGRhdGEgYSBsaXR0bGUgbWVzc3kgYW5kIGFkZCBhIHVzZWxlc3MgZmVhdHVyZS4KCmBgYHtyfQppcmlzWywxXSA8LSBpcmlzWywxXSArIHJub3JtKG5yb3coaXJpcykpCmlyaXNbLDJdIDwtIGlyaXNbLDJdICsgcm5vcm0obnJvdyhpcmlzKSkKaXJpc1ssM10gPC0gaXJpc1ssM10gKyBybm9ybShucm93KGlyaXMpKQppcmlzIDwtIGNiaW5kKGlyaXNbLC01XSwgdXNlbGVzcyA9IG1lYW4oaXJpc1ssMV0pICsgcm5vcm0obnJvdyhpcmlzKSksIFNwZWNpZXMgPSBpcmlzWyw1XSkKcGxvdChpcmlzLCBjb2wgPSBpcmlzJFNwZWNpZXMpCmBgYAoKYGBge3J9CnN1bW1hcnkoaXJpcykKYGBgCgpDcmVhdGUgc29tZSB0cmFpbmluZyBhbmQgbGVhcm5pbmcgZGF0YS4KCmBgYHtyfQp0cmFpbiA8LSBpcmlzWzE6MTAwLF0KdGVzdCA8LSBpcmlzWzEwMToxNTAsXQpgYGAKCkxldCdzIGZpdCBvdXIgZmlyc3QgbGluZWFyIHJlZ3Jlc3Npb24gbW9kZWwuIEhlcmUgaXMgdGhlIGVxdWF0aW9uIHRoYXQgd2UgYXR0ZW1wdCB0byBzb2x2ZToKClxbClBldGFsLldpZHRoID0gQl8wICsgQl8xIFx0aW1lcyBTZXBhbC5MZW5ndGggKyBCXzIgXHRpbWVzIFNlcGFsLldpZHRoICsgQl8zIFx0aW1lcyBQZXRhbC5MZW5ndGggKyBCXzQgXHRpbWVzIHVlc2xlc3MKXF0KCmBgYHtyfQptb2RlbDEgPC0gbG0oUGV0YWwuV2lkdGggfiAuIC0gU3BlY2llcywgZGF0YSA9IHRyYWluKQpzdW1tYXJ5KG1vZGVsMSkKYGBgCgpgYGB7cn0KcGxvdChtb2RlbDEsIHdoaWNoID0gMSkKYGBgCgojIyBDb21wYXJpbmcgTmVzdGVkIE1vZGVscwoKV2UgdHJ5IHNpbXBsZXIgbW9kZWxzLgoKYGBge3J9Cm1vZGVsMiA8LSBsbShQZXRhbC5XaWR0aCB+IC4gLSBTcGVjaWVzIC0gdXNlbGVzcywgZGF0YSA9IHRyYWluKQpzdW1tYXJ5KG1vZGVsMikKYGBgCgpObyBpbnRlcmNlcHQKCmBgYHtyfQptb2RlbDMgPC0gbG0oUGV0YWwuV2lkdGggfiBTZXBhbC5MZW5ndGggKyBTZXBhbC5XaWR0aCArIFBldGFsLkxlbmd0aCAtIDEsCiAgICAgICAgICAgICBkYXRhID0gdHJhaW4pCnN1bW1hcnkobW9kZWwzKQpgYGAKClZlcnkgc2ltcGxlIG1vZGVsLgoKYGBge3J9Cm1vZGVsNCA8LSBsbShQZXRhbC5XaWR0aCB+IFBldGFsLkxlbmd0aCAtIDEsCiAgICAgICAgICAgICBkYXRhID0gdHJhaW4pCnN1bW1hcnkobW9kZWw0KQpgYGAKCkNvbXBhcmUgbW9kZWxzIChOdWxsIGh5cG90aGVzaXM6IGFsbCB0cmVhdG1lbnRzPW1vZGVscyBoYXZlIHRoZSBzYW1lIGVmZmVjdCkuIFRoaXMgb25seSB3b3JrcyBmb3IgbmVzdGVkIG1vZGVscy4gTW9kZWxzIGFyZSBuZXN0ZWQgb25seSBpZiBvbmUgbW9kZWwgY29udGFpbnMgYWxsIHRoZSBwcmVkaWN0b3JzIG9mIHRoZSBvdGhlciBtb2RlbC4KCmBgYHtyfQphbm92YShtb2RlbDEsIG1vZGVsMiwgbW9kZWwzLCBtb2RlbDQpCmBgYAoKTW9kZWxzIDEgaXMgbm90IHNpZ25pZmljYW50bHkgYmV0dGVyIHRoYW4gbW9kZWwgMi4gTW9kZWwgMiBpcyBub3Qgc2lnbmlmaWNhbnRseSBiZXR0ZXIgdGhhbiBtb2RlbCAzLiBNb2RlbCAzIGlzIG5vdCBzaWduaWZpY2FudGx5IGJldHRlciB0aGFuIG1vZGVsIDQhIFVzZSBtb2RlbCA0IChzaW1wbGVzdCBtb2RlbCk=