Linear Regression Using the Iris Dataset

2026-04-12

Dataset `Iris`

data (iris)
head (iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Linear Regression

Linear regression is a statistical method used to estimate the value of one variable based on the value of another.
The variable being predicted is the dependent variable, and the variable used for prediction is the independent variable.

Mathematically:

\[ \hat{y} = \beta_0 + \beta_1 x \]

Breakdown of the equation:

\(\hat{y}\) — predicted value of the dependent variable
\(x\) — independent (predictor) variable
\(\beta_1\) — slope coefficient
\(\beta_0\) — intercept term

Dataset `Iris`: Sepal Length vs. Petal Length

Model \[ \text{Sepal.Length} = \beta_0 + \beta_1 \cdot \text{Petal.Length} + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2) \]

Fitted Model \[ \text{Sepal.Length} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{Petal.Length} \]

\(\hat{\beta}_0 = b_0\) — estimate of the intercept
\(\hat{\beta}_1 = b_1\) — estimate of the slope

Sepal Length vs. Petal Length Graph

`geom_smooth()` using formula = 'y ~ x'

Dataset `Iris`: Sepal Length vs. Sepal Width

Model \[ \text{Sepal.Length} = \beta_0 + \beta_1 \cdot \text{Sepal.Width} + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2) \]

Fitted Model \[ \text{Sepal.Length} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{Sepal.Width} \]

\(\hat{\beta}_0 = b_0\) — estimate of the intercept
\(\hat{\beta}_1 = b_1\) — estimate of the slope

Sepal Length vs Sepal Width Graph

`geom_smooth()` using formula = 'y ~ x'

Dataset `Iris`: Petal Length vs. Petal Width

Model \[ \text{Petal.Length} = \beta_0 + \beta_1 \cdot \text{Petal.Width} + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2) \]

Fitted Model \[ \text{Petal.Length} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{Petal.Width} \]

\(\hat{\beta}_0 = b_0\) — estimate of the intercept
\(\hat{\beta}_1 = b_1\) — estimate of the slope

Petal Length vs Petal Width Graph

`geom_smooth()` using formula = 'y ~ x'

3D Iris Graph Code

x_vals <- iris$Sepal.Length
y_vals <- iris$Sepal.Width
z_vals <- iris$Petal.Length
species_vals <- iris$Species
my_colors <- c("#6a659e", "#659e6a", "#9e6599")
plot_ly(
  x = x_vals,
  y = y_vals,
  z = z_vals,
  type = "scatter3d",
  mode = "markers",
  color = species_vals,
  colors = my_colors,
  marker = list(size = 4)
) %>%
  hide_colorbar() %>%
layout(
    scene = list(
      xaxis = list(title = "Sepal Length"),
      yaxis = list(title = "Sepal Width"),
      zaxis = list(title = "Petal Length")
    )
  )

Sepal Length x Sepal Width x Petal Length

Linear Regression Conclusion

Petal Length shows a strong positive linear relationship with Sepal Length.
Points trend upward and stay close to the regression line, so the linear model fits well.

Sepal Width shows a weak relationship with Sepal Length.
Points are scattered with no clear pattern, and the regression line does not match the data.
This means the linear model:

\[ \text{Sepal Length} = \beta_0 + \beta_1 \cdot \text{Sepal Width} \]

does not fit this relationship.

Petal Length and Petal Width have a very strong positive relationship.
Points are tightly grouped along the regression line, showing an great model fit.

The 3D plot shows that increases in Petal Length are associated with increases in Sepal Length, while Sepal Width shows less consistent variation. Petal Length is more strongly related to Sepal Length than Sepal Width. Petal Length is the better predictor of Sepal Length.

Species Patterns in the Data

Species form clear clusters in all scatterplots, even though species were not part of the regression models.

Setosa is clearly different from the other two species.
It has much smaller petal measurements and is overall a smaller iris.
This shows up as clusters of purple Setosa points on the lower end of the graphs.

Versicolor and Virginica overlap more, but still show separation.
Virginica tends to have the largest measurements.
Versicolor sits in the middle.

Petal Length separates species the most.
When Petal Length is compared with other variables, the clusters become more distinct.

Overall, species grouping explains some of the variation in the data and adds an interesting layer to the dataset.

Dataset Iris