2026-06-08

The Iris Dataset

Iris is a famous data set built into the base R package that was recorded by famous botanist, Edgar Anderson. This data contains observations from 150 Iris flowers across 3 main species:

  • Iris setosa
  • Iris versicolor
  • Iris virginica.

For each flower, Anderson observed measured qualities such as
petal length and petal width.

Simple Linear Regression

A common method that statisticians use to discover and analyze possible relationships between two numerical variables.

  • x is called the predictor or independent variable. It is the input.
  • y is called the response or dependent variable. It is important to analyze the output of y to see if x has any influence on it.
  • If there exists a strong relationship, we can even predict values for inputs outside of the data set.

Constructing a Model

The core idea of Linear Regression is to estimate a line that fits to the given data as much as possible.

  • This fitted line can be calculated through this equation:
    \[Y = \beta_0 + \beta_1 X + \varepsilon\]

\(\beta_1\) is an estimate for the slope of the line.
\(\beta_0\) is an estimate for the y-intercept of the line.

These \(\beta_0\) and \(\beta_1\) values can be calculated mathematically by
the least squares method.

Luckily, in R, the fitted line or model can simply be generated by the command lm(formula, data).

Visualizing the Iris Data Set

Before creating the simple regression model, I created a scatterplot of
every flower’s respective petal width and length using ggplot().

Already, we can notice a general trend between the two variables:

  • As petal width (x) increases, petal length (y) also increases, possibly that they share a positive linear relationship.

Fitting the Model

Below is code in R to generate the fitted model.

# Generates the fitted model
model = lm(Petal.Length ~ Petal.Width, data=iris)
# Print out the coefficients/values of the fitted model
coef(summary(model))
##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 1.083558 0.07296696 14.84998 4.043318e-31
## Petal.Width 2.229940 0.05139623 43.38724 4.675004e-86

From this, we can gather that:

  • Slope or \(\beta{1}=2.229940\)
  • Y-Intercept or \(\beta{0}=1.083558\)

Incorporating the Fitted Line

I plotted the fitted regression line alongside the original ggplot() graph to see how well the model fits the data points.

Interpreting Results

Using the coefficients obtained in earlier sections, we can form the fitted line equation: \[Y = 1.084 + 2.230X\] Additionally, running the R command summary() on our fitted model yields \(R^2=0.9271\)

\(R^2\) is the percentage of variability that can be accounted for by the model. In other words, it measures how well the linear regression model holds.

  • Thus, the high \(R^2\) value of 0.9271 indicates a strong correlation between Petal Width and Petal Length.

Using this model, we could substitute any value for Petal Width (x) and generate an estimate of its Petal Length (y).

Checking our Model

To check how much error exists in our model, we can compare our model’s estimated value to the actual value in the data set.

From iris[1, c("Petal.Length","Petal.Width")],
we obtain that the first sample’s Petal Length is 1.4 and Petal Width is 0.2.

Plugging in x = 0.2 into our fitted line equation, we obtain 1.53.

To calculate our error, we subtract our predicted value from the actual value. This error is called a residual.

\(\varepsilon = 1.4 - 1.53\)
\(\varepsilon = -0.13\)

Since \(\varepsilon\) is a small value close that is close to 0, one could argue the model made a fairly accurate estimate for this specific observation.

An Interactive Plot of the Residuals

Using the R function residuals(), I calculated all residual values from our model.
Finally, using Plotly, I created an interactive graph below.