2025-10-26

Introduction

The iris dataset is a very famous dataset, often used for machine learning and statistical analysis. It contains the lengths and width of the petals and sepals of three closely related species of flowers: Setosa, Versicolor, and Virginica.

This project will attempt to create a regression model able to accurately predict the petal length of the Versicolor species specifically.

Relationship between Petal Width and Petal Length

Linear Regression

How can we define the relationship between petal width and length? The answer is linear regression.

Linear regression is a machine learning technique that allows us to fit a line to the datapoints as closely as possible. The model will try to minimize it’s error as much as possible.

The formula for the slope of the line: \[ b=\frac{n(\sum xy) - (\sum x)(\sum y)}{n(\sum x^2)-(\sum x)^2} \] The formula for the y-intercept of the line: \[ a=\frac{\sum y \sum x^2-\sum x \sum x y}{n(\sum x^2)-(\sum x)^2} \]

Linear Regression applied to the dataset

Multiple Linear Regression

The previous result is good, and it looks like the line fits the data well, but what if we wanted to do a deeper analysis, for example using sepal length as well as petal width in order to predict petal length? Then we must use multiple linear regression.

Multiple linear regression is just like linear regression, but with more variables/dimensions. In this case, the space will be 3-dimensional, so the result of our regression will be a plane in 3D space.

In general, the formula that you get back from multiple linear regression is of the form: \[ y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n \] where n is the number of variables you have.

R Code for Multiple Linear Regression

We will use the following code to create the regression model and get the matrix of predictions it makes

model <- lm(Petal.Length ~ Petal.Width + Sepal.Length, data = versicolor)

x_seq <- seq(min(versicolor$Petal.Width), 
             max(versicolor$Petal.Width), length.out = 20)
y_seq <- seq(min(versicolor$Sepal.Length), 
             max(versicolor$Sepal.Length), length.out = 20)

grid <- expand.grid(Petal.Width = x_seq, Sepal.Length = y_seq)
grid$Petal.Length <- predict(model, newdata = grid)

Multiple Linear Regression Plot

Conclusions

We can visually see that petal length is highly correlated with both sepal length and petal width for the Versicolor species. Using the techniques of linear regression and multiple linear regression allowed us to create models that can accurately predict petal length based on those inputs.