This project explores the classic Iris flower dataset to demonstrate fundamental data analysis techniques. We will focus on two key variables: Petal Length and Sepal Length.
2025-09-27
This project explores the classic Iris flower dataset to demonstrate fundamental data analysis techniques. We will focus on two key variables: Petal Length and Sepal Length.
We will tackle two distinct problems:
Sepal.Length
based on Petal.Length
.Sepal.Length
.Before modeling, we start with a basic summary of the data’s characteristics.
## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
This confirms the data is clean and gives us a sense of the range for each variable.
Histograms help us visualize the distribution of each variable.
The Petal Length distribution is bimodal (two peaks), suggesting two natural subgroups of flowers within our data, which is a key insight.
Next, we visualize the relationship between our two variables.
The plot clearly shows a strong, positive, linear relationship. As Petal Length increases, Sepal Length also consistently increases.
The correlation of 0.87 confirms this strong association.
A critical step before modeling is to split our data to test our model on unseen data.
We use 80% of the data for training and 20% for testing.
Our first goal is to predict the exact Sepal.Length
given the Petal.Length
.
Linear Regression’s goal is to find the single straight line (the “line of best fit”) that best describes the data.
It finds the line that minimizes the sum of the squared vertical distances (the “residuals”) from each data point to the line.
Performance Metrics: - RMSE: 0.439 (This is the average prediction error, in cm). - R²: 0.691 (This means our model explains about 78.5% of the variance).
The results show a strong fit with a low average error.
We must check if our model’s assumptions are met. These plots help assess the model’s health.
The diagnostic plots confirm that our linear model is a valid fit for this data.
Random Forest is a more complex ensemble model. It combines many weak models (decision trees) to create one strong one.
It works by building hundreds of individual decision trees, each one trained on a random subset of the data.
To make a prediction, the model gets a “vote” from every tree in the “forest” and goes with the average. This “wisdom of the crowd” approach makes it very powerful.
Performance Metrics: - RMSE: 0.456 - R²: 0.684
Analysis: The simpler Linear Regression model had a slightly lower RMSE, showing that for this highly linear data, a simple model was more effective.
Now we switch to our second goal: classifying a flower as “Long” or “Short”.
First, we create our “Long” and “Short” categories using the median petal length as our threshold.
Logistic Regression is an effective model for binary classification that predicts the probability an observation belongs to a class.
Performance Metrics: - Accuracy: 0.138 (The model was correct 93.3% of the time).
Next, we test the more complex Random Forest model for classification.
Performance Metrics: - Accuracy: 0.828
Both models performed almost identically, with very high accuracy.
The ROC Curve visualizes how well a classifier can distinguish between classes.
The Area Under the Curve (AUC) is 0.894. A score close to 1.0 indicates an excellent classifier, which is what we see here.
Petal.Length
is an excellent predictor of Sepal.Length
. The simple Linear Regression model was slightly more accurate.Sepal.Length
was a very effective feature for classifying petals, with both models achieving over 93% accuracy.Petal.Width
could add more predictive power.Main Takeaway: More complex models are not always better. For this dataset, simple models were highly effective.