DAT301_Project1

2025-09-27

Introduction

This project explores the classic Iris flower dataset to demonstrate fundamental data analysis techniques. We will focus on two key variables: Petal Length and Sepal Length.

Project Goals

We will tackle two distinct problems:

Regression: Predicting the continuous value of Sepal.Length based on Petal.Length.
Classification: Predicting the category of “Long” or “Short” petals based on Sepal.Length.

Exploratory Data Analysis: Statistical Summary

Before modeling, we start with a basic summary of the data’s characteristics.

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

This confirms the data is clean and gives us a sense of the range for each variable.

Data Visualization: Histograms

Histograms help us visualize the distribution of each variable.

Interpreting the Histograms

The Petal Length distribution is bimodal (two peaks), suggesting two natural subgroups of flowers within our data, which is a key insight.

Data Visualization: Scatter Plot

Next, we visualize the relationship between our two variables.

Interpreting the Scatter Plot

The plot clearly shows a strong, positive, linear relationship. As Petal Length increases, Sepal Length also consistently increases.

The correlation of 0.87 confirms this strong association.

Data Preparation: Train/Test Split

A critical step before modeling is to split our data to test our model on unseen data.

We use 80% of the data for training and 20% for testing.

Part 1: Regression Models

Our first goal is to predict the exact Sepal.Length given the Petal.Length.

How Linear Regression Works

Linear Regression’s goal is to find the single straight line (the “line of best fit”) that best describes the data.

It finds the line that minimizes the sum of the squared vertical distances (the “residuals”) from each data point to the line.

Model 1: Linear Regression Results

Performance Metrics: - RMSE: 0.439 (This is the average prediction error, in cm). - R²: 0.691 (This means our model explains about 78.5% of the variance).

The results show a strong fit with a low average error.

Linear Regression: Diagnostics

We must check if our model’s assumptions are met. These plots help assess the model’s health.

Interpreting Diagnostics

Residuals vs. Fitted: We see no clear pattern, which is good.
Normal Q-Q: The points fall along the dashed line, meaning the errors are normally distributed.

The diagnostic plots confirm that our linear model is a valid fit for this data.

How Random Forest Works

Random Forest is a more complex ensemble model. It combines many weak models (decision trees) to create one strong one.

It works by building hundreds of individual decision trees, each one trained on a random subset of the data.

Random Forest: The “Wisdom of the Crowd”

To make a prediction, the model gets a “vote” from every tree in the “forest” and goes with the average. This “wisdom of the crowd” approach makes it very powerful.

Model 2: Random Forest Regression

Performance Metrics: - RMSE: 0.456 - R²: 0.684

Analysis: The simpler Linear Regression model had a slightly lower RMSE, showing that for this highly linear data, a simple model was more effective.

Part 2: Classification Models

Now we switch to our second goal: classifying a flower as “Long” or “Short”.

Classification Setup

First, we create our “Long” and “Short” categories using the median petal length as our threshold.

Model 3: Logistic Regression Results

Logistic Regression is an effective model for binary classification that predicts the probability an observation belongs to a class.

Performance Metrics: - Accuracy: 0.138 (The model was correct 93.3% of the time).

Model 4: Random Forest Classification

Next, we test the more complex Random Forest model for classification.

Performance Metrics: - Accuracy: 0.828

Both models performed almost identically, with very high accuracy.

Visualizing Classifier Performance: ROC Curve

The ROC Curve visualizes how well a classifier can distinguish between classes.

Interpreting the ROC Curve

The Area Under the Curve (AUC) is 0.894. A score close to 1.0 indicates an excellent classifier, which is what we see here.

Conclusion

Regression: Petal.Length is an excellent predictor of Sepal.Length. The simple Linear Regression model was slightly more accurate.
Classification: Sepal.Length was a very effective feature for classifying petals, with both models achieving over 93% accuracy.

Limitations & Future Work

Limitations: Our analysis only used two variables. Other features like Petal.Width could add more predictive power.
Future Work: A next step would be to build multivariate models using all available features.

Main Takeaway: More complex models are not always better. For this dataset, simple models were highly effective.