Exploring the iris flower dataset

Rudaiba Tarannum
YWCA Higher Secondary Girls’ School
February 27, 2024

Part-1

Introduction

In this notebook, we will take a deep dive into the Iris dataset and apply an array of statistical and machine-learning techniques to uncover the relationships between the various features and the species of the flowers. Additionally, we will harness this dataset to construct and evaluate a classifier that can accurately predict the species of an iris flower based on its measurements. Get ready to explore and understand the Iris flower dataset like never before! The dataset was introduced by British statistician and biologist Ronald Fisher in his 1936 paper, “The use of multiple measurements in taxonomic problems.

Iris Dataset

This data set includes information on the sepal and petal measurements of three different types of irises - Setosa, Versicolour, and Virginica. The data set contains a total of 150 samples, with each sample having 4 different measurements - Sepal Length, Sepal Width, Petal Length, and Petal Width (all in centimeters). The rows represent the individual samples, while the columns contain the measurements of the respective attributes. glimpse(data)

Rows: 150
Columns: 5
$ sepal.length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1…
$ sepal.width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.4, 3.9, 3.5, 3.8, 3.8…
$ petal.length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5…
$ petal.width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2, 0.4, 0.4, 0.3, 0.3, 0.3…
$ variety      <chr> "Setosa", "Setosa", "Setosa", "Setosa", "Setosa", "Setosa", "Setosa", "Setosa", "Setosa", "Setosa"…

The target variable for this dataset is undoubtedly “variety”. Additionally, it is worth noting that all other variables are of type, which in R represents numeric values with double precision.

Variable Description
sepal.length sepal.length = length of the sepal
sepal.width sepal.width = width if the sepal
petal.length petal.length = length of the petal
petal.width petal.width = width of the petal
variety variety = specie of the flower

Based on the absence of any other categorical variables apart from the target variable in the dataset, we can conclude that there is no need to remove any variable from the dataset.

Data exploration

The Data:

Let’s start analyzing the dataset to understand the relationship between various features and flower species. We will be examining the dataset using different plots, and for each plot, a brief explanation will be provided to prevent any misunderstandings and make it easier to comprehend the figure.

Let’s take a look at the data itself. Let’s see the first 5 rows of data for each class:

1)Setosa:

2)Versicolor:

3)Virginica:

Exploratory Data Analysis:

We want to learn more about the data. We can calculate basic statistics on each of the data frame’s columns with summary:

  sepal.length    sepal.width     petal.length    petal.width      variety         
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   Length:150        
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   Class :character  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   Mode  :character  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                     
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                     
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500                     

Numbers can tell a lot, but sometimes it is better to see the statistics with different plots.

Static plots:

Scatter plot:

1)Sepal Length vs. Sepal Width:

This scatterplot presents the correlation between sepal length and sepal width across three iris varieties, Setosa, Versicolor, and Virginica. Setosa demonstrates a moderate positive correlation (R = 0.46, p = 0.00084), indicating a noteworthy association between sepal length and width in this species. Versicolor exhibits a slightly stronger positive correlation (R = 0.53, p = 8.8e-05), suggesting a pronounced relationship between these attributes, potentially reflecting distinct characteristics unique to Versicolor. On the other hand, Virginica displays a robust positive correlation (R = 0.74, p = 6.7e-10), implying a significant relationship between sepal length and width in this variety, which may signify particular traits specific to Virginica.

2)Petal Length vs. Petal Width:

This scatterplot presents the correlation between petal length and petal width across three iris varieties, Setosa, Versicolor, and Virginica. Setosa exhibits a moderate positive correlation (R = 0.32, p = 0.023), suggesting a modest association between petal length and width in this variety. Versicolor displays a strong positive correlation (R = 0.79, p = 1.3e-11), indicating a pronounced relationship between these attributes, potentially reflecting specific characteristics unique to this species. Virginica shows a moderate positive correlation (R = 0.33, p = 0.019), similar to Setosa but with a slightly stronger correlation.

2)Sepal Length vs. Petal Length:

This scatterplot presents the correlation between sepal length and petal length across three iris varieties, Setosa, Versicolor, and Virginica. Setosa showcases a strong positive correlation (R = 0.86, p = 6.3e-16), indicating a robust association between sepal and petal lengths in this species. Versicolor demonstrates a similarly strong positive correlation (R = 0.75, p = 2.6e-10), implying a significant relationship between these characteristics, possibly reflecting unique traits inherent to Versicolor. In contrast, Virginica exhibits a weaker positive correlation (R = 0.27, p = 0.061), suggesting a less pronounced relationship between sepal and petal lengths compared to the other two varieties.

Box plot:

1)Sepal Length:

Setosa has the smallest sepal length among the three varieties, with Versicolor in the middle, ranging from 5.5 to 6.5 cm. Only one observation deviates with the Virginica variety having the longest sepal length.

2)Sepal Width:

Among the three species, Setosa stands out as the widest, with its sepals spreading out more than those of Versicolor and Virginica. Nevertheless, both Versicolor and Virginica also exhibit a considerable width, following closely behind Setosa.

3)Petal Length:

Among the three species, there is a noticeable difference in the length of their petals. Specifically, Setosa stands out with the smallest petal lengths.

4)Petal Width:

The petal width exhibits a significant variation among the three species of iris. In particular, Setosa stands out for having the smallest petal width, while Verginica shows the highest degree of diversity in petal width.

Violin plot:

1)Sepal Length:

The three varieties of plants - Setosa, Versicolor, and Virginica - have different sepal lengths. Setosa has the smallest sepal length, typically measuring shorter than the other two varieties. The Versicolor variety has a moderate sepal length, ranging from 5.5 to 6.5 cm. On the other hand, the Virginica variety usually has the longest sepal length, except for one observation where it deviates from the norm.

2)Sepal Width:

Among the three species, Setosa stands out with its remarkable width. Its sepals spread out considerably more than those of Versicolor and Virginica, giving it a distinctive appearance. While Versicolor and Virginica do not exhibit the same degree of width as Setosa, they are still quite broad, following closely behind Setosa in terms of their width.

3)Petal Length:

Out of the three species, it is worth noting that there is a distinct variation in the length of their petals. In particular, Setosa species stands out with its remarkably small petals, which are significantly shorter than those of the other two species.

4)Petal Width:

Among the three species of iris, there is a significant variation in petal width. Setosa has the smallest petal width, while Verginica exhibits the highest degree of diversity in petal width.

Part-2

The plots we have been using for data visualization have been helpful so far. They have helped us understand the data better than before. However, to delve deeper into the dataset, we need to interact with the data. Therefore, let us explore some interactive plots.

Interactive plots:

This dataset contains 4 features: Sepal length, Sepal Width, Petal Length, and Petal Width. To display the data, we will use two types of interactive plots: violin plots and box plots. We will begin by displaying the Sepal length and the Sepal Width using the interactive violin plots, followed by the display of the Petal length and the Petal Width using the interactive Box plot.

Interactive Violin plot:

1)Sepal Length:

1)Sepal Width:

Interactive Box plot:

1)Petal Length:

1)Petal Width:

Correlation:

1)Calculating the correlation matrix:

             sepal.length sepal.width petal.length petal.width
sepal.length    1.0000000  -0.1175698    0.8717538   0.8179411
sepal.width    -0.1175698   1.0000000   -0.4284401  -0.3661259
petal.length    0.8717538  -0.4284401    1.0000000   0.9628654
petal.width     0.8179411  -0.3661259    0.9628654   1.0000000

The correlation matrix shows that there are strong positive correlations between sepal length on one hand, and both petal length (0.87) and petal width (0.82) on the other. There is also a strong positive correlation between petal length and width (0.96). These results suggest that there may be linear relationships between these attributes, as they tend to increase together. On the other hand, there is a weak negative correlation between sepal length and width (-0.12). This indicates that sepal width may slightly decrease as sepal length increases, but this relationship is not very strong..

1)Correlation mattrix plot:

Pair plot:

Principal Component Analysis(PCA):

Let’s start by assessing what we have at our disposal.

Standard deviations (1, .., p=4):
[1] 1.7083611 0.9560494 0.3830886 0.1439265

Rotation (n x k) = (4 x 4):
                    PC1         PC2        PC3        PC4
sepal.length  0.5210659 -0.37741762  0.7195664  0.2612863
sepal.width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
petal.length  0.5804131 -0.02449161 -0.1421264 -0.8014492
petal.width   0.5648565 -0.06694199 -0.6342727  0.5235971

Let’s summerise the numbers,

Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

When interpreting the proportion of variance, it is important to determine which components hold the most information and what percentage of the original data’s variability they capture. In this analysis, Components 1 and 2 together explain almost 95.81% of the variance, indicating that they capture a large portion of the information from the original variables. Component 1 alone accounts for 72.96% of the total variance, while Component 2 contributes an additional 22.85%. These results suggest that the first two components are highly informative and capture the majority of the variability present in the dataset. On the other hand, subsequent components such as Components 3 and 4 explain much smaller proportions of variance (3.669% and 0.518% respectively), indicating that their contributions to the overall variability of the data are significantly lower.

1)Bar plot of PCAs:

The scree plot for the principal component analysis (PCA) indicates a significant drop in variance from 73% to 22.9% in the first two components, with the remaining two components contributing very little. This suggests that the first two components account for the majority of the significant variance in the data.

2) Contribution plot for PCs:

2) Cluster plot after PCA:

SVM model:

2)SVM model:

We will now create a machine learning model to predict iris variety based on petal and sepal measurements.

train_data
test_data
predict(svm_model, test_data[15,-5])
        76 
Versicolor 
Levels: Setosa Versicolor Virginica

This machine learning model has been trained using the train_data. Its purpose is to predict the variety of a flower based on the values of its attributes. When we give it values from the test_data without the target variable, it sums up the values of all the other columns and compares them with the data from the train_data. It then identifies which of the three varieties have similar attribute values to the input and returns its prediction. In this specific case, the model detected that test_data[15, -5] belongs to the Versicolor variety, which is correct.

2)Confusion matrix:

Confusion Matrix and Statistics

            Reference
Prediction   Setosa Versicolor Virginica
  Setosa         10          0         0
  Versicolor      0         10         0
  Virginica       0          0        10

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.8843, 1)
    No Information Rate : 0.3333     
    P-Value [Acc > NIR] : 4.857e-15  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: Setosa Class: Versicolor Class: Virginica
Sensitivity                 1.0000            1.0000           1.0000
Specificity                 1.0000            1.0000           1.0000
Pos Pred Value              1.0000            1.0000           1.0000
Neg Pred Value              1.0000            1.0000           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3333           0.3333
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            1.0000           1.0000

US Admission Dataset

a)removing Serial No column:

b)Pair plot:

c)Linear regression:

1) GRE Score:

2) TOEFL Score:

3) University rating:

4) SOP:

5) LOR:

6) CGPA:

7) Research:

The plots show a binary dependent variable, making linear regression unsuitable. Instead, logistic regression can estimate probabilities. All factors examined for university admission display a positive association. Increasing these factors increases the probability of admission.

d)Polynomial regression:

1) GRE Score:

2) TOEFL Score:

3) University rating:

4) SOP:

5) LOR:

6) CGPA:

7) Research:

The plotted polynomial graphs above provide a clear and accurate representation that indicates a positive correlation between GRE and TOEFL scores, strong recommendation letters, high CGPA, and research experience with increased admission chances to US universities. These findings emphasize the importance of academic performance, language proficiency, strong endorsements, consistent grades, and research engagement in securing admission.

Conclsion

To sum up, our data science project has been a success thanks to the insights we gained from analyzing the Iris flower dataset. We were able to develop an accurate machine learning model that can predict the species of iris based on sepal and petal measurements. This project has demonstrated the power of data analysis and machine learning and how it can be applied to solve real-world problems.

Reference

1.Exploring the Iris flower dataset, Emine Bozkus, Medium - eminebozkus.medium.com/exploring-the-iris-flower-dataset-4e000bcc266c

  1. Data Science Example - Iris dataset, lac.inpe - http://www.lac.inpe.br/~rafael.santos/Docs/CAP394/WholeStory-Iris.html

3.Iris flower data set, wikipedia - https://en.wikipedia.org/wiki/Iris_flower_data_set

4.Explore diabetes-related factors and predictive modeling of admittance into a masters graduate program at the United States University: Machine learning approaches, Nasif Hossain, Department of Global Health Policy, Graduate School of Medicine, The University of Tokyo - https://rpubs.com/Nasif/1136525

