In this notebook, we will take a deep dive into the Iris dataset and apply an array of statistical and machine-learning techniques to uncover the relationships between the various features and the species of the flowers. Additionally, we will harness this dataset to construct and evaluate a classifier that can accurately predict the species of an iris flower based on its measurements. Get ready to explore and understand the Iris flower dataset like never before! The dataset was introduced by British statistician and biologist Ronald Fisher in his 1936 paper, “The use of multiple measurements in taxonomic problems.
This data set includes information on the sepal and petal measurements of three different types of irises - Setosa, Versicolour, and Virginica. The data set contains a total of 150 samples, with each sample having 4 different measurements - Sepal Length, Sepal Width, Petal Length, and Petal Width (all in centimeters). The rows represent the individual samples, while the columns contain the measurements of the respective attributes. glimpse(data)
Rows: 150
Columns: 5
$ sepal.length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1…
$ sepal.width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.4, 3.9, 3.5, 3.8, 3.8…
$ petal.length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5…
$ petal.width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2, 0.4, 0.4, 0.3, 0.3, 0.3…
$ variety <chr> "Setosa", "Setosa", "Setosa", "Setosa", "Setosa", "Setosa", "Setosa", "Setosa", "Setosa", "Setosa"…
The target variable for this dataset is undoubtedly “variety”.
Additionally, it is worth noting that all other variables are of
| Variable | Description |
|---|---|
| sepal.length | sepal.length = length of the sepal |
| sepal.width | sepal.width = width if the sepal |
| petal.length | petal.length = length of the petal |
| petal.width | petal.width = width of the petal |
| variety | variety = specie of the flower |
Based on the absence of any other categorical variables apart from the target variable in the dataset, we can conclude that there is no need to remove any variable from the dataset.
Let’s start analyzing the dataset to understand the relationship between various features and flower species. We will be examining the dataset using different plots, and for each plot, a brief explanation will be provided to prevent any misunderstandings and make it easier to comprehend the figure.
Let’s take a look at the data itself. Let’s see the first 5 rows of data for each class:
We want to learn more about the data. We can calculate basic statistics on each of the data frame’s columns with summary:
sepal.length sepal.width petal.length petal.width variety
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 Length:150
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Class :character
Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mode :character
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Numbers can tell a lot, but sometimes it is better to see the statistics with different plots.
This scatterplot presents the correlation between sepal length and sepal width across three iris varieties, Setosa, Versicolor, and Virginica. Setosa demonstrates a moderate positive correlation (R = 0.46, p = 0.00084), indicating a noteworthy association between sepal length and width in this species. Versicolor exhibits a slightly stronger positive correlation (R = 0.53, p = 8.8e-05), suggesting a pronounced relationship between these attributes, potentially reflecting distinct characteristics unique to Versicolor. On the other hand, Virginica displays a robust positive correlation (R = 0.74, p = 6.7e-10), implying a significant relationship between sepal length and width in this variety, which may signify particular traits specific to Virginica.
This scatterplot presents the correlation between petal length and petal width across three iris varieties, Setosa, Versicolor, and Virginica. Setosa exhibits a moderate positive correlation (R = 0.32, p = 0.023), suggesting a modest association between petal length and width in this variety. Versicolor displays a strong positive correlation (R = 0.79, p = 1.3e-11), indicating a pronounced relationship between these attributes, potentially reflecting specific characteristics unique to this species. Virginica shows a moderate positive correlation (R = 0.33, p = 0.019), similar to Setosa but with a slightly stronger correlation.
This scatterplot presents the correlation between sepal length and petal length across three iris varieties, Setosa, Versicolor, and Virginica. Setosa showcases a strong positive correlation (R = 0.86, p = 6.3e-16), indicating a robust association between sepal and petal lengths in this species. Versicolor demonstrates a similarly strong positive correlation (R = 0.75, p = 2.6e-10), implying a significant relationship between these characteristics, possibly reflecting unique traits inherent to Versicolor. In contrast, Virginica exhibits a weaker positive correlation (R = 0.27, p = 0.061), suggesting a less pronounced relationship between sepal and petal lengths compared to the other two varieties.
Setosa has the smallest sepal length among the three varieties, with Versicolor in the middle, ranging from 5.5 to 6.5 cm. Only one observation deviates with the Virginica variety having the longest sepal length.
Among the three species, Setosa stands out as the widest, with its sepals spreading out more than those of Versicolor and Virginica. Nevertheless, both Versicolor and Virginica also exhibit a considerable width, following closely behind Setosa.
Among the three species, there is a noticeable difference in the length of their petals. Specifically, Setosa stands out with the smallest petal lengths.
The petal width exhibits a significant variation among the three species of iris. In particular, Setosa stands out for having the smallest petal width, while Verginica shows the highest degree of diversity in petal width.
The three varieties of plants - Setosa, Versicolor, and Virginica - have different sepal lengths. Setosa has the smallest sepal length, typically measuring shorter than the other two varieties. The Versicolor variety has a moderate sepal length, ranging from 5.5 to 6.5 cm. On the other hand, the Virginica variety usually has the longest sepal length, except for one observation where it deviates from the norm.
Among the three species, Setosa stands out with its remarkable width. Its sepals spread out considerably more than those of Versicolor and Virginica, giving it a distinctive appearance. While Versicolor and Virginica do not exhibit the same degree of width as Setosa, they are still quite broad, following closely behind Setosa in terms of their width.
Out of the three species, it is worth noting that there is a distinct variation in the length of their petals. In particular, Setosa species stands out with its remarkably small petals, which are significantly shorter than those of the other two species.
Among the three species of iris, there is a significant variation in petal width. Setosa has the smallest petal width, while Verginica exhibits the highest degree of diversity in petal width.
The plots we have been using for data visualization have been helpful so far. They have helped us understand the data better than before. However, to delve deeper into the dataset, we need to interact with the data. Therefore, let us explore some interactive plots.
This dataset contains 4 features: Sepal length, Sepal Width, Petal Length, and Petal Width. To display the data, we will use two types of interactive plots: violin plots and box plots. We will begin by displaying the Sepal length and the Sepal Width using the interactive violin plots, followed by the display of the Petal length and the Petal Width using the interactive Box plot.
sepal.length sepal.width petal.length petal.width
sepal.length 1.0000000 -0.1175698 0.8717538 0.8179411
sepal.width -0.1175698 1.0000000 -0.4284401 -0.3661259
petal.length 0.8717538 -0.4284401 1.0000000 0.9628654
petal.width 0.8179411 -0.3661259 0.9628654 1.0000000
The correlation matrix shows that there are strong positive correlations between sepal length on one hand, and both petal length (0.87) and petal width (0.82) on the other. There is also a strong positive correlation between petal length and width (0.96). These results suggest that there may be linear relationships between these attributes, as they tend to increase together. On the other hand, there is a weak negative correlation between sepal length and width (-0.12). This indicates that sepal width may slightly decrease as sepal length increases, but this relationship is not very strong..
Let’s start by assessing what we have at our disposal.
Standard deviations (1, .., p=4):
[1] 1.7083611 0.9560494 0.3830886 0.1439265
Rotation (n x k) = (4 x 4):
PC1 PC2 PC3 PC4
sepal.length 0.5210659 -0.37741762 0.7195664 0.2612863
sepal.width -0.2693474 -0.92329566 -0.2443818 -0.1235096
petal.length 0.5804131 -0.02449161 -0.1421264 -0.8014492
petal.width 0.5648565 -0.06694199 -0.6342727 0.5235971
Let’s summerise the numbers,
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
When interpreting the proportion of variance, it is important to determine which components hold the most information and what percentage of the original data’s variability they capture. In this analysis, Components 1 and 2 together explain almost 95.81% of the variance, indicating that they capture a large portion of the information from the original variables. Component 1 alone accounts for 72.96% of the total variance, while Component 2 contributes an additional 22.85%. These results suggest that the first two components are highly informative and capture the majority of the variability present in the dataset. On the other hand, subsequent components such as Components 3 and 4 explain much smaller proportions of variance (3.669% and 0.518% respectively), indicating that their contributions to the overall variability of the data are significantly lower.
The scree plot for the principal component analysis (PCA) indicates a significant drop in variance from 73% to 22.9% in the first two components, with the remaining two components contributing very little. This suggests that the first two components account for the majority of the significant variance in the data.
We will now create a machine learning model to predict iris variety based on petal and sepal measurements.
train_data
test_data
predict(svm_model, test_data[15,-5])
76
Versicolor
Levels: Setosa Versicolor Virginica
This machine learning model has been trained using the train_data. Its purpose is to predict the variety of a flower based on the values of its attributes. When we give it values from the test_data without the target variable, it sums up the values of all the other columns and compares them with the data from the train_data. It then identifies which of the three varieties have similar attribute values to the input and returns its prediction. In this specific case, the model detected that test_data[15, -5] belongs to the Versicolor variety, which is correct.
Confusion Matrix and Statistics
Reference
Prediction Setosa Versicolor Virginica
Setosa 10 0 0
Versicolor 0 10 0
Virginica 0 0 10
Overall Statistics
Accuracy : 1
95% CI : (0.8843, 1)
No Information Rate : 0.3333
P-Value [Acc > NIR] : 4.857e-15
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Setosa Class: Versicolor Class: Virginica
Sensitivity 1.0000 1.0000 1.0000
Specificity 1.0000 1.0000 1.0000
Pos Pred Value 1.0000 1.0000 1.0000
Neg Pred Value 1.0000 1.0000 1.0000
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.3333 0.3333
Detection Prevalence 0.3333 0.3333 0.3333
Balanced Accuracy 1.0000 1.0000 1.0000
The plots show a binary dependent variable, making linear regression unsuitable. Instead, logistic regression can estimate probabilities. All factors examined for university admission display a positive association. Increasing these factors increases the probability of admission.
The plotted polynomial graphs above provide a clear and accurate representation that indicates a positive correlation between GRE and TOEFL scores, strong recommendation letters, high CGPA, and research experience with increased admission chances to US universities. These findings emphasize the importance of academic performance, language proficiency, strong endorsements, consistent grades, and research engagement in securing admission.
To sum up, our data science project has been a success thanks to the insights we gained from analyzing the Iris flower dataset. We were able to develop an accurate machine learning model that can predict the species of iris based on sepal and petal measurements. This project has demonstrated the power of data analysis and machine learning and how it can be applied to solve real-world problems.
1.Exploring the Iris flower dataset, Emine Bozkus, Medium - eminebozkus.medium.com/exploring-the-iris-flower-dataset-4e000bcc266c
3.Iris flower data set, wikipedia - https://en.wikipedia.org/wiki/Iris_flower_data_set
4.Explore diabetes-related factors and predictive modeling of admittance into a masters graduate program at the United States University: Machine learning approaches, Nasif Hossain, Department of Global Health Policy, Graduate School of Medicine, The University of Tokyo - https://rpubs.com/Nasif/1136525