Descriptive statistics of Iris Flower dataset

iris_data <- read.table("./IRIS.csv",
                     header = TRUE,
                     sep = ",",
                     dec = ".")

head(iris_data)
##   sepal_length sepal_width petal_length petal_width     species
## 1          5.1         3.5          1.4         0.2 Iris-setosa
## 2          4.9         3.0          1.4         0.2 Iris-setosa
## 3          4.7         3.2          1.3         0.2 Iris-setosa
## 4          4.6         3.1          1.5         0.2 Iris-setosa
## 5          5.0         3.6          1.4         0.2 Iris-setosa
## 6          5.4         3.9          1.7         0.4 Iris-setosa

Description of data set used in the analysis:

Unit of observation: The unit of observation in the provided dataset is an individual iris flower. Each row in the dataset represents a single iris flower, with measurements for its sepal length, sepal width, petal length, petal width, and the species it belongs to (e.g., Setosa, Versicolor, Virginica)

Sample size:150 individual iris flowers

Definition of variables:

  • sepal_length: sepal length in cm
  • sepal_width: sepal width in cm
  • petal_length: petal length in cm
  • petal_width: petal width in cm
  • species: species it belongs to: (Iris-setosa, Iris-versicolor, Iris-virginica)

Source: Kaggle

Convert Categorical Variables to Factors:

In the dataset, the species column is a categorical variable. I will convert it to a factor for better analysis and visualization.

iris_data$species_f <- factor(iris_data$species,
                         levels = c("Iris-setosa", "Iris-versicolor", "Iris-virginica"),
                         labels = c("Setosa", "Versicolor", "Virginica"))

head(iris_data)
##   sepal_length sepal_width petal_length petal_width     species species_f
## 1          5.1         3.5          1.4         0.2 Iris-setosa    Setosa
## 2          4.9         3.0          1.4         0.2 Iris-setosa    Setosa
## 3          4.7         3.2          1.3         0.2 Iris-setosa    Setosa
## 4          4.6         3.1          1.5         0.2 Iris-setosa    Setosa
## 5          5.0         3.6          1.4         0.2 Iris-setosa    Setosa
## 6          5.4         3.9          1.7         0.4 Iris-setosa    Setosa

Descriptive statistics:

I pipe iris_data dataset to function group_by to group it based on species_f column. Then grouped dataset piped to summarise function which aggregates groups based of function that I provide. Additonaly to avoid repetition I decided to process all variables in one go using across function, to which I provided vector of column names (petal_width, petal_length, sepal_width, sepal_length) which I want to process separately. As second argument I passed aggregation function of my choose - mean, which will aggregade data for each column name in vector. Third argument is naming convention for output columns. I pass to it "mean_{.col}" to achieve prepending of mean_ prefix to each processed column thus marking its contents. All aggregated columns are then saved in mean_variables_per_species dataset.

mean_variables_per_species <- iris_data %>%
  group_by(species_f) %>%
  summarise(across(c(petal_width, petal_length, sepal_width, sepal_length), mean, .names = "mean_{.col}"))

show(mean_variables_per_species)
## # A tibble: 3 × 5
##   species_f  mean_petal_width mean_petal_length mean_sepal_width
##   <fct>                 <dbl>             <dbl>            <dbl>
## 1 Setosa                0.244              1.46             3.42
## 2 Versicolor            1.33               4.26             2.77
## 3 Virginica             2.03               5.55             2.97
## # ℹ 1 more variable: mean_sepal_length <dbl>

Graphical visualisation of distribution of the variables using scatterplots

Using scatterplots I decided to visualise petal and sepal characteristics of different species. Using ggplot2 library I visualised distribution and clustering of dataset entries based on their variables.

In first scatterplot Iris Dataset: Petal Width vs. Petal Height I visualized each dataset entry based of their petal_width and petal_length variables and colored them based on factor species_f. In this visualisation we can clearly see that Virginica irises are tallest and widest of all species and clustered around top right corner of graph. Versicolor irises are clustered around the center indicating average characteristics in height and width. The smallest ones, Setosa irises, are clustered around bottom-left corner indicating that they have smalles petels of all species presented in dataset.

ggplot(iris_data, aes(x = petal_width, y = petal_length, color = species_f)) +
  geom_point() +
  labs(x = "Petal Width", y = "Petal Height", title = "Iris Dataset: Petal Width vs. Petal Height") +
  theme_minimal()

In second scatterplot, similarly to the first one, Iris Dataset: Sepal Width vs. Sepal Height I visualized each dataset entry based of their sepal_width and sepal_length variables and colored them based on factor species_f. In this visualisation we can clearly see that Virginica and Versicolor irises posses very similar sepal length and width, since they are clustered together. But overall Virginica’s sepal is taller and wider than the ones of Versicolor. Setosa irises are clustered around bottom-center of plot chart which allows us to see that their sepals are shorted, but overall they are wider than 2 of their other “competitor” species.

ggplot(iris_data, aes(x = sepal_width, y = sepal_length, color = species_f)) +
  geom_point() +
  labs(x = "Sepal Width", y = "Sepal Height", title = "Iris Dataset: Sepal Width vs. Sepal Height") +
  theme_minimal()