iris_data <- read.table("./IRIS.csv",
header = TRUE,
sep = ",",
dec = ".")
head(iris_data)
## sepal_length sepal_width petal_length petal_width species
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3.0 1.4 0.2 Iris-setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa
## 4 4.6 3.1 1.5 0.2 Iris-setosa
## 5 5.0 3.6 1.4 0.2 Iris-setosa
## 6 5.4 3.9 1.7 0.4 Iris-setosa
Unit of observation: The unit of observation in the provided dataset is an individual iris flower. Each row in the dataset represents a single iris flower, with measurements for its sepal length, sepal width, petal length, petal width, and the species it belongs to (e.g., Setosa, Versicolor, Virginica)
Sample size:150 individual iris flowers
Definition of variables:
sepal_length: sepal length in cmsepal_width: sepal width in cmpetal_length: petal length in cmpetal_width: petal width in cmspecies: species it belongs to:
(Iris-setosa, Iris-versicolor,
Iris-virginica)Source: Kaggle
Convert Categorical Variables to Factors:
In the dataset, the species column is a categorical
variable. I will convert it to a factor for better analysis and
visualization.
iris_data$species_f <- factor(iris_data$species,
levels = c("Iris-setosa", "Iris-versicolor", "Iris-virginica"),
labels = c("Setosa", "Versicolor", "Virginica"))
head(iris_data)
## sepal_length sepal_width petal_length petal_width species species_f
## 1 5.1 3.5 1.4 0.2 Iris-setosa Setosa
## 2 4.9 3.0 1.4 0.2 Iris-setosa Setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa Setosa
## 4 4.6 3.1 1.5 0.2 Iris-setosa Setosa
## 5 5.0 3.6 1.4 0.2 Iris-setosa Setosa
## 6 5.4 3.9 1.7 0.4 Iris-setosa Setosa
I pipe iris_data dataset to function
group_by to group it based on species_f
column. Then grouped dataset piped to summarise function
which aggregates groups based of function that I provide. Additonaly to
avoid repetition I decided to process all variables in one go using
across function, to which I provided vector of column names
(petal_width, petal_length,
sepal_width, sepal_length) which I want to
process separately. As second argument I passed aggregation function of
my choose - mean, which will aggregade data for each column
name in vector. Third argument is naming convention for output columns.
I pass to it "mean_{.col}" to achieve prepending of
mean_ prefix to each processed column thus marking its
contents. All aggregated columns are then saved in
mean_variables_per_species dataset.
mean_variables_per_species <- iris_data %>%
group_by(species_f) %>%
summarise(across(c(petal_width, petal_length, sepal_width, sepal_length), mean, .names = "mean_{.col}"))
show(mean_variables_per_species)
## # A tibble: 3 × 5
## species_f mean_petal_width mean_petal_length mean_sepal_width
## <fct> <dbl> <dbl> <dbl>
## 1 Setosa 0.244 1.46 3.42
## 2 Versicolor 1.33 4.26 2.77
## 3 Virginica 2.03 5.55 2.97
## # ℹ 1 more variable: mean_sepal_length <dbl>
Using scatterplots I decided to visualise petal and sepal
characteristics of different species. Using ggplot2 library
I visualised distribution and clustering of dataset entries based on
their variables.
In first scatterplot
Iris Dataset: Petal Width vs. Petal Height I visualized
each dataset entry based of their petal_width and
petal_length variables and colored them based on factor
species_f. In this visualisation we can clearly see that
Virginica irises are tallest and widest of all species and clustered
around top right corner of graph. Versicolor irises are clustered around
the center indicating average characteristics in height and width. The
smallest ones, Setosa irises, are clustered around bottom-left corner
indicating that they have smalles petels of all species presented in
dataset.
ggplot(iris_data, aes(x = petal_width, y = petal_length, color = species_f)) +
geom_point() +
labs(x = "Petal Width", y = "Petal Height", title = "Iris Dataset: Petal Width vs. Petal Height") +
theme_minimal()
In second scatterplot, similarly to the first one,
Iris Dataset: Sepal Width vs. Sepal Height I visualized
each dataset entry based of their sepal_width and
sepal_length variables and colored them based on factor
species_f. In this visualisation we can clearly see that
Virginica and Versicolor irises posses very similar sepal length and
width, since they are clustered together. But overall Virginica’s sepal
is taller and wider than the ones of Versicolor. Setosa irises are
clustered around bottom-center of plot chart which allows us to see that
their sepals are shorted, but overall they are wider than 2 of their
other “competitor” species.
ggplot(iris_data, aes(x = sepal_width, y = sepal_length, color = species_f)) +
geom_point() +
labs(x = "Sepal Width", y = "Sepal Height", title = "Iris Dataset: Sepal Width vs. Sepal Height") +
theme_minimal()