The Iris dataset is one of the most well-known
inbuilt datasets in R.
It contains measurements of four features of iris flowers from three
different species — setosa, versicolor, and
virginica.
In this markdown, we will perform five descriptive analyses and five visualizations to understand the dataset better.
# Load essential packages
library(dplyr)
library(ggplot2)
# Load the inbuilt dataset
data("iris")
# Display first few rows
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
cat("Number of Rows: ", nrow(iris), "\n")
## Number of Rows: 150
cat("Number of Columns: ", ncol(iris))
## Number of Columns: 5
iris %>%
summarise(
Mean_Sepal_Length = mean(Sepal.Length),
Median_Sepal_Length = median(Sepal.Length),
SD_Sepal_Length = sd(Sepal.Length)
)
## Mean_Sepal_Length Median_Sepal_Length SD_Sepal_Length
## 1 5.843333 5.8 0.8280661
iris %>%
group_by(Species) %>%
summarise(
Mean_Sepal_Length = mean(Sepal.Length),
Mean_Sepal_Width = mean(Sepal.Width),
Mean_Petal_Length = mean(Petal.Length),
Mean_Petal_Width = mean(Petal.Width)
)
## # A tibble: 3 × 5
## Species Mean_Sepal_Length Mean_Sepal_Width Mean_Petal_Length Mean_Petal_Width
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 3.43 1.46 0.246
## 2 versico… 5.94 2.77 4.26 1.33
## 3 virgini… 6.59 2.97 5.55 2.03
cor(iris[,1:4])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
Visual exploration helps identify patterns and relationships among variables.
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
geom_histogram(bins = 20, color = "black", alpha = 0.7) +
labs(title = "Distribution of Sepal Length",
x = "Sepal Length", y = "Frequency") +
theme_minimal()
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot(alpha = 0.8) +
labs(title = "Sepal Length Comparison Across Species",
x = "Species", y = "Sepal Length") +
theme_minimal()
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_point(size = 3, alpha = 0.8) +
labs(title = "Relationship between Sepal Length and Petal Length",
x = "Sepal Length", y = "Petal Length") +
theme_minimal()
pairs(iris[1:4], main = "Pair Plot of Iris Numeric Features",
pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])
ggplot(iris, aes(x = Petal.Width, fill = Species)) +
geom_density(alpha = 0.6) +
labs(title = "Density Plot of Petal Width by Species",
x = "Petal Width", y = "Density") +
theme_minimal()
This markdown demonstrates how to perform descriptive and visual
analysis using the Iris dataset in R.
These methods can be applied to other datasets for quick exploratory
data analysis (EDA) and visualization.