In this analysis, we will explore the built-in Iris dataset, which contains measurements of sepal length, sepal width, petal length, and petal width for three species of Iris flowers: setosa, versicolor, and virginica.
The primary goal of this analysis is to understand the basic structure of the data, look for patterns, visualize relationships between variables, and summarize the dataset.
The iris
dataset is available by default in R. Here’s a
preview of the first few rows:
# View the first few rows of the dataset
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# Summary statistics for the iris dataset
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
# Check the structure of the dataset
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# Pair plot for numerical features
library(ggplot2)
pairs(iris[, 1:4], main = "Pair Plot of Iris Features", pch = 21, bg = c("red", "green3", "blue")[iris$Species])
# Boxplot of Sepal Length by Species
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Boxplot of Sepal Length by Species", y = "Sepal Length", x = "Species")
# Scatter plot for Petal Length vs Sepal Length
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_point(size = 3) +
theme_minimal() +
labs(title = "Petal Length vs Sepal Length", x = "Sepal Length", y = "Petal Length")
# Correlation matrix of numerical features
cor_matrix <- cor(iris[, 1:4])
cor_matrix
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
# Visualizing the correlation matrix
library(corrplot)
## corrplot 0.95 loaded
corrplot(cor_matrix, method = "circle", type = "upper", tl.cex = 0.8, tl.col = "black")