IRIS Dataset

##About this data

The Iris dataset is one of the most famous and widely used datasets in data science and machine learning. It was introduced by the British statistician and biologist Ronald A. Fisher in 1936 as part of his paper “The Use of Multiple Measurements in Taxonomic Problems.” The dataset is simple yet rich, making it perfect for beginners to explore EDA (Exploratory Data Analysis), visualization, and classification algorithms.

The dataset contains 150 rows (observations), representing three species of Iris flowers. Each row provides four measurements of the flower’s sepals and petals. The goal is to see if these measurements can help distinguish between the species.

Analyze how sepal and petal measurements are correlated.

data(iris)
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.2

# Scatter plot with species-wise coloring
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(size = 3) +
  labs(title = "Sepal Dimensions by Species", x = "Sepal Length", y = "Sepal Width") +
  theme_minimal()

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

# Pair plot with correlation values
ggpairs(iris, aes(color = Species, alpha = 0.5)) +
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(ggridges)

# Ridge plot for Sepal.Length by Species
ggplot(iris, aes(x = Sepal.Length, y = Species, fill = Species)) +
  geom_density_ridges(alpha = 0.7) +
  labs(title = "Distribution of Sepal Length by Species", x = "Sepal Length", y = "") +
  theme_ridges() +
  theme(legend.position = "none")

## Picking joint bandwidth of 0.181

# Compute the correlation matrix
cor_matrix = cor(iris[, 1:4])
print(cor_matrix)

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

library(reshape2)

## Warning: package 'reshape2' was built under R version 4.4.2

# Melt the correlation matrix for plotting
cor_data <- melt(cor_matrix)

# Heatmap of correlations
ggplot(cor_data, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, 
                       limit = c(-1, 1), space = "Lab", name = "Correlation") +
  theme_minimal() +
  labs(title = "Correlation Heatmap of Iris Dataset", x = "", y = "") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

IRIS Dataset

Thanga mari

2024-11-16