This report provides a thorough analysis of the famous Iris dataset.
The analysis includes statistical summaries, histograms, scatter plots,
and visualizations of differences among species using the
ggplot2 package.
First, let’s load the dataset and summarize its key statistical properties.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Histograms allow us to understand the distribution of each numerical variable in the dataset.
par(mfrow=c(2,2))
hist(iris$Sepal.Length, main="Histogram of Sepal Length", xlab="Sepal Length", col="skyblue")
hist(iris$Sepal.Width, main="Histogram of Sepal Width", xlab="Sepal Width", col="lightgreen")
hist(iris$Petal.Length, main="Histogram of Petal Length", xlab="Petal Length", col="lightcoral")
hist(iris$Petal.Width, main="Histogram of Petal Width", xlab="Petal Width", col="lightgoldenrod")
Scatter plots help visualize relationships between pairs of variables.
pairs(iris[,1:4], main="Scatter plot matrix")
Using ggplot2, we can visualize the differences in
measurements among the Iris species.
ggplot(iris, aes(x=Species, y=Sepal.Length, color=Species)) +
geom_boxplot() +
theme_minimal() +
labs(title="Sepal Length across Species", y="Sepal Length", x="Species")
ggplot(iris, aes(x=Species, y=Sepal.Width, color=Species)) +
geom_boxplot() +
theme_minimal() +
labs(title="Sepal Width across Species", y="Sepal Width", x="Species")
ggplot(iris, aes(x=Species, y=Petal.Length, color=Species)) +
geom_boxplot() +
theme_minimal() +
labs(title="Petal Length across Species", y="Petal Length", x="Species")
ggplot(iris, aes(x=Species, y=Petal.Width, color=Species)) +
geom_boxplot() +
theme_minimal() +
labs(title="Petal Width across Species", y="Petal Width", x="Species")
The Iris dataset presents a fascinating opportunity to explore basic
data analysis techniques. The statistical summaries, histograms, and
scatter plots provide insights into the distribution and relationships
between variables. The ggplot2 visualizations highlight the
clear distinctions in measurements among the different Iris species.
This analysis lays the groundwork for further statistical modeling and
classification tasks.