Introduction

This report provides a thorough analysis of the famous Iris dataset. The analysis includes statistical summaries, histograms, scatter plots, and visualizations of differences among species using the ggplot2 package.

Data Loading and Summary

First, let’s load the dataset and summarize its key statistical properties.

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Data Visualization

Histograms

Histograms allow us to understand the distribution of each numerical variable in the dataset.

par(mfrow=c(2,2))
hist(iris$Sepal.Length, main="Histogram of Sepal Length", xlab="Sepal Length", col="skyblue")
hist(iris$Sepal.Width, main="Histogram of Sepal Width", xlab="Sepal Width", col="lightgreen")
hist(iris$Petal.Length, main="Histogram of Petal Length", xlab="Petal Length", col="lightcoral")
hist(iris$Petal.Width, main="Histogram of Petal Width", xlab="Petal Width", col="lightgoldenrod")

Scatter Plots

Scatter plots help visualize relationships between pairs of variables.

pairs(iris[,1:4], main="Scatter plot matrix")

Differences Among Species

Using ggplot2, we can visualize the differences in measurements among the Iris species.

ggplot(iris, aes(x=Species, y=Sepal.Length, color=Species)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title="Sepal Length across Species", y="Sepal Length", x="Species")

ggplot(iris, aes(x=Species, y=Sepal.Width, color=Species)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title="Sepal Width across Species", y="Sepal Width", x="Species")

ggplot(iris, aes(x=Species, y=Petal.Length, color=Species)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title="Petal Length across Species", y="Petal Length", x="Species")

ggplot(iris, aes(x=Species, y=Petal.Width, color=Species)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title="Petal Width across Species", y="Petal Width", x="Species")

Conclusion

The Iris dataset presents a fascinating opportunity to explore basic data analysis techniques. The statistical summaries, histograms, and scatter plots provide insights into the distribution and relationships between variables. The ggplot2 visualizations highlight the clear distinctions in measurements among the different Iris species. This analysis lays the groundwork for further statistical modeling and classification tasks.