This is a basic Exploratory Data Analysis on the one and only Iris Dataset - my first mini project in R.
I downloaded the Iris Flower dataset from Kaggle.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data = read.csv('IRIS.csv')
Let’s start analysing the data.
str(data)
## 'data.frame': 150 obs. of 5 variables:
## $ sepal_length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ sepal_width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ petal_length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ petal_width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ species : chr "Iris-setosa" "Iris-setosa" "Iris-setosa" "Iris-setosa" ...
subset(data, species == 'Iris-setosa')[1:8,]
## sepal_length sepal_width petal_length petal_width species
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3.0 1.4 0.2 Iris-setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa
## 4 4.6 3.1 1.5 0.2 Iris-setosa
## 5 5.0 3.6 1.4 0.2 Iris-setosa
## 6 5.4 3.9 1.7 0.4 Iris-setosa
## 7 4.6 3.4 1.4 0.3 Iris-setosa
## 8 5.0 3.4 1.5 0.2 Iris-setosa
summary(data)
## sepal_length sepal_width petal_length petal_width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## species
## Length:150
## Class :character
## Mode :character
##
##
##
Now I’m checking for any missing values in the data.
missing_values <- sum(is.na(data))
print(paste("Number of Missing Values:", missing_values))
## [1] "Number of Missing Values: 0"
There are no missing values in the dataset.
Univariate Analysis
Histogram for Sepal Length
ggplot(data, aes(x = sepal_length)) +
geom_histogram(binwidth = 0.2, fill = "blue", color = "black") +
labs(title = "Distribution of Sepal Length")
This is a histogram that visualizes the distribution of sepal lengths in the dataset. Each bar in the histogram represents a range of sepal lengths.The height of the bar corresponds to the frequency or count of observations falling within that range.
Bivariate Analysis
Scatter plot for Sepal Length vs. Sepal Width
ggplot(data, aes(x = sepal_length, y = sepal_width, color = species)) +
geom_point() +
labs(title = "Scatter Plot of Sepal Length vs. Sepal Width")
This is a scatter plot that visualizes the relationship between sepal length and sepal width. Each point in the plot represents an observation in the dataset. The color of the points distinguishes different species.
Multivariate Analysis
Boxplot for all numeric variables
ggplot(data, aes(x = species, y = sepal_length, fill = species)) +
geom_boxplot() +
labs(title = "Boxplot of Sepal Length by Species")
Another Boxplot
ggplot(data, aes(x = species, y = sepal_width, fill = species)) +
geom_boxplot() +
facet_wrap(~petal_width, scales = "free_y") +
labs(title = "Boxplot of Sepal Width by Species and Petal Width")
Here is a faceted boxplot that allows us to compare the distribution of sepal widths among different species for various levels of petal width. Each boxplot corresponds to a species and a specific petal width - provides a summary of the central tendency and spread of the “sepal_width” variable within each category.
Scatterplot
pairs(data[, 1:4], col = as.integer(data$petal_length), pch = 16, main = "Scatterplot Matrix with Color Gradient")
This is a scatterplot matrix where each point represents an observation in the dataset. The color of the points is determined by the “petal_length” variable - creates a color gradient.
And at last a Correlation Matrix.
cor_matrix <- cor(data[, 1:4])
print(cor_matrix)
## sepal_length sepal_width petal_length petal_width
## sepal_length 1.0000000 -0.1093692 0.8717542 0.8179536
## sepal_width -0.1093692 1.0000000 -0.4205161 -0.3565441
## petal_length 0.8717542 -0.4205161 1.0000000 0.9627571
## petal_width 0.8179536 -0.3565441 0.9627571 1.0000000
So that was it!
Will be back with another project soon.