IRIS Exploratory Data Analysis

This is a basic Exploratory Data Analysis on the one and only Iris Dataset - my first mini project in R.

I downloaded the Iris Flower dataset from Kaggle.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data = read.csv('IRIS.csv')

Let’s start analysing the data.

str(data)

## 'data.frame':    150 obs. of  5 variables:
##  $ sepal_length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ sepal_width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ petal_length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ petal_width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ species     : chr  "Iris-setosa" "Iris-setosa" "Iris-setosa" "Iris-setosa" ...

subset(data, species == 'Iris-setosa')[1:8,]

##   sepal_length sepal_width petal_length petal_width     species
## 1          5.1         3.5          1.4         0.2 Iris-setosa
## 2          4.9         3.0          1.4         0.2 Iris-setosa
## 3          4.7         3.2          1.3         0.2 Iris-setosa
## 4          4.6         3.1          1.5         0.2 Iris-setosa
## 5          5.0         3.6          1.4         0.2 Iris-setosa
## 6          5.4         3.9          1.7         0.4 Iris-setosa
## 7          4.6         3.4          1.4         0.3 Iris-setosa
## 8          5.0         3.4          1.5         0.2 Iris-setosa

summary(data)

##   sepal_length    sepal_width     petal_length    petal_width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.054   Mean   :3.759   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##    species         
##  Length:150        
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Now I’m checking for any missing values in the data.

missing_values <- sum(is.na(data))
print(paste("Number of Missing Values:", missing_values))

## [1] "Number of Missing Values: 0"

There are no missing values in the dataset.

Univariate Analysis

Histogram for Sepal Length

ggplot(data, aes(x = sepal_length)) + 
  geom_histogram(binwidth = 0.2, fill = "blue", color = "black") +
  labs(title = "Distribution of Sepal Length")

This is a histogram that visualizes the distribution of sepal lengths in the dataset. Each bar in the histogram represents a range of sepal lengths.The height of the bar corresponds to the frequency or count of observations falling within that range.

Bivariate Analysis

Scatter plot for Sepal Length vs. Sepal Width

ggplot(data, aes(x = sepal_length, y = sepal_width, color = species)) + 
  geom_point() +
  labs(title = "Scatter Plot of Sepal Length vs. Sepal Width")

This is a scatter plot that visualizes the relationship between sepal length and sepal width. Each point in the plot represents an observation in the dataset. The color of the points distinguishes different species.

Multivariate Analysis

Boxplot for all numeric variables

ggplot(data, aes(x = species, y = sepal_length, fill = species)) +
  geom_boxplot() +
  labs(title = "Boxplot of Sepal Length by Species")

Another Boxplot

ggplot(data, aes(x = species, y = sepal_width, fill = species)) +
  geom_boxplot() +
  facet_wrap(~petal_width, scales = "free_y") +
  labs(title = "Boxplot of Sepal Width by Species and Petal Width")

Here is a faceted boxplot that allows us to compare the distribution of sepal widths among different species for various levels of petal width. Each boxplot corresponds to a species and a specific petal width - provides a summary of the central tendency and spread of the “sepal_width” variable within each category.

Scatterplot

pairs(data[, 1:4], col = as.integer(data$petal_length), pch = 16, main = "Scatterplot Matrix with Color Gradient")

This is a scatterplot matrix where each point represents an observation in the dataset. The color of the points is determined by the “petal_length” variable - creates a color gradient.

And at last a Correlation Matrix.

cor_matrix <- cor(data[, 1:4])
print(cor_matrix)

##              sepal_length sepal_width petal_length petal_width
## sepal_length    1.0000000  -0.1093692    0.8717542   0.8179536
## sepal_width    -0.1093692   1.0000000   -0.4205161  -0.3565441
## petal_length    0.8717542  -0.4205161    1.0000000   0.9627571
## petal_width     0.8179536  -0.3565441    0.9627571   1.0000000

So that was it!

Will be back with another project soon.

IRIS Exploratory Data Analysis

Aparna RVM

2023-10-12