Analysis of the Iris Dataset

Section 1 - Data Manipulation of the Iris Dataset with dplyr

In this introductory section, I demonstrate how to use different dplyr verbs (i.e., ‘select’, ‘filter’, ‘arrange’ and ‘mutate’) to manipulate the iris dataset. Before demonstrating the capabilities or dplyr, however, I briefly explore the basic structure of the aforementioned dataset. It is worth noting that in order to follow along with my analyses, you will need to install and then load the following packages: (1) “tidyverse”; (2) “knitr”; (3) “psych”; (4) “tibble”, and (5) rmarkdown. To install the abovementioned packages, use install.packages(). Simply insert the name of the package inside parentheses and double quotes, for instance, install.packages(“tidyverse”).

Load required R packages:

library(tidyverse) # load the 'tidyverse' package from library
library(knitr) # load the 'knitr' package from library
library(psych) # load the 'psych' package from library
library(tibble) # load the 'tibble' package from the library
library(rmarkdown) # load the 'rmarkdown' package from the library

options(digits = 2) # set digits to two

Basic structure of the iris dataset:

nrow(iris) # outputs the number of rows (observations) in the iris dataset

## [1] 150

ncol(iris) # outputs the number of columns (fields) in the iris dataset

## [1] 5

class(iris) # outputs the class of the iris dataset

## [1] "data.frame"

names(iris) # outputs the names of all the variables in the iris dataset

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

However, there is a much more efficient method for extracting the above information (and more) from the iris dataset:

str(iris) # outputs more in-depth information about the structure of the iris dataset

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

glimpse(iris) # an alternative method for extracting more in-depth iris information

## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.~
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.~
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.~
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.~
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s~

Extracting specific columns and rows from the iris dataset:

as_tibble(head(iris)) # extract the first six rows of the iris dataset

## # A tibble: 6 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa

as_tibble(tail(iris)) # extract the last six rows of the iris dataset

## # A tibble: 6 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>    
## 1          6.7         3.3          5.7         2.5 virginica
## 2          6.7         3            5.2         2.3 virginica
## 3          6.3         2.5          5           1.9 virginica
## 4          6.5         3            5.2         2   virginica
## 5          6.2         3.4          5.4         2.3 virginica
## 6          5.9         3            5.1         1.8 virginica

It is also possible to use the kable() function from the knitr package to extract sequences of rows (or a single row) and specific columns (or a single column) from the iris dataset. Additionally, the kable() function produces a presentable table:

kable(iris[19:24, ], caption = "Knitr kable showing rows 19 to 24 of the iris dataset:")

Knitr kable showing rows 19 to 24 of the iris dataset:
	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
19	5.7	3.8	1.7	0.3	setosa
20	5.1	3.8	1.5	0.3	setosa
21	5.4	3.4	1.7	0.2	setosa
22	5.1	3.7	1.5	0.4	setosa
23	4.6	3.6	1.0	0.2	setosa
24	5.1	3.3	1.7	0.5	setosa

kable(head(iris[, 1:3]), caption = "Knitr kable showing rows 1 to 6 and columns 1 to 3 of the iris dataset:")

Knitr kable showing rows 1 to 6 and columns 1 to 3 of the iris dataset:
Sepal.Length	Sepal.Width	Petal.Length
5.1	3.5	1.4
4.9	3.0	1.4
4.7	3.2	1.3
4.6	3.1	1.5
5.0	3.6	1.4
5.4	3.9	1.7

kable(tail(iris[, 3:5]), caption = "Knitr kable showing rows 145 to 150 and columns 3 to 5 of the iris dataset:")

Knitr kable showing rows 145 to 150 and columns 3 to 5 of the iris dataset:
	Petal.Length	Petal.Width	Species
145	5.7	2.5	virginica
146	5.2	2.3	virginica
147	5.0	1.9	virginica
148	5.2	2.0	virginica
149	5.4	2.3	virginica
150	5.1	1.8	virginica

Data Manipulation with dplyr

Filtering the iris data set:

# Filter for the species "setosa" with sepal length greater than 5.3:
as_tibble(iris %>% 
  filter(Species == "setosa", Sepal.Length > 5.3))

## # A tibble: 10 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.4         3.9          1.7         0.4 setosa 
##  2          5.4         3.7          1.5         0.2 setosa 
##  3          5.8         4            1.2         0.2 setosa 
##  4          5.7         4.4          1.5         0.4 setosa 
##  5          5.4         3.9          1.3         0.4 setosa 
##  6          5.7         3.8          1.7         0.3 setosa 
##  7          5.4         3.4          1.7         0.2 setosa 
##  8          5.4         3.4          1.5         0.4 setosa 
##  9          5.5         4.2          1.4         0.2 setosa 
## 10          5.5         3.5          1.3         0.2 setosa

# Filter for the species "versicolor" with sepal length greater than 6.3:
as_tibble(iris %>% 
  filter(Species == "versicolor", Sepal.Length > 6.3))

## # A tibble: 11 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>     
##  1          7           3.2          4.7         1.4 versicolor
##  2          6.4         3.2          4.5         1.5 versicolor
##  3          6.9         3.1          4.9         1.5 versicolor
##  4          6.5         2.8          4.6         1.5 versicolor
##  5          6.6         2.9          4.6         1.3 versicolor
##  6          6.7         3.1          4.4         1.4 versicolor
##  7          6.4         2.9          4.3         1.3 versicolor
##  8          6.6         3            4.4         1.4 versicolor
##  9          6.8         2.8          4.8         1.4 versicolor
## 10          6.7         3            5           1.7 versicolor
## 11          6.7         3.1          4.7         1.5 versicolor

# Filter for the species "virginica" with sepal less than 6.0:
as_tibble(iris %>% 
  filter(Species == "virginica", Sepal.Length < 6.0))

## # A tibble: 7 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>    
## 1          5.8         2.7          5.1         1.9 virginica
## 2          4.9         2.5          4.5         1.7 virginica
## 3          5.7         2.5          5           2   virginica
## 4          5.8         2.8          5.1         2.4 virginica
## 5          5.6         2.8          4.9         2   virginica
## 6          5.8         2.7          5.1         1.9 virginica
## 7          5.9         3            5.1         1.8 virginica

# Filter for the species "setosa" with sepal width greater than 3.5 and sepal length greater than or equal to 5.5:
as_tibble(iris %>% 
  filter(Species == "setosa", Sepal.Width > 3.5, Sepal.Length >= 5.3))

## # A tibble: 8 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.4         3.9          1.7         0.4 setosa 
## 2          5.4         3.7          1.5         0.2 setosa 
## 3          5.8         4            1.2         0.2 setosa 
## 4          5.7         4.4          1.5         0.4 setosa 
## 5          5.4         3.9          1.3         0.4 setosa 
## 6          5.7         3.8          1.7         0.3 setosa 
## 7          5.5         4.2          1.4         0.2 setosa 
## 8          5.3         3.7          1.5         0.2 setosa

Selecting and filtering data with the “select” and “filter” verbs:

# Select the columns sepal length, sepal width, and Petal.Length, then filter the data in the order of the versicolor species and sepal width greater than or equal to 3.1:
as_tibble(iris %>% 
  select(Sepal.Length, Sepal.Width, Petal.Width) %>% 
  filter(iris$Species == "versicolor", Sepal.Width >= 3.1))

## # A tibble: 8 x 3
##   Sepal.Length Sepal.Width Petal.Width
##          <dbl>       <dbl>       <dbl>
## 1          7           3.2         1.4
## 2          6.4         3.2         1.5
## 3          6.9         3.1         1.5
## 4          6.3         3.3         1.6
## 5          6.7         3.1         1.4
## 6          5.9         3.2         1.8
## 7          6           3.4         1.6
## 8          6.7         3.1         1.5

Filtering and arranging the iris data set:

# Filter for the species "setosa" with sepal width greater than 3.5 and sepal length less than or equal to 5.0. Next, arrange sepal length in descending order:
as_tibble(iris %>% 
  filter(Species == "setosa", Sepal.Width > 3.5, Sepal.Length <= 5.0) %>% 
  arrange(desc(Sepal.Length)))

## # A tibble: 3 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5           3.6          1.4         0.2 setosa 
## 2          4.9         3.6          1.4         0.1 setosa 
## 3          4.6         3.6          1           0.2 setosa

Selecting, filtering, and arranging the iris data set:

# Select the columns sepal length and sepal width, filter the data in the order of the versicolor species and sepal width greater than or equal to 3.1, then arrange the data by sepal length in descending order:
as_tibble(iris %>% 
  select(Sepal.Length, Sepal.Width) %>% 
  filter(iris$Species == "versicolor", Sepal.Width >= 3.1) %>%
  arrange(Sepal.Length))

## # A tibble: 8 x 2
##   Sepal.Length Sepal.Width
##          <dbl>       <dbl>
## 1          5.9         3.2
## 2          6           3.4
## 3          6.3         3.3
## 4          6.4         3.2
## 5          6.7         3.1
## 6          6.7         3.1
## 7          6.9         3.1
## 8          7           3.2

# Select the column sepal length, select the variable Sepal.Length, filter the data in the order of the setosa species and sepal width greater than or equal to 3.1, then arrange the data by sepal length in descending order:
as_tibble(iris %>% 
  select(Sepal.Length) %>% 
  filter(iris$Species == "setosa", iris$Sepal.Width >= 3.9) %>%
  arrange(desc(Sepal.Length)))

## # A tibble: 6 x 1
##   Sepal.Length
##          <dbl>
## 1          5.8
## 2          5.7
## 3          5.5
## 4          5.4
## 5          5.4
## 6          5.2

Mutating, filtering, and arranging the iris data set:

# Use mutate to calculate the petal length/width ratio (see 'Petal.Length' field), filter the data for the virginica species and sepal length greater than 7.0, then arrange the data by petal length in descending order:
as_tibble(iris %>%
  mutate(Petal.Length = Petal.Length / Petal.Width) %>% 
  filter(Species == "virginica", Sepal.Length > 7.0) %>% 
  arrange(desc(Petal.Length)))

## # A tibble: 12 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>    
##  1          7.2         3           3.62         1.6 virginica
##  2          7.3         2.9         3.5          1.8 virginica
##  3          7.7         2.8         3.35         2   virginica
##  4          7.2         3.2         3.33         1.8 virginica
##  5          7.4         2.8         3.21         1.9 virginica
##  6          7.9         3.8         3.2          2   virginica
##  7          7.6         3           3.14         2.1 virginica
##  8          7.7         3.8         3.05         2.2 virginica
##  9          7.7         2.6         3            2.3 virginica
## 10          7.1         3           2.81         2.1 virginica
## 11          7.7         3           2.65         2.3 virginica
## 12          7.2         3.6         2.44         2.5 virginica

Mutating, selecting, filtering, and arranging the iris data set:

# Use mutate to calculate the petal length/width ratio (see 'Petal.Length' field), select the variables sepal length, sepal width, petal length, and petal width, filter the data for the versicolor species and sepal width equal to or greater than 3.1, then arrange the data by sepal length:
as_tibble(iris %>% 
  mutate(Petal.Length = Petal.Length / Petal.Width) %>%
  select(Sepal.Length, Sepal.Width, Petal.Length) %>% 
  filter(iris$Species == "versicolor", iris$Sepal.Width >= 3.1) %>% 
  arrange(Sepal.Length))

## # A tibble: 8 x 3
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.9         3.2         2.67
## 2          6           3.4         2.81
## 3          6.3         3.3         2.94
## 4          6.4         3.2         3   
## 5          6.7         3.1         3.14
## 6          6.7         3.1         3.13
## 7          6.9         3.1         3.27
## 8          7           3.2         3.36

Section 2 - Basic Summary Statistics

In this section, I calculate some basic summary statistics relating to the iris dataset.

Summary statistics for the variable Sepal.Length:

describe(iris$Sepal.Length)

##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 150  5.8 0.83    5.8     5.8   1 4.3 7.9   3.6 0.31    -0.61 0.07

Summary statistics for the variable Sepal.Width:

describe(iris$Sepal.Width)

##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 150  3.1 0.44      3       3 0.44   2 4.4   2.4 0.31     0.14 0.04

Summary statistics for the variable Petal.Length:

describe(iris$Petal.Length)

##    vars   n mean  sd median trimmed mad min max range  skew kurtosis   se
## X1    1 150  3.8 1.8    4.3     3.8 1.8   1 6.9   5.9 -0.27     -1.4 0.14

Summary statistics for the variable Petal.Width:

describe(iris$Petal.Width)

##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 150  1.2 0.76    1.3     1.2   1 0.1 2.5   2.4 -0.1     -1.4 0.06

Section 3 - Inferential Analysis of the Iris Dataset.

In this section, I provide a variety of inferential analyses of specific variables from the iris data set, including one-way ANOVAs, posthoc tests (specifically, Tukey’s HSD test), tests of normality (specifically, the Shapiro-Wilk test), and tests of homogeneity of variance (specifically, Bartlett’s parametric test).

For the uninitiated, a one-way ANOVA test does not give any indication of where potentially significant difference(s) between the groups lie. To determine whether there are any significant differences between group means, one can perform a Tukey posthoc test via the TukeyHSD() function.

Shapiro-Wilk test of normality (Sepal.Length):

shapiro.test(iris$Sepal.Length)

## 
##  Shapiro-Wilk normality test
## 
## data:  iris$Sepal.Length
## W = 1, p-value = 0.01

Bartlett’s (parametric) test of homogeneity of variance (Sepal.Length):

bartlett.test(iris$Sepal.Length, iris$Species)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  iris$Sepal.Length and iris$Species
## Bartlett's K-squared = 16, df = 2, p-value = 3e-04

One-way ANOVA for the factor “Species” and the variable Sepal.Length:

summary(aov(Sepal.Length ~ Species, data = iris))

##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2   63.2   31.61     119 <2e-16 ***
## Residuals   147   39.0    0.27                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Tukey HSD to determine where differences (if any) lie in relation to Sepal.Length group means:

TukeyHSD(aov(Sepal.Length ~ Species, data = iris))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Sepal.Length ~ Species, data = iris)
## 
## $Species
##                      diff  lwr upr p adj
## versicolor-setosa    0.93 0.69 1.2     0
## virginica-setosa     1.58 1.34 1.8     0
## virginica-versicolor 0.65 0.41 0.9     0

Shapiro-Wilk test of normality (Petal.Length):

shapiro.test(iris$Petal.Length)

## 
##  Shapiro-Wilk normality test
## 
## data:  iris$Petal.Length
## W = 0.9, p-value = 7e-10

Bartlett’s (parametric) test of homogeneity of variance (Petal.Length):

bartlett.test(iris$Petal.Length, iris$Species)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  iris$Petal.Length and iris$Species
## Bartlett's K-squared = 55, df = 2, p-value = 9e-13

One-way ANOVA for the factor “Species” and the variable Petal.Length:

summary(aov(Petal.Length ~ Species, data = iris))

##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2    437   218.6    1180 <2e-16 ***
## Residuals   147     27     0.2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Tukey HSD to determine where differences (if any) lie in relation to Petal.Length group means:

TukeyHSD(aov(Petal.Length ~ Species, data = iris))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Petal.Length ~ Species, data = iris)
## 
## $Species
##                      diff lwr upr p adj
## versicolor-setosa     2.8 2.6 3.0     0
## virginica-setosa      4.1 3.9 4.3     0
## virginica-versicolor  1.3 1.1 1.5     0

Section 4 - Data Visualisations of the Iris Dataset

In this section, I provide a variety of data visualisations of specific variables from the iris data set.

Base R plots:

plot(iris$Sepal.Length, iris$Petal.Length)

plot(iris$Sepal.Length, iris$Petal.Width)

qqnorm(iris$Sepal.Length)
qqline(iris$Sepal.Length)

boxplot(iris$Sepal.Length ~ iris$Species)

boxplot(iris$Sepal.Width ~ iris$Species)

hist(iris$Sepal.Length, col = "dark blue", border = "yellow")

hist(iris$Sepal.Width, col = "dark red", border = "green")

ggplot2 histograms showing the frequency count of the variable Sepal.Length:

ggplot(data = iris, aes(x = Sepal.Length)) + 
  geom_histogram(bins = 10, colour = "black", fill = "white")

ggplot(data = iris, aes(x = Sepal.Length)) + 
  geom_histogram(bins = 10, colour = "black", fill = "dark grey") +
  coord_flip()

ggplot2 histograms showing the frequency count of the variable Petal.Length:

ggplot(data = iris, aes(x = Petal.Length)) + 
  geom_histogram(bins = 10)

ggplot(data = iris, aes(x = Petal.Length)) + 
  geom_histogram(bins = 10) +
  coord_flip()

ggplot2 histograms showing the frequency count of the variable Sepal.Length grouped by the factor “Species”:

ggplot(iris) +
  aes(Sepal.Length, colour = Species) +
  geom_histogram(bins = 10, fill = "white") +
  scale_color_hue()

ggplot(iris) +
  aes(Sepal.Length, colour = Species) +
  geom_histogram(bins = 10, fill = "grey") +
  scale_color_hue() +
  coord_flip()

ggplot(iris, aes(Sepal.Width, fill = Species)) +
  geom_histogram(binwidth = 0.30) +
  facet_wrap(~ Species)

Density plot of the variable Sepal.Length:

ggplot(iris, aes(x = Sepal.Length)) +
  geom_density()

Density plot of the variable Petal.Length:

ggplot(iris, aes(x = Petal.Length)) +
  geom_density()

Density plot of the variable Sepal.Length grouped by the factor “Species”:

ggplot(iris, aes(x = Sepal.Length, colour = Species)) +
  geom_density()

Density plot of the variable Petal.Length grouped by the factor “Species”:

ggplot(iris, aes(x = Petal.Length, colour = Species)) +
  geom_density()

ggplot2 boxplots showing the variable Sepal.Length across three levels of the factor “Species”:

ggplot(iris) +
  aes(x = Species, y = Sepal.Length) +
  geom_boxplot()

ggplot(iris, aes(x = Species, y = Sepal.Length)) +
  geom_boxplot(colour = "dark blue", fill = "orange", alpha = 0.3)

iris %>% 
  ggplot(aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_boxplot() +
  geom_jitter(width = 0.1, alpha = 0.2)

ggplot2 boxplots showing the variable Petal.Length across three levels of the factor “Species”:

ggplot(iris) +
  aes(x = Species, y = Petal.Length) +
  geom_boxplot()

ggplot(iris, aes(x = Species, y = Petal.Length)) +
  geom_boxplot(colour = "dark blue", fill = "orange", alpha = 0.3)

iris %>% 
  ggplot(aes(x = Species, y = Petal.Length, fill = Species)) +
  geom_boxplot() + 
  geom_jitter(width = 0.1, alpha = 0.2)

ggplot2 scatterplot showing the variables Sepal.Length and Petal.Length grouped by the three-level factor “Species”:

ggplot(iris) +
  aes(x = Sepal.Length, y = Petal.Length, col = Species) +
  geom_point() +
  scale_color_hue()

Faceting of the factor “Species” in relation to the variables Sepal.Length and Petal.Length:

ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
  geom_point() +
  facet_wrap(~ Species)

ggplot(iris, aes(Sepal.Length, Sepal.Width, col = Species)) +
  geom_point() +
  facet_wrap(~ Species)

Conclusion - In the first section, I employed a number of verbs from the dplyr package to manipulate the iris dataset. In section 2, I reported various summary statistics with regards to the iris dataset. Section 3 included some inferential statistics pertaining to the variables in the abovementioned dataset. In section 4, I provided a variety of data visualisations in relation to specific variables extracted from the iris dataset.

About the Author - Gary Lavery is a research psychologist, cognitive scientist, and aspiring data analyst. He graduated with a research PhD from Queen’s University Belfast in June 2019.

Document Format - The current document was constructed in rmarkdown.

Author Contact Details - email: glavery05@qub.ac.uk; LinkedIn profile: https://www.linkedin.com/in/gary-lavery-73369961/; Academia.edu Profile: https://qub.academia.edu/GaryLavery

Analysis of the Iris Dataset

Authored by Gary Lavery

Data Manipulation with dplyr

In this section, I calculate some basic summary statistics relating to the iris dataset.