The iris dataset is a dataset in R that contains measurements on 4 different attributes (in centimeters) for 50 flowers from 3 different species.

In 1935, Edgar Anderson collected data to quantify the geographic variations of iris flowers. The data set consists of 50 samples from each of the three sub-species ( iris setosa, iris virginica, and iris versicolor).

# iris
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

attributes(iris)

## $names
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
## 
## $class
## [1] "data.frame"
## 
## $row.names
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

dim(iris)

## [1] 150   5

# display column names
names(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

We can clearly see Iris dataSet is [150 5]. Species is the only one categorical variable

df <- iris[, 1:4]
boxplot(df)

pairs(df)

stars(df)

PL <- df$Petal.Length
barplot(PL)

hist(PL)

SP <- iris$Species
pie(table(SP))

boxplot(PL ~ SP)

summary(aov(PL ~ SP))

##              Df Sum Sq Mean Sq F value Pr(>F)    
## SP            2  437.1  218.55    1180 <2e-16 ***
## Residuals   147   27.2    0.19                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

PW <- df$Petal.Width
plot(PL, PW, col = SP)
abline(lm(PW ~ PL))

summary(PL)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.600   4.350   3.758   5.100   6.900

This produces a lot of useful information about the distribution of petal length:

The minimum is 1.000, and the maximum is 6.900.

Average petal length is 3.758.

The mid-point, or median, is 4.350, as about half of the numbers are smaller than 4.350. Why the median is different from the mean? What happens if there is a typo and one number is entered 340cm instead of 3.40cm?

The 3rd quartile, or 75th percentile, is 5.100, as 75% of the flowers have petals shorter than 5.100. If a student’s GPA ranks 5th in a class of 25, he/she is at 80th percentile. The minimum is the 0th percentile and the maximum is the 100th percentile.

The 1st quartile, or 25th percentile, is 1.600. Only 25% of the flowers have petals shorter than 1.600. These summary statistics are graphically represented as a boxplot. Boxplots are more useful when multiple sets of numbers are compared.

# View(iris) # show as a spreadsheet
SL <- df$Sepal.Length
hist(SL, main="Histogram of Sepal Length", xlab="Sepal Length",
     ylab='Frequency', xlim=c(4,8), col="blue", freq=FALSE)

SW <- df$Sepal.Width

hist(SW, main="Histogram of Sepal Width", xlab="Sepal Width",
     ylab='Frequency', xlim=c(2,5), col="darkorchid", freq=FALSE)

Box Plot shows 5 statistically significant numbers- the minimum, the 25th percentile, the median, the 75th percentile and the maximum. It is thus useful for visualizing the spread of the data is and deriving inferences accordingly.

irisVer <- subset(iris, Species == "versicolor")
irisSet <- subset(iris, Species == "setosa")
irisVir <- subset(iris, Species == "virginica")
par(mfrow=c(1,3),mar=c(6,3,2,1))
boxplot(irisVer[,1:4], main="Versicolor, Rainbow Palette",ylim = c(0,8),las=2, col=rainbow(4))
boxplot(irisSet[,1:4], main="Setosa, Heat color Palette",ylim = c(0,8),las=2, col=heat.colors(4))
boxplot(irisVir[,1:4], main="Virginica, Topo colors Palette",ylim = c(0,8),las=2, col=topo.colors(4))

# create boxplot of sepal width vs. sepal length
boxplot(Sepal.Length~Species,
        data=iris,
        main='Sepal Length by Species',
        xlab='Species',
        ylab='Sepal Length',
        col='steelblue',
        border='black')

# create scatterplot of sepal width vs. sepal length
plot(iris$Sepal.Width, iris$Sepal.Length,
     col='steelblue',
     main='Scatterplot',
     xlab='Sepal Width',
     ylab='Sepal Length',
     pch=19)

Scatterplot Matrices

When we have two or more variables and we want to correlate between one variable and others so we use a scatterplot matrix.

pairs() function is used to create matrices of scatterplots.

x <- iris[, 1:4]
pairs(x)

sessionInfo()

## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22000)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] bookdown_0.28    codetools_0.2-18 digest_0.6.29    R6_2.5.1        
##  [5] jsonlite_1.8.0   magrittr_2.0.3   evaluate_0.15    rmdformats_1.0.4
##  [9] highr_0.9        stringi_1.7.8    cachem_1.0.6     rlang_1.0.4     
## [13] cli_3.3.0        rstudioapi_0.13  jquerylib_0.1.4  bslib_0.4.0     
## [17] rmarkdown_2.14   tools_4.2.1      stringr_1.4.0    xfun_0.31       
## [21] yaml_2.3.5       fastmap_1.1.0    compiler_4.2.1   htmltools_0.5.3 
## [25] knitr_1.39       sass_0.4.2

The Iris Dataset EDA

Ing. Ernesto Ibanez

2022-08-27

The iris dataset is a dataset in R that contains measurements on 4 different attributes (in centimeters) for 50 flowers from 3 different species.

This produces a lot of useful information about the distribution of petal length:

Box Plot shows 5 statistically significant numbers- the minimum, the 25th percentile, the median, the 75th percentile and the maximum. It is thus useful for visualizing the spread of the data is and deriving inferences accordingly.

Scatterplot Matrices