Discussion on Iris Dataset

In OpenStats Chapter 1, Exercises, Problem 9, there is a reference to Fisher’s iris data. Discuss the solutions to this problem, and then conduct a descriptive analysis of the data which are conveniently available in R. To access the data in R, simply type “iris.” Investigate any additional R libraries that might help support analysis of this data (e.g., psych package). Share your code and analysis in the discussion forum.

Fisher’s irises: Sir Ronald Aylmer Fisher was an English statistician, evolutionary biologist, and geneticist who worked on a data set that contained sepal length and width, and petal length and width from three species of iris flowers (setosa, versicolor and virginica). There were 50 flowers from each species in the data set.

QUESTIONS

How many cases were included in the data?
How many numerical variables are included in the data? Indicate what they are, and if they are continuous or discrete.
How many categorical variables are included in the data, and what are they? List the corresponding levels (categories).

SOLUTIONS

n=150
4 numerical variables are continuous (sepal.length, sepal.width, petal.length, petal.width).
1 categorical variable (species). Factors are setosa, versicolor and virginica.

# Clear the workspace
  rm(list = ls()) # Clear all files from your environment
# gc()            # Clear unused memory
  cat("\f")       # Clear the console

  graphics.off()  # Clear all graphs

df <- iris

# a
nrow(df)

## [1] 150

# b
str(df)  # internal structure function str() suggests four numerical variables and one categorical variable within dataset.

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# c
?unique # unique returns a vector, data frame or array like x but with duplicate elements/rows removed.
unique(df$Species)

## [1] setosa     versicolor virginica 
## Levels: setosa versicolor virginica

?levels # levels provides access to the levels attribute of a variable. The first form returns the value of the levels of its argument and the second sets the attribute.
levels(iris$Species)

## [1] "setosa"     "versicolor" "virginica"

It is never wise to print the entire data, especially if it is very large.

Print a subset of the data only. Some commands to help you -

head(iris)   # see the first few rows of data

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

tail(iris,
     n=9)   # see the last few rows of data

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 142          6.9         3.1          5.1         2.3 virginica
## 143          5.8         2.7          5.1         1.9 virginica
## 144          6.8         3.2          5.9         2.3 virginica
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

iris[80:90,] # print 80th to 90th observation of the dataset in the console

##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 80          5.7         2.6          3.5         1.0 versicolor
## 81          5.5         2.4          3.8         1.1 versicolor
## 82          5.5         2.4          3.7         1.0 versicolor
## 83          5.8         2.7          3.9         1.2 versicolor
## 84          6.0         2.7          5.1         1.6 versicolor
## 85          5.4         3.0          4.5         1.5 versicolor
## 86          6.0         3.4          4.5         1.6 versicolor
## 87          6.7         3.1          4.7         1.5 versicolor
## 88          6.3         2.3          4.4         1.3 versicolor
## 89          5.6         3.0          4.1         1.3 versicolor
## 90          5.5         2.5          4.0         1.3 versicolor

Do basic summary statistics of the data -

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

# more informative
by(data = iris, 
   INDICES = iris$Species, 
   FUN = summary
   )

## iris$Species: setosa
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100  
##  1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200  
##  Median :5.000   Median :3.400   Median :1.500   Median :0.200  
##  Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246  
##  3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300  
##  Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600  
##        Species  
##  setosa    :50  
##  versicolor: 0  
##  virginica : 0  
##                 
##                 
##                 
## ------------------------------------------------------------ 
## iris$Species: versicolor
##   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
##  Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
##  1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
##  Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
##  Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
##  3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
##  Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  
## ------------------------------------------------------------ 
## iris$Species: virginica
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400  
##  1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800  
##  Median :6.500   Median :3.000   Median :5.550   Median :2.000  
##  Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026  
##  3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300  
##  Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    : 0  
##  versicolor: 0  
##  virginica :50  
##                 
##                 
##

1 Alternative ways to get some of the answer above.

In R you can get the same answer in multiple ways.

?levels
levels(iris$Species)

## [1] "setosa"     "versicolor" "virginica"

for (j in levels(iris$Species)) {
  print(j)
}

## [1] "setosa"
## [1] "versicolor"
## [1] "virginica"

2 Descriptive Statistics

Multiple ways to do this - lots of different commands. Experiment and find your own preferred commands. Some might prefer commands in base R package, while others may prefer commands in user written packages (will require both an initial install and calling the library for every session). I provide an example of two user written commands that are popular - describe() in psych package and stargazer() in stargazer package (can be used for regression tables too, apart from summary statistics tables, and used in professional publications).

2.1 Psych package, describe() command

Do not apply different summary statistics functions on a variable. We have more efficient ways like psych package to create a summary statistics table.

## measures of center 
mean(iris$Sepal.Length)

## [1] 5.843333

median(iris$Sepal.Length)

## [1] 5.8

## measures of spread
max(iris$Sepal.Length)

## [1] 7.9

min(iris$Sepal.Length)

## [1] 4.3

range(iris$Sepal.Length)

## [1] 4.3 7.9

quantile(iris$Sepal.Length)

##   0%  25%  50%  75% 100% 
##  4.3  5.1  5.8  6.4  7.9

## measures of dispersion
sd(iris$Sepal.Length)

## [1] 0.8280661

var(iris$Sepal.Length)

## [1] 0.6856935

## higher order moments 
# install.packages("moments")
library(moments) 

?skewness  # calculate skewness in r
skewness(iris$Sepal.Length)

## [1] 0.3117531

?kurtosis  # calculate kurtosis in r (compare with 3)
kurtosis(iris$Sepal.Length)

## [1] 2.426432

Read up on psych package here. Lets apply it on the entire variable, or dataset.

# install.packages("psych") - required one time only, commented out to save running time

library(psych) # load the library to use the commands in the package

describe(iris)

##              vars   n mean   sd median trimmed  mad min max range  skew
## Sepal.Length    1 150 5.84 0.83   5.80    5.81 1.04 4.3 7.9   3.6  0.31
## Sepal.Width     2 150 3.06 0.44   3.00    3.04 0.44 2.0 4.4   2.4  0.31
## Petal.Length    3 150 3.76 1.77   4.35    3.76 1.85 1.0 6.9   5.9 -0.27
## Petal.Width     4 150 1.20 0.76   1.30    1.18 1.04 0.1 2.5   2.4 -0.10
## Species*        5 150 2.00 0.82   2.00    2.00 1.48 1.0 3.0   2.0  0.00
##              kurtosis   se
## Sepal.Length    -0.61 0.07
## Sepal.Width      0.14 0.04
## Petal.Length    -1.42 0.14
## Petal.Width     -1.36 0.06
## Species*        -1.52 0.07

2.2 Stargazer Package, stargazer() command

To create more professional looking tables, see the stargazer package examples.

# install.packages("stargzer") - required one time only, commented out to save running time

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

## ALL VARIABLES IN THE ORIGINAL DATAFRAME
stargazer(df, 
          type = "text", # html, latex
          # summary.stat = 
          covariate.labels = c("Sepal Length", "Sepal Width", "Petal Length", "Petal Width"),
          digits = 2)

## 
## ========================================
## Statistic     N  Mean St. Dev. Min  Max 
## ----------------------------------------
## Sepal Length 150 5.84   0.83   4.30 7.90
## Sepal Width  150 3.06   0.44   2.00 4.40
## Petal Length 150 3.76   1.77   1.00 6.90
## Petal Width  150 1.20   0.76   0.10 2.50
## ----------------------------------------

Do not apply functions one by one on variables in the dataset iris. If you did want to do that, use a loop. Or sapply function.

?sapply
# apply a function on all elements of your object X
sapply(X   = iris, 
       FUN = is.numeric
       )

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##         TRUE         TRUE         TRUE         TRUE        FALSE

?is.numeric
is.numeric(iris)

## [1] FALSE

3 Including Plots

You can also embed plots, for example (these need to be properly embellished with titles, axis names, et cetra):

3.0.1 Color names in R

Color Palette in R

3.1 Scatterplot

?plot

## Help on topic 'plot' was found in the following packages:
## 
##   Package               Library
##   graphics              /Library/Frameworks/R.framework/Versions/4.2/Resources/library
##   base                  /Library/Frameworks/R.framework/Resources/library
## 
## 
## Using the first match ...

# bare bones command - ensure you have an output 
plot(iris$Sepal.Length)

# basic command - ensure you labelled the axis and have an informative title 
plot(iris$Sepal.Length, 
     ylab  = "Sepal Length",    # y axis label
     xlab  = "Species ID",      # x axis label
     main  = "Scatter Plot"     # main title of the chart
     )

Lots of options to play with in the base plot command - read the help file for The Default Scatterplot Function. You can find snippets on online blogs too.

# fancier commands - can play with a million settings
plot(iris$Sepal.Length, 
     ylab  = "Sepal Length",    # y axis label
     xlab  = "Species ID",      # x axis label
     main  = "Scatter Plot",    # main title of the chart
     pch   =  as.numeric(iris$Species),
     col   =  as.numeric(iris$Species)
     )

# plot(iris$Sepal.Length, iris$Sepal.Width)

3.2 Histogram

?hist
hist(iris$Sepal.Length,     # required option 
     main = "Histogram",    # optional:  for title of graph
     xlab = "Sepal Length", # optional:  for x axis label. axis limits can be controlled too
     col  = "gold1"         # optional:  for coloring the bars
     )

3.3 Boxplot

?boxplot
# bare bones command
boxplot (iris$Sepal.Length, main = "Boxplot", col = "cyan")

plot(iris[1:4], col=iris$Species)

describeBy(x     = iris, 
           group = iris$Species
           )

## 
##  Descriptive statistics by group 
## group: setosa
##              vars  n mean   sd median trimmed  mad min max range skew kurtosis
## Sepal.Length    1 50 5.01 0.35    5.0    5.00 0.30 4.3 5.8   1.5 0.11    -0.45
## Sepal.Width     2 50 3.43 0.38    3.4    3.42 0.37 2.3 4.4   2.1 0.04     0.60
## Petal.Length    3 50 1.46 0.17    1.5    1.46 0.15 1.0 1.9   0.9 0.10     0.65
## Petal.Width     4 50 0.25 0.11    0.2    0.24 0.00 0.1 0.6   0.5 1.18     1.26
## Species*        5 50 1.00 0.00    1.0    1.00 0.00 1.0 1.0   0.0  NaN      NaN
##                se
## Sepal.Length 0.05
## Sepal.Width  0.05
## Petal.Length 0.02
## Petal.Width  0.01
## Species*     0.00
## ------------------------------------------------------------ 
## group: versicolor
##              vars  n mean   sd median trimmed  mad min max range  skew kurtosis
## Sepal.Length    1 50 5.94 0.52   5.90    5.94 0.52 4.9 7.0   2.1  0.10    -0.69
## Sepal.Width     2 50 2.77 0.31   2.80    2.78 0.30 2.0 3.4   1.4 -0.34    -0.55
## Petal.Length    3 50 4.26 0.47   4.35    4.29 0.52 3.0 5.1   2.1 -0.57    -0.19
## Petal.Width     4 50 1.33 0.20   1.30    1.32 0.22 1.0 1.8   0.8 -0.03    -0.59
## Species*        5 50 2.00 0.00   2.00    2.00 0.00 2.0 2.0   0.0   NaN      NaN
##                se
## Sepal.Length 0.07
## Sepal.Width  0.04
## Petal.Length 0.07
## Petal.Width  0.03
## Species*     0.00
## ------------------------------------------------------------ 
## group: virginica
##              vars  n mean   sd median trimmed  mad min max range  skew kurtosis
## Sepal.Length    1 50 6.59 0.64   6.50    6.57 0.59 4.9 7.9   3.0  0.11    -0.20
## Sepal.Width     2 50 2.97 0.32   3.00    2.96 0.30 2.2 3.8   1.6  0.34     0.38
## Petal.Length    3 50 5.55 0.55   5.55    5.51 0.67 4.5 6.9   2.4  0.52    -0.37
## Petal.Width     4 50 2.03 0.27   2.00    2.03 0.30 1.4 2.5   1.1 -0.12    -0.75
## Species*        5 50 3.00 0.00   3.00    3.00 0.00 3.0 3.0   0.0   NaN      NaN
##                se
## Sepal.Length 0.09
## Sepal.Width  0.05
## Petal.Length 0.08
## Petal.Width  0.04
## Species*     0.00

# alternatively, if you do not want to use describe by command, create three new data frames and summarize them
irisSetosa     <- subset(iris, Species == "setosa"   )
irisVersicolor <- subset(iris, Species == "versicolor")
irisVirginica  <- subset(iris, Species == "virginica" )

list (summary(irisSetosa    [,1:4] ) )

## [[1]]
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100  
##  1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200  
##  Median :5.000   Median :3.400   Median :1.500   Median :0.200  
##  Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246  
##  3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300  
##  Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600

list (summary(irisVersicolor[,1:4] ) )

## [[1]]
##   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width   
##  Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000  
##  1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200  
##  Median :5.900   Median :2.800   Median :4.35   Median :1.300  
##  Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326  
##  3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500  
##  Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800

list (summary(irisVirginica [,1:4] ) )

## [[1]]
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400  
##  1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800  
##  Median :6.500   Median :3.000   Median :5.550   Median :2.000  
##  Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026  
##  3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300  
##  Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500

par(mfrow = c(1,3) ,mar = c(6,3,2,1) )
# las=2 ensures the x axis labels are vertical and not horizontal ( las=1 )
boxplot(irisSetosa[,1:4]    , main = "Setosa",    ylim = c(0,8), las = 2 , col = "blue")
boxplot(irisVersicolor[,1:4], main = "Versicolor",ylim = c(0,8), las = 2,  col = "red")
boxplot(irisVirginica[,1:4] , main = "Virginica", ylim = c(0,8), las = 2,  col = "green")

4 K-means clustering

Omit code chunk output below by setting eval=FALSE and include=FALSE.

5 R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button in RStudio after setting up a markdown file, a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk in your markdown (.Rmd extension) file, different from script file (.R extension), within triple back quotes, and also control what code shows up in the final output file and also organise your input/code file within sections/chunks.

R markdown is great for generating quick, short reports to be shared with stakeholders.

Make sure to read up on its cheatsheets in the class Dropbox folder.

Iris Data

Arvind Sharma

2022-08-10