1 Discussion on Iris Dataset

In OpenStats Chapter 1, Exercises, Problem 9, there is a reference to Fisher’s iris data. Discuss the solutions to this problem, and then conduct a descriptive analysis of the data which are conveniently available in R. To access the data in R, simply type “iris”. Investigate any additional R libraries that might help support analysis of this data (e.g., psych package). Share your code and analysis in the discussion forum.

Fisher’s irises: Sir Ronald Aylmer Fisher was an English statistician, evolutionary biologist, and geneticist who worked on a data set that contained sepal length and width, and petal length and width from three species of iris flowers (setosa, versicolor and virginica). There were 50 flowers from each species in the data set.

How many cases were included in the data?
How many numerical variables are included in the data? Indicate what they are, and if they are continuous or discrete.
How many categorical variables are included in the data, and what are they? List the corresponding levels (categories).

n=150
- Better to use a command instead of eyeballing (with larger datasets).
4 numerical variables are continuous (sepal.length, sepal.width, petal.length, petal.width).
- Summary Statistics is usually a good way to find the type of variable and level of measurement.
1 categorical variable (species).
- Factors are setosa, versicolor and virginica.

# Clear the workspace
  rm(list = ls()) # Clear all files from your environment
# gc()            # Clear unused memory
  cat("\f")       # Clear the console

  graphics.off()  # Clear all graphs

df <- iris

# a
nrow(df)

## [1] 150

# b
str(df)  # internal structure function str() suggests four numerical variables and one categorical variable within dataset.

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# c
?unique # unique returns a vector, data frame or array like x but with duplicate elements/rows removed.
unique(df$Species)

## [1] setosa     versicolor virginica 
## Levels: setosa versicolor virginica

?levels # levels provides access to the levels attribute of a variable. The first form returns the value of the levels of its argument and the second sets the attribute.
levels(iris$Species)

## [1] "setosa"     "versicolor" "virginica"

If you wanted to look at the first few, last few (or specific) observations -

head(iris)      # see the first few rows of data

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

tail(iris)      # see the last few rows of data

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

iris[80:90,]    # print 80th to 90th observation of the dataset in the console

##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 80          5.7         2.6          3.5         1.0 versicolor
## 81          5.5         2.4          3.8         1.1 versicolor
## 82          5.5         2.4          3.7         1.0 versicolor
## 83          5.8         2.7          3.9         1.2 versicolor
## 84          6.0         2.7          5.1         1.6 versicolor
## 85          5.4         3.0          4.5         1.5 versicolor
## 86          6.0         3.4          4.5         1.6 versicolor
## 87          6.7         3.1          4.7         1.5 versicolor
## 88          6.3         2.3          4.4         1.3 versicolor
## 89          5.6         3.0          4.1         1.3 versicolor
## 90          5.5         2.5          4.0         1.3 versicolor

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

You can use user written packages for basic summary statistics/part b to get to know your data a bit better -

??dfSummary              # for global search 
# install.packages(summarytools)
library(summarytools)
dfSummary(iris)

## Data Frame Summary  
## iris  
## Dimensions: 150 x 5  
## Duplicates: 1  
## 
## -----------------------------------------------------------------------------------------------------------
## No   Variable       Stats / Values          Freqs (% of Valid)   Graph                 Valid      Missing  
## ---- -------------- ----------------------- -------------------- --------------------- ---------- ---------
## 1    Sepal.Length   Mean (sd) : 5.8 (0.8)   35 distinct values     . . : :             150        0        
##      [numeric]      min < med < max:                               : : : :             (100.0%)   (0.0%)   
##                     4.3 < 5.8 < 7.9                                : : : : :                               
##                     IQR (CV) : 1.3 (0.1)                           : : : : :                               
##                                                                  : : : : : : : :                           
## 
## 2    Sepal.Width    Mean (sd) : 3.1 (0.4)   23 distinct values           :             150        0        
##      [numeric]      min < med < max:                                     :             (100.0%)   (0.0%)   
##                     2 < 3 < 4.4                                        . :                                 
##                     IQR (CV) : 0.5 (0.1)                             : : : :                               
##                                                                  . . : : : : : :                           
## 
## 3    Petal.Length   Mean (sd) : 3.8 (1.8)   43 distinct values   :                     150        0        
##      [numeric]      min < med < max:                             :         . :         (100.0%)   (0.0%)   
##                     1 < 4.3 < 6.9                                :         : : .                           
##                     IQR (CV) : 3.5 (0.5)                         : :       : : : .                         
##                                                                  : :   . : : : : : .                       
## 
## 4    Petal.Width    Mean (sd) : 1.2 (0.8)   22 distinct values   :                     150        0        
##      [numeric]      min < med < max:                             :                     (100.0%)   (0.0%)   
##                     0.1 < 1.3 < 2.5                              :       . .   :                           
##                     IQR (CV) : 1.5 (0.6)                         :       : :   :   .                       
##                                                                  : :   : : : . : : :                       
## 
## 5    Species        1. setosa               50 (33.3%)           IIIIII                150        0        
##      [factor]       2. versicolor           50 (33.3%)           IIIIII                (100.0%)   (0.0%)   
##                     3. virginica            50 (33.3%)           IIIIII                                    
## -----------------------------------------------------------------------------------------------------------

2 R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button in RStudio after setting up a markdown file, a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk in your markdown (.Rmd extension) file, different from script file (.R extension), within triple back quotes, and also control what code shows up in the final output file and also organise your input/code file within sections/chunks.

R markdown is great for generating quick, short reports to be shared with stakeholders.

3 Descriptive Statistics

Multiple ways to do this - lots of different commands. Experiment and find your own preferred commands. Some might prefer commands in base R package, while others may prefer commands in user written packages (will require both an initial install and calling the library for every session). I provide an example of two user written commands that are popular - describe in psych package and stargazer in stargazer package (can be used for regression tables too, apart from summary statistics tables, and used in professional publications).

3.1 Psych package, describe() command

Psych package intro

# install.packages("psych") - required one time only, commented out to save running time
library(psych) # load the library to use the commands in the package
describe(iris)

##              vars   n mean   sd median trimmed  mad min max range  skew
## Sepal.Length    1 150 5.84 0.83   5.80    5.81 1.04 4.3 7.9   3.6  0.31
## Sepal.Width     2 150 3.06 0.44   3.00    3.04 0.44 2.0 4.4   2.4  0.31
## Petal.Length    3 150 3.76 1.77   4.35    3.76 1.85 1.0 6.9   5.9 -0.27
## Petal.Width     4 150 1.20 0.76   1.30    1.18 1.04 0.1 2.5   2.4 -0.10
## Species*        5 150 2.00 0.82   2.00    2.00 1.48 1.0 3.0   2.0  0.00
##              kurtosis   se
## Sepal.Length    -0.61 0.07
## Sepal.Width      0.14 0.04
## Petal.Length    -1.42 0.14
## Petal.Width     -1.36 0.06
## Species*        -1.52 0.07

# install.packages("stargzer") - required one time only, commented out to save running time

3.2 Stargazer Package, stargazer() command

Stargazer package examples

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

## ALL VARIABLES IN THE ORIGINAL DATAFRAME
stargazer(df, 
          type = "text", # html, latex
          # out =
          # summary.stat = 
          covariate.labels = c("Sepal Length", "Sepal Width", "Petal Length", "Petal Width"),
          digits = 2)

## 
## ========================================
## Statistic     N  Mean St. Dev. Min  Max 
## ----------------------------------------
## Sepal Length 150 5.84   0.83   4.30 7.90
## Sepal Width  150 3.06   0.44   2.00 4.40
## Petal Length 150 3.76   1.77   1.00 6.90
## Petal Width  150 1.20   0.76   0.10 2.50
## ----------------------------------------

?levels
levels(iris$Species)

## [1] "setosa"     "versicolor" "virginica"

for (j in levels(iris$Species)) {
  print(j)
}

## [1] "setosa"
## [1] "versicolor"
## [1] "virginica"

Do not apply functions one by one on variables in the dataset iris. If you did want to do that, use a loop.

mean(iris$Sepal.Length)

## [1] 5.843333

median(iris$Sepal.Length)

## [1] 5.8

max(iris$Sepal.Length)

## [1] 7.9

min(iris$Sepal.Length)

## [1] 4.3

range(iris$Sepal.Length)

## [1] 4.3 7.9

quantile(iris$Sepal.Length)

##   0%  25%  50%  75% 100% 
##  4.3  5.1  5.8  6.4  7.9

sd(iris$Sepal.Length)

## [1] 0.8280661

var(iris$Sepal.Length)

## [1] 0.6856935

skew(iris$Sepal.Length)

## [1] 0.3086407

kurtosi(iris$Sepal.Length)

## [1] -0.6058125

4 Including Plots

You can also embed plots, for example (these need to be properly embellished with titles, axis names, et cetra):

4.0.1 Color names in R

Color Palette in R

4.1 Scatterplot

?plot

## Help on topic 'plot' was found in the following packages:
## 
##   Package               Library
##   graphics              /Library/Frameworks/R.framework/Versions/4.2/Resources/library
##   base                  /Library/Frameworks/R.framework/Resources/library
## 
## 
## Using the first match ...

# bare bones command - ensure you have an output 
plot(iris$Sepal.Length)

# basic command - ensure you labelled the axis and have an informative title 
plot(iris$Sepal.Length, 
     ylab  = "Sepal Length",    # y axis label
     xlab  = "Species ID",      # x axis label
     main  = "Scatter Plot"     # main title of the chart
     )

Lots of options to play with in the base plot command - read the help file for The Default Scatterplot Function. You can find snippets on online blogs too.

# fancier commands - can play with a million settings
plot(iris$Sepal.Length, 
     ylab  = "Sepal Length",    # y axis label
     xlab  = "Species ID",      # x axis label
     main  = "Scatter Plot",    # main title of the chart
     pch   =  as.numeric(iris$Species),
     col   =  as.numeric(iris$Species)
     )

# plot(iris$Sepal.Length, iris$Sepal.Width)

4.2 Histogram

?hist
hist(iris$Sepal.Length,     # required option 
     main = "Histogram",    # optional:  for title of graph
     xlab = "Sepal Length", # optional:  for x axis label. axis limits can be controlled too
     col  = "gold1"         # optional:  for coloring the bars
     )

4.3 Boxplot

Boxplots are really useful in descriptive statistics and are often underused (mostly because it is not well understood by the public). A boxplot graphically represents the distribution of a quantitative variable by visually displaying five common location summary (minimum, median, first/third quartiles and maximum) and any observation that was classified as a suspected outlier using the interquartile range (IQR) criterion.

?boxplot

# bare bones command
boxplot (iris$Sepal.Length, 
         main = "Boxplot", 
         col = "cyan"
         )

plot(iris[1:4], col=iris$Species)

describeBy(x     = iris, 
           group = iris$Species
           )

## 
##  Descriptive statistics by group 
## group: setosa
##              vars  n mean   sd median trimmed  mad min max range skew kurtosis
## Sepal.Length    1 50 5.01 0.35    5.0    5.00 0.30 4.3 5.8   1.5 0.11    -0.45
## Sepal.Width     2 50 3.43 0.38    3.4    3.42 0.37 2.3 4.4   2.1 0.04     0.60
## Petal.Length    3 50 1.46 0.17    1.5    1.46 0.15 1.0 1.9   0.9 0.10     0.65
## Petal.Width     4 50 0.25 0.11    0.2    0.24 0.00 0.1 0.6   0.5 1.18     1.26
## Species*        5 50 1.00 0.00    1.0    1.00 0.00 1.0 1.0   0.0  NaN      NaN
##                se
## Sepal.Length 0.05
## Sepal.Width  0.05
## Petal.Length 0.02
## Petal.Width  0.01
## Species*     0.00
## ------------------------------------------------------------ 
## group: versicolor
##              vars  n mean   sd median trimmed  mad min max range  skew kurtosis
## Sepal.Length    1 50 5.94 0.52   5.90    5.94 0.52 4.9 7.0   2.1  0.10    -0.69
## Sepal.Width     2 50 2.77 0.31   2.80    2.78 0.30 2.0 3.4   1.4 -0.34    -0.55
## Petal.Length    3 50 4.26 0.47   4.35    4.29 0.52 3.0 5.1   2.1 -0.57    -0.19
## Petal.Width     4 50 1.33 0.20   1.30    1.32 0.22 1.0 1.8   0.8 -0.03    -0.59
## Species*        5 50 2.00 0.00   2.00    2.00 0.00 2.0 2.0   0.0   NaN      NaN
##                se
## Sepal.Length 0.07
## Sepal.Width  0.04
## Petal.Length 0.07
## Petal.Width  0.03
## Species*     0.00
## ------------------------------------------------------------ 
## group: virginica
##              vars  n mean   sd median trimmed  mad min max range  skew kurtosis
## Sepal.Length    1 50 6.59 0.64   6.50    6.57 0.59 4.9 7.9   3.0  0.11    -0.20
## Sepal.Width     2 50 2.97 0.32   3.00    2.96 0.30 2.2 3.8   1.6  0.34     0.38
## Petal.Length    3 50 5.55 0.55   5.55    5.51 0.67 4.5 6.9   2.4  0.52    -0.37
## Petal.Width     4 50 2.03 0.27   2.00    2.03 0.30 1.4 2.5   1.1 -0.12    -0.75
## Species*        5 50 3.00 0.00   3.00    3.00 0.00 3.0 3.0   0.0   NaN      NaN
##                se
## Sepal.Length 0.09
## Sepal.Width  0.05
## Petal.Length 0.08
## Petal.Width  0.04
## Species*     0.00

# alternatively, if you do not want to use describe by command, create three new data frames and summarize them
irisSetosa     <- subset(iris, Species == "setosa"   )
irisVersicolor <- subset(iris, Species == "versicolor")
irisVirginica  <- subset(iris, Species == "virginica" )

list (summary(irisSetosa    [,1:4] ) )

## [[1]]
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100  
##  1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200  
##  Median :5.000   Median :3.400   Median :1.500   Median :0.200  
##  Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246  
##  3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300  
##  Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600

list (summary(irisVersicolor[,1:4] ) )

## [[1]]
##   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width   
##  Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000  
##  1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200  
##  Median :5.900   Median :2.800   Median :4.35   Median :1.300  
##  Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326  
##  3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500  
##  Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800

list (summary(irisVirginica [,1:4] ) )

## [[1]]
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400  
##  1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800  
##  Median :6.500   Median :3.000   Median :5.550   Median :2.000  
##  Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026  
##  3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300  
##  Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500

par(mfrow = c(1,3) ,mar = c(6,3,2,1) )
# las=2 ensures the x axis labels are vertical and not horizontal ( las=1 )
boxplot(irisSetosa[,1:4]    , main = "Setosa",    ylim = c(0,8), las = 2 , col = "blue")
boxplot(irisVersicolor[,1:4], main = "Versicolor",ylim = c(0,8), las = 2,  col = "red")
boxplot(irisVirginica[,1:4] , main = "Virginica", ylim = c(0,8), las = 2,  col = "green")

5 K-means clustering

Omit code chunk output below by setting eval=FALSE and do not even print it by setting include=FALSE.
- Useful when you want to develop some code but not show it to end user.
Omit code chunk output below by setting eval=FALSE but print the commands by setting include=TRUE.
- Useful if you want to show some interim steps but the output is not important. EG data cleaning steps.

attach(iris) # The database is attached to the R search path. This means that the database is searched by R when evaluating a variable, so objects in the database can be accessed by simply giving their names.

iris.features = iris
iris.features$Species <- NULL
report <- kmeans(iris.features, 3)
report # K-means clustering with 3 clusters of sizes 50, 38, 62.  

# install.packages("ResourceSelection")
library(ResourceSelection)
kdepairs(iris[,1:4])

Print code chunk commands and output below by setting eval=TRUE (the default value).
- Typically what you will show in this course

attach(iris) # The database is attached to the R search path. This means that the database is searched by R when evaluating a variable, so objects in the database can be accessed by simply giving their names.

iris.features = iris
iris.features$Species <- NULL
report <- kmeans(iris.features, 3)
report # K-means clustering with 3 clusters of sizes 50, 38, 62.

## K-means clustering with 3 clusters of sizes 50, 62, 38
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.006000    3.428000     1.462000    0.246000
## 2     5.901613    2.748387     4.393548    1.433871
## 3     6.850000    3.073684     5.742105    2.071053
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
## 
## Within cluster sum of squares by cluster:
## [1] 15.15100 39.82097 23.87947
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# install.packages("ResourceSelection")
library(ResourceSelection)

## ResourceSelection 0.3-5   2019-07-22

kdepairs(iris[,1:4])

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

Print code chunk output but no commands below by setting eval=TRUE (the default value) and echo=FALSE.
- Can control output more with hiding messages or warning with message=FALSE and warning=FALSE.
- Useful in a professional report where you only show the client the final output - professionla looking charts and text describing it.

## K-means clustering with 3 clusters of sizes 33, 21, 96
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.175758    3.624242     1.472727   0.2727273
## 2     4.738095    2.904762     1.790476   0.3523810
## 3     6.314583    2.895833     4.973958   1.7031250
## 
## Clustering vector:
##   [1] 1 2 2 2 1 1 1 1 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 1 1 1 2 1 1
##  [38] 1 2 1 1 2 2 1 1 2 1 2 1 1 3 3 3 3 3 3 3 2 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3
## 
## Within cluster sum of squares by cluster:
## [1]   6.432121  17.669524 118.651875
##  (between_SS / total_SS =  79.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Iris Data

Arvind Sharma

2022-08-10