Discussion on Iris Dataset
In OpenStats Chapter 1, Exercises, Problem 9, there is a reference to Fisher’s iris data. Discuss the solutions to this problem, and then conduct a descriptive analysis of the data which are conveniently available in R. To access the data in R, simply type “iris.” Investigate any additional R libraries that might help support analysis of this data (e.g., psych package). Share your code and analysis in the discussion forum.
Fisher’s irises: Sir Ronald Aylmer Fisher was an English statistician, evolutionary biologist, and geneticist who worked on a data set that contained sepal length and width, and petal length and width from three species of iris flowers (setosa, versicolor and virginica). There were 50 flowers from each species in the data set.
QUESTIONS
How many cases were included in the data?
How many numerical variables are included in the data? Indicate what they are, and if they are continuous or discrete.
How many categorical variables are included in the data, and what are they? List the corresponding levels (categories).
SOLUTIONS
n=150
4 numerical variables are continuous (sepal.length, sepal.width, petal.length, petal.width).
1 categorical variable (species). Factors are
setosa, versicolor and
virginica.
# Clear the workspace
rm(list = ls()) # Clear all files from your environment
# gc() # Clear unused memory
cat("\f") # Clear the console
graphics.off() # Clear all graphs
df <- iris
# a
nrow(df)
## [1] 150
# b
str(df) # internal structure function str() suggests four numerical variables and one categorical variable within dataset.
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# c
?unique # unique returns a vector, data frame or array like x but with duplicate elements/rows removed.
unique(df$Species)
## [1] setosa versicolor virginica
## Levels: setosa versicolor virginica
?levels # levels provides access to the levels attribute of a variable. The first form returns the value of the levels of its argument and the second sets the attribute.
levels(iris$Species)
## [1] "setosa" "versicolor" "virginica"
It is never wise to print the entire data, especially if it is very large.
Print a subset of the data only. Some commands to help you -
head(iris) # see the first few rows of data
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
tail(iris,
n=9) # see the last few rows of data
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
iris[80:90,] # print 80th to 90th observation of the dataset in the console
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
Do basic summary statistics of the data -
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
# more informative
by(data = iris,
INDICES = iris$Species,
FUN = summary
)
## iris$Species: setosa
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100
## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200
## Median :5.000 Median :3.400 Median :1.500 Median :0.200
## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
## Species
## setosa :50
## versicolor: 0
## virginica : 0
##
##
##
## ------------------------------------------------------------
## iris$Species: versicolor
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0
## 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50
## Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0
## Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
## 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
## Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
## ------------------------------------------------------------
## iris$Species: virginica
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400
## 1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800
## Median :6.500 Median :3.000 Median :5.550 Median :2.000
## Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
## 3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
## Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
## Species
## setosa : 0
## versicolor: 0
## virginica :50
##
##
##
In R you can get the same answer in multiple ways.
?levels
levels(iris$Species)
## [1] "setosa" "versicolor" "virginica"
for (j in levels(iris$Species)) {
print(j)
}
## [1] "setosa"
## [1] "versicolor"
## [1] "virginica"
Multiple ways to do this - lots of different commands. Experiment and
find your own preferred commands. Some might prefer commands in base R
package, while others may prefer commands in user written packages (will
require both an initial install and calling the library for every
session). I provide an example of two user written commands that are
popular - describe() in psych
package and stargazer() in
stargazer package (can be used for regression tables too,
apart from summary statistics tables, and used in professional
publications).
Do not apply different summary statistics functions on a variable. We
have more efficient ways like psych package to create a
summary statistics table.
## measures of center
mean(iris$Sepal.Length)
## [1] 5.843333
median(iris$Sepal.Length)
## [1] 5.8
## measures of spread
max(iris$Sepal.Length)
## [1] 7.9
min(iris$Sepal.Length)
## [1] 4.3
range(iris$Sepal.Length)
## [1] 4.3 7.9
quantile(iris$Sepal.Length)
## 0% 25% 50% 75% 100%
## 4.3 5.1 5.8 6.4 7.9
## measures of dispersion
sd(iris$Sepal.Length)
## [1] 0.8280661
var(iris$Sepal.Length)
## [1] 0.6856935
## higher order moments
# install.packages("moments")
library(moments)
?skewness # calculate skewness in r
skewness(iris$Sepal.Length)
## [1] 0.3117531
?kurtosis # calculate kurtosis in r (compare with 3)
kurtosis(iris$Sepal.Length)
## [1] 2.426432
Read up on psych package here. Lets apply it on the entire variable, or dataset.
# install.packages("psych") - required one time only, commented out to save running time
library(psych) # load the library to use the commands in the package
describe(iris)
## vars n mean sd median trimmed mad min max range skew
## Sepal.Length 1 150 5.84 0.83 5.80 5.81 1.04 4.3 7.9 3.6 0.31
## Sepal.Width 2 150 3.06 0.44 3.00 3.04 0.44 2.0 4.4 2.4 0.31
## Petal.Length 3 150 3.76 1.77 4.35 3.76 1.85 1.0 6.9 5.9 -0.27
## Petal.Width 4 150 1.20 0.76 1.30 1.18 1.04 0.1 2.5 2.4 -0.10
## Species* 5 150 2.00 0.82 2.00 2.00 1.48 1.0 3.0 2.0 0.00
## kurtosis se
## Sepal.Length -0.61 0.07
## Sepal.Width 0.14 0.04
## Petal.Length -1.42 0.14
## Petal.Width -1.36 0.06
## Species* -1.52 0.07
To create more professional looking tables, see the stargazer package examples.
# install.packages("stargzer") - required one time only, commented out to save running time
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
## ALL VARIABLES IN THE ORIGINAL DATAFRAME
stargazer(df,
type = "text", # html, latex
# summary.stat =
covariate.labels = c("Sepal Length", "Sepal Width", "Petal Length", "Petal Width"),
digits = 2)
##
## ========================================
## Statistic N Mean St. Dev. Min Max
## ----------------------------------------
## Sepal Length 150 5.84 0.83 4.30 7.90
## Sepal Width 150 3.06 0.44 2.00 4.40
## Petal Length 150 3.76 1.77 1.00 6.90
## Petal Width 150 1.20 0.76 0.10 2.50
## ----------------------------------------
Do not apply functions one by one on variables in the dataset
iris. If you did want to do that, use a loop. Or
sapply function.
?sapply
# apply a function on all elements of your object X
sapply(X = iris,
FUN = is.numeric
)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## TRUE TRUE TRUE TRUE FALSE
?is.numeric
is.numeric(iris)
## [1] FALSE
You can also embed plots, for example (these need to be properly embellished with titles, axis names, et cetra):
?plot
## Help on topic 'plot' was found in the following packages:
##
## Package Library
## graphics /Library/Frameworks/R.framework/Versions/4.2/Resources/library
## base /Library/Frameworks/R.framework/Resources/library
##
##
## Using the first match ...
# bare bones command - ensure you have an output
plot(iris$Sepal.Length)
# basic command - ensure you labelled the axis and have an informative title
plot(iris$Sepal.Length,
ylab = "Sepal Length", # y axis label
xlab = "Species ID", # x axis label
main = "Scatter Plot" # main title of the chart
)
Lots of options to play with in the base plot command - read the help file for The Default Scatterplot Function. You can find snippets on online blogs too.
# fancier commands - can play with a million settings
plot(iris$Sepal.Length,
ylab = "Sepal Length", # y axis label
xlab = "Species ID", # x axis label
main = "Scatter Plot", # main title of the chart
pch = as.numeric(iris$Species),
col = as.numeric(iris$Species)
)
# plot(iris$Sepal.Length, iris$Sepal.Width)
?hist
hist(iris$Sepal.Length, # required option
main = "Histogram", # optional: for title of graph
xlab = "Sepal Length", # optional: for x axis label. axis limits can be controlled too
col = "gold1" # optional: for coloring the bars
)
?boxplot
# bare bones command
boxplot (iris$Sepal.Length, main = "Boxplot", col = "cyan")
plot(iris[1:4], col=iris$Species)
describeBy(x = iris,
group = iris$Species
)
##
## Descriptive statistics by group
## group: setosa
## vars n mean sd median trimmed mad min max range skew kurtosis
## Sepal.Length 1 50 5.01 0.35 5.0 5.00 0.30 4.3 5.8 1.5 0.11 -0.45
## Sepal.Width 2 50 3.43 0.38 3.4 3.42 0.37 2.3 4.4 2.1 0.04 0.60
## Petal.Length 3 50 1.46 0.17 1.5 1.46 0.15 1.0 1.9 0.9 0.10 0.65
## Petal.Width 4 50 0.25 0.11 0.2 0.24 0.00 0.1 0.6 0.5 1.18 1.26
## Species* 5 50 1.00 0.00 1.0 1.00 0.00 1.0 1.0 0.0 NaN NaN
## se
## Sepal.Length 0.05
## Sepal.Width 0.05
## Petal.Length 0.02
## Petal.Width 0.01
## Species* 0.00
## ------------------------------------------------------------
## group: versicolor
## vars n mean sd median trimmed mad min max range skew kurtosis
## Sepal.Length 1 50 5.94 0.52 5.90 5.94 0.52 4.9 7.0 2.1 0.10 -0.69
## Sepal.Width 2 50 2.77 0.31 2.80 2.78 0.30 2.0 3.4 1.4 -0.34 -0.55
## Petal.Length 3 50 4.26 0.47 4.35 4.29 0.52 3.0 5.1 2.1 -0.57 -0.19
## Petal.Width 4 50 1.33 0.20 1.30 1.32 0.22 1.0 1.8 0.8 -0.03 -0.59
## Species* 5 50 2.00 0.00 2.00 2.00 0.00 2.0 2.0 0.0 NaN NaN
## se
## Sepal.Length 0.07
## Sepal.Width 0.04
## Petal.Length 0.07
## Petal.Width 0.03
## Species* 0.00
## ------------------------------------------------------------
## group: virginica
## vars n mean sd median trimmed mad min max range skew kurtosis
## Sepal.Length 1 50 6.59 0.64 6.50 6.57 0.59 4.9 7.9 3.0 0.11 -0.20
## Sepal.Width 2 50 2.97 0.32 3.00 2.96 0.30 2.2 3.8 1.6 0.34 0.38
## Petal.Length 3 50 5.55 0.55 5.55 5.51 0.67 4.5 6.9 2.4 0.52 -0.37
## Petal.Width 4 50 2.03 0.27 2.00 2.03 0.30 1.4 2.5 1.1 -0.12 -0.75
## Species* 5 50 3.00 0.00 3.00 3.00 0.00 3.0 3.0 0.0 NaN NaN
## se
## Sepal.Length 0.09
## Sepal.Width 0.05
## Petal.Length 0.08
## Petal.Width 0.04
## Species* 0.00
# alternatively, if you do not want to use describe by command, create three new data frames and summarize them
irisSetosa <- subset(iris, Species == "setosa" )
irisVersicolor <- subset(iris, Species == "versicolor")
irisVirginica <- subset(iris, Species == "virginica" )
list (summary(irisSetosa [,1:4] ) )
## [[1]]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100
## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200
## Median :5.000 Median :3.400 Median :1.500 Median :0.200
## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
list (summary(irisVersicolor[,1:4] ) )
## [[1]]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000
## 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200
## Median :5.900 Median :2.800 Median :4.35 Median :1.300
## Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
## 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
## Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
list (summary(irisVirginica [,1:4] ) )
## [[1]]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400
## 1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800
## Median :6.500 Median :3.000 Median :5.550 Median :2.000
## Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
## 3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
## Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
par(mfrow = c(1,3) ,mar = c(6,3,2,1) )
# las=2 ensures the x axis labels are vertical and not horizontal ( las=1 )
boxplot(irisSetosa[,1:4] , main = "Setosa", ylim = c(0,8), las = 2 , col = "blue")
boxplot(irisVersicolor[,1:4], main = "Versicolor",ylim = c(0,8), las = 2, col = "red")
boxplot(irisVirginica[,1:4] , main = "Virginica", ylim = c(0,8), las = 2, col = "green")
Omit code chunk output below by setting eval=FALSE and include=FALSE.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button in RStudio after setting up a markdown file, a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk in your markdown (.Rmd extension) file, different from script file (.R extension), within triple back quotes, and also control what code shows up in the final output file and also organise your input/code file within sections/chunks.
R markdown is great for generating quick, short reports to be shared with stakeholders.
Make sure to read up on its cheatsheets in the class Dropbox folder.