In OpenStats Chapter 1, Exercises, Problem 9, there is a
reference to Fisher’s iris data. Discuss the solutions to this problem,
and then conduct a descriptive analysis of the data which are
conveniently available in R. To access the data in R, simply type
“iris
”. Investigate any additional R libraries that might
help support analysis of this data (e.g., psych
package).
Share your code and analysis in the discussion forum.
Fisher’s irises: Sir Ronald Aylmer Fisher was an English statistician, evolutionary biologist, and geneticist who worked on a data set that contained sepal length and width, and petal length and width from three species of iris flowers (setosa, versicolor and virginica). There were 50 flowers from each species in the data set.
How many cases were included in the data?
How many numerical variables are included in the data? Indicate what they are, and if they are continuous or discrete.
How many categorical variables are included in the data, and what are they? List the corresponding levels (categories).
sepal.length
,
sepal.width
, petal.length
,
petal.width
).
species
).
setosa
, versicolor
and
virginica
.# Clear the workspace
rm(list = ls()) # Clear all files from your environment
# gc() # Clear unused memory
cat("\f") # Clear the console
graphics.off() # Clear all graphs
df <- iris
# a
nrow(df)
## [1] 150
# b
str(df) # internal structure function str() suggests four numerical variables and one categorical variable within dataset.
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# c
?unique # unique returns a vector, data frame or array like x but with duplicate elements/rows removed.
unique(df$Species)
## [1] setosa versicolor virginica
## Levels: setosa versicolor virginica
?levels # levels provides access to the levels attribute of a variable. The first form returns the value of the levels of its argument and the second sets the attribute.
levels(iris$Species)
## [1] "setosa" "versicolor" "virginica"
head(iris) # see the first few rows of data
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
tail(iris) # see the last few rows of data
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
iris[80:90,] # print 80th to 90th observation of the dataset in the console
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
??dfSummary # for global search
# install.packages(summarytools)
library(summarytools)
dfSummary(iris)
## Data Frame Summary
## iris
## Dimensions: 150 x 5
## Duplicates: 1
##
## -----------------------------------------------------------------------------------------------------------
## No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
## ---- -------------- ----------------------- -------------------- --------------------- ---------- ---------
## 1 Sepal.Length Mean (sd) : 5.8 (0.8) 35 distinct values . . : : 150 0
## [numeric] min < med < max: : : : : (100.0%) (0.0%)
## 4.3 < 5.8 < 7.9 : : : : :
## IQR (CV) : 1.3 (0.1) : : : : :
## : : : : : : : :
##
## 2 Sepal.Width Mean (sd) : 3.1 (0.4) 23 distinct values : 150 0
## [numeric] min < med < max: : (100.0%) (0.0%)
## 2 < 3 < 4.4 . :
## IQR (CV) : 0.5 (0.1) : : : :
## . . : : : : : :
##
## 3 Petal.Length Mean (sd) : 3.8 (1.8) 43 distinct values : 150 0
## [numeric] min < med < max: : . : (100.0%) (0.0%)
## 1 < 4.3 < 6.9 : : : .
## IQR (CV) : 3.5 (0.5) : : : : : .
## : : . : : : : : .
##
## 4 Petal.Width Mean (sd) : 1.2 (0.8) 22 distinct values : 150 0
## [numeric] min < med < max: : (100.0%) (0.0%)
## 0.1 < 1.3 < 2.5 : . . :
## IQR (CV) : 1.5 (0.6) : : : : .
## : : : : : . : : :
##
## 5 Species 1. setosa 50 (33.3%) IIIIII 150 0
## [factor] 2. versicolor 50 (33.3%) IIIIII (100.0%) (0.0%)
## 3. virginica 50 (33.3%) IIIIII
## -----------------------------------------------------------------------------------------------------------
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button in RStudio after setting up a markdown file, a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk in your markdown (.Rmd extension) file, different from script file (.R extension), within triple back quotes, and also control what code shows up in the final output file and also organise your input/code file within sections/chunks.
R markdown is great for generating quick, short reports to be shared with stakeholders.
Multiple ways to do this - lots of different commands. Experiment and
find your own preferred commands. Some might prefer commands in base R
package, while others may prefer commands in user written packages (will
require both an initial install and calling the library for every
session). I provide an example of two user written commands that are
popular - describe
in
psych
package and
stargazer
in
stargazer
package (can be used for regression
tables too, apart from summary statistics tables, and used in
professional publications).
# install.packages("psych") - required one time only, commented out to save running time
library(psych) # load the library to use the commands in the package
describe(iris)
## vars n mean sd median trimmed mad min max range skew
## Sepal.Length 1 150 5.84 0.83 5.80 5.81 1.04 4.3 7.9 3.6 0.31
## Sepal.Width 2 150 3.06 0.44 3.00 3.04 0.44 2.0 4.4 2.4 0.31
## Petal.Length 3 150 3.76 1.77 4.35 3.76 1.85 1.0 6.9 5.9 -0.27
## Petal.Width 4 150 1.20 0.76 1.30 1.18 1.04 0.1 2.5 2.4 -0.10
## Species* 5 150 2.00 0.82 2.00 2.00 1.48 1.0 3.0 2.0 0.00
## kurtosis se
## Sepal.Length -0.61 0.07
## Sepal.Width 0.14 0.04
## Petal.Length -1.42 0.14
## Petal.Width -1.36 0.06
## Species* -1.52 0.07
# install.packages("stargzer") - required one time only, commented out to save running time
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
## ALL VARIABLES IN THE ORIGINAL DATAFRAME
stargazer(df,
type = "text", # html, latex
# out =
# summary.stat =
covariate.labels = c("Sepal Length", "Sepal Width", "Petal Length", "Petal Width"),
digits = 2)
##
## ========================================
## Statistic N Mean St. Dev. Min Max
## ----------------------------------------
## Sepal Length 150 5.84 0.83 4.30 7.90
## Sepal Width 150 3.06 0.44 2.00 4.40
## Petal Length 150 3.76 1.77 1.00 6.90
## Petal Width 150 1.20 0.76 0.10 2.50
## ----------------------------------------
?levels
levels(iris$Species)
## [1] "setosa" "versicolor" "virginica"
for (j in levels(iris$Species)) {
print(j)
}
## [1] "setosa"
## [1] "versicolor"
## [1] "virginica"
Do not apply functions one by one on variables in the dataset
iris
. If you did want to do that, use a loop.
mean(iris$Sepal.Length)
## [1] 5.843333
median(iris$Sepal.Length)
## [1] 5.8
max(iris$Sepal.Length)
## [1] 7.9
min(iris$Sepal.Length)
## [1] 4.3
range(iris$Sepal.Length)
## [1] 4.3 7.9
quantile(iris$Sepal.Length)
## 0% 25% 50% 75% 100%
## 4.3 5.1 5.8 6.4 7.9
sd(iris$Sepal.Length)
## [1] 0.8280661
var(iris$Sepal.Length)
## [1] 0.6856935
skew(iris$Sepal.Length)
## [1] 0.3086407
kurtosi(iris$Sepal.Length)
## [1] -0.6058125
You can also embed plots, for example (these need to be properly embellished with titles, axis names, et cetra):
?plot
## Help on topic 'plot' was found in the following packages:
##
## Package Library
## graphics /Library/Frameworks/R.framework/Versions/4.2/Resources/library
## base /Library/Frameworks/R.framework/Resources/library
##
##
## Using the first match ...
# bare bones command - ensure you have an output
plot(iris$Sepal.Length)
# basic command - ensure you labelled the axis and have an informative title
plot(iris$Sepal.Length,
ylab = "Sepal Length", # y axis label
xlab = "Species ID", # x axis label
main = "Scatter Plot" # main title of the chart
)
Lots of options to play with in the base plot command - read the help file for The Default Scatterplot Function. You can find snippets on online blogs too.
# fancier commands - can play with a million settings
plot(iris$Sepal.Length,
ylab = "Sepal Length", # y axis label
xlab = "Species ID", # x axis label
main = "Scatter Plot", # main title of the chart
pch = as.numeric(iris$Species),
col = as.numeric(iris$Species)
)
# plot(iris$Sepal.Length, iris$Sepal.Width)
?hist
hist(iris$Sepal.Length, # required option
main = "Histogram", # optional: for title of graph
xlab = "Sepal Length", # optional: for x axis label. axis limits can be controlled too
col = "gold1" # optional: for coloring the bars
)
Boxplots are really useful in descriptive statistics and are often underused (mostly because it is not well understood by the public). A boxplot graphically represents the distribution of a quantitative variable by visually displaying five common location summary (minimum, median, first/third quartiles and maximum) and any observation that was classified as a suspected outlier using the interquartile range (IQR) criterion.
?boxplot
# bare bones command
boxplot (iris$Sepal.Length,
main = "Boxplot",
col = "cyan"
)
plot(iris[1:4], col=iris$Species)
describeBy(x = iris,
group = iris$Species
)
##
## Descriptive statistics by group
## group: setosa
## vars n mean sd median trimmed mad min max range skew kurtosis
## Sepal.Length 1 50 5.01 0.35 5.0 5.00 0.30 4.3 5.8 1.5 0.11 -0.45
## Sepal.Width 2 50 3.43 0.38 3.4 3.42 0.37 2.3 4.4 2.1 0.04 0.60
## Petal.Length 3 50 1.46 0.17 1.5 1.46 0.15 1.0 1.9 0.9 0.10 0.65
## Petal.Width 4 50 0.25 0.11 0.2 0.24 0.00 0.1 0.6 0.5 1.18 1.26
## Species* 5 50 1.00 0.00 1.0 1.00 0.00 1.0 1.0 0.0 NaN NaN
## se
## Sepal.Length 0.05
## Sepal.Width 0.05
## Petal.Length 0.02
## Petal.Width 0.01
## Species* 0.00
## ------------------------------------------------------------
## group: versicolor
## vars n mean sd median trimmed mad min max range skew kurtosis
## Sepal.Length 1 50 5.94 0.52 5.90 5.94 0.52 4.9 7.0 2.1 0.10 -0.69
## Sepal.Width 2 50 2.77 0.31 2.80 2.78 0.30 2.0 3.4 1.4 -0.34 -0.55
## Petal.Length 3 50 4.26 0.47 4.35 4.29 0.52 3.0 5.1 2.1 -0.57 -0.19
## Petal.Width 4 50 1.33 0.20 1.30 1.32 0.22 1.0 1.8 0.8 -0.03 -0.59
## Species* 5 50 2.00 0.00 2.00 2.00 0.00 2.0 2.0 0.0 NaN NaN
## se
## Sepal.Length 0.07
## Sepal.Width 0.04
## Petal.Length 0.07
## Petal.Width 0.03
## Species* 0.00
## ------------------------------------------------------------
## group: virginica
## vars n mean sd median trimmed mad min max range skew kurtosis
## Sepal.Length 1 50 6.59 0.64 6.50 6.57 0.59 4.9 7.9 3.0 0.11 -0.20
## Sepal.Width 2 50 2.97 0.32 3.00 2.96 0.30 2.2 3.8 1.6 0.34 0.38
## Petal.Length 3 50 5.55 0.55 5.55 5.51 0.67 4.5 6.9 2.4 0.52 -0.37
## Petal.Width 4 50 2.03 0.27 2.00 2.03 0.30 1.4 2.5 1.1 -0.12 -0.75
## Species* 5 50 3.00 0.00 3.00 3.00 0.00 3.0 3.0 0.0 NaN NaN
## se
## Sepal.Length 0.09
## Sepal.Width 0.05
## Petal.Length 0.08
## Petal.Width 0.04
## Species* 0.00
# alternatively, if you do not want to use describe by command, create three new data frames and summarize them
irisSetosa <- subset(iris, Species == "setosa" )
irisVersicolor <- subset(iris, Species == "versicolor")
irisVirginica <- subset(iris, Species == "virginica" )
list (summary(irisSetosa [,1:4] ) )
## [[1]]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100
## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200
## Median :5.000 Median :3.400 Median :1.500 Median :0.200
## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
list (summary(irisVersicolor[,1:4] ) )
## [[1]]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000
## 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200
## Median :5.900 Median :2.800 Median :4.35 Median :1.300
## Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
## 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
## Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
list (summary(irisVirginica [,1:4] ) )
## [[1]]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400
## 1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800
## Median :6.500 Median :3.000 Median :5.550 Median :2.000
## Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
## 3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
## Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
par(mfrow = c(1,3) ,mar = c(6,3,2,1) )
# las=2 ensures the x axis labels are vertical and not horizontal ( las=1 )
boxplot(irisSetosa[,1:4] , main = "Setosa", ylim = c(0,8), las = 2 , col = "blue")
boxplot(irisVersicolor[,1:4], main = "Versicolor",ylim = c(0,8), las = 2, col = "red")
boxplot(irisVirginica[,1:4] , main = "Virginica", ylim = c(0,8), las = 2, col = "green")
eval=FALSE
and
do not even print it by setting include=FALSE
.
eval=FALSE
but
print the commands by setting include=TRUE
.
attach(iris) # The database is attached to the R search path. This means that the database is searched by R when evaluating a variable, so objects in the database can be accessed by simply giving their names.
iris.features = iris
iris.features$Species <- NULL
report <- kmeans(iris.features, 3)
report # K-means clustering with 3 clusters of sizes 50, 38, 62.
# install.packages("ResourceSelection")
library(ResourceSelection)
kdepairs(iris[,1:4])
eval=TRUE
(the default value).
attach(iris) # The database is attached to the R search path. This means that the database is searched by R when evaluating a variable, so objects in the database can be accessed by simply giving their names.
iris.features = iris
iris.features$Species <- NULL
report <- kmeans(iris.features, 3)
report # K-means clustering with 3 clusters of sizes 50, 38, 62.
## K-means clustering with 3 clusters of sizes 50, 62, 38
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.006000 3.428000 1.462000 0.246000
## 2 5.901613 2.748387 4.393548 1.433871
## 3 6.850000 3.073684 5.742105 2.071053
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
##
## Within cluster sum of squares by cluster:
## [1] 15.15100 39.82097 23.87947
## (between_SS / total_SS = 88.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# install.packages("ResourceSelection")
library(ResourceSelection)
## ResourceSelection 0.3-5 2019-07-22
kdepairs(iris[,1:4])
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
eval=TRUE
(the default value) and echo=FALSE
.
message=FALSE
and warning=FALSE
.## K-means clustering with 3 clusters of sizes 33, 21, 96
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.175758 3.624242 1.472727 0.2727273
## 2 4.738095 2.904762 1.790476 0.3523810
## 3 6.314583 2.895833 4.973958 1.7031250
##
## Clustering vector:
## [1] 1 2 2 2 1 1 1 1 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 1 1 1 2 1 1
## [38] 1 2 1 1 2 2 1 1 2 1 2 1 1 3 3 3 3 3 3 3 2 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3
## [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3
##
## Within cluster sum of squares by cluster:
## [1] 6.432121 17.669524 118.651875
## (between_SS / total_SS = 79.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"