Introduction

Numerous guides have been written on the exploration of this widely known dataset. Iris, introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems, contains three plant species (setosa, virginica, versicolor) and four features measured for each sample. These quantify the morphologic variation of the iris flower in its three species, all measurements given in centimeters.

Any comments within our code have to be preceded by the pound sign to notify the compiler to ignore them. # comments appear like this in code

Load the iris dataset

# The datasets package needs to be loaded to access our data 
# For a full list of these datasets, type library(help = "datasets")
library(datasets)
data(iris)
summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

The summary() function gives summary statistics for any dataset. It can also be called on one variable instead of on the whole dataset. Try summary(iris$Sepal.Length) and compare that with the above summaries.

Alternatively, you may only want to know the column names of your dataset, in which case you can use names(NameOfdataset), which in our case would look like names(iris). Also notice that each coloumn name in the iris dataset has some upper case letters, which might be inconvenient to work with. You can then call the tolower() function on names(iris) to make this change. For those who might prefer upper case column names, the toupper() function will instead, be useful.

Written packages make it easier to work with datasets than regular baseR functions. They have been optimized to be faster and more intuitive than baseR functions, therefore reducing the steepness of the R learning curve. Let’s take a look;

The dplyr package

Use install.packages("dplyr") in your console to install this package. Note that you must be connected to the internet. If you’ve opened a new R script file, you will need to use the keys CTRL+Enter [PC] or Cmd+Enter [Mac] to run the commands.

filter()

names(iris) <- tolower(names(iris))
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# filter() the data for species virginica
virginica <- filter(iris, species == "virginica")
head(virginica) # This dispalys the first six rows

##   sepal.length sepal.width petal.length petal.width   species
## 1          6.3         3.3          6.0         2.5 virginica
## 2          5.8         2.7          5.1         1.9 virginica
## 3          7.1         3.0          5.9         2.1 virginica
## 4          6.3         2.9          5.6         1.8 virginica
## 5          6.5         3.0          5.8         2.2 virginica
## 6          7.6         3.0          6.6         2.1 virginica

Notice that we use the logical double equal sign as in species == "virginica", and quotations around virginica since this value is of a char (character) data type. The equivalent base command for filter() would be subset(), with all the inner arguments being exactly the same. We can also filter for multiple conditions within our function.

sepalLength6 <- filter(iris, species == "virginica", sepal.length > 6)
tail(sepalLength6) # compare this to head()

##    sepal.length sepal.width petal.length petal.width   species
## 36          6.8         3.2          5.9         2.3 virginica
## 37          6.7         3.3          5.7         2.5 virginica
## 38          6.7         3.0          5.2         2.3 virginica
## 39          6.3         2.5          5.0         1.9 virginica
## 40          6.5         3.0          5.2         2.0 virginica
## 41          6.2         3.4          5.4         2.3 virginica

The syntax for using subset() would be subset(iris, species == "virginica" & sepal.length > 6) and using <- to assign it to a variable of your choice, which in our case is sepalLength6

select()

This function selects data by column name. You can select any number of columns in a few different ways.

# select() the specified columns
selected <- select(iris, sepal.length, sepal.width, petal.length)
# select all columns from sepal.length to petal.length
selected2 <- select(iris, sepal.length:petal.length)
head(selected, 3)

##   sepal.length sepal.width petal.length
## 1          5.1         3.5          1.4
## 2          4.9         3.0          1.4
## 3          4.7         3.2          1.3

# selected and selected2 are exactly the same
identical(selected, selected2)

## [1] TRUE

mutate()

Create new columns using this function

# create a new column that stores logical values for sepal.width greater than half of sepal.length
newCol <- mutate(iris, greater.half = sepal.width > 0.5 * sepal.length)
tail(newCol)

##     sepal.length sepal.width petal.length petal.width   species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica
##     greater.half
## 145        FALSE
## 146        FALSE
## 147        FALSE
## 148        FALSE
## 149         TRUE
## 150         TRUE

Challenge: Out of the 150 flowers, find how many satisfy this condition. Hint: use the sum() function on newCol$greater.half

Other functions

# arrange()
newCol <- arrange(newCol, petal.width)
head(newCol)

##   sepal.length sepal.width petal.length petal.width species greater.half
## 1          4.9         3.1          1.5         0.1  setosa         TRUE
## 2          4.8         3.0          1.4         0.1  setosa         TRUE
## 3          4.3         3.0          1.1         0.1  setosa         TRUE
## 4          5.2         4.1          1.5         0.1  setosa         TRUE
## 5          4.9         3.6          1.4         0.1  setosa         TRUE
## 6          5.1         3.5          1.4         0.2  setosa         TRUE

# The chain operator, or the pipeline %>%
# This will first filter, and then arrange our data. Note that here the order in which you call functions does not matter, but in other cases it might
arr.virg <- newCol %>% filter(species == "virginica") %>%
  arrange(sepal.width)
arr.virg[30:35,] # will show us rows 30 through 35 and all columns

##    sepal.length sepal.width petal.length petal.width   species
## 30          6.8         3.0          5.5         2.1 virginica
## 31          6.5         3.0          5.8         2.2 virginica
## 32          7.7         3.0          6.1         2.3 virginica
## 33          6.7         3.0          5.2         2.3 virginica
## 34          6.4         3.1          5.5         1.8 virginica
## 35          6.9         3.1          5.4         2.1 virginica
##    greater.half
## 30        FALSE
## 31        FALSE
## 32        FALSE
## 33        FALSE
## 34        FALSE
## 35        FALSE

# You can also arrange in descending order using desc() on what you arrange by
# arrange(desc(sepal.width))
# summarise()
summarise(arr.virg, mean.length = mean(sepal.length, na.rm = TRUE))

##   mean.length
## 1       6.588

This is the mean sepal.length for the virginica species. Challenge2: The standard deviation gives how much individual values vary from the mean. Find the standard deviation of sepal.length using summarise() and sd()

Visualization

Any powerful analysis will visualize the data to give a better picture (wink wink) of the data. Below is a general plot of the iris dataset:

plot(iris)

If we’re looking to plot specific variables, we can use plot(x,y) where x and y are the variables we’re interested in. hist() is another useful function

# use ?plot to read more about other arguments
plot(iris$sepal.width, iris$sepal.length)

# ?hist will give you details on more arguments
hist(iris$sepal.width)

Notice that in flowers with greater sepal widths tend to have shorter sepal lengths.

For more resources on R, visit:

Learning R on iris

Stephen Malefetsane Moerane

29 August 2016