1 The example data

To explore the basic data manipulation in R, we’ll use some fishy data. The dataset contains measurments of 123252 fish of a particular species.

To do: Describe in more detail …

Needed libraries:

library(dplyr)
library(ggplot2)

Read in the data and take a little peak:

fish <- read.csv("http://www.hafro.is/~einarhj/data/fish.csv") %>%
  tbl_df()
dim(fish)
#> [1] 123252     17
glimpse(fish)
#> Observations: 123,252
#> Variables: 17
#> $ sample.id       (int) 36091, 36091, 36091, 36091, 36091, 36091, 3609...
#> $ year            (int) 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1987...
#> $ month           (int) 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
#> $ day             (int) 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11...
#> $ station.id      (int) 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22...
#> $ lat             (dbl) 66.2167, 66.2167, 66.2167, 66.2167, 66.2167, 6...
#> $ lon             (dbl) -19.1583, -19.1583, -19.1583, -19.1583, -19.15...
#> $ depth           (dbl) 103.5, 103.5, 103.5, 103.5, 103.5, 103.5, 103....
#> $ sex             (fctr) F, M, M, M, M, F, M, M, F, M, F, F, F, F, M, ...
#> $ maturity_stage  (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1...
#> $ age             (int) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
#> $ length          (int) 55, 38, 45, 21, 46, 18, 32, 31, 32, 72, 44, 51...
#> $ ungutted.weight (dbl) 1325, 490, 790, 70, 900, 45, 245, 225, 270, 31...
#> $ gutted.weight   (dbl) 1190, 445, 705, 65, 810, 30, 230, 210, 245, 28...
#> $ liver.weight    (dbl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
#> $ gonad.weight    (dbl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
#> $ species         (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

2 Some basic commands

min(fish$length)
#> [1] 7
max(fish$length)
#> [1] 148
range(fish$length)
#> [1]   7 148
median(fish$length)
#> [1] 59
quantile(fish$length)
#>   0%  25%  50%  75% 100% 
#>    7   43   59   73  148
mean(fish$length)
#> [1] 58.292
summary(fish$length)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    7.00   43.00   59.00   58.29   73.00  148.00
sd(fish$length)
#> [1] 21.73653
var(fish$length)
#> [1] 472.4766

3 Some ggplot drills

Because the fish data set is rather large, we will here work with only a subset of the data, lets take e.g. only data from year 2015 (the function filter will be covered in more details later in this document):

d <- fish %>% filter(year == 2015)
dim(d)
#> [1] 5734   17

3.1 Basics

3.1.1 Scatterplot

## create a empty plot
p <- ggplot()
## add layer to the plot
p + layer(data=d, mapping=aes(x=length, y=ungutted.weight),
          geom = 'point', stat='identity', position = 'identity')
#> Warning: Removed 12 rows containing missing values (geom_point).

The full specification of a layer is rather cumbersome to write out as each component of a layer needs to specified. It is therefor more common to use a shorthand for common types of plots. Equivalent plot to the above can be obtained using the following code:

p + geom_point(data = d, aes(length, ungutted.weight))

There are numerous ways to achieved the same results as above:

ggplot() + geom_point(data = d, aes(x = length, y = ungutted.weight))
ggplot(data = d) + geom_point(aes(x = length, y = ungutted.weight))
ggplot(d) + geom_point(aes(length, ungutted.weight))

p <- ggplot(d)
p + geom_point(aes(length, ungutted.weight))

Some basic controls:

p <- ggplot(d, aes(length, ungutted.weight))
p1 <- p + geom_point(colour = "red",
                     size = 2,
                     alpha = 0.1)
p1

Adding labels:

p2 <- p1 +  labs(x = "Length [cm]",
           y = "Weight [g]",
           title = "Length weight relationship")
p2
#> Warning: Removed 12 rows containing missing values (geom_point).

Add a smoother:

p2 + stat_smooth(aes(length, ungutted.weight), lwd = 1)
#> geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
#> Warning: Removed 12 rows containing missing values (stat_smooth).
#> Warning: Removed 12 rows containing missing values (geom_point).

3.1.2 Histograms

p <- ggplot(d, aes(length))
p1 <- p + geom_histogram(binwidth = 1)
p1

p1 + geom_histogram(aes(fill = factor(age)), binwidth = 1)

Plotting as facet:

p1 + geom_histogram(binwidth = 1) + facet_grid(age ~ .)

p1 + geom_histogram(binwidth = 1) + facet_grid(age ~ ., scale="free_y")

p1 + geom_histogram(binwidth = 1) + facet_grid(. ~ age) +
  coord_flip()

3.1.3 Boxplots

d <- d %>% filter(age <= 10)
p <- ggplot(d, aes(factor(age), length)) + geom_boxplot()
p

# add the raw data
p + geom_point()

p1 <- p + geom_jitter(alpha = 0.1, colour = "red")
p1

p1 + geom_violin(alpha = 0.2, fill = "blue", scale = "width")

4 Basic dataframe manipulations

In the following tutorial we are using functions in the dplyr-package. As is stated in the Introduction to dplyr:

When working with data you must:

  • Figure out what you want to do.

  • Describe those tasks in the form of a computer program.

  • Execute the program.

The dplyr package makes these steps fast and easy:

  • By constraining your options, it simplifies how you can think about common data manipulation tasks.

  • It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate those thoughts into code.

  • It uses efficient data storage backends, so you spend less time waiting for the computer.

dplyr can work with data frames as is, but if you’re dealing with large data, it’s worthwhile to convert them to a tbl_df (as done above): this is a wrapper around a data frame that won’t accidentally print a lot of data to the screen.

The dplyr-package aims to provide a function for each basic verb of data manipulation:

The following structure of the text is inspired by the above cited dplyr-tutorial. Any sentences that are taken directly from that text are quoted.

4.1 Select columns with select()

“Often you work with large datasets with many columns but only a few are actually of interest to you. select() allows you to rapidly zoom in on a useful subset.”

Select columns by name:

select(fish, year, length, sex)
#> Source: local data frame [123,252 x 3]
#> 
#>     year length    sex
#>    (int)  (int) (fctr)
#> 1   1987     55      F
#> 2   1987     38      M
#> 3   1987     45      M
#> 4   1987     21      M
#> ..   ...    ...    ...

Note that in the above code the results is just passed to the console. I.e. the subset of the fish data is not stored as object in your working environment. For that one needs to do:

fish_subset <- select(fish, year, length, sex)

In this document we will only resort to the latter when needed.

Select all columns between length and ungutted weight (inclusive)

select(fish, sex:ungutted.weight)
#> Source: local data frame [123,252 x 5]
#> 
#>       sex maturity_stage   age length ungutted.weight
#>    (fctr)          (int) (int)  (int)           (dbl)
#> 1       F              1    NA     55            1325
#> 2       M              1    NA     38             490
#> 3       M              1    NA     45             790
#> 4       M              1    NA     21              70
#> ..    ...            ...   ...    ...             ...

Select all columns except those from sex to species (inclusive)

select(fish, -(sex:species))
#> Source: local data frame [123,252 x 8]
#> 
#>    sample.id  year month   day station.id     lat      lon depth
#>        (int) (int) (int) (int)      (int)   (dbl)    (dbl) (dbl)
#> 1      36091  1987     3    11         22 66.2167 -19.1583 103.5
#> 2      36091  1987     3    11         22 66.2167 -19.1583 103.5
#> 3      36091  1987     3    11         22 66.2167 -19.1583 103.5
#> 4      36091  1987     3    11         22 66.2167 -19.1583 103.5
#> ..       ...   ...   ...   ...        ...     ...      ...   ...
A note on piping %>%

In the select function the first argument is the dataframe you want to manipulate (here the fish dataframe). Using the piping function (%>%) one could have written the above as:

fish %>% select(year, length, sex)
fish %>% select(sex:ungutted.weight)
fish %>% select(-(sex:species))

We strongly urge you to adapt the pipe convention. And this will be adhered to as much as possible in what follows

“There are a number of helper functions you can use within select(), like starts_with(), ends_with(), matches() and contains(). These let you quickly match larger blocks of variables that meet some criterion. See ?select for more details.”

4.2 Rename columns with rename()

Renaming columns (results not shown):

fish %>% rename(unguttedWeightInGrams_becauseILikeLongNames = ungutted.weight)

Combining the rename and the select function:

fish %>%
  rename(unguttedWeightInGrams_becauseILikeLongNames = ungutted.weight) %>%
  select(sample.id, length, unguttedWeightInGrams_becauseILikeLongNames)
#> Source: local data frame [123,252 x 3]
#> 
#>    sample.id length unguttedWeightInGrams_becauseILikeLongNames
#>        (int)  (int)                                       (dbl)
#> 1      36091     55                                        1325
#> 2      36091     38                                         490
#> 3      36091     45                                         790
#> 4      36091     21                                          70
#> ..       ...    ...                                         ...

In the above case you had in the second line to type in the long name for ungutted weight. If you are a minimalist you would do:

fish %>%
  rename(uW = ungutted.weight) %>%
  select(length, uW)
#> Source: local data frame [123,252 x 2]
#> 
#>    length    uW
#>     (int) (dbl)
#> 1      55  1325
#> 2      38   490
#> 3      45   790
#> 4      21    70
#> ..    ...   ...

4.3 Filter rows with filter()

filter() allows you to select a subset of rows in a data frame. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame:

For example we can select all data from year 2003 with:

fish %>% filter(year == 2003)
#> Source: local data frame [4,132 x 17]
#> 
#>    sample.id  year month   day station.id     lat      lon depth    sex
#>        (int) (int) (int) (int)      (int)   (dbl)    (dbl) (dbl) (fctr)
#> 1     203413  2003     3     4          1 66.7425 -23.2717   126      M
#> 2     203413  2003     3     4          1 66.7425 -23.2717   126      M
#> 3     203413  2003     3     4          1 66.7425 -23.2717   126      M
#> 4     203413  2003     3     4          1 66.7425 -23.2717   126      F
#> ..       ...   ...   ...   ...        ...     ...      ...   ...    ...
#> Variables not shown: maturity_stage (int), age (int), length (int),
#>   ungutted.weight (dbl), gutted.weight (dbl), liver.weight (dbl),
#>   gonad.weight (dbl), species (int)

This is equivalent to (results not shown):

fish %>% filter(year %in% c(2003))

If we want to select for more years we could do (here both year 2002 and 2004 (results not shown):

fish %>% filter(year == 2002 | year == 2003)

The | stands for or. An alternative could be (results not shown):

fish %>% filter(year %in% c(2002:2003))

Another example, we can select all female fish from 2003 with:

fish %>% filter(year == 2003, sex == "F")
#> Source: local data frame [1,976 x 17]
#> 
#>    sample.id  year month   day station.id     lat      lon depth    sex
#>        (int) (int) (int) (int)      (int)   (dbl)    (dbl) (dbl) (fctr)
#> 1     203413  2003     3     4          1 66.7425 -23.2717 126.0      F
#> 2     203415  2003     3     4          2 66.8875 -23.0175 209.5      F
#> 3     203415  2003     3     4          2 66.8875 -23.0175 209.5      F
#> 4     203657  2003     3     4          6 66.9666 -23.5927 214.5      F
#> ..       ...   ...   ...   ...        ...     ...      ...   ...    ...
#> Variables not shown: maturity_stage (int), age (int), length (int),
#>   ungutted.weight (dbl), gutted.weight (dbl), liver.weight (dbl),
#>   gonad.weight (dbl), species (int)

filter() works similarly to subset() except that you can give it any number of filtering conditions, which are joined together with & (not && which is easy to do accidentally!). You can also use other boolean operators:

x <- fish %>% filter(year == c(2001:2004) | sex == "F")

The | stands for or, so in the above we select all data for years 2001 to 2004 but only data for females for the other years. We can visualize this by using the table function:

table(x$year, x$sex)
#>       
#>           F    M
#>   1985 1369    0
#>   1986 1570    0
#>   1987 1494    0
#>   1988 1941    0
#>   1989 2107    0
#>   1990 1534    0
#>   1991 1701    0
#>   1992 1432    0
#>   1993 1530    0
#>   1994 1386    0
#>   1995 1564    0
#>   1996 2356    0
#>   1997 2070    0
#>   1998 2001    0
#>   1999 2029    0
#>   2000 2024    0
#>   2001 1456  360
#>   2002 2020  542
#>   2003 1976  495
#>   2004 2051  522
#>   2005 2060    0
#>   2006 1894    0
#>   2007 1813    0
#>   2008 2071    0
#>   2009 2222    0
#>   2010 2195    0
#>   2011 2299    0
#>   2012 2396    0
#>   2013 2546    0
#>   2014 2317    0
#>   2015 2619    0

4.4 Filter rows with slice()

To select rows by position, use slice():

fish %>% slice(1001:1002)
#> Source: local data frame [2 x 17]
#> 
#>   sample.id  year month   day station.id    lat     lon depth    sex
#>       (int) (int) (int) (int)      (int)  (dbl)   (dbl) (dbl) (fctr)
#> 1     43642  1989     3     5         22 64.627 -22.772    59      M
#> 2     43642  1989     3     5         22 64.627 -22.772    59      M
#> Variables not shown: maturity_stage (int), age (int), length (int),
#>   ungutted.weight (dbl), gutted.weight (dbl), liver.weight (dbl),
#>   gonad.weight (dbl), species (int)

The above selects rows 1001 and 1002 from the total rows of 123252. Have not found a lot of use for this yet.

4.5 Reorder rows with arrange()

arrange() works similarly to filter() except that instead of filtering or selecting rows, it reorders them (equivalent to sort in Excel). It takes a data frame, and a set of column names (or more complicated expressions) to order by.

fish %>% arrange(length)
#> Source: local data frame [123,252 x 17]
#> 
#>    sample.id  year month   day station.id     lat      lon depth    sex
#>        (int) (int) (int) (int)      (int)   (dbl)    (dbl) (dbl) (fctr)
#> 1      30096  1985     3    20         88 65.1839 -12.8369 157.0     NA
#> 2      56504  1993     3     5         11 65.1910 -23.3042 108.0     NA
#> 3      56723  1993     3    16        117 66.2662 -14.6382  78.5     NA
#> 4     140806  2000     3     8          8 65.5108 -12.8134 129.5     NA
#> ..       ...   ...   ...   ...        ...     ...      ...   ...    ...
#> Variables not shown: maturity_stage (int), age (int), length (int),
#>   ungutted.weight (dbl), gutted.weight (dbl), liver.weight (dbl),
#>   gonad.weight (dbl), species (int)

Here the data are reordered by length. Hence the top 4 lines in the output “shows” the smallest fish measured. Actually you do not “see” the smallest length here (because the variable length is included in “Variables not shown”). If we combine arrange with select we get a better picture:

fish %>% arrange(length) %>%
  select(year, month, day, length)
#> Source: local data frame [123,252 x 4]
#> 
#>     year month   day length
#>    (int) (int) (int)  (int)
#> 1   1985     3    20      7
#> 2   1993     3     5      8
#> 3   1993     3    16      8
#> 4   2000     3     8      8
#> ..   ...   ...   ...    ...

So the smallest fish measured is 7 cm, measured on the date 20.3.1985. And then we have a bunch of 8 cm fish.

Use desc() to order a column in descending order:

fish %>% arrange(desc(length)) %>%
  select(year, month, day, length)
#> Source: local data frame [123,252 x 4]
#> 
#>     year month   day length
#>    (int) (int) (int)  (int)
#> 1   1989     3    14    148
#> 2   1986     3     9    147
#> 3   1994     3     5    145
#> 4   1998     3     7    145
#> ..   ...   ...   ...    ...

This shows the top 4 largest fish measured by date. By the way check out the help file for desc by typing ?desc.

So instead of using desc we could instead put a minus sign in front of the variable length to get the same results (not shown):

fish %>% arrange(-length) %>%
  select(year, month, day, length)

4.6 Select distinct/unique rows distinct()

" A common use of select() is to find the values of a set of variables. This is particularly useful in conjunction with the distinct() verb which only returns the unique values in a table."

x <- fish %>%
  select(year) %>%
  distinct() %>%
  arrange()
dim(x)
#> [1] 31  1
x$year
#>  [1] 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
#> [15] 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
#> [29] 2015 1985 1986

In the above we have selected column ‘year’ and extracted only the unique values of year (31 unique year values are in the dataframe fish) and sorted the data in an ascending order (1987 - 2015). Similarily we could select the unique dates:

fish %>%
  select(year, month, day) %>%
  distinct() %>%
  arrange(-year, -month, -day)

Here, after selecting distinct dates we arrange the data in a descending order and hence the top lines show the last dates of the survey.

4.7 Add new variables and preserves existing varibles with mutate()

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate(). Lets calculate the weights from length using some constants:

d <- fish %>%
  select(length, ungutted.weight) %>%
  mutate(derived.weight = 0.01 * length^3)
d
#> Source: local data frame [123,252 x 3]
#> 
#>    length ungutted.weight derived.weight
#>     (int)           (dbl)          (dbl)
#> 1      55            1325        1663.75
#> 2      38             490         548.72
#> 3      45             790         911.25
#> 4      21              70          92.61
#> ..    ...             ...            ...
ggplot(d, aes(x = length)) +
  geom_point(aes(y = ungutted.weight), colour = "blue", size = 2, alpha = 0.2) + 
  geom_line(aes(y = derived.weight), colour = "red", lwd = 1) +
  labs(x = "Length [cm]", y = "Ungutted weight", title = "Observed and derived weights")
#> Warning: Removed 24533 rows containing missing values (geom_point).

4.8 Add new variable and drop exisiting variables with transmute()

pending …

4.9 Summarise multiple values to a single value summarise()

pending …

4.10 Sample n rows from a table with sample_n()

pending …

4.11 Sample n rows from a table with sample_frac()

pending …