select()rename()filter()slice()arrange()distinct()mutate()transmute()summarise()sample_n()sample_frac()To explore the basic data manipulation in R, we’ll use some fishy data. The dataset contains measurments of 123252 fish of a particular species.
To do: Describe in more detail …
Needed libraries:
library(dplyr)
library(ggplot2)
Read in the data and take a little peak:
fish <- read.csv("http://www.hafro.is/~einarhj/data/fish.csv") %>%
tbl_df()
dim(fish)
#> [1] 123252 17
glimpse(fish)
#> Observations: 123,252
#> Variables: 17
#> $ sample.id (int) 36091, 36091, 36091, 36091, 36091, 36091, 3609...
#> $ year (int) 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1987...
#> $ month (int) 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
#> $ day (int) 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11...
#> $ station.id (int) 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22...
#> $ lat (dbl) 66.2167, 66.2167, 66.2167, 66.2167, 66.2167, 6...
#> $ lon (dbl) -19.1583, -19.1583, -19.1583, -19.1583, -19.15...
#> $ depth (dbl) 103.5, 103.5, 103.5, 103.5, 103.5, 103.5, 103....
#> $ sex (fctr) F, M, M, M, M, F, M, M, F, M, F, F, F, F, M, ...
#> $ maturity_stage (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1...
#> $ age (int) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
#> $ length (int) 55, 38, 45, 21, 46, 18, 32, 31, 32, 72, 44, 51...
#> $ ungutted.weight (dbl) 1325, 490, 790, 70, 900, 45, 245, 225, 270, 31...
#> $ gutted.weight (dbl) 1190, 445, 705, 65, 810, 30, 230, 210, 245, 28...
#> $ liver.weight (dbl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
#> $ gonad.weight (dbl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
#> $ species (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
min(fish$length)
#> [1] 7
max(fish$length)
#> [1] 148
range(fish$length)
#> [1] 7 148
median(fish$length)
#> [1] 59
quantile(fish$length)
#> 0% 25% 50% 75% 100%
#> 7 43 59 73 148
mean(fish$length)
#> [1] 58.292
summary(fish$length)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 7.00 43.00 59.00 58.29 73.00 148.00
sd(fish$length)
#> [1] 21.73653
var(fish$length)
#> [1] 472.4766
Because the fish data set is rather large, we will here work with only a subset of the data, lets take e.g. only data from year 2015 (the function filter will be covered in more details later in this document):
d <- fish %>% filter(year == 2015)
dim(d)
#> [1] 5734 17
## create a empty plot
p <- ggplot()
## add layer to the plot
p + layer(data=d, mapping=aes(x=length, y=ungutted.weight),
geom = 'point', stat='identity', position = 'identity')
#> Warning: Removed 12 rows containing missing values (geom_point).
The full specification of a layer is rather cumbersome to write out as each component of a layer needs to specified. It is therefor more common to use a shorthand for common types of plots. Equivalent plot to the above can be obtained using the following code:
p + geom_point(data = d, aes(length, ungutted.weight))
There are numerous ways to achieved the same results as above:
ggplot() + geom_point(data = d, aes(x = length, y = ungutted.weight))
ggplot(data = d) + geom_point(aes(x = length, y = ungutted.weight))
ggplot(d) + geom_point(aes(length, ungutted.weight))
p <- ggplot(d)
p + geom_point(aes(length, ungutted.weight))
Some basic controls:
p <- ggplot(d, aes(length, ungutted.weight))
p1 <- p + geom_point(colour = "red",
size = 2,
alpha = 0.1)
p1
Adding labels:
p2 <- p1 + labs(x = "Length [cm]",
y = "Weight [g]",
title = "Length weight relationship")
p2
#> Warning: Removed 12 rows containing missing values (geom_point).
Add a smoother:
p2 + stat_smooth(aes(length, ungutted.weight), lwd = 1)
#> geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
#> Warning: Removed 12 rows containing missing values (stat_smooth).
#> Warning: Removed 12 rows containing missing values (geom_point).
p <- ggplot(d, aes(length))
p1 <- p + geom_histogram(binwidth = 1)
p1
p1 + geom_histogram(aes(fill = factor(age)), binwidth = 1)
Plotting as facet:
p1 + geom_histogram(binwidth = 1) + facet_grid(age ~ .)
p1 + geom_histogram(binwidth = 1) + facet_grid(age ~ ., scale="free_y")
p1 + geom_histogram(binwidth = 1) + facet_grid(. ~ age) +
coord_flip()
d <- d %>% filter(age <= 10)
p <- ggplot(d, aes(factor(age), length)) + geom_boxplot()
p
# add the raw data
p + geom_point()
p1 <- p + geom_jitter(alpha = 0.1, colour = "red")
p1
p1 + geom_violin(alpha = 0.2, fill = "blue", scale = "width")
In the following tutorial we are using functions in the dplyr-package. As is stated in the Introduction to dplyr:
When working with data you must:
Figure out what you want to do.
Describe those tasks in the form of a computer program.
Execute the program.
The dplyr package makes these steps fast and easy:
By constraining your options, it simplifies how you can think about common data manipulation tasks.
It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate those thoughts into code.
It uses efficient data storage backends, so you spend less time waiting for the computer.
dplyr can work with data frames as is, but if you’re dealing with large data, it’s worthwhile to convert them to a
tbl_df(as done above): this is a wrapper around a data frame that won’t accidentally print a lot of data to the screen.
The dplyr-package aims to provide a function for each basic verb of data manipulation:
select(): Selects variable(s) (column(s)) by namerename(): Renames variable(s) (column(s))filter(): Returns row(s) with matching conditionsslice(): Selects row(s) by position - normally not very usefularrange(): Arrange rows by variables. Equivalent to “sort”distinct(): Select distinct/unique rowsmutate(): Add new variables and preserves existing variblestransmute(): Add new variable but drop exisiting variablessummarise(): Summarise multiple values to a single valuesample_n(): Sample n rows from a tablesample_frac(): Sample n rows from a tableThe following structure of the text is inspired by the above cited dplyr-tutorial. Any sentences that are taken directly from that text are quoted.
select()“Often you work with large datasets with many columns but only a few are actually of interest to you. select() allows you to rapidly zoom in on a useful subset.”
Select columns by name:
select(fish, year, length, sex)
#> Source: local data frame [123,252 x 3]
#>
#> year length sex
#> (int) (int) (fctr)
#> 1 1987 55 F
#> 2 1987 38 M
#> 3 1987 45 M
#> 4 1987 21 M
#> .. ... ... ...
Note that in the above code the results is just passed to the console. I.e. the subset of the fish data is not stored as object in your working environment. For that one needs to do:
fish_subset <- select(fish, year, length, sex)
In this document we will only resort to the latter when needed.
Select all columns between length and ungutted weight (inclusive)
select(fish, sex:ungutted.weight)
#> Source: local data frame [123,252 x 5]
#>
#> sex maturity_stage age length ungutted.weight
#> (fctr) (int) (int) (int) (dbl)
#> 1 F 1 NA 55 1325
#> 2 M 1 NA 38 490
#> 3 M 1 NA 45 790
#> 4 M 1 NA 21 70
#> .. ... ... ... ... ...
Select all columns except those from sex to species (inclusive)
select(fish, -(sex:species))
#> Source: local data frame [123,252 x 8]
#>
#> sample.id year month day station.id lat lon depth
#> (int) (int) (int) (int) (int) (dbl) (dbl) (dbl)
#> 1 36091 1987 3 11 22 66.2167 -19.1583 103.5
#> 2 36091 1987 3 11 22 66.2167 -19.1583 103.5
#> 3 36091 1987 3 11 22 66.2167 -19.1583 103.5
#> 4 36091 1987 3 11 22 66.2167 -19.1583 103.5
#> .. ... ... ... ... ... ... ... ...
%>%
In the select function the first argument is the dataframe you want to manipulate (here the fish dataframe). Using the piping function (%>%) one could have written the above as:
fish %>% select(year, length, sex)
fish %>% select(sex:ungutted.weight)
fish %>% select(-(sex:species))
We strongly urge you to adapt the pipe convention. And this will be adhered to as much as possible in what follows
“There are a number of helper functions you can use within select(), like starts_with(), ends_with(), matches() and contains(). These let you quickly match larger blocks of variables that meet some criterion. See ?select for more details.”
rename()Renaming columns (results not shown):
fish %>% rename(unguttedWeightInGrams_becauseILikeLongNames = ungutted.weight)
Combining the rename and the select function:
fish %>%
rename(unguttedWeightInGrams_becauseILikeLongNames = ungutted.weight) %>%
select(sample.id, length, unguttedWeightInGrams_becauseILikeLongNames)
#> Source: local data frame [123,252 x 3]
#>
#> sample.id length unguttedWeightInGrams_becauseILikeLongNames
#> (int) (int) (dbl)
#> 1 36091 55 1325
#> 2 36091 38 490
#> 3 36091 45 790
#> 4 36091 21 70
#> .. ... ... ...
In the above case you had in the second line to type in the long name for ungutted weight. If you are a minimalist you would do:
fish %>%
rename(uW = ungutted.weight) %>%
select(length, uW)
#> Source: local data frame [123,252 x 2]
#>
#> length uW
#> (int) (dbl)
#> 1 55 1325
#> 2 38 490
#> 3 45 790
#> 4 21 70
#> .. ... ...
filter()filter() allows you to select a subset of rows in a data frame. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame:
For example we can select all data from year 2003 with:
fish %>% filter(year == 2003)
#> Source: local data frame [4,132 x 17]
#>
#> sample.id year month day station.id lat lon depth sex
#> (int) (int) (int) (int) (int) (dbl) (dbl) (dbl) (fctr)
#> 1 203413 2003 3 4 1 66.7425 -23.2717 126 M
#> 2 203413 2003 3 4 1 66.7425 -23.2717 126 M
#> 3 203413 2003 3 4 1 66.7425 -23.2717 126 M
#> 4 203413 2003 3 4 1 66.7425 -23.2717 126 F
#> .. ... ... ... ... ... ... ... ... ...
#> Variables not shown: maturity_stage (int), age (int), length (int),
#> ungutted.weight (dbl), gutted.weight (dbl), liver.weight (dbl),
#> gonad.weight (dbl), species (int)
This is equivalent to (results not shown):
fish %>% filter(year %in% c(2003))
If we want to select for more years we could do (here both year 2002 and 2004 (results not shown):
fish %>% filter(year == 2002 | year == 2003)
The | stands for or. An alternative could be (results not shown):
fish %>% filter(year %in% c(2002:2003))
Another example, we can select all female fish from 2003 with:
fish %>% filter(year == 2003, sex == "F")
#> Source: local data frame [1,976 x 17]
#>
#> sample.id year month day station.id lat lon depth sex
#> (int) (int) (int) (int) (int) (dbl) (dbl) (dbl) (fctr)
#> 1 203413 2003 3 4 1 66.7425 -23.2717 126.0 F
#> 2 203415 2003 3 4 2 66.8875 -23.0175 209.5 F
#> 3 203415 2003 3 4 2 66.8875 -23.0175 209.5 F
#> 4 203657 2003 3 4 6 66.9666 -23.5927 214.5 F
#> .. ... ... ... ... ... ... ... ... ...
#> Variables not shown: maturity_stage (int), age (int), length (int),
#> ungutted.weight (dbl), gutted.weight (dbl), liver.weight (dbl),
#> gonad.weight (dbl), species (int)
filter() works similarly to subset() except that you can give it any number of filtering conditions, which are joined together with & (not && which is easy to do accidentally!). You can also use other boolean operators:
x <- fish %>% filter(year == c(2001:2004) | sex == "F")
The | stands for or, so in the above we select all data for years 2001 to 2004 but only data for females for the other years. We can visualize this by using the table function:
table(x$year, x$sex)
#>
#> F M
#> 1985 1369 0
#> 1986 1570 0
#> 1987 1494 0
#> 1988 1941 0
#> 1989 2107 0
#> 1990 1534 0
#> 1991 1701 0
#> 1992 1432 0
#> 1993 1530 0
#> 1994 1386 0
#> 1995 1564 0
#> 1996 2356 0
#> 1997 2070 0
#> 1998 2001 0
#> 1999 2029 0
#> 2000 2024 0
#> 2001 1456 360
#> 2002 2020 542
#> 2003 1976 495
#> 2004 2051 522
#> 2005 2060 0
#> 2006 1894 0
#> 2007 1813 0
#> 2008 2071 0
#> 2009 2222 0
#> 2010 2195 0
#> 2011 2299 0
#> 2012 2396 0
#> 2013 2546 0
#> 2014 2317 0
#> 2015 2619 0
slice()To select rows by position, use slice():
fish %>% slice(1001:1002)
#> Source: local data frame [2 x 17]
#>
#> sample.id year month day station.id lat lon depth sex
#> (int) (int) (int) (int) (int) (dbl) (dbl) (dbl) (fctr)
#> 1 43642 1989 3 5 22 64.627 -22.772 59 M
#> 2 43642 1989 3 5 22 64.627 -22.772 59 M
#> Variables not shown: maturity_stage (int), age (int), length (int),
#> ungutted.weight (dbl), gutted.weight (dbl), liver.weight (dbl),
#> gonad.weight (dbl), species (int)
The above selects rows 1001 and 1002 from the total rows of 123252. Have not found a lot of use for this yet.
arrange()arrange() works similarly to filter() except that instead of filtering or selecting rows, it reorders them (equivalent to sort in Excel). It takes a data frame, and a set of column names (or more complicated expressions) to order by.
fish %>% arrange(length)
#> Source: local data frame [123,252 x 17]
#>
#> sample.id year month day station.id lat lon depth sex
#> (int) (int) (int) (int) (int) (dbl) (dbl) (dbl) (fctr)
#> 1 30096 1985 3 20 88 65.1839 -12.8369 157.0 NA
#> 2 56504 1993 3 5 11 65.1910 -23.3042 108.0 NA
#> 3 56723 1993 3 16 117 66.2662 -14.6382 78.5 NA
#> 4 140806 2000 3 8 8 65.5108 -12.8134 129.5 NA
#> .. ... ... ... ... ... ... ... ... ...
#> Variables not shown: maturity_stage (int), age (int), length (int),
#> ungutted.weight (dbl), gutted.weight (dbl), liver.weight (dbl),
#> gonad.weight (dbl), species (int)
Here the data are reordered by length. Hence the top 4 lines in the output “shows” the smallest fish measured. Actually you do not “see” the smallest length here (because the variable length is included in “Variables not shown”). If we combine arrange with select we get a better picture:
fish %>% arrange(length) %>%
select(year, month, day, length)
#> Source: local data frame [123,252 x 4]
#>
#> year month day length
#> (int) (int) (int) (int)
#> 1 1985 3 20 7
#> 2 1993 3 5 8
#> 3 1993 3 16 8
#> 4 2000 3 8 8
#> .. ... ... ... ...
So the smallest fish measured is 7 cm, measured on the date 20.3.1985. And then we have a bunch of 8 cm fish.
Use desc() to order a column in descending order:
fish %>% arrange(desc(length)) %>%
select(year, month, day, length)
#> Source: local data frame [123,252 x 4]
#>
#> year month day length
#> (int) (int) (int) (int)
#> 1 1989 3 14 148
#> 2 1986 3 9 147
#> 3 1994 3 5 145
#> 4 1998 3 7 145
#> .. ... ... ... ...
This shows the top 4 largest fish measured by date. By the way check out the help file for desc by typing ?desc.
So instead of using desc we could instead put a minus sign in front of the variable length to get the same results (not shown):
fish %>% arrange(-length) %>%
select(year, month, day, length)
distinct()" A common use of select() is to find the values of a set of variables. This is particularly useful in conjunction with the distinct() verb which only returns the unique values in a table."
x <- fish %>%
select(year) %>%
distinct() %>%
arrange()
dim(x)
#> [1] 31 1
x$year
#> [1] 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
#> [15] 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
#> [29] 2015 1985 1986
In the above we have selected column ‘year’ and extracted only the unique values of year (31 unique year values are in the dataframe fish) and sorted the data in an ascending order (1987 - 2015). Similarily we could select the unique dates:
fish %>%
select(year, month, day) %>%
distinct() %>%
arrange(-year, -month, -day)
Here, after selecting distinct dates we arrange the data in a descending order and hence the top lines show the last dates of the survey.
mutate()Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate(). Lets calculate the weights from length using some constants:
d <- fish %>%
select(length, ungutted.weight) %>%
mutate(derived.weight = 0.01 * length^3)
d
#> Source: local data frame [123,252 x 3]
#>
#> length ungutted.weight derived.weight
#> (int) (dbl) (dbl)
#> 1 55 1325 1663.75
#> 2 38 490 548.72
#> 3 45 790 911.25
#> 4 21 70 92.61
#> .. ... ... ...
ggplot(d, aes(x = length)) +
geom_point(aes(y = ungutted.weight), colour = "blue", size = 2, alpha = 0.2) +
geom_line(aes(y = derived.weight), colour = "red", lwd = 1) +
labs(x = "Length [cm]", y = "Ungutted weight", title = "Observed and derived weights")
#> Warning: Removed 24533 rows containing missing values (geom_point).
transmute()pending …
summarise()pending …
sample_n()pending …
sample_frac()pending …