Ben Bond-Lamberty
October 2018
A short workshop covering reproducibility and data management; data reshaping; and summarizing and manipulation.
University of Delaware
gapminder
dataset)babynames
dataset)Feedback: bondlamberty@pnnl.gov or @BenBondLamberty.
This workshop assumes an intermediate knowledge of R.
If you want to do the hands-on exercises (encouraged!), make sure up-to-date versions of the following packages are installed:
dplyr
tidyr
gapminder
babynames
Note this is effectively a particular dialect of R, but principles are broadly applicable.
We are in the era of collaborative 'big data', but even if you work by yourself with 'little data' you have to have some skills to deal with those data.
Most fundamentally, your results have to be reproducible.
Your most important collaborator is your future self. It’s important to make a workflow that you can use time and time again, and even pass on to others in such a way that you don’t have to be there to walk them through it. Source
Even if you don't buy it, prepare yourself for the future. Funders, journals, governments, colleagues are all pushing for more reproducibility and openness. It's a slow but steady ratchet.
NSF, DOE, Wellcome, Gates, etc. are increasingly requiring data management plans; data deposition; publication in open-access journals.
Please ensure the data shown in all figures, and supporting all main results, is publicly available, describing this is in the text. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.
Reproducibility generally means scripts tied to open source software with effective data management and archiving.
Scripts provide an auditable, reproducible record of what you did.
…what doesn't exist.
Gozilla ate my computer!
Godzilla destroyed my office!!!!!!
…what you've lost. What if you need access to a file as it existed 1, 10, or 100, or 1000 days ago?
Git (and website GitHub) are the most popular version control tools for use with R, and many other languages:
As a recent paper noted, version control is not easy enough yet. It needs to get better.
Version control and scripts address two of the biggest problems with managing data: tracking changes over time, and understanding/reproducing analytical steps.
Ideally, every step in your analysis is programmatic–done by a script–so it can be 'read': understood and reproduced later.
From Sandve et al. (2013), Ten Simple Rules for Reproducible Computational Research.
Don't let the perfect be the enemy of the good. Upgrade and improve your workflow and skills over time.
Organizing analyses so that they are reproducible is not easy. It requires diligence and a considerable investment of time: to learn new computational tools, and to organize and document analyses as you go.
But partially reproducible is better than not at all reproducible. Just try to make your next paper or project better organized than the last.
A great and practical guide: http://kbroman.org/steps2rr/
A typical project/paper directory for me:
1-download.R
2-prepdata.R
3-analyze_data.R
4-make_graphs.R
5-manuscript.R (perhaps)
logs/
output/
rawdata/
This directory contains scripts that are backed up both locally and remotely. It is under version control, so it's easy to track changes over time.
Mon Mar 6 09:12:49 2017 Opening outputs//2-prepdata/2-prepdata.R.log.txt
Mon Mar 6 09:12:49 2017 Welcome to 2-prepdata.R
...
Mon Mar 6 12:24:21 2017 All done 2-prepdata.R
R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] babynames_0.3.0
loaded via a namespace (and not attached):
[1] assertthat_0.1 tools_3.3.3 tibble_1.2 Rcpp_0.12.9
Vines et al. (2014) published a shocking finding, based on a survey of 516 biology articles from 2 to 22 years old:
The odds of a data set being available post-publication fell by 17% each year, and the chances that the contact author’s email address still worked declined by 7% per year.
Data loss hits ecosystem, soil, and global change ecology particularly hard: climate changes make ecological data effectively irreproducible.
In my opinion, many repositories and archives have made it way too hard to deposit data.
Please fill out this 100-item questionnaire, entering carbon flux numbers in some units you've never used before. Also, you need to put your data into our required format, which is going to be a complete PITA, and write up a 1000-word metadata file in Aramaic.
I exaggerate only slightly.
Again…don't let the perfect be the enemy of the good.
Be aware of 'unstructured' data repositories like GitHub (intended primarily for code development, not permanent deposition), figshare (super easy, gives instant DOIs), etc.
Far better the data be available permanently, however imperfectly they're formatted or described (though those things are good), than lost forever.
This has the promise to
See for example BAAD, SRDB, TRY, FRED, FLUXNET!
These vary in their degree of structure, centralization, and openness, but are all hugely better than nothing.
If you're doing the exercises and problems, you'll need these packages:
dplyr
- fast, flexible tool for working with data framestidyr
- reshaping and cleaning dataggplot2
- popular package for visualizing dataWe'll also use these data package:
babynames
- names provided to the SSA 1880-2013gapminder
- life expectancy, GDP per capita, and population for 142 countriesIn honor of the late Hans Rosling, we'll use the gapminder
dataset today.
library(dplyr)
library(gapminder)
gapminder
# A tibble: 1,704 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779
2 Afghanistan Asia 1957 30.3 9240934 821
3 Afghanistan Asia 1962 32.0 10267083 853
4 Afghanistan Asia 1967 34.0 11537966 836
5 Afghanistan Asia 1972 36.1 13079460 740
6 Afghanistan Asia 1977 38.4 14880372 786
7 Afghanistan Asia 1982 39.9 12881816 978
8 Afghanistan Asia 1987 40.8 13867957 852
9 Afghanistan Asia 1992 41.7 16317921 649
10 Afghanistan Asia 1997 41.8 22227415 635
# ... with 1,694 more rows
The magrittr
package (used by both dplyr
and tidyr
) provides the %>%
operator, which allows us to pipe an object forward into a function or call expression.
Note that x %>% f
is usually equivalent to f(x)
.
print(gapminder)
gapminder %>% print
gagminder %>% head
gapminder %>% head(n=20)
gapminder %>%
print %>%
summary # what is non-piped equivalent?
summary(print(gapminder))
By default, the left hand expression is put in as the first argument of the right hand expression (rhs). But you can put it into any other position too:
library(ggplot2)
gapminder %>% qplot(gdpPercap, lifeExp, data=., log="xy")
gapminder %>% qplot(gdpPercap, lifeExp, data=., log="xy", color=continent, size=pop)
gapminder %>% qplot(gdpPercap, lifeExp, data=., log="xy", color=year, size=pop)
help("%>%")
Notice when we print gapminder
only a bit of the data frame prints, whereas if we type cars
(a built-in dataset) everything scrolls off the screen. Examine:
class(cars)
[1] "data.frame"
class(gapminder)
[1] "tbl_df" "tbl" "data.frame"
Tibbles are a re-imagining of R's venerable data.frame
for more convenience and speed:
For more information, see ?tibble::tibble
.
The dplyr
package uses verbs (functions) to operate on tibbles (data frames).
Very commonly used.
gapminder %>% filter(country == "Egypt")
gapminder %>% filter(country == "Egypt", year > 2000)
IMPORTANT NOTE AT THIS POINT. We have dplyr::filter
but there's also stats::filter
in base R; similar confusing overlaps can exist for other dplyr
verbs too. (This is a more general problem, that of namespace collision.)
Either load dplyr
last or specify (::
) which function you want to use.
Also extremely useful. Note different notations for selecting columns:
select(gapminder, pop, year)
gapminder %>% select(pop, year)
gapminder %>% select(-lifeExp, -gdpPercap)
gapminder %>% select(-1)
There are lots of other cool ways to select columns–see ?select
.
Let's focus on a single country's data for a bit. Write a pipeline that picks out Egypt data only, removes the continent and country columns, and assigns the result to a variable Egypt
.
gapminder %>%
filter(country == "Egypt") %>%
select(-continent, -country) ->
Egypt
Put this into long format–where every row is a different observation. For this we use tidyr::gather
, which asks: what's the data source? Name of variable column (i.e. that will get old names of columns)? Name of data column? And what columns to operate on?
library(tidyr)
Egypt %>% gather(variable, value, lifeExp, pop, gdpPercap)
# A tibble: 36 x 3
year variable value
<int> <chr> <dbl>
1 1952 lifeExp 41.9
2 1957 lifeExp 44.4
3 1962 lifeExp 47.0
4 1967 lifeExp 49.3
5 1972 lifeExp 51.1
6 1977 lifeExp 53.3
7 1982 lifeExp 56.0
8 1987 lifeExp 59.8
9 1992 lifeExp 63.7
10 1997 lifeExp 67.2
# ... with 26 more rows
library(ggplot2)
Egypt %>%
gather(variable, value, -year) %>%
qplot(year, value, data=., geom="line") +
facet_wrap(~variable, scales="free")
Experiment. Why do these do what they do?
Egypt %>% gather(variable, value, lifeExp)
Egypt %>% gather(variable, value, -lifeExp)
Why?
We can also spread our data out into a table form, like what you'd see in a spreadsheet, using spread
:
Egypt %>%
gather(variable, value, -year) %>%
spread(year, value)
spread
is easy. It asks,
These functions can be very useful.
gapminder %>% unite(coco, country, continent)
gapminder %>%
unite(coco, country, continent) %>%
separate(coco, into = c("country", "continent"), sep="_", extra="merge")
gapminder %>% mutate(logpop = log(pop))
gapminder %>% rename(population = pop)
Thinking back to the typical data pipeline, we often want to summarize data by groups as an intermediate or final step. For example, for each subgroup we might want to:
n
->1)n
->n
)Specific examples:
gapminder
: what's the year of maximum GDP for each country?babynames
: what's the most common name over time?These are generally known as split-apply-combine problems.
The newer dplyr
package specializes in data frames, but also allows you to work with remote, out-of-memory databases, using exactly the same tools, because it abstracts away how your data is stored.
dplyr
is extremely fast for most, though not all, operations on data frames.
dplyr
provides functions for each basic verb of data manipulation. These tend to have analogues in base R, but use a consistent, compact syntax, and are very high performance.
filter()
- subset rows; like base::subset()
arrange()
- reorder rows; like order()
select()
- select columnsmutate()
- add new columnssummarise()
- like aggregate
dplyr
is ~10x faster than the older plyr
package. (And plyr
was ~10x faster than base R.)Why not?
data.table
package is also worth checking out for its speed.dplyr
verbs become particularly powerful when used in conjunction with groups we define in the dataset. This doesn't change the data but instead groups it, ready for the next operation we perform.
library(dplyr)
gapminder %>%
group_by(country)
# A tibble: 1,704 x 6
# Groups: country [142]
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779
2 Afghanistan Asia 1957 30.3 9240934 821
3 Afghanistan Asia 1962 32.0 10267083 853
4 Afghanistan Asia 1967 34.0 11537966 836
5 Afghanistan Asia 1972 36.1 13079460 740
6 Afghanistan Asia 1977 38.4 14880372 786
7 Afghanistan Asia 1982 39.9 12881816 978
8 Afghanistan Asia 1987 40.8 13867957 852
9 Afghanistan Asia 1992 41.7 16317921 649
10 Afghanistan Asia 1997 41.8 22227415 635
# ... with 1,694 more rows
gapminder %>%
group_by(country) %>%
summarise(max(pop))
# A tibble: 142 x 2
country `max(pop)`
<fct> <dbl>
1 Afghanistan 31889923
2 Albania 3600523
3 Algeria 33333216
4 Angola 12420476
5 Argentina 40301927
6 Australia 20434176
7 Austria 8199783
8 Bahrain 708573
9 Bangladesh 150448339
10 Belgium 10392226
# ... with 132 more rows
gapminder %>%
group_by(country) %>%
summarise(maxpop = max(pop))
# A tibble: 142 x 2
country maxpop
<fct> <dbl>
1 Afghanistan 31889923
2 Albania 3600523
3 Algeria 33333216
4 Angola 12420476
5 Argentina 40301927
6 Australia 20434176
7 Austria 8199783
8 Bahrain 708573
9 Bangladesh 150448339
10 Belgium 10392226
# ... with 132 more rows
We can apply a function to multiple columns, or multiple functions to a column (or both):
gapminder %>%
select(-continent, -year) %>%
group_by(country) %>%
summarise_all(max)
gapminder %>%
select(country, pop) %>%
group_by(country) %>%
summarise_all(max)
gapminder %>%
group_by(country) %>%
summarise_if(is.numeric, max)
We can apply a function to multiple columns, or multiple functions to a column (or both):
gapminder %>%
select(country, pop) %>%
group_by(country) %>%
summarise_all(funs(min, max, mean))
gapminder %>%
select(-year) %>%
group_by(country) %>%
summarise_if(is.numeric, funs(min, max, mean))
We can build up a long pipeline to, e.g., summarise min, mean, max for all numeric variables and end up with a table with min-mean-max as columns headers, and variable (gdpPercap, lifeExp, pop) rows.
gapminder %>%
select(-year) %>%
group_by(country) %>%
summarise_if(is.numeric, funs(min, max, mean)) %>%
gather(variable, value, -country) %>%
separate(variable, into=c("variable", "stat")) %>%
spread(stat, value)
Explore babynames
a bit. How many rows, columns does it have? How many unique names?
library(babynames)
babynames
# A tibble: 1,858,689 x 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
5 1880 F Minnie 1746 0.0179
6 1880 F Margaret 1578 0.0162
7 1880 F Ida 1472 0.0151
8 1880 F Alice 1414 0.0145
9 1880 F Bertha 1320 0.0135
10 1880 F Sarah 1288 0.0132
# ... with 1,858,679 more rows
What does this calculate?
babynames %>%
group_by(year, sex) %>%
summarise(prop = max(prop),
name = name[which.max(prop)])
# A tibble: 272 x 4
# Groups: year [?]
year sex prop name
<dbl> <chr> <dbl> <chr>
1 1880 F 0.0724 Mary
2 1880 M 0.0815 John
3 1881 F 0.0700 Mary
4 1881 M 0.0810 John
5 1882 F 0.0704 Mary
6 1882 M 0.0783 John
7 1883 F 0.0667 Mary
8 1883 M 0.0791 John
9 1884 F 0.0670 Mary
10 1884 M 0.0765 John
# ... with 262 more rows
Load the dataset using library(babynames)
.
Read its help page. Look at its structure (rows, columns, summary).
Use dplyr
to calculate the total number of names in the SSA database for each year. Hint: n()
.
Make a graph or table showing how popular YOUR name has been over time (either its proportion, or rank).
babynames %>%
filter(name == "Benjamin") %>%
qplot(year, n, color = sex, data = .)
babynames %>%
group_by(year, sex) %>%
mutate(rank = row_number(desc(n))) %>%
filter(name == "Benjamin") %>%
qplot(year, rank, color = sex, data = .)
The best thing about R is that it was written by statisticians. The worst thing about R is that it was written by statisticians.
– Bow Cowgill
All the source code for this presentation is available at https://github.com/bpbond/R-data-picarro (under the delaware
branch)
R has many contributed packages across a wide variety of scientific fields. Almost anything you want to do will have packages to support it.
CRAN also provides “Task Views”. For example: