R: Core Skills for Reproducible Research

Maja Založnik

  1. & 8. March 2017

Outline

  • Reproducible Research
  • RStudio and project management
  • Good coding practices in R
  • Importing and cleaning data
  • Standard control structures
  • Vectorisation and apply functions
  • Writing your own functions
  • Data manipulation with dplyr
  • Piping/chaining commands
  • knitting

Rules

  • Don't forget to mention the rules!

Preparation

Go to this url: http://tinyurl.com/RCSRepRes

  • click on the green button Clone or download
  • select Download zip
  • save it to your PC
  • right click on the file and select Extract all
  • then open the RepResCoreSkillsR.Rproj file which should launch RStudio

RStudio

  • Best IDE for R
  • Project management
  • Version control
  • knitting and direct publishing
  • Interactive graphics

Project management

R project-level files:

  • project – .Rproj
  • workspace – .RData
  • history – .Rhistory
  • settings – .Rprofile

Scripts – .R

Data – .csv, .xls, .dat …

Figures

Presentation – .pdf, .doc, .html, (.tex)

Good Coding Practice

Janitor work

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

source: NY Times

  • Data formats and sources
  • tidyr package
    • spread() and gather()
    • separate() and unite()

Tidy Data

Tidy datasets are all alike but every messy dataset is messy in its own way. - Hadley Wickham

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Tidy Data - spread()

    country year   var.name   var.value
1    Norway 2010 population  4891300.00
2    Norway 2010    density       16.07
3    Norway 2050 population  6364008.00
4    Norway 2050    density       20.91
5  Slovenia 2010 population  2003136.00
6  Slovenia 2010    density       99.41
7  Slovenia 2050 population  1596947.00
8  Slovenia 2050    density       79.25
9        UK 2010 population 62348447.00
10       UK 2010    density      257.71
11       UK 2050 population 71153797.00
12       UK 2050    density      294.11

Tidy Data - spread()

tidy.02 <- spread(messy.02, key = var.name, value = var.value)
tidy.02
   country year density population
1   Norway 2010   16.07    4891300
2   Norway 2050   20.91    6364008
3 Slovenia 2010   99.41    2003136
4 Slovenia 2050   79.25    1596947
5       UK 2010  257.71   62348447
6       UK 2050  294.11   71153797

Tidy Data -- key : value pairs

Country: Norway
Country: Slovenia 
...
Year: 2010
Year: 2050

Population: 4891300
Population: 2003136
...
Density: 16.07489
Density: 20.91484
...

Tidy data:

  • each column name is a key
  • each cell is a value

Tidy Data - gather()

   country    X2010    X2050
1   Norway  4891300  6364008
2 Slovenia  2003136  1596947
3       UK 62348447 71153797
tidy.01 <- gather(messy.01, year, population, 2:3)
tidy.01
   country year population
1   Norway 2010    4891300
2 Slovenia 2010    2003136
3       UK 2010   62348447
4   Norway 2050    6364008
5 Slovenia 2050    1596947
6       UK 2050   71153797

Tidy Data - separate()

    country      double.key       value
1    Norway 2010_population  4891300.00
2    Norway    2010_density       16.07
3    Norway 2050_population  6364008.00
4    Norway    2050_density       20.91
5  Slovenia 2010_population  2003136.00
6  Slovenia    2010_density       99.41
7  Slovenia 2050_population  1596947.00
8  Slovenia    2050_density       79.25
9        UK 2010_population 62348447.00
10       UK    2010_density      257.71
11       UK 2050_population 71153797.00
12       UK    2050_density      294.11

Tidy Data - separate()

tidy.03 <- separate(messy.03, double.key, c("year", "key"))
tidy.03
    country year        key       value
1    Norway 2010 population  4891300.00
2    Norway 2010    density       16.07
3    Norway 2050 population  6364008.00
4    Norway 2050    density       20.91
5  Slovenia 2010 population  2003136.00
6  Slovenia 2010    density       99.41
7  Slovenia 2050 population  1596947.00
8  Slovenia 2050    density       79.25
9        UK 2010 population 62348447.00
10       UK 2010    density      257.71
11       UK 2050 population 71153797.00
12       UK 2050    density      294.11

Tidy Data - unite()

messy.again <- unite(tidy.03, new.double.key, key, year,  
                     sep = " in the year ")
messy.again
    country              new.double.key       value
1    Norway population in the year 2010  4891300.00
2    Norway    density in the year 2010       16.07
3    Norway population in the year 2050  6364008.00
4    Norway    density in the year 2050       20.91
5  Slovenia population in the year 2010  2003136.00
6  Slovenia    density in the year 2010       99.41
7  Slovenia population in the year 2050  1596947.00
8  Slovenia    density in the year 2050       79.25
9        UK population in the year 2010 62348447.00
10       UK    density in the year 2010      257.71
11       UK population in the year 2050 71153797.00
12       UK    density in the year 2050      294.11

Tidy Data - Resources

For an excellent write-up of the main tidyr functions see Garrett Grolemund's post here http://garrettgman.github.io/tidying/.

For a quick tidyr cheat-sheet stick this to your wall: Data Wrangling Cheatsheet, also available in the literature folder of this course's repository.

Practical I. - project setup and data tidying

P1.i Start a new project and organize the folder structure

P1.ii Import and clean some data

  • Comma separated values (.csv)
  • Excel files
  • Unzipping an SPSS file

P1.iii Data Tidying

Today: only .csv files, then run load("data/economic.situation.RData"), then data tidying.

Efficient Coding Practices

  • Standard control structures
    • Conditional execution
    • Looping
  • Vectorisation and apply family of functions
  • Writing your own functions
  • Data manipulation with dplyr
    • Sub-setting
    • Grouping
    • Making new variables
    • Piping/chaining daisies

Conditional execution

if (condition) {
  # do something
} else {
  # do something else
}
x <- runif(1) # randum number from uniform distribution [0,1]
if (x >= 0.6) {
  print("Good")
} else {
  if (x <= 0.4) {
    print("Bad")
  } else {
    print("Not Sure")}
}
[1] "Good"

Conditional execution - conditions and operators

x == y   # x is equal to y
x != y   # x is not equal to y
x > y    # x is greater than y
x < y    # x is less than y
x <= y   # x is less than or equal to y
x >= y   # x is greater than or equal to y
x %in% y  # x is located in y
TRUE     # 
FALSE    #
!x        # NOT
x & y     # AND
x | y     # OR
xor(x, y)  # exclusive OR

Conditional execution - vectorised

ifelse(condition, yes, no)
x <- runif(20) # 20 randum numbers from uniform distribution [0,1]
ifelse(x >= 0.6, "G", 
       ifelse(x<=0.4, "B", "N"))
 [1] "G" "G" "G" "B" "N" "B" "G" "N" "B" "B" "G" "B" "G" "G" "B" "G" "B"
[18] "N" "B" "G"

Looping - for() loop

for (i in seq) expr
mat <- matrix(NA, nrow=3, ncol=3)
for (i in 1:3){
  for (j in 1:3){
    mat[i,j] <- paste(i, j, sep="-")
  }
} 
mat
     [,1]  [,2]  [,3] 
[1,] "1-1" "1-2" "1-3"
[2,] "2-1" "2-2" "2-3"
[3,] "3-1" "3-2" "3-3"

Vectorisation - apply()

apply(X, MARGIN, FUN, ...)
mat <- matrix(1:9, 3,3)
# row totals
apply(mat, 1, sum)
[1] 12 15 18
# column totals
apply(mat, 2, sum)
[1]  6 15 24

Vectorisation - apply() vs for()

mat <- matrix(sample(1:100, 25), 5,5)
mat
     [,1] [,2] [,3] [,4] [,5]
[1,]   67   35   22   62   91
[2,]   99   65   31   71   66
[3,]   70   41   56   39   87
[4,]   64   49   17   80   13
[5,]   58   29   78   18   94
# find the second largest value in each row
out <- vector()
for (i in 1:nrow(mat)) {
  out[i] <- sort(mat[i,], decreasing = TRUE)[2]
}

Vectorisation - apply() vs for()

for (i in 1:nrow(mat)) {
  out[i] <- sort(mat[i,], decreasing = TRUE)[2]
}
out
[1] 67 71 70 64 78
# using apply()
apply(mat, 1, function(x) sort(x, decreasing = TRUE)[2])
[1] 67 71 70 64 78

Vectorisation - lapply() and sapply()

# a list of elements with different lengths:
test <- list(a = 1:5, b = 20:100, c = 17234) 
lapply(test, min)
$a
[1] 1

$b
[1] 20

$c
[1] 17234
sapply(test, min)
    a     b     c 
    1    20 17234 

Vectorisation - lapply() and sapply()

# a data frame (list of three vectors of equal length):
test <- data.frame(a = 1:5, b = 6:10, c = 11:15) 
lapply(test, mean)
$a
[1] 3

$b
[1] 8

$c
[1] 13
sapply(test, mean)
 a  b  c 
 3  8 13 

Writing your own functions

function.name <- function(arguments, ...) {
  expression
  (return value)
}

Writing your own functions

FunSecondLargest <-function(x) {
  r <- sort(x, decreasing = TRUE)[2]
  return(r)
}
# now let's try it out with a sample vector
test.vector <- tidy.population2010$population

FunSecondLargest(test.vector)  
[1] 14642884
apply(mat, 1, FunSecondLargest)
[1] 67 71 70 64 78

Writing your own functions

# Function for extracting the n-th largest value from a vector
# Agruments: x - vector,  n - optional integer value for rank
# Output:returns single value 
FunNthLargest <-function(x, n=1) {
  r <- sort(x, decreasing = TRUE)[n]
  return(r)
}
FunNthLargest(test.vector)  
[1] 15120232
FunNthLargest(test.vector, n=2)  
[1] 14642884
FunNthLargest(test.vector, n=3)  
[1] 13601669

Practical II

###########################################################
## CONDITIONAL EXPR, LOOPS, APPLY AND FUNCTIONS
###########################################################
## 3.1 Conditional expressions and logical operators 
## 3.2 For() loops
## 3.3 Funciton writing
## 3.4 Use your function inside an apply statement
## 3.5 Bar-Plotting function
###########################################################

Data manipulation with `dplyr`

  • Sub-setting:
    • filter()
    • select()
  • Calculating:
    • mutate()
    • summarise()
  • Helpers:
    • group_by() and ungroup()
    • arrange()

Filtering - rows

filter(data, criteria)
filter(tidy.population2010, 
       AREA_KM2 < 1000 & population > 10000 & AGE == 0)
  AGE AREA_KM2       NAME FIPS    sex population
1   0      360 Gaza Strip   GZ   male      29209
2   0      687  Singapore   SN   male      18632
3   0      360 Gaza Strip   GZ female      27616
4   0      687  Singapore   SN female      17438

Selecting - columns

select(data, list)
select(tidy.economic.situation, bad, good, neutral)
   bad good neutral
1   25   28      47
2   14   44      42
3   11   54      35
4    1   64      36
5    3   63      34
6   18   40      43
7   76    8      17
8   80    5      15
9   83    4      13
10  92    4       4
11  89    0      11
12  80    6      15
13  77    6      17
14  79    3      18
15  85    2      12
16  91    1       8
17  92    0       8
18  80    4      16

Selecting - columns

# all the columns between bad and neutral
select(tidy.economic.situation, bad:neutral)
# all but the perception column
select(tidy.economic.situation, -perception)
# contains a dot in the name:
select(tidy.economic.situation, contains("."))
# starts with the letter p
select(tidy.economic.situation, starts_with("p"))
# ends with the letter d
select(tidy.economic.situation, ends_with("d"))
# contains the text "n"
select(tidy.economic.situation, contains("n"))
# change order of columns by name
select(tidy.economic.situation,income.group:bad, neutral, good:perception), n=3
# we can also do this using the columns' respective indices instead
select(tidy.economic.situation, 1,2,4,3,5), n=3

Calculating - new variables (columns)

# scale the values so they all sum up to 100
mutate(tidy.economic.situation, 
       total = bad+neutral+good, 
       bad.s = bad/total*100,
       good.s = good/total*100,
       neutral.s = neutral/total*100
)
   income.group perception   bad.s good.s neutral.s
1     inc.lt.20         HH 25.0000 28.000     47.00
2  inc.20.to.39         HH 14.0000 44.000     42.00
3  inc.40.to.59         HH 11.0000 54.000     35.00
4  inc.60.to.99         HH  0.9901 63.366     35.64
5    inc.gt.100         HH  3.0000 63.000     34.00
6           all         HH 17.8218 39.604     42.57
7     inc.lt.20         UK 75.2475  7.921     16.83
8  inc.20.to.39         UK 80.0000  5.000     15.00
9  inc.40.to.59         UK 83.0000  4.000     13.00
10 inc.60.to.99         UK 92.0000  4.000      4.00
11   inc.gt.100         UK 89.0000  0.000     11.00
12          all         UK 79.2079  5.941     14.85
13    inc.lt.20          W 77.0000  6.000     17.00
14 inc.20.to.39          W 79.0000  3.000     18.00
15 inc.40.to.59          W 85.8586  2.020     12.12
16 inc.60.to.99          W 91.0000  1.000      8.00
17   inc.gt.100          W 92.0000  0.000      8.00
18          all          W 80.0000  4.000     16.00

Calculating - new observations (rows)

summarise(data, new.var = summary.function(column))
summarise(tidy.population2010, 
          av.pop = mean(population), 
          av.area = mean(AREA_KM2), 
          count = n(), 
          second = FunSecondLargest(population))
  av.pop av.area count   second
1 149087  578149 46056 14642884

Calculating - new observations (rows) - grouped

gr <- group_by(tidy.population2010, AGE)
summarise(gr, pop = mean(population), 
          area = mean(AREA_KM2), 
          count = n(), 
          second = FunSecondLargest(population))
# A tibble: 101 × 5
     AGE    pop   area count   second
   <int>  <dbl>  <dbl> <int>    <int>
1      0 282738 578149   456 11355900
2      1 278558 578149   456 11177984
3      2 275981 578149   456 11082923
4      3 272960 578149   456 11021599
5      4 270025 578149   456 11002862
6      5 267803 578149   456 11022271
7      6 266032 578149   456 11037275
8      7 264164 578149   456 11040174
9      8 262954 578149   456 11030910
10     9 262510 578149   456 11016559
# ... with 91 more rows

Sorting data

# `arrange()` for data sorting on columns
arrange(tidy.economic.situation, perception, desc(bad.s))
   income.group perception   bad.s good.s neutral.s
1     inc.lt.20         HH 25.0000 28.000     47.00
2           all         HH 17.8218 39.604     42.57
3  inc.20.to.39         HH 14.0000 44.000     42.00
4  inc.40.to.59         HH 11.0000 54.000     35.00
5    inc.gt.100         HH  3.0000 63.000     34.00
6  inc.60.to.99         HH  0.9901 63.366     35.64
7  inc.60.to.99         UK 92.0000  4.000      4.00
8    inc.gt.100         UK 89.0000  0.000     11.00
9  inc.40.to.59         UK 83.0000  4.000     13.00
10 inc.20.to.39         UK 80.0000  5.000     15.00
11          all         UK 79.2079  5.941     14.85
12    inc.lt.20         UK 75.2475  7.921     16.83
13   inc.gt.100          W 92.0000  0.000      8.00
14 inc.60.to.99          W 91.0000  1.000      8.00
15 inc.40.to.59          W 85.8586  2.020     12.12
16          all          W 80.0000  4.000     16.00
17 inc.20.to.39          W 79.0000  3.000     18.00
18    inc.lt.20          W 77.0000  6.000     17.00

Joining tables

  • left_join(a, b) – keeps all rows in first table
  • right_join(a, b) – keeps all rows in second table
  • inner_join(a, b) – only keeps rows present in both a and b
  • full_join(a, b) – keeps all rows

Piping

Without Piping

data.subset <-  filter(tidy.population2010, AREA_KM2 < 2000 & population > 15000 & AGE == 0)
data.subset2 <- select(data.subset, -AGE) 
data.subset3 <-  mutate(data.subset2, density = population/AREA_KM2) 
data.grouped <-   group_by(data.subset3, NAME)
summarise(data.grouped, count = n(), mean.density = mean(density)) 
# A tibble: 3 × 3
        NAME count mean.density
      <fctr> <int>        <dbl>
1 Gaza Strip     2        78.92
2  Hong Kong     2        28.07
3  Singapore     2        26.25

With Piping

tidy.population2010 %>%
  filter(AREA_KM2 < 2000 & population > 15000 & AGE == 0) %>%
  select(-AGE) %>%
  mutate(density = population/AREA_KM2) %>%
  group_by(NAME) %>%
  summarise(count = n(), mean.density = mean(density)) -> final.table
final.table
# A tibble: 3 × 3
        NAME count mean.density
      <fctr> <int>        <dbl>
1 Gaza Strip     2        78.92
2  Hong Kong     2        28.07
3  Singapore     2        26.25

PRACTICAL III

##########################################################
## PIPING and POPULATION PYRAMIDS
##########################################################
## 4.0 Practice piping on population2010
## 4.1 Write a function to extract population pyramid data
## 4.2 Write a function to plot the output of 4.1
##########################################################

PRACTICAL III.i

##########################################################
## 4.0 Practice piping on population2010
##########################################################

Using the population2010.csv file find the answers to the following:

  • How many 20 year-old males were there in Tanzania in 2010
  • Which country has the lowest total population?
  • In which country do women outnumber men in the most age groups?

PRACTICAL III.I solutions

# How many 20 year-old males were there in Tanzania in 2010
tidy.population2010 %>%
  filter(FIPS == "TZ" & AGE == 20 & sex == "male")
  AGE AREA_KM2     NAME FIPS  sex population
1  20   885800 Tanzania   TZ male     424839

PRACTICAL III.I solutions

# Which country has the lowest total population?
tidy.population2010 %>%
  group_by(NAME) %>%
  summarise( population = sum(population)) %>%
  arrange(population)
# A tibble: 228 × 2
                        NAME population
                      <fctr>      <int>
1                 Montserrat       5118
2  Saint Pierre and Miquelon       5943
3           Saint Barthelemy       7406
4               Saint Helena       7670
5                      Nauru       9267
6                     Tuvalu      10472
7               Cook Islands      11488
8                   Anguilla      14766
9          Wallis and Futuna      15343
10                     Palau      20879
# ... with 218 more rows

PRACTICAL III.I solutions

# In which country do women outnuber men in the most age groups?
tidy.population2010 %>%
  spread(sex, population) %>%
  mutate(morewomen = ifelse(female > male, 1, 0)) %>%
  filter(morewomen == 1) %>%
  group_by(NAME) %>%
  summarise(most.ages = sum(morewomen)) %>%
  arrange(desc(most.ages))
# A tibble: 228 × 2
                   NAME most.ages
                 <fctr>     <dbl>
1          Sierra Leone        98
2              Cambodia        92
3               Comoros        92
4                Malawi        91
5           Gambia, The        90
6            Mauritania        90
7  Virgin Islands, U.S.        90
8            Mozambique        88
9                Uganda        88
10             Djibouti        87
# ... with 218 more rows

Literate programming with `knitr`

Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

Donald Knuth (1984)

  • Pure code:
    • what and how BUT not why
  • Pure text:
    • what and why BUT not how

Literate programming with `knitr`

a text file that combines:

  • text (with formatting instructions)

    • LaTeX
    • Markdown
  • code to be executed

    • R
    • Python, C, Fortran, Ruby…

And can be knitted to output into:

  • html, docx, pdf
  • shiny applications, dashboards, books, websites.. gallery

Markdown

A lightweight markup language for writing formatting syntax but keeping your text human readable.

RStudio will automatically recognize and render markdown if you save it with the .md extension.

But if you save it as .Rmd then it will not only render the markdown, but also execute any code chunks.

Code chunks and in-line code

```{r chunk_name, chunk_options}
print('hello world!')
1 + 2
```
[1] "hello world!"
[1] 3

You can also use code in-line like this: `r 1 + 1`

Which produces:

You can also use code in-line like this: 2

Code chunk options

  • chunk_name - optional
  • echo = TRUE
  • eval = TRUE
  • message = TRUE
  • warning = TRUE
  • results = 'markup'
  • fig.show = 'asis'
  • fig.height, fig.width = 7

Setup code chunk:

Global options:

```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=10, fig.height=5, 
                  echo=FALSE, warning=FALSE, message=FALSE)
```
  • Load libraries
  • source other .R scripts

Images in Rmarkdown

R Markdown version:

![Image Title](path/to/your/image)

As code chunk:

```{r fig.width=4, fig.height=2,echo=FALSE}
library(png)
library(grid)
img <- readPNG("path/to/your/image")
grid.raster(img)

html solution (also works in RPresentaitons):

<div align="center">
<img src="path/to/your/image" width=300>
</div>

YAML headers

---
title: "Test knitted document"
author: "me"
output: html_document
date: "7 March 2017"
---

R Presentations

Everything we just said, but save as .Rpres

and just add:

Title of slide 
================================================

to make a new slide.

R Presentations

Slide with options
================================================
incremental: true
transition: none
autosize: true
  • Transitions
  • Navigations
  • Hierarchy/sectioning
  • Slide IDs

Publishing to RPubs

Register a new account at https://rpubs.com/users/new

Use the Publish icon from an .Rpres or .Rmd file to publish it on-line.

RPubs is public - but not searchable.

If you want to replace a presentation with a more updated version you must first delete the old one.

PRACTICAL IV Bring it all together

##########################################################
## Creating a report and presentation 
##########################################################
## 4.0 Clear project structure: data, functions, report
## 4.1 Load and source them all from a .Rmd or .Rpres file
## 4.2 Create report with multiple populaiton pyramids
## 4.3 
##########################################################