Data carpentry

Why data carpentry?

Data analysts and 'scientists': don't wrangle, munge or 'hack' your valuable datasets. Use #datacarpentry: https://t.co/gXrlIJH91R pic.twitter.com/GSWS7O7zBz
— Robin Lovelace (@robinlovelace) February 20, 2017

If you 'hack' or 'munge' data, it won't scale
So ultimately it's about being able to handle Big Data
We'll cover the basics of data frames and tibbles
And the basics of dplyr, an excellent package for data carpentry
- dplyr is also compatible with the sf package

The data frame

The humble data frame is at the heart of most analysis projects:

d = data.frame(x = 1:3, y = c("A", "B", "C"))
d

##   x y
## 1 1 A
## 2 2 B
## 3 3 C

In reality this is a list, making function work on each column:

summary(d)

##        x       y    
##  Min.   :1.0   A:1  
##  1st Qu.:1.5   B:1  
##  Median :2.0   C:1  
##  Mean   :2.0        
##  3rd Qu.:2.5        
##  Max.   :3.0

plot(d)

Subsetting

In base R, there are many ways to subset:

d[1,] # the first line

##   x y
## 1 1 A

d[,1] # the first column

## [1] 1 2 3

d$x # the first column

## [1] 1 2 3

d [1] # the first column, as a data frame

##   x
## 1 1
## 2 2
## 3 3

The tibble

Recently the data frame has been extended:

library("tibble")
dt = tibble(x = 1:3, y = c("A", "B", "C"))
dt

## # A tibble: 3 × 2
##       x     y
##   <int> <chr>
## 1     1     A
## 2     2     B
## 3     3     C

Advantages of the tibble

It comes down to efficiency and usability

When printed, the tibble diff reports class
Character vectors are not coerced into factors
When printing a tibble diff to screen, only the first ten rows are displayed

dplyr

Like tibbles, has advantages over historic ways of doing things

Type stability (data frame in, data frame out)
Consistent functions - functions not [ do everything
Piping make complex operations easy

ghg_ems %>%
  filter(!grepl("World|Europe", Country)) %>% 
  group_by(Country) %>% 
  summarise(Mean = mean(Transportation),
            Growth = diff(range(Transportation))) %>%
  top_n(3, Growth) %>%
  arrange(desc(Growth))

# dplyr must be loaded with
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Why pipes?

wb_ineq %>% 
  filter(grepl("g", Country)) %>%
  group_by(Year) %>%
  summarise(gini = mean(gini, na.rm  = TRUE)) %>%
  arrange(desc(gini)) %>%
  top_n(n = 5)

top_n(
  arrange(
    summarise(
      group_by(
        filter(wb_ineq, grepl("g", Country)),
        Year),
      gini = mean(gini, na.rm  = TRUE)),
    desc(gini)),
  n = 5)

Subsetting with dplyr

Only 1 way to do it, making life simpler:

select(dt, x) # select columns

## # A tibble: 3 × 1
##       x
##   <int>
## 1     1
## 2     2
## 3     3

slice(dt, 2) # 'slice' rows

## # A tibble: 1 × 2
##       x     y
##   <int> <chr>
## 1     2     B

Practical prerequisites

You'll need the efficient package installed:

devtools::install_github("csgillespie/efficient")

These are the packages we'll be using:

library("tibble")
library("tidyr")
library("stringr")
library("readr")
library("dplyr")

Practical

Test out tibbles with ERP 6.2 (15 minutes)
Tidy data is vital for good data carpentry. Work through the tidying code in section 6.3 (30 minutes)
Work through in detail Section 6.4 on dplyr
Work through the code examples in Efficient R Programming (45 minutes)
Bonus (if time allows): look at how to merge datasets with left_join() in Section 6.5 (15 minutes)