Slides: rpubs.com/RobinLovelace

Why data carpentry?

  • If you 'hack' or 'munge' data, it won't scale
  • So ultimately it's about being able to handle Big Data
  • We'll cover the basics of data frames and tibbles
  • And the basics of dplyr, an excellent package for data carpentry
    • dplyr is also compatible with the sf package

The data frame

The humble data frame is at the heart of most analysis projects:

d = data.frame(x = 1:3, y = c("A", "B", "C"))
d
##   x y
## 1 1 A
## 2 2 B
## 3 3 C

In reality this is a list, making function work on each column:

summary(d)
##        x       y    
##  Min.   :1.0   A:1  
##  1st Qu.:1.5   B:1  
##  Median :2.0   C:1  
##  Mean   :2.0        
##  3rd Qu.:2.5        
##  Max.   :3.0
plot(d)

Subsetting

In base R, there are many ways to subset:

d[1,] # the first line
##   x y
## 1 1 A
d[,1] # the first column
## [1] 1 2 3
d$x # the first column
## [1] 1 2 3
d [1] # the first column, as a data frame
##   x
## 1 1
## 2 2
## 3 3

The tibble

Recently the data frame has been extended:

library("tibble")
dt = tibble(x = 1:3, y = c("A", "B", "C"))
dt
## # A tibble: 3 × 2
##       x     y
##   <int> <chr>
## 1     1     A
## 2     2     B
## 3     3     C

Advantages of the tibble

It comes down to efficiency and usability

  • When printed, the tibble diff reports class
  • Character vectors are not coerced into factors
  • When printing a tibble diff to screen, only the first ten rows are displayed

dplyr

Like tibbles, has advantages over historic ways of doing things

  • Type stability (data frame in, data frame out)
  • Consistent functions - functions not [ do everything
  • Piping make complex operations easy
ghg_ems %>%
  filter(!grepl("World|Europe", Country)) %>% 
  group_by(Country) %>% 
  summarise(Mean = mean(Transportation),
            Growth = diff(range(Transportation))) %>%
  top_n(3, Growth) %>%
  arrange(desc(Growth))
# dplyr must be loaded with
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Why pipes?

wb_ineq %>% 
  filter(grepl("g", Country)) %>%
  group_by(Year) %>%
  summarise(gini = mean(gini, na.rm  = TRUE)) %>%
  arrange(desc(gini)) %>%
  top_n(n = 5)

vs

top_n(
  arrange(
    summarise(
      group_by(
        filter(wb_ineq, grepl("g", Country)),
        Year),
      gini = mean(gini, na.rm  = TRUE)),
    desc(gini)),
  n = 5)

Subsetting with dplyr

Only 1 way to do it, making life simpler:

select(dt, x) # select columns
## # A tibble: 3 × 1
##       x
##   <int>
## 1     1
## 2     2
## 3     3
slice(dt, 2) # 'slice' rows
## # A tibble: 1 × 2
##       x     y
##   <int> <chr>
## 1     2     B

Practical prerequisites

You'll need the efficient package installed:

devtools::install_github("csgillespie/efficient")

These are the packages we'll be using:

library("tibble")
library("tidyr")
library("stringr")
library("readr")
library("dplyr")

Practical

  • Test out tibbles with ERP 6.2 (15 minutes)
  • Tidy data is vital for good data carpentry. Work through the tidying code in section 6.3 (30 minutes)
  • Work through in detail Section 6.4 on dplyr
  • Work through the code examples in Efficient R Programming (45 minutes)
  • Bonus (if time allows): look at how to merge datasets with left_join() in Section 6.5 (15 minutes)