Data tidying and transforming

Carlos Zelada
Octubre 2017

80% of data analysis is spent on the process of cleaning and preparing the data.“ Dasu and Johnson 2003

"Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis.” Forbes.com 2016

Ref: Cleaning Big Data: Most Time-Consuming …

Tidy Data

Package: Tidyr

Tidy Data

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.

Tidy Data

In tidy data:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each data set contains information information about a single observational unit of analysis (e.g., families, participants, participant visits)

Tidy Data

Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset.

alt text alt text

Ejemplo

pew <- read.delim(
  file = "http://stat405.had.co.nz/data/pew.txt",
  header = TRUE,
  stringsAsFactors = FALSE,
  check.names = F,
  encoding = "UTF-8"
)
religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k >150k Don't know/refused
Agnostic 27 34 60 81 76 137 122 109 84 96
Atheist 12 27 37 52 35 70 73 59 74 76
Buddhist 27 21 30 34 33 58 62 39 53 54
Catholic 418 617 732 670 638 1116 949 792 633 1489
Don’t know/refused 15 14 15 11 10 35 21 17 18 116
Evangelical Prot 575 869 1064 982 881 1486 949 723 414 1529

gather()

tidytable<-gather(pew,income,frequency, -religion)

gather()

tidytable<-gather(pew,income,frequency, -religion)
kable(head(tidytable))
religion income frequency
Agnostic <$10k 27
Atheist <$10k 12
Buddhist <$10k 27
Catholic <$10k 418
Don’t know/refused <$10k 15
Evangelical Prot <$10k 575

Exploration and Transformation

Package: DPLYR

dplyr

Why use dplyr?

  • Great for data exploration and transformation.
  • Intuitive to write and easy to read.
  • Fast on data frames.

dplyr functionality

  • Five basic verbs
    • filter
    • select
    • arrange
    • mutate
    • summarise along side with group_by.
  • Can work with data stored in databases and data tables.
  • Joins: inner join, left join, semi-join, anti-join.
  • Window functions for calculating ranking, offsets, and more.

Let´s start

library(dplyr)
library(nycflights13)

Filter

alt text

alt text

Find all flights:

  1. To SFO or OAK.
  2. In January.
  3. Delayed by more than an hour.
  4. That departed between midnight and five am.
  5. Where the arrival delay was more than twice the departure delay.

Answers

## (1)
filter(flights, dest %in% c("SFO", "OAK")) 
filter(flights, dest == "SFO" | dest == "OAK")
filter(flights, dest == "SFO" | "OAK") ##Error

## (2)
filter(flights, month == 1)

## (3)
filter(flights, dep_delay > 60)

## (4)
filter(flights, hour >= 0, hour <= 5) 
filter(flights, hour >= 0 & hour <= 5)

## (5)
filter(flights, arr_delay > 2 * dep_delay)

Select

alt text

alt text

Your turn

  1. Read the help for select(). What other ways can you select variables?
  2. Write down three ways to select the two delay variables.

Answers

select(flights, arr_delay, dep_delay)
select(flights, arr_delay:dep_delay)
select(flights, ends_with("delay"))
select(flights, contains("delay"))

Arrange

alt text

alt text

Your Turn

  1. Order the flights by departure date and time.
  2. Which flights were most delayed?

Answers

##(1)
arrange(flights, year ,month, day, hour, minute)
##(2)
arrange(flights, desc(dep_delay)) 
arrange(flights, desc(arr_delay))

Mutate

alt text

alt text

Your turn

  1. Compute speed in mph from time (in minutes) and distance (in miles). Which flight flew the fastest?
  2. Add a new variable that shows how much time was made up or lost in flight.
  3. How did I compute hour and minute from dep_time?

hint: you may need to use select() or View() to see your new variable

Answers

## (1)
fl <- mutate(flights, speed = distance / (air_time / 60))
arrange(fl, desc(speed))

## (2)
mutate(fl, delta = dep_delay - arr_delay)

## (3)
mutate(fl,
 hour = dep_time %/% 100,
 minute = dep_time %% 100)

Summarise and Group_by

alt text

alt text

Examples

by_month <- group_by(flights, month)
kable(summarise(by_month,Number_flights=n()))
month Number_flights
1 27004
2 24951
3 28834
4 28330
5 28796
6 28243
7 29425
8 29327
9 27574
10 28889
11 27268
12 28135
by_month <- group_by(flights, month)
kable(summarise(by_month,
                mean_air_time=mean(air_time,
                                   na.rm=TRUE),
                sd_air_time=
                        sd(air_time,
                           na.rm = TRUE),
                flgs=n())
      )
month mean_air_time sd_air_time planes
1 154.1874 95.24514 27004
2 151.3464 92.26271 24951
3 149.0770 91.83097 28834
4 153.1011 94.53309 28330
5 145.7275 92.24838 28796
6 150.3252 93.95354 28243
7 146.7283 91.14821 29425
8 148.1604 92.88184 29327
9 143.4712 91.90547 27574
10 148.8861 93.58280 28889
11 155.4686 95.94203 27268
12 162.5914 97.13239 28135

Your turn

How might you summarise dep_delay for each day?

hint: mean, sd, median, max, min, quantile, sum, n

Answers

by_date <- group_by(flights, month, day)
delays <- summarise(by_date,
  mean = mean(dep_delay, na.rm = TRUE),
  median = median(dep_delay, na.rm = TRUE),
  q75 = quantile(dep_delay, 0.75, na.rm = TRUE),
  over_15 = mean(dep_delay > 15, na.rm = TRUE),
  over_30 = mean(dep_delay > 30, na.rm = TRUE),
  over_60 = mean(dep_delay > 60, na.rm = TRUE)
)

What does this code do?

hourly_delay <- filter(
  summarise(
    group_by(
      filter(
flights,
        !is.na(dep_delay)
      ),
month,day, hour ),
    delay = mean(dep_delay),
n = n() ),
n > 10 )

Pipe operator

alt text

\[ \LARGE x \; \text{%>%} \; f(y) \rightarrow f(x,y) \]

\[ \LARGE y \; \text{%>%} \; f(x) \rightarrow f(y,x) \]

rewrite

hourly_delay <- flights %>%
  filter(!is.na(dep_delay)) %>%
  group_by(month,day, hour) %>%
  summarise(delay = mean(dep_delay), n = n()) %>%
  filter(n > 10)

Hint: pronounce %>% as then

Your turn

Create data pipelines to answer the following questions:

  1. Which destinations have the highest average delays?
  2. Which flights (i.e. carrier + flight) happen every day?
    • Where do they fly to?
  3. On average, how do delays vary over the course of a day?

(Hint: hour + minute / 60)

Answers (1)

flights %>%
  group_by(dest) %>%
  summarise(
    arr_delay = mean(arr_delay, na.rm = TRUE),
    n = n()) %>%
  arrange(desc(arr_delay))

Answers (2)

flights %>%
  group_by(carrier, flight, dest) %>%
  tally(sort = TRUE) %>% 
  filter(n == 365)

OR

flights %>%
  group_by(carrier, flight, dest) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>%
  filter(n == 365)

Answers (3)

per_hour <- flights %>%
  mutate(time = hour + minute / 60) %>%
  group_by(time) %>%
  summarise(
    arr_delay = mean(arr_delay, na.rm = TRUE),n = n() )

Grouped mutate/filter

  • Creating new variables within a group is also often useful.
planes <- flights %>%
        filter(!is.na(arr_delay)) %>%
        group_by(tail_num) %>%
        filter(n() > 30)
planes %>%
  mutate(z_delay =
    (arr_delay - mean(arr_delay)) / sd(arr_delay)) %>%
  filter(z_delay > 5)

Grouped mutate/filter (2)

planes %>% filter(min_rank(arr_delay) < 5)

Your turn

What’s the difference between min_rank(), row_number() and dense_rank()?

For each plane, find the two most delayed flights. Which of the three rank functions is most appropriate?

Two tables verbs

alt text

inner_join

alt text

left_join

alt text

semi_join

alt text

anti_join

alt text

Two verb summary

Type Action
inner Include only rows in both x and y
left Include all of x, and matching rows of y
semi Include rows of x that match y
anti Include rows of x that don’t match y

Example

hourly_delay <- flights %>%
        group_by(month, day, hour) %>%
        filter(!is.na(dep_delay)) %>%
        summarise(
                delay = mean(dep_delay),
                n = n() ) %>%
        filter

delay_weather <- hourly_delay %>% left_join(weather)

Your turn

Are older planes more likely to be delayed?

Biblography