Data tidying and transforming

Carlos Zelada
Octubre 2017

80% of data analysis is spent on the process of cleaning and preparing the data.“ Dasu and Johnson 2003

"Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis.” Forbes.com 2016

Ref: Cleaning Big Data: Most Time-Consuming …

Tidy Data

Package: Tidyr

Tidy Data

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.

Tidy Data

In tidy data:

Each variable forms a column
Each observation forms a row
Each data set contains information information about a single observational unit of analysis (e.g., families, participants, participant visits)

Tidy Data

Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset.

alt text

Ejemplo

pew <- read.delim(
  file = "http://stat405.had.co.nz/data/pew.txt",
  header = TRUE,
  stringsAsFactors = FALSE,
  check.names = F,
  encoding = "UTF-8"
)

religion	<$10k	$10-20k	$20-30k	$30-40k	$40-50k	$50-75k	$75-100k	$100-150k	>150k	Don't know/refused
Agnostic	27	34	60	81	76	137	122	109	84	96
Atheist	12	27	37	52	35	70	73	59	74	76
Buddhist	27	21	30	34	33	58	62	39	53	54
Catholic	418	617	732	670	638	1116	949	792	633	1489
Don�t know/refused	15	14	15	11	10	35	21	17	18	116
Evangelical Prot	575	869	1064	982	881	1486	949	723	414	1529

gather()

tidytable<-gather(pew,income,frequency, -religion)

gather()

tidytable<-gather(pew,income,frequency, -religion)

kable(head(tidytable))

religion	income	frequency
Agnostic	<$10k	27
Atheist	<$10k	12
Buddhist	<$10k	27
Catholic	<$10k	418
Don�t know/refused	<$10k	15
Evangelical Prot	<$10k	575

Exploration and Transformation

Package: DPLYR

dplyr

Why use dplyr?

Great for data exploration and transformation.
Intuitive to write and easy to read.
Fast on data frames.

dplyr functionality

Five basic verbs
- filter
- select
- arrange
- mutate
- summarise along side with group_by.
Can work with data stored in databases and data tables.
Joins: inner join, left join, semi-join, anti-join.
Window functions for calculating ranking, offsets, and more.

Let´s start

library(dplyr)
library(nycflights13)

Filter

alt text

Find all flights:

To SFO or OAK.
In January.
Delayed by more than an hour.
That departed between midnight and five am.
Where the arrival delay was more than twice the departure delay.

Answers

## (1)
filter(flights, dest %in% c("SFO", "OAK")) 
filter(flights, dest == "SFO" | dest == "OAK")
filter(flights, dest == "SFO" | "OAK") ##Error

## (2)
filter(flights, month == 1)

## (3)
filter(flights, dep_delay > 60)

## (4)
filter(flights, hour >= 0, hour <= 5) 
filter(flights, hour >= 0 & hour <= 5)

## (5)
filter(flights, arr_delay > 2 * dep_delay)

Select

alt text

Your turn

Read the help for select(). What other ways can you select variables?
Write down three ways to select the two delay variables.

Answers

select(flights, arr_delay, dep_delay)
select(flights, arr_delay:dep_delay)
select(flights, ends_with("delay"))
select(flights, contains("delay"))

Arrange

alt text

Your Turn

Order the flights by departure date and time.
Which flights were most delayed?

Answers

##(1)
arrange(flights, year ,month, day, hour, minute)
##(2)
arrange(flights, desc(dep_delay)) 
arrange(flights, desc(arr_delay))

Mutate

alt text

Your turn

Compute speed in mph from time (in minutes) and distance (in miles). Which flight flew the fastest?
Add a new variable that shows how much time was made up or lost in flight.
How did I compute hour and minute from dep_time?

hint: you may need to use select() or View() to see your new variable

Answers

## (1)
fl <- mutate(flights, speed = distance / (air_time / 60))
arrange(fl, desc(speed))

## (2)
mutate(fl, delta = dep_delay - arr_delay)

## (3)
mutate(fl,
 hour = dep_time %/% 100,
 minute = dep_time %% 100)

Summarise and Group_by

alt text

Examples

by_month <- group_by(flights, month)
kable(summarise(by_month,Number_flights=n()))

month	Number_flights
1	27004
2	24951
3	28834
4	28330
5	28796
6	28243
7	29425
8	29327
9	27574
10	28889
11	27268
12	28135

by_month <- group_by(flights, month)
kable(summarise(by_month,
                mean_air_time=mean(air_time,
                                   na.rm=TRUE),
                sd_air_time=
                        sd(air_time,
                           na.rm = TRUE),
                flgs=n())
      )

month	mean_air_time	sd_air_time	planes
1	154.1874	95.24514	27004
2	151.3464	92.26271	24951
3	149.0770	91.83097	28834
4	153.1011	94.53309	28330
5	145.7275	92.24838	28796
6	150.3252	93.95354	28243
7	146.7283	91.14821	29425
8	148.1604	92.88184	29327
9	143.4712	91.90547	27574
10	148.8861	93.58280	28889
11	155.4686	95.94203	27268
12	162.5914	97.13239	28135

Your turn

How might you summarise dep_delay for each day?

hint: mean, sd, median, max, min, quantile, sum, n

Answers

by_date <- group_by(flights, month, day)
delays <- summarise(by_date,
  mean = mean(dep_delay, na.rm = TRUE),
  median = median(dep_delay, na.rm = TRUE),
  q75 = quantile(dep_delay, 0.75, na.rm = TRUE),
  over_15 = mean(dep_delay > 15, na.rm = TRUE),
  over_30 = mean(dep_delay > 30, na.rm = TRUE),
  over_60 = mean(dep_delay > 60, na.rm = TRUE)
)

What does this code do?

hourly_delay <- filter(
  summarise(
    group_by(
      filter(
flights,
        !is.na(dep_delay)
      ),
month,day, hour ),
    delay = mean(dep_delay),
n = n() ),
n > 10 )

Pipe operator

alt text

\[ \LARGE x \; \text{%>%} \; f(y) \rightarrow f(x,y) \]

\[ \LARGE y \; \text{%>%} \; f(x) \rightarrow f(y,x) \]

rewrite

hourly_delay <- flights %>%
  filter(!is.na(dep_delay)) %>%
  group_by(month,day, hour) %>%
  summarise(delay = mean(dep_delay), n = n()) %>%
  filter(n > 10)

Hint: pronounce %>% as then

Your turn

Create data pipelines to answer the following questions:

Which destinations have the highest average delays?
Which flights (i.e. carrier + flight) happen every day?
- Where do they fly to?
On average, how do delays vary over the course of a day?

(Hint: hour + minute / 60)

Answers (1)

flights %>%
  group_by(dest) %>%
  summarise(
    arr_delay = mean(arr_delay, na.rm = TRUE),
    n = n()) %>%
  arrange(desc(arr_delay))

Answers (2)

flights %>%
  group_by(carrier, flight, dest) %>%
  tally(sort = TRUE) %>% 
  filter(n == 365)

flights %>%
  group_by(carrier, flight, dest) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>%
  filter(n == 365)

Answers (3)

per_hour <- flights %>%
  mutate(time = hour + minute / 60) %>%
  group_by(time) %>%
  summarise(
    arr_delay = mean(arr_delay, na.rm = TRUE),n = n() )

Grouped mutate/filter

Creating new variables within a group is also often useful.

planes <- flights %>%
        filter(!is.na(arr_delay)) %>%
        group_by(tail_num) %>%
        filter(n() > 30)

planes %>%
  mutate(z_delay =
    (arr_delay - mean(arr_delay)) / sd(arr_delay)) %>%
  filter(z_delay > 5)

Grouped mutate/filter (2)

planes %>% filter(min_rank(arr_delay) < 5)

Your turn

What’s the difference between min_rank(), row_number() and dense_rank()?

For each plane, find the two most delayed flights. Which of the three rank functions is most appropriate?

Two tables verbs

alt text

inner_join

alt text

left_join

semi_join

alt text

anti_join

alt text

Two verb summary

Type	Action
inner	Include only rows in both x and y
left	Include all of x, and matching rows of y
semi	Include rows of x that match y
anti	Include rows of x that don’t match y

Example

hourly_delay <- flights %>%
        group_by(month, day, hour) %>%
        filter(!is.na(dep_delay)) %>%
        summarise(
                delay = mean(dep_delay),
                n = n() ) %>%
        filter

delay_weather <- hourly_delay %>% left_join(weather)

Your turn

Are older planes more likely to be delayed?

Biblography

Introduction to dplyr https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Data Processing with dplyr & tidyr https://rpubs.com/bradleyboehmke/data_wrangling

Data manipulation with tidyr http://datascienceplus.com/data-manipulation-with-tidyr/