library(tidyverse)
library(nycflights13)

For the examples we use a set of all flights from NYC in 2013. The dataset consists of several tables:

  • Airlines: Names of the airlines
  • Airports: Information about the various airports
  • Flighs : Information about the flights such as origin, destination, delays etc.
  • Planes : Plane information, make, manufacture year etc.
  • Weather : Weather informaton for the origin airports

Select,Filter,Sort & Mutate

Filter rows with filter()

Find all flights that departed 22 december 2013 from JFK.

filter(nycflights13::flights,month == 12 & day == 22 & origin == "JFK")


Select rows with select()

Select specific rows

select(nycflights13::flights,dest,origin)

All rows between two variables

select(nycflights13::flights,year:day)

There are a number of helper functions you can use within select():

  • starts_with(“abc”): matches names that begin with “abc”.
  • ends_with(“xyz”): matches names that end with “xyz”.
  • contains(“ijk”): matches names that contain “ijk”.
  • matches(“(.)\”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
  • num_range(“x”, 1:3) matches x1, x2 and x3.


Add new variables using mutate(), using the pipe operator %>% to pass on data

flights_subset <-  select(nycflights13::flights,dest,tailnum,contains("delay"),distance,air_time) %>%
      mutate(
           air_speed=distance/air_time*60, #Mph
           cumsum_delay=cumsum(arr_delay),
           lagged_delay2=lag(x = cumsum_delay,2)) 
flights_subset

Summarizing data with summarise()

summarise(flights_subset,mean_speed=mean(air_speed,na.rm=T))

Combine with group_by for more usefulness.

Here we get the top10 fastest tailnums pr. destination.

group_by(flights_subset,dest,tailnum) %>% 
    summarise(mean_speed_dest = mean(air_speed,na.rm=T),n=n()) %>%  #Calculate speed for pr. tailnum pr dest
          arrange(dest,desc(mean_speed_dest)) %>%                   #Sort by destination and tailnum
            mutate(rown=row_number()) %>%                           #Get the (intra group) row number (rank)
              filter(rown <= 10) %>%                                #Keep only the 10 fastest tailnums pr dest
                arrange(dest,desc(mean_speed_dest))                 #Arrange by destination and fastest tailnums

Another example: For each destination, compute the total minutes of delay. For each, flight, compute the proportion of the total delay for its destination.

group_by(flights_subset,dest) %>%
  mutate(dest_delay=sum(arr_delay,na.rm=T),prop_delay=arr_delay/dest_delay) %>%
    arrange(dest,desc(prop_delay)) %>%
      select(dest,tailnum,arr_delay,dest_delay,prop_delay)

Tidy data with spread, gather, separate and unite

Gathering

We can use the gather function to gather data where the columns contain values of a variable rather than variables:

table4a

Here the two columns 1999 and 2000 are not variables but values of a variable indicating time of observation. We can use gather to put these variables in to a column called year instead:

table4a %>% 
  gather(`1999`,`2000`, key="Year",value="Cases") %>% 
    arrange(country) 

Gather makes long data wider and shorter.

Spreading

Spread is used when an observation is located over several rows:

table2

Here each observation (a country in a year) is spread across two rows and indicated by the type variable. We can use spread to fix this:

table2 %>% spread(key=type,value=count)

Separate

We use separate to divide variables that are put in to one column when they belong in several columns:

table3

We use separate() to put the components of the rate into two new columns: Cases & Population

table3 %>% separate(rate,sep="/",into = c("Cases","Population"))
LS0tDQp0aXRsZTogIkRhdGEgbWFuaXB1bGF0aW9uIHdpdGggdGlkeXZlcnNlIChleGFtcGxlcyBmcm9tIHI0ZHMpIg0Kb3V0cHV0Og0KICBodG1sX25vdGVib29rOg0KICAgIHRoZW1lOiBmbGF0bHkNCiAgICBoaWdobGlnaHQ6IHplbmJ1cm4NCiAgICBjc3M6IGN1c3RvbS5jc3MNCiAgICB0b2M6IHRydWUNCiAgICB0b2NfZGVwdGg6IDQNCiAgICB0b2NfZmxvYXQ6DQogICAgICBjb2xsYXBzZWQ6IGZhbHNlDQogICAgICBzbW9vdGhfc2Nyb2xsOiBmYWxzZQ0KICAgIA0KICAgIA0KLS0tDQoNCmBgYHtyLHdhcm5pbmc9RixtZXNzYWdlPUZ9DQpsaWJyYXJ5KHRpZHl2ZXJzZSkNCmxpYnJhcnkobnljZmxpZ2h0czEzKQ0KDQpgYGANCg0KRm9yIHRoZSBleGFtcGxlcyB3ZSB1c2UgYSBzZXQgb2YgYWxsIGZsaWdodHMgZnJvbSBOWUMgaW4gMjAxMy4gVGhlIGRhdGFzZXQgY29uc2lzdHMgb2Ygc2V2ZXJhbCB0YWJsZXM6DQoNCiogQWlybGluZXM6IE5hbWVzIG9mIHRoZSBhaXJsaW5lcw0KKiBBaXJwb3J0czogSW5mb3JtYXRpb24gYWJvdXQgdGhlIHZhcmlvdXMgYWlycG9ydHMNCiogRmxpZ2hzICA6IEluZm9ybWF0aW9uIGFib3V0IHRoZSBmbGlnaHRzIHN1Y2ggYXMgb3JpZ2luLCBkZXN0aW5hdGlvbiwgZGVsYXlzIGV0Yy4NCiogUGxhbmVzICA6IFBsYW5lIGluZm9ybWF0aW9uLCBtYWtlLCBtYW51ZmFjdHVyZSB5ZWFyIGV0Yy4NCiogV2VhdGhlciA6IFdlYXRoZXIgaW5mb3JtYXRvbiBmb3IgdGhlIG9yaWdpbiBhaXJwb3J0cw0KPGIvPg0KDQoNCiMjU2VsZWN0LEZpbHRlcixTb3J0ICYgTXV0YXRlIA0KDQojIyNGaWx0ZXIgcm93cyB3aXRoIGZpbHRlcigpDQoNCkZpbmQgYWxsIGZsaWdodHMgdGhhdCBkZXBhcnRlZCAyMiBkZWNlbWJlciAyMDEzIGZyb20gSkZLLg0KYGBge3J9DQpmaWx0ZXIobnljZmxpZ2h0czEzOjpmbGlnaHRzLG1vbnRoID09IDEyICYgZGF5ID09IDIyICYgb3JpZ2luID09ICJKRksiKQ0KYGBgDQoNCjxici8+DQoNCiMjI1NlbGVjdCByb3dzIHdpdGggc2VsZWN0KCkNCg0KU2VsZWN0IHNwZWNpZmljIHJvd3MNCg0KYGBge3J9DQpzZWxlY3QobnljZmxpZ2h0czEzOjpmbGlnaHRzLGRlc3Qsb3JpZ2luKQ0KYGBgDQoNCkFsbCByb3dzIGJldHdlZW4gdHdvIHZhcmlhYmxlcw0KDQpgYGB7cn0NCnNlbGVjdChueWNmbGlnaHRzMTM6OmZsaWdodHMseWVhcjpkYXkpDQpgYGANCg0KVGhlcmUgYXJlIGEgbnVtYmVyIG9mIGhlbHBlciBmdW5jdGlvbnMgeW91IGNhbiB1c2Ugd2l0aGluIHNlbGVjdCgpOg0KDQoqIHN0YXJ0c193aXRoKCJhYmMiKTogbWF0Y2hlcyBuYW1lcyB0aGF0IGJlZ2luIHdpdGggImFiYyIuDQoqIGVuZHNfd2l0aCgieHl6Iik6IG1hdGNoZXMgbmFtZXMgdGhhdCBlbmQgd2l0aCAieHl6Ii4NCiogY29udGFpbnMoImlqayIpOiBtYXRjaGVzIG5hbWVzIHRoYXQgY29udGFpbiAiaWprIi4NCiogbWF0Y2hlcygiKC4pXFxcMSIpOiBzZWxlY3RzIHZhcmlhYmxlcyB0aGF0IG1hdGNoIGEgcmVndWxhciBleHByZXNzaW9uLiBUaGlzIG9uZSBtYXRjaGVzIGFueSB2YXJpYWJsZXMgdGhhdCBjb250YWluIHJlcGVhdGVkIGNoYXJhY3RlcnMuIFlvdSdsbCBsZWFybiBtb3JlIGFib3V0IHJlZ3VsYXIgZXhwcmVzc2lvbnMgaW4gc3RyaW5ncy4NCiogbnVtX3JhbmdlKCJ4IiwgMTozKSBtYXRjaGVzIHgxLCB4MiBhbmQgeDMuDQoNCjxici8+DQoNCg0KIyMjQWRkIG5ldyB2YXJpYWJsZXMgdXNpbmcgbXV0YXRlKCksIHVzaW5nIHRoZSBwaXBlIG9wZXJhdG9yICU+JSB0byBwYXNzIG9uIGRhdGENCmBgYHtyfQ0KZmxpZ2h0c19zdWJzZXQgPC0gIHNlbGVjdChueWNmbGlnaHRzMTM6OmZsaWdodHMsZGVzdCx0YWlsbnVtLGNvbnRhaW5zKCJkZWxheSIpLGRpc3RhbmNlLGFpcl90aW1lKSAlPiUNCg0KICAgICAgbXV0YXRlKA0KICAgICAgICAgICBhaXJfc3BlZWQ9ZGlzdGFuY2UvYWlyX3RpbWUqNjAsICNNcGgNCiAgICAgICAgICAgY3Vtc3VtX2RlbGF5PWN1bXN1bShhcnJfZGVsYXkpLA0KICAgICAgICAgICBsYWdnZWRfZGVsYXkyPWxhZyh4ID0gY3Vtc3VtX2RlbGF5LDIpKSANCg0KZmxpZ2h0c19zdWJzZXQNCg0KYGBgDQoNCg0KIyMjU3VtbWFyaXppbmcgZGF0YSB3aXRoIHN1bW1hcmlzZSgpDQpgYGB7cn0NCnN1bW1hcmlzZShmbGlnaHRzX3N1YnNldCxtZWFuX3NwZWVkPW1lYW4oYWlyX3NwZWVkLG5hLnJtPVQpKQ0KYGBgDQoNCkNvbWJpbmUgd2l0aCBncm91cF9ieSBmb3IgbW9yZSB1c2VmdWxuZXNzLiANCg0KSGVyZSB3ZSBnZXQgdGhlIHRvcDEwIGZhc3Rlc3QgdGFpbG51bXMgcHIuIGRlc3RpbmF0aW9uLg0KDQpgYGB7cn0NCmdyb3VwX2J5KGZsaWdodHNfc3Vic2V0LGRlc3QsdGFpbG51bSkgJT4lIA0KICAgIHN1bW1hcmlzZShtZWFuX3NwZWVkX2Rlc3QgPSBtZWFuKGFpcl9zcGVlZCxuYS5ybT1UKSxuPW4oKSkgJT4lICAjQ2FsY3VsYXRlIHNwZWVkIGZvciBwci4gdGFpbG51bSBwciBkZXN0DQogICAgICAgICAgYXJyYW5nZShkZXN0LGRlc2MobWVhbl9zcGVlZF9kZXN0KSkgJT4lICAgICAgICAgICAgICAgICAgICNTb3J0IGJ5IGRlc3RpbmF0aW9uIGFuZCB0YWlsbnVtDQogICAgICAgICAgICBtdXRhdGUocm93bj1yb3dfbnVtYmVyKCkpICU+JSAgICAgICAgICAgICAgICAgICAgICAgICAgICNHZXQgdGhlIChpbnRyYSBncm91cCkgcm93IG51bWJlciAocmFuaykNCiAgICAgICAgICAgICAgZmlsdGVyKHJvd24gPD0gMTApICU+JSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgI0tlZXAgb25seSB0aGUgMTAgZmFzdGVzdCB0YWlsbnVtcyBwciBkZXN0DQogICAgICAgICAgICAgICAgYXJyYW5nZShkZXN0LGRlc2MobWVhbl9zcGVlZF9kZXN0KSkgICAgICAgICAgICAgICAgICNBcnJhbmdlIGJ5IGRlc3RpbmF0aW9uIGFuZCBmYXN0ZXN0IHRhaWxudW1zDQpgYGANCg0KQW5vdGhlciBleGFtcGxlOiBGb3IgZWFjaCBkZXN0aW5hdGlvbiwgY29tcHV0ZSB0aGUgdG90YWwgbWludXRlcyBvZiBkZWxheS4gRm9yIGVhY2gsIGZsaWdodCwgY29tcHV0ZSB0aGUgcHJvcG9ydGlvbiBvZiB0aGUgdG90YWwgZGVsYXkgZm9yIGl0cyBkZXN0aW5hdGlvbi4NCg0KYGBge3J9DQpncm91cF9ieShmbGlnaHRzX3N1YnNldCxkZXN0KSAlPiUNCiAgbXV0YXRlKGRlc3RfZGVsYXk9c3VtKGFycl9kZWxheSxuYS5ybT1UKSxwcm9wX2RlbGF5PWFycl9kZWxheS9kZXN0X2RlbGF5KSAlPiUNCiAgICBhcnJhbmdlKGRlc3QsZGVzYyhwcm9wX2RlbGF5KSkgJT4lDQogICAgICBzZWxlY3QoZGVzdCx0YWlsbnVtLGFycl9kZWxheSxkZXN0X2RlbGF5LHByb3BfZGVsYXkpDQpgYGANCg0KDQoNCiMjVGlkeSBkYXRhIHdpdGggc3ByZWFkLCBnYXRoZXIsIHNlcGFyYXRlIGFuZCB1bml0ZQ0KDQojIyNHYXRoZXJpbmcNCg0KV2UgY2FuIHVzZSB0aGUgZ2F0aGVyIGZ1bmN0aW9uIHRvICoqZ2F0aGVyKiogZGF0YSB3aGVyZSB0aGUgY29sdW1ucyBjb250YWluIHZhbHVlcyBvZiBhIHZhcmlhYmxlIHJhdGhlciB0aGFuIHZhcmlhYmxlczoNCg0KYGBge3J9DQp0YWJsZTRhDQpgYGANCg0KSGVyZSB0aGUgdHdvIGNvbHVtbnMgMTk5OSBhbmQgMjAwMCBhcmUgbm90IHZhcmlhYmxlcyBidXQgdmFsdWVzIG9mIGEgdmFyaWFibGUgaW5kaWNhdGluZyB0aW1lIG9mIG9ic2VydmF0aW9uLiBXZSBjYW4gdXNlIGdhdGhlciB0byBwdXQgdGhlc2UgdmFyaWFibGVzIGluIHRvIGEgY29sdW1uIGNhbGxlZCB5ZWFyIGluc3RlYWQ6DQoNCmBgYHtyfQ0KdGFibGU0YSAlPiUgDQogIGdhdGhlcihgMTk5OWAsYDIwMDBgLCBrZXk9IlllYXIiLHZhbHVlPSJDYXNlcyIpICU+JSANCiAgICBhcnJhbmdlKGNvdW50cnkpIA0KYGBgDQoNCkdhdGhlciBtYWtlcyBsb25nIGRhdGEgd2lkZXIgYW5kIHNob3J0ZXIuICANCg0KDQojIyNTcHJlYWRpbmcNCg0KU3ByZWFkIGlzIHVzZWQgd2hlbiBhbiBvYnNlcnZhdGlvbiBpcyBsb2NhdGVkIG92ZXIgc2V2ZXJhbCByb3dzOg0KDQpgYGB7cn0NCnRhYmxlMg0KYGBgDQoNCkhlcmUgZWFjaCBvYnNlcnZhdGlvbiAoYSBjb3VudHJ5IGluIGEgeWVhcikgaXMgc3ByZWFkIGFjcm9zcyB0d28gcm93cyBhbmQgaW5kaWNhdGVkIGJ5IHRoZSB0eXBlIHZhcmlhYmxlLiANCldlIGNhbiB1c2UgKipzcHJlYWQqKiB0byBmaXggdGhpczoNCg0KYGBge3J9DQp0YWJsZTIgJT4lIHNwcmVhZChrZXk9dHlwZSx2YWx1ZT1jb3VudCkNCmBgYA0KDQojIyNTZXBhcmF0ZQ0KDQpXZSB1c2Ugc2VwYXJhdGUgdG8gZGl2aWRlIHZhcmlhYmxlcyB0aGF0IGFyZSBwdXQgaW4gdG8gb25lIGNvbHVtbiB3aGVuIHRoZXkgYmVsb25nIGluIHNldmVyYWwgY29sdW1uczoNCg0KYGBge3J9DQp0YWJsZTMNCmBgYA0KDQpXZSB1c2Ugc2VwYXJhdGUoKSB0byBwdXQgdGhlIGNvbXBvbmVudHMgb2YgdGhlIHJhdGUgaW50byB0d28gbmV3IGNvbHVtbnM6IENhc2VzICYgUG9wdWxhdGlvbg0KDQpgYGB7cn0NCnRhYmxlMyAlPiUgc2VwYXJhdGUocmF0ZSxzZXA9Ii8iLGludG8gPSBjKCJDYXNlcyIsIlBvcHVsYXRpb24iKSkNCmBgYA0KDQo=