library(tidyverse)
library(nycflights13)
For the examples we use a set of all flights from NYC in 2013. The dataset consists of several tables:
- Airlines: Names of the airlines
- Airports: Information about the various airports
- Flighs : Information about the flights such as origin, destination, delays etc.
- Planes : Plane information, make, manufacture year etc.
- Weather : Weather informaton for the origin airports
Select,Filter,Sort & Mutate
Filter rows with filter()
Find all flights that departed 22 december 2013 from JFK.
filter(nycflights13::flights,month == 12 & day == 22 & origin == "JFK")
Select rows with select()
Select specific rows
select(nycflights13::flights,dest,origin)
All rows between two variables
select(nycflights13::flights,year:day)
There are a number of helper functions you can use within select():
- starts_with(“abc”): matches names that begin with “abc”.
- ends_with(“xyz”): matches names that end with “xyz”.
- contains(“ijk”): matches names that contain “ijk”.
- matches(“(.)\”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
- num_range(“x”, 1:3) matches x1, x2 and x3.
Add new variables using mutate(), using the pipe operator %>% to pass on data
flights_subset <- select(nycflights13::flights,dest,tailnum,contains("delay"),distance,air_time) %>%
mutate(
air_speed=distance/air_time*60, #Mph
cumsum_delay=cumsum(arr_delay),
lagged_delay2=lag(x = cumsum_delay,2))
flights_subset
Summarizing data with summarise()
summarise(flights_subset,mean_speed=mean(air_speed,na.rm=T))
Combine with group_by for more usefulness.
Here we get the top10 fastest tailnums pr. destination.
group_by(flights_subset,dest,tailnum) %>%
summarise(mean_speed_dest = mean(air_speed,na.rm=T),n=n()) %>% #Calculate speed for pr. tailnum pr dest
arrange(dest,desc(mean_speed_dest)) %>% #Sort by destination and tailnum
mutate(rown=row_number()) %>% #Get the (intra group) row number (rank)
filter(rown <= 10) %>% #Keep only the 10 fastest tailnums pr dest
arrange(dest,desc(mean_speed_dest)) #Arrange by destination and fastest tailnums
Another example: For each destination, compute the total minutes of delay. For each, flight, compute the proportion of the total delay for its destination.
group_by(flights_subset,dest) %>%
mutate(dest_delay=sum(arr_delay,na.rm=T),prop_delay=arr_delay/dest_delay) %>%
arrange(dest,desc(prop_delay)) %>%
select(dest,tailnum,arr_delay,dest_delay,prop_delay)
Tidy data with spread, gather, separate and unite
Gathering
We can use the gather function to gather data where the columns contain values of a variable rather than variables:
Here the two columns 1999 and 2000 are not variables but values of a variable indicating time of observation. We can use gather to put these variables in to a column called year instead:
table4a %>%
gather(`1999`,`2000`, key="Year",value="Cases") %>%
arrange(country)
Gather makes long data wider and shorter.
Spreading
Spread is used when an observation is located over several rows:
Here each observation (a country in a year) is spread across two rows and indicated by the type variable. We can use spread to fix this:
table2 %>% spread(key=type,value=count)
Separate
We use separate to divide variables that are put in to one column when they belong in several columns:
We use separate() to put the components of the rate into two new columns: Cases & Population
table3 %>% separate(rate,sep="/",into = c("Cases","Population"))
LS0tDQp0aXRsZTogIkRhdGEgbWFuaXB1bGF0aW9uIHdpdGggdGlkeXZlcnNlIChleGFtcGxlcyBmcm9tIHI0ZHMpIg0Kb3V0cHV0Og0KICBodG1sX25vdGVib29rOg0KICAgIHRoZW1lOiBmbGF0bHkNCiAgICBoaWdobGlnaHQ6IHplbmJ1cm4NCiAgICBjc3M6IGN1c3RvbS5jc3MNCiAgICB0b2M6IHRydWUNCiAgICB0b2NfZGVwdGg6IDQNCiAgICB0b2NfZmxvYXQ6DQogICAgICBjb2xsYXBzZWQ6IGZhbHNlDQogICAgICBzbW9vdGhfc2Nyb2xsOiBmYWxzZQ0KICAgIA0KICAgIA0KLS0tDQoNCmBgYHtyLHdhcm5pbmc9RixtZXNzYWdlPUZ9DQpsaWJyYXJ5KHRpZHl2ZXJzZSkNCmxpYnJhcnkobnljZmxpZ2h0czEzKQ0KDQpgYGANCg0KRm9yIHRoZSBleGFtcGxlcyB3ZSB1c2UgYSBzZXQgb2YgYWxsIGZsaWdodHMgZnJvbSBOWUMgaW4gMjAxMy4gVGhlIGRhdGFzZXQgY29uc2lzdHMgb2Ygc2V2ZXJhbCB0YWJsZXM6DQoNCiogQWlybGluZXM6IE5hbWVzIG9mIHRoZSBhaXJsaW5lcw0KKiBBaXJwb3J0czogSW5mb3JtYXRpb24gYWJvdXQgdGhlIHZhcmlvdXMgYWlycG9ydHMNCiogRmxpZ2hzICA6IEluZm9ybWF0aW9uIGFib3V0IHRoZSBmbGlnaHRzIHN1Y2ggYXMgb3JpZ2luLCBkZXN0aW5hdGlvbiwgZGVsYXlzIGV0Yy4NCiogUGxhbmVzICA6IFBsYW5lIGluZm9ybWF0aW9uLCBtYWtlLCBtYW51ZmFjdHVyZSB5ZWFyIGV0Yy4NCiogV2VhdGhlciA6IFdlYXRoZXIgaW5mb3JtYXRvbiBmb3IgdGhlIG9yaWdpbiBhaXJwb3J0cw0KPGIvPg0KDQoNCiMjU2VsZWN0LEZpbHRlcixTb3J0ICYgTXV0YXRlIA0KDQojIyNGaWx0ZXIgcm93cyB3aXRoIGZpbHRlcigpDQoNCkZpbmQgYWxsIGZsaWdodHMgdGhhdCBkZXBhcnRlZCAyMiBkZWNlbWJlciAyMDEzIGZyb20gSkZLLg0KYGBge3J9DQpmaWx0ZXIobnljZmxpZ2h0czEzOjpmbGlnaHRzLG1vbnRoID09IDEyICYgZGF5ID09IDIyICYgb3JpZ2luID09ICJKRksiKQ0KYGBgDQoNCjxici8+DQoNCiMjI1NlbGVjdCByb3dzIHdpdGggc2VsZWN0KCkNCg0KU2VsZWN0IHNwZWNpZmljIHJvd3MNCg0KYGBge3J9DQpzZWxlY3QobnljZmxpZ2h0czEzOjpmbGlnaHRzLGRlc3Qsb3JpZ2luKQ0KYGBgDQoNCkFsbCByb3dzIGJldHdlZW4gdHdvIHZhcmlhYmxlcw0KDQpgYGB7cn0NCnNlbGVjdChueWNmbGlnaHRzMTM6OmZsaWdodHMseWVhcjpkYXkpDQpgYGANCg0KVGhlcmUgYXJlIGEgbnVtYmVyIG9mIGhlbHBlciBmdW5jdGlvbnMgeW91IGNhbiB1c2Ugd2l0aGluIHNlbGVjdCgpOg0KDQoqIHN0YXJ0c193aXRoKCJhYmMiKTogbWF0Y2hlcyBuYW1lcyB0aGF0IGJlZ2luIHdpdGggImFiYyIuDQoqIGVuZHNfd2l0aCgieHl6Iik6IG1hdGNoZXMgbmFtZXMgdGhhdCBlbmQgd2l0aCAieHl6Ii4NCiogY29udGFpbnMoImlqayIpOiBtYXRjaGVzIG5hbWVzIHRoYXQgY29udGFpbiAiaWprIi4NCiogbWF0Y2hlcygiKC4pXFxcMSIpOiBzZWxlY3RzIHZhcmlhYmxlcyB0aGF0IG1hdGNoIGEgcmVndWxhciBleHByZXNzaW9uLiBUaGlzIG9uZSBtYXRjaGVzIGFueSB2YXJpYWJsZXMgdGhhdCBjb250YWluIHJlcGVhdGVkIGNoYXJhY3RlcnMuIFlvdSdsbCBsZWFybiBtb3JlIGFib3V0IHJlZ3VsYXIgZXhwcmVzc2lvbnMgaW4gc3RyaW5ncy4NCiogbnVtX3JhbmdlKCJ4IiwgMTozKSBtYXRjaGVzIHgxLCB4MiBhbmQgeDMuDQoNCjxici8+DQoNCg0KIyMjQWRkIG5ldyB2YXJpYWJsZXMgdXNpbmcgbXV0YXRlKCksIHVzaW5nIHRoZSBwaXBlIG9wZXJhdG9yICU+JSB0byBwYXNzIG9uIGRhdGENCmBgYHtyfQ0KZmxpZ2h0c19zdWJzZXQgPC0gIHNlbGVjdChueWNmbGlnaHRzMTM6OmZsaWdodHMsZGVzdCx0YWlsbnVtLGNvbnRhaW5zKCJkZWxheSIpLGRpc3RhbmNlLGFpcl90aW1lKSAlPiUNCg0KICAgICAgbXV0YXRlKA0KICAgICAgICAgICBhaXJfc3BlZWQ9ZGlzdGFuY2UvYWlyX3RpbWUqNjAsICNNcGgNCiAgICAgICAgICAgY3Vtc3VtX2RlbGF5PWN1bXN1bShhcnJfZGVsYXkpLA0KICAgICAgICAgICBsYWdnZWRfZGVsYXkyPWxhZyh4ID0gY3Vtc3VtX2RlbGF5LDIpKSANCg0KZmxpZ2h0c19zdWJzZXQNCg0KYGBgDQoNCg0KIyMjU3VtbWFyaXppbmcgZGF0YSB3aXRoIHN1bW1hcmlzZSgpDQpgYGB7cn0NCnN1bW1hcmlzZShmbGlnaHRzX3N1YnNldCxtZWFuX3NwZWVkPW1lYW4oYWlyX3NwZWVkLG5hLnJtPVQpKQ0KYGBgDQoNCkNvbWJpbmUgd2l0aCBncm91cF9ieSBmb3IgbW9yZSB1c2VmdWxuZXNzLiANCg0KSGVyZSB3ZSBnZXQgdGhlIHRvcDEwIGZhc3Rlc3QgdGFpbG51bXMgcHIuIGRlc3RpbmF0aW9uLg0KDQpgYGB7cn0NCmdyb3VwX2J5KGZsaWdodHNfc3Vic2V0LGRlc3QsdGFpbG51bSkgJT4lIA0KICAgIHN1bW1hcmlzZShtZWFuX3NwZWVkX2Rlc3QgPSBtZWFuKGFpcl9zcGVlZCxuYS5ybT1UKSxuPW4oKSkgJT4lICAjQ2FsY3VsYXRlIHNwZWVkIGZvciBwci4gdGFpbG51bSBwciBkZXN0DQogICAgICAgICAgYXJyYW5nZShkZXN0LGRlc2MobWVhbl9zcGVlZF9kZXN0KSkgJT4lICAgICAgICAgICAgICAgICAgICNTb3J0IGJ5IGRlc3RpbmF0aW9uIGFuZCB0YWlsbnVtDQogICAgICAgICAgICBtdXRhdGUocm93bj1yb3dfbnVtYmVyKCkpICU+JSAgICAgICAgICAgICAgICAgICAgICAgICAgICNHZXQgdGhlIChpbnRyYSBncm91cCkgcm93IG51bWJlciAocmFuaykNCiAgICAgICAgICAgICAgZmlsdGVyKHJvd24gPD0gMTApICU+JSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgI0tlZXAgb25seSB0aGUgMTAgZmFzdGVzdCB0YWlsbnVtcyBwciBkZXN0DQogICAgICAgICAgICAgICAgYXJyYW5nZShkZXN0LGRlc2MobWVhbl9zcGVlZF9kZXN0KSkgICAgICAgICAgICAgICAgICNBcnJhbmdlIGJ5IGRlc3RpbmF0aW9uIGFuZCBmYXN0ZXN0IHRhaWxudW1zDQpgYGANCg0KQW5vdGhlciBleGFtcGxlOiBGb3IgZWFjaCBkZXN0aW5hdGlvbiwgY29tcHV0ZSB0aGUgdG90YWwgbWludXRlcyBvZiBkZWxheS4gRm9yIGVhY2gsIGZsaWdodCwgY29tcHV0ZSB0aGUgcHJvcG9ydGlvbiBvZiB0aGUgdG90YWwgZGVsYXkgZm9yIGl0cyBkZXN0aW5hdGlvbi4NCg0KYGBge3J9DQpncm91cF9ieShmbGlnaHRzX3N1YnNldCxkZXN0KSAlPiUNCiAgbXV0YXRlKGRlc3RfZGVsYXk9c3VtKGFycl9kZWxheSxuYS5ybT1UKSxwcm9wX2RlbGF5PWFycl9kZWxheS9kZXN0X2RlbGF5KSAlPiUNCiAgICBhcnJhbmdlKGRlc3QsZGVzYyhwcm9wX2RlbGF5KSkgJT4lDQogICAgICBzZWxlY3QoZGVzdCx0YWlsbnVtLGFycl9kZWxheSxkZXN0X2RlbGF5LHByb3BfZGVsYXkpDQpgYGANCg0KDQoNCiMjVGlkeSBkYXRhIHdpdGggc3ByZWFkLCBnYXRoZXIsIHNlcGFyYXRlIGFuZCB1bml0ZQ0KDQojIyNHYXRoZXJpbmcNCg0KV2UgY2FuIHVzZSB0aGUgZ2F0aGVyIGZ1bmN0aW9uIHRvICoqZ2F0aGVyKiogZGF0YSB3aGVyZSB0aGUgY29sdW1ucyBjb250YWluIHZhbHVlcyBvZiBhIHZhcmlhYmxlIHJhdGhlciB0aGFuIHZhcmlhYmxlczoNCg0KYGBge3J9DQp0YWJsZTRhDQpgYGANCg0KSGVyZSB0aGUgdHdvIGNvbHVtbnMgMTk5OSBhbmQgMjAwMCBhcmUgbm90IHZhcmlhYmxlcyBidXQgdmFsdWVzIG9mIGEgdmFyaWFibGUgaW5kaWNhdGluZyB0aW1lIG9mIG9ic2VydmF0aW9uLiBXZSBjYW4gdXNlIGdhdGhlciB0byBwdXQgdGhlc2UgdmFyaWFibGVzIGluIHRvIGEgY29sdW1uIGNhbGxlZCB5ZWFyIGluc3RlYWQ6DQoNCmBgYHtyfQ0KdGFibGU0YSAlPiUgDQogIGdhdGhlcihgMTk5OWAsYDIwMDBgLCBrZXk9IlllYXIiLHZhbHVlPSJDYXNlcyIpICU+JSANCiAgICBhcnJhbmdlKGNvdW50cnkpIA0KYGBgDQoNCkdhdGhlciBtYWtlcyBsb25nIGRhdGEgd2lkZXIgYW5kIHNob3J0ZXIuICANCg0KDQojIyNTcHJlYWRpbmcNCg0KU3ByZWFkIGlzIHVzZWQgd2hlbiBhbiBvYnNlcnZhdGlvbiBpcyBsb2NhdGVkIG92ZXIgc2V2ZXJhbCByb3dzOg0KDQpgYGB7cn0NCnRhYmxlMg0KYGBgDQoNCkhlcmUgZWFjaCBvYnNlcnZhdGlvbiAoYSBjb3VudHJ5IGluIGEgeWVhcikgaXMgc3ByZWFkIGFjcm9zcyB0d28gcm93cyBhbmQgaW5kaWNhdGVkIGJ5IHRoZSB0eXBlIHZhcmlhYmxlLiANCldlIGNhbiB1c2UgKipzcHJlYWQqKiB0byBmaXggdGhpczoNCg0KYGBge3J9DQp0YWJsZTIgJT4lIHNwcmVhZChrZXk9dHlwZSx2YWx1ZT1jb3VudCkNCmBgYA0KDQojIyNTZXBhcmF0ZQ0KDQpXZSB1c2Ugc2VwYXJhdGUgdG8gZGl2aWRlIHZhcmlhYmxlcyB0aGF0IGFyZSBwdXQgaW4gdG8gb25lIGNvbHVtbiB3aGVuIHRoZXkgYmVsb25nIGluIHNldmVyYWwgY29sdW1uczoNCg0KYGBge3J9DQp0YWJsZTMNCmBgYA0KDQpXZSB1c2Ugc2VwYXJhdGUoKSB0byBwdXQgdGhlIGNvbXBvbmVudHMgb2YgdGhlIHJhdGUgaW50byB0d28gbmV3IGNvbHVtbnM6IENhc2VzICYgUG9wdWxhdGlvbg0KDQpgYGB7cn0NCnRhYmxlMyAlPiUgc2VwYXJhdGUocmF0ZSxzZXA9Ii8iLGludG8gPSBjKCJDYXNlcyIsIlBvcHVsYXRpb24iKSkNCmBgYA0KDQo=