The content presented here is somewhat based on the tutorial on data manipulation by Hadley Wickham (author of dplyr, tidyr, ggplot, etc).
Run the following script to make sure all necessary packages are installed
source("http://www.ucl.ac.uk/~uctqiax/PUBLG100/setup.R")
## [1] "Checking package: readr"
## [1] "Checking package: tidyr"
## [1] "Checking package: dplyr"
Now load these packages
library(readr)
library(tidyr)
library(dplyr)
And clear the environment to make sure there are no leftover variables
rm(list = ls())
| Dataset | Filename | Description |
|---|---|---|
| flights | http://ucl.ac.uk/~uctqiax/data/flights.csv | Flight departures from Houston in 2011 |
| weather | http://ucl.ac.uk/~uctqiax/data/weather.csv | Hourly weather |
| planes | http://ucl.ac.uk/~uctqiax/data/planes.csv | Metadata for planes |
| airports | http://ucl.ac.uk/~uctqiax/data/airports.csv | Metadata for airports |
We’re going to use the readr package which provides improved functions for reading datasets from files.
flights <- read_csv("http://ucl.ac.uk/~uctqiax/data/flights.csv")
weather <- read_csv("http://ucl.ac.uk/~uctqiax/data/weather.csv")
planes <- read_csv("http://ucl.ac.uk/~uctqiax/data/planes.csv")
airports <- read_csv("http://ucl.ac.uk/~uctqiax/data/airports.csv")
## Warning: 2 parsing failures.
## row col expected actual
## 1252 airport delimiter or quote B
## 1252 airport delimiter or quote
You can ignore the warnings from the airport dataset.
Dplyr makes it easy to “chain” functions together using the pipe operator %>%. The following diagram illustrates the general concept of pipes where data flows from one pipe to another until all the processing is completed.
The syntax of the pipe operator %>% might appear unusual at first, but once you get used to it you’ll start to appreciate it’s power and flexibility.
Let’s start off by examining the flights data
flights
## Source: local data frame [227,496 x 14]
##
## date hour minute dep arr dep_delay arr_delay
## (time) (int) (int) (int) (int) (int) (int)
## 1 2011-01-01 12:00:00 14 0 1400 1500 0 -10
## 2 2011-01-02 12:00:00 14 1 1401 1501 1 -9
## 3 2011-01-03 12:00:00 13 52 1352 1502 -8 -8
## 4 2011-01-04 12:00:00 14 3 1403 1513 3 3
## 5 2011-01-05 12:00:00 14 5 1405 1507 5 -3
## 6 2011-01-06 12:00:00 13 59 1359 1503 -1 -7
## 7 2011-01-07 12:00:00 13 59 1359 1509 -1 -1
## 8 2011-01-08 12:00:00 13 55 1355 1454 -5 -16
## 9 2011-01-09 12:00:00 14 43 1443 1554 43 44
## 10 2011-01-10 12:00:00 14 43 1443 1553 43 43
## .. ... ... ... ... ... ... ...
## Variables not shown: carrier (chr), flight (int), dest (chr), plane (chr),
## cancelled (int), time (int), dist (int)
weather
## Source: local data frame [8,723 x 14]
##
## date hour temp dew_point humidity pressure visibility wind_dir
## (date) (int) (dbl) (dbl) (int) (dbl) (dbl) (chr)
## 1 2011-01-01 0 59.0 28.9 32 29.86 10 NNE
## 2 2011-01-01 1 57.2 28.4 33 29.88 10 NNE
## 3 2011-01-01 2 55.4 28.4 36 29.93 10 NNW
## 4 2011-01-01 3 53.6 28.4 38 29.94 10 North
## 5 2011-01-01 4 NA NA NA 29.99 10 NNW
## 6 2011-01-01 5 NA NA NA 30.02 10 North
## 7 2011-01-01 6 53.1 17.1 24 30.05 10 North
## 8 2011-01-01 7 53.1 16.0 23 30.07 10 North
## 9 2011-01-01 8 54.0 18.0 24 30.09 10 North
## 10 2011-01-01 9 55.4 17.6 23 30.09 10 NNE
## .. ... ... ... ... ... ... ... ...
## Variables not shown: wind_dir2 (int), wind_speed (dbl), gust_speed (dbl),
## precip (dbl), conditions (chr), events (chr)
planes
## Source: local data frame [2,853 x 9]
##
## plane year mfr model no.eng no.seats speed
## (chr) (int) (chr) (chr) (int) (int) (int)
## 1 N576AA 1991 MCDONNELL DOUGLAS DC-9-82(MD-82) 2 172 NA
## 2 N557AA 1993 MARZ BARRY KITFOX IV 1 2 NA
## 3 N403AA 1974 RAVEN S55A NA 1 60
## 4 N492AA 1989 MCDONNELL DOUGLAS DC-9-82(MD-82) 2 172 NA
## 5 N262AA 1985 MCDONNELL DOUGLAS DC-9-82(MD-82) 2 172 NA
## 6 N493AA 1989 MCDONNELL DOUGLAS DC-9-82(MD-82) 2 172 NA
## 7 N477AA 1988 MCDONNELL DOUGLAS DC-9-82(MD-82) 2 172 NA
## 8 N476AA 1988 MCDONNELL DOUGLAS DC-9-82(MD-82) 2 172 NA
## 9 N504AA NA AUTHIER ANTHONY P TIERRA II 1 2 NA
## 10 N565AA 1987 MCDONNELL DOUGLAS DC-9-83(MD-83) 2 172 NA
## .. ... ... ... ... ... ... ...
## Variables not shown: engine (chr), type (chr)
airports
## Source: local data frame [3,376 x 7]
##
## iata airport city state country lat
## (chr) (chr) (chr) (chr) (chr) (dbl)
## 1 00M Thigpen Bay Springs MS USA 31.95376
## 2 00R Livingston Municipal Livingston TX USA 30.68586
## 3 00V Meadow Lake Colorado Springs CO USA 38.94575
## 4 01G Perry-Warsaw Perry NY USA 42.74135
## 5 01J Hilliard Airpark Hilliard FL USA 30.68801
## 6 01M Tishomingo County Belmont MS USA 34.49167
## 7 02A Gragg-Wade Clanton AL USA 32.85049
## 8 02C Capitol Brookfield WI USA 43.08751
## 9 02G Columbiana County East Liverpool OH USA 40.67331
## 10 03D Memphis Memorial Memphis MO USA 40.44726
## .. ... ... ... ... ... ...
## Variables not shown: long (dbl)
Notice that because we used read_csv(), the data frame we received now prints nicely without having to use the head() function and does not clutter your screen.
Now let’s do some simple tasks with dplyr using the pipe operator
dest, time and dist)flights %>%
select(dest, time, dist)
## Source: local data frame [227,496 x 3]
##
## dest time dist
## (chr) (int) (int)
## 1 DFW 40 224
## 2 DFW 45 224
## 3 DFW 48 224
## 4 DFW 39 224
## 5 DFW 44 224
## 6 DFW 45 224
## 7 DFW 43 224
## 8 DFW 40 224
## 9 DFW 41 224
## 10 DFW 45 224
## .. ... ... ...
Add the arrival delay (arr_delay) and departure delay (dep_delay) columns as well.
flights %>%
select(dest, time, dist, arr_delay, dep_delay)
## Source: local data frame [227,496 x 5]
##
## dest time dist arr_delay dep_delay
## (chr) (int) (int) (int) (int)
## 1 DFW 40 224 -10 0
## 2 DFW 45 224 -9 1
## 3 DFW 48 224 -8 -8
## 4 DFW 39 224 3 3
## 5 DFW 44 224 -3 5
## 6 DFW 45 224 -7 -1
## 7 DFW 43 224 -1 -1
## 8 DFW 40 224 -16 -5
## 9 DFW 41 224 44 43
## 10 DFW 45 224 43 43
## .. ... ... ... ... ...
Other ways to do the same
flights %>%
select(dest, time, dist, ends_with("delay"))
## Source: local data frame [227,496 x 5]
##
## dest time dist dep_delay arr_delay
## (chr) (int) (int) (int) (int)
## 1 DFW 40 224 0 -10
## 2 DFW 45 224 1 -9
## 3 DFW 48 224 -8 -8
## 4 DFW 39 224 3 3
## 5 DFW 44 224 5 -3
## 6 DFW 45 224 -1 -7
## 7 DFW 43 224 -1 -1
## 8 DFW 40 224 -5 -16
## 9 DFW 41 224 43 44
## 10 DFW 45 224 43 43
## .. ... ... ... ... ...
and …
flights %>%
select(dest, time, dist, contains("delay"))
## Source: local data frame [227,496 x 5]
##
## dest time dist dep_delay arr_delay
## (chr) (int) (int) (int) (int)
## 1 DFW 40 224 0 -10
## 2 DFW 45 224 1 -9
## 3 DFW 48 224 -8 -8
## 4 DFW 39 224 3 3
## 5 DFW 44 224 5 -3
## 6 DFW 45 224 -1 -7
## 7 DFW 43 224 -1 -1
## 8 DFW 40 224 -5 -16
## 9 DFW 41 224 43 44
## 10 DFW 45 224 43 43
## .. ... ... ... ... ...
Select all columns from date to arr
flights %>%
select(date:arr)
## Source: local data frame [227,496 x 5]
##
## date hour minute dep arr
## (time) (int) (int) (int) (int)
## 1 2011-01-01 12:00:00 14 0 1400 1500
## 2 2011-01-02 12:00:00 14 1 1401 1501
## 3 2011-01-03 12:00:00 13 52 1352 1502
## 4 2011-01-04 12:00:00 14 3 1403 1513
## 5 2011-01-05 12:00:00 14 5 1405 1507
## 6 2011-01-06 12:00:00 13 59 1359 1503
## 7 2011-01-07 12:00:00 13 59 1359 1509
## 8 2011-01-08 12:00:00 13 55 1355 1454
## 9 2011-01-09 12:00:00 14 43 1443 1554
## 10 2011-01-10 12:00:00 14 43 1443 1553
## .. ... ... ... ... ...
Select all except plane column using the minus sign
flights %>%
select(-plane)
## Source: local data frame [227,496 x 13]
##
## date hour minute dep arr dep_delay arr_delay
## (time) (int) (int) (int) (int) (int) (int)
## 1 2011-01-01 12:00:00 14 0 1400 1500 0 -10
## 2 2011-01-02 12:00:00 14 1 1401 1501 1 -9
## 3 2011-01-03 12:00:00 13 52 1352 1502 -8 -8
## 4 2011-01-04 12:00:00 14 3 1403 1513 3 3
## 5 2011-01-05 12:00:00 14 5 1405 1507 5 -3
## 6 2011-01-06 12:00:00 13 59 1359 1503 -1 -7
## 7 2011-01-07 12:00:00 13 59 1359 1509 -1 -1
## 8 2011-01-08 12:00:00 13 55 1355 1454 -5 -16
## 9 2011-01-09 12:00:00 14 43 1443 1554 43 44
## 10 2011-01-10 12:00:00 14 43 1443 1553 43 43
## .. ... ... ... ... ... ... ...
## Variables not shown: carrier (chr), flight (int), dest (chr), cancelled
## (int), time (int), dist (int)
The filter() function returns rows with matching conditions. Let’s build on the previous exercise and find all flights to Boston (BOS):
flights %>%
select(dest, time, dist) %>%
filter(dest == "BOS")
## Source: local data frame [1,752 x 3]
##
## dest time dist
## (chr) (int) (int)
## 1 BOS 195 1597
## 2 BOS 188 1597
## 3 BOS 190 1597
## 4 BOS 188 1597
## 5 BOS 180 1597
## 6 BOS 190 1597
## 7 BOS 185 1597
## 8 BOS 198 1597
## 9 BOS 194 1597
## 10 BOS 203 1597
## .. ... ... ...
Now let’s do the filter first and then select the columns
flights %>%
filter(dest == "BOS") %>%
select(dest, time, dist)
## Source: local data frame [1,752 x 3]
##
## dest time dist
## (chr) (int) (int)
## 1 BOS 195 1597
## 2 BOS 188 1597
## 3 BOS 190 1597
## 4 BOS 188 1597
## 5 BOS 180 1597
## 6 BOS 190 1597
## 7 BOS 185 1597
## 8 BOS 198 1597
## 9 BOS 194 1597
## 10 BOS 203 1597
## .. ... ... ...
In this case the order doesn’t matter, but when using pipes make sure you understand that each function is executed in sequence and the results are then fed to the next one.
Find all flights that match the following conditions:
Here’s a brief summary of operators you can use
Source: Hadley Wickham’s Dplyr Tutorial
The arrange() function is used to sort the rows based on one or more columns
flights %>%
arrange(dest)
## Source: local data frame [227,496 x 14]
##
## date hour minute dep arr dep_delay arr_delay
## (time) (int) (int) (int) (int) (int) (int)
## 1 2011-01-31 12:00:00 17 33 1733 1901 -2 -4
## 2 2011-01-30 12:00:00 17 50 1750 1913 15 8
## 3 2011-01-29 12:00:00 17 32 1732 1837 -3 -23
## 4 2011-01-28 12:00:00 17 33 1733 1848 -2 -17
## 5 2011-01-27 12:00:00 17 41 1741 1854 6 -11
## 6 2011-01-26 12:00:00 17 32 1732 1853 -3 -12
## 7 2011-01-25 12:00:00 17 29 1729 1858 -6 -7
## 8 2011-01-24 12:00:00 17 34 1734 1845 -1 -20
## 9 2011-01-23 12:00:00 17 35 1735 1853 0 -12
## 10 2011-01-22 12:00:00 17 33 1733 1843 -2 -17
## .. ... ... ... ... ... ... ...
## Variables not shown: carrier (chr), flight (int), dest (chr), plane (chr),
## cancelled (int), time (int), dist (int)
The mutate() function is used to create new variables.
Up until now we’ve only been examining the dataset but haven’t made any changes to it. All our functions so far have simply displayed the results on screen but haven’t created or modified existing variables. Let’s see how we can create a new variable called speed based on the distance and duration in the flights dataframe.
In this exercise we’re adding a new variable to an existing dataframe so we’ll just overwrite the flights variable with the one that has a speed column
flights <- flights %>%
mutate(speed = dist / (time / 60))
Let’s count the number of flights departing each day.
flights %>%
group_by(date) %>%
summarise(count = n())
## Source: local data frame [365 x 2]
##
## date count
## (time) (int)
## 1 2011-01-01 12:00:00 552
## 2 2011-01-02 12:00:00 678
## 3 2011-01-03 12:00:00 702
## 4 2011-01-04 12:00:00 583
## 5 2011-01-05 12:00:00 590
## 6 2011-01-06 12:00:00 660
## 7 2011-01-07 12:00:00 661
## 8 2011-01-08 12:00:00 500
## 9 2011-01-09 12:00:00 602
## 10 2011-01-10 12:00:00 659
## .. ... ...
Here’s a nice little trick. You can use View() to look at the results of a pipe operation without creating new variables.
flights %>%
group_by(date) %>%
summarise(count = n()) %>%
View()
Of course, often times we’d want to save the summary in a variable for further analysis.
Let’s find the average departure delay for each destination
delays <- flights %>%
group_by(dest) %>%
summarise(mean = mean(dep_delay))
delays
## Source: local data frame [116 x 2]
##
## dest mean
## (chr) (dbl)
## 1 ABQ NA
## 2 AEX NA
## 3 AGS 10.000
## 4 AMA NA
## 5 ANC 24.952
## 6 ASE NA
## 7 ATL NA
## 8 AUS NA
## 9 AVL NA
## 10 BFL NA
## .. ... ...
hourly <- flights %>%
filter(cancelled == 0) %>%
mutate(time = hour + minute / 60) %>%
group_by(time) %>%
summarise(
arr_delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
plot(arr_delay ~ time, data = hourly, xaxt = 'n')
axis(1, at = seq(0, 24), las=2)
What if we wanted to see airport delays based on a state or city? We have the airports data that gives us the location of the airport so we can just merge it with the flights data.
flights %>%
group_by(dest) %>%
filter(!is.na(arr_delay)) %>%
summarise(delay = mean(arr_delay)) %>%
arrange(desc(delay)) %>%
left_join(airports, by = c("dest" = "iata"))
## Source: local data frame [116 x 8]
##
## dest delay airport city
## (chr) (dbl) (chr) (chr)
## 1 ANC 26.08065 Ted Stevens Anchorage International Anchorage
## 2 CID 17.80049 Eastern Iowa Cedar Rapids
## 3 DSM 15.95110 Des Moines International Des Moines
## 4 SFO 14.89036 San Francisco International San Francisco
## 5 BPT 14.33333 Southeast Texas Regional Beaumont/Port Arthur
## 6 GRR 13.71729 Kent County International Grand Rapids
## 7 DAY 13.67117 James M Cox Dayton Intl Dayton
## 8 VPS 12.45718 Eglin Air Force Base Valparaiso
## 9 ECP 12.42222 NA NA
## 10 SAV 12.33137 Savannah International Savannah
## .. ... ... ... ...
## Variables not shown: state (chr), country (chr), lat (dbl), long (dbl)
Remember the wonderful WDI Dataset from Term 1? It’s fairly common for datasets from public sources to come in wide formats that need to be reshaped. Let’s go over the WDI example step by step to see how that’s done.
We’re using a tiny sample of the original WDI dataset here to simplify the tasks. Let’s load the dataset and see what it looks like.
wdi <- read_csv("http://ucl.ac.uk/~uctqiax/data/wdi.csv", na = "..")
wdi
## Source: local data frame [5 x 7]
##
## ¬Series.Name Series.Code Country.Name Country.Code X1995.YR1995
## (chr) (chr) (chr) (chr) (dbl)
## 1 Maternal mortality SH.STA.MMRT France FRA 15.000000
## 2 Maternal mortality SH.STA.MMRT Spain ESP 6.000000
## 3 Maternal mortality SH.STA.MMRT NA
## 4 Health expenditure SH.XPD.TOTL.ZS France FRA 10.355906
## 5 Health expenditure SH.XPD.TOTL.ZS Spain ESP 7.444592
## Variables not shown: X2000.YR2000 (dbl), X2005.YR2005 (dbl)
But ideally, we’d like our data to look something like this:
## Source: local data frame [6 x 5]
##
## CountryCode CountryName Year MaternalMortality HealthExpenditure
## (chr) (chr) (dbl) (dbl) (dbl)
## 1 ESP Spain 1995 6 7.444592
## 2 ESP Spain 2000 5 7.214756
## 3 ESP Spain 2005 5 8.288271
## 4 FRA France 1995 15 10.355906
## 5 FRA France 2000 12 10.084833
## 6 FRA France 2005 10 10.932626
We can see that some country names and codes are blank, so let’s get rid of them first
wdi %>%
filter(Country.Code != "")
## Source: local data frame [4 x 7]
##
## ¬Series.Name Series.Code Country.Name Country.Code X1995.YR1995
## (chr) (chr) (chr) (chr) (dbl)
## 1 Maternal mortality SH.STA.MMRT France FRA 15.000000
## 2 Maternal mortality SH.STA.MMRT Spain ESP 6.000000
## 3 Health expenditure SH.XPD.TOTL.ZS France FRA 10.355906
## 4 Health expenditure SH.XPD.TOTL.ZS Spain ESP 7.444592
## Variables not shown: X2000.YR2000 (dbl), X2005.YR2005 (dbl)
So far so good. Note that we’re not making any changes yet so we can just add one function at a time to the pipeline and check the results. Once we’re satisfied with the results we save them to a variable.
We need to gather all columns that start with “X” that contain per-year values for each series (for example X1960..YR1960)
wdi %>%
filter(Country.Code != "") %>%
gather(Year, Value, starts_with("X"))
## Source: local data frame [12 x 6]
##
## ¬Series.Name Series.Code Country.Name Country.Code
## (chr) (chr) (chr) (chr)
## 1 Maternal mortality SH.STA.MMRT France FRA
## 2 Maternal mortality SH.STA.MMRT Spain ESP
## 3 Health expenditure SH.XPD.TOTL.ZS France FRA
## 4 Health expenditure SH.XPD.TOTL.ZS Spain ESP
## 5 Maternal mortality SH.STA.MMRT France FRA
## 6 Maternal mortality SH.STA.MMRT Spain ESP
## 7 Health expenditure SH.XPD.TOTL.ZS France FRA
## 8 Health expenditure SH.XPD.TOTL.ZS Spain ESP
## 9 Maternal mortality SH.STA.MMRT France FRA
## 10 Maternal mortality SH.STA.MMRT Spain ESP
## 11 Health expenditure SH.XPD.TOTL.ZS France FRA
## 12 Health expenditure SH.XPD.TOTL.ZS Spain ESP
## Variables not shown: Year (chr), Value (dbl)
Now all values are in the Value column, so we need to spread them out to individual columns based on the Series.Code. We have to make sure that we only keep the columns that make the country-year observations unique. We use select() to keep Country.Code, Country.Name, Year, plus the two columns (Series.Code and Value) that will make up the key-value pair for the spread() function.
wdi %>%
filter(Country.Code != "") %>%
gather(Year, Value, starts_with("X")) %>%
select(Country.Code, Country.Name, Year, Series.Code, Value) %>%
spread(Series.Code, Value)
## Source: local data frame [6 x 5]
##
## Country.Code Country.Name Year SH.STA.MMRT SH.XPD.TOTL.ZS
## (chr) (chr) (chr) (dbl) (dbl)
## 1 ESP Spain X1995.YR1995 6 7.444592
## 2 ESP Spain X2000.YR2000 5 7.214756
## 3 ESP Spain X2005.YR2005 5 8.288271
## 4 FRA France X1995.YR1995 15 10.355906
## 5 FRA France X2000.YR2000 12 10.084833
## 6 FRA France X2005.YR2005 10 10.932626
It looks good, so we can rename the variables to something meaningful.
wdi %>%
filter(Country.Code != "") %>%
gather(Year, Value, starts_with("X")) %>%
select(Country.Code, Country.Name, Year, Series.Code, Value) %>%
spread(Series.Code, Value) %>%
rename(CountryName = Country.Name,
CountryCode = Country.Code,
MaternalMortality = SH.STA.MMRT,
HealthExpenditure = SH.XPD.TOTL.ZS)
## Source: local data frame [6 x 5]
##
## CountryCode CountryName Year MaternalMortality HealthExpenditure
## (chr) (chr) (chr) (dbl) (dbl)
## 1 ESP Spain X1995.YR1995 6 7.444592
## 2 ESP Spain X2000.YR2000 5 7.214756
## 3 ESP Spain X2005.YR2005 5 8.288271
## 4 FRA France X1995.YR1995 15 10.355906
## 5 FRA France X2000.YR2000 12 10.084833
## 6 FRA France X2005.YR2005 10 10.932626
Now we just need to extract the 4-digit year from the Year column. The Year column is formatted as X1995.YR1995 which means that the 4-digits for the year are in position 2,3,4, and 5. We can use the substring() function to take all the characters from position 2 to 5 and assign it back to the Year column.
Since this is the last step we might as well assign the results to a new variable.
wdi_long <- wdi %>%
filter(Country.Code != "") %>%
gather(Year, Value, starts_with("X")) %>%
select(Country.Code, Country.Name, Year, Series.Code, Value) %>%
spread(Series.Code, Value) %>%
rename(CountryName = Country.Name,
CountryCode = Country.Code,
MaternalMortality = SH.STA.MMRT,
HealthExpenditure = SH.XPD.TOTL.ZS) %>%
mutate(Year = as.numeric(substring(Year, 2, 5)))
wdi_long
## Source: local data frame [6 x 5]
##
## CountryCode CountryName Year MaternalMortality HealthExpenditure
## (chr) (chr) (dbl) (dbl) (dbl)
## 1 ESP Spain 1995 6 7.444592
## 2 ESP Spain 2000 5 7.214756
## 3 ESP Spain 2005 5 8.288271
## 4 FRA France 1995 15 10.355906
## 5 FRA France 2000 12 10.084833
## 6 FRA France 2005 10 10.932626
You can assign it back to wdi if you want, but we’re using a different name in case we make a mistake and have to start again. This way we would’ve have to reload the file all over again.