Data Manipulation in R

The content presented here is somewhat based on the tutorial on data manipulation by Hadley Wickham (author of dplyr, tidyr, ggplot, etc).

Setup

Run the following script to make sure all necessary packages are installed

source("http://www.ucl.ac.uk/~uctqiax/PUBLG100/setup.R")

## [1] "Checking package: readr"
## [1] "Checking package: tidyr"
## [1] "Checking package: dplyr"

Now load these packages

library(readr)
library(tidyr)
library(dplyr)

And clear the environment to make sure there are no leftover variables

rm(list = ls())

Flights Data

Dataset	Filename	Description
flights	http://ucl.ac.uk/~uctqiax/data/flights.csv	Flight departures from Houston in 2011
weather	http://ucl.ac.uk/~uctqiax/data/weather.csv	Hourly weather
planes	http://ucl.ac.uk/~uctqiax/data/planes.csv	Metadata for planes
airports	http://ucl.ac.uk/~uctqiax/data/airports.csv	Metadata for airports

We’re going to use the readr package which provides improved functions for reading datasets from files.

flights <- read_csv("http://ucl.ac.uk/~uctqiax/data/flights.csv")
weather <- read_csv("http://ucl.ac.uk/~uctqiax/data/weather.csv")
planes <- read_csv("http://ucl.ac.uk/~uctqiax/data/planes.csv")
airports <- read_csv("http://ucl.ac.uk/~uctqiax/data/airports.csv")

## Warning: 2 parsing failures.
##  row     col           expected actual
## 1252 airport delimiter or quote      B
## 1252 airport delimiter or quote

You can ignore the warnings from the airport dataset.

Data pipelines

Dplyr makes it easy to “chain” functions together using the pipe operator %>%. The following diagram illustrates the general concept of pipes where data flows from one pipe to another until all the processing is completed.

The syntax of the pipe operator %>% might appear unusual at first, but once you get used to it you’ll start to appreciate it’s power and flexibility.

Let’s start off by examining the flights data

flights

## Source: local data frame [227,496 x 14]
## 
##                   date  hour minute   dep   arr dep_delay arr_delay
##                 (time) (int)  (int) (int) (int)     (int)     (int)
## 1  2011-01-01 12:00:00    14      0  1400  1500         0       -10
## 2  2011-01-02 12:00:00    14      1  1401  1501         1        -9
## 3  2011-01-03 12:00:00    13     52  1352  1502        -8        -8
## 4  2011-01-04 12:00:00    14      3  1403  1513         3         3
## 5  2011-01-05 12:00:00    14      5  1405  1507         5        -3
## 6  2011-01-06 12:00:00    13     59  1359  1503        -1        -7
## 7  2011-01-07 12:00:00    13     59  1359  1509        -1        -1
## 8  2011-01-08 12:00:00    13     55  1355  1454        -5       -16
## 9  2011-01-09 12:00:00    14     43  1443  1554        43        44
## 10 2011-01-10 12:00:00    14     43  1443  1553        43        43
## ..                 ...   ...    ...   ...   ...       ...       ...
## Variables not shown: carrier (chr), flight (int), dest (chr), plane (chr),
##   cancelled (int), time (int), dist (int)

weather

## Source: local data frame [8,723 x 14]
## 
##          date  hour  temp dew_point humidity pressure visibility wind_dir
##        (date) (int) (dbl)     (dbl)    (int)    (dbl)      (dbl)    (chr)
## 1  2011-01-01     0  59.0      28.9       32    29.86         10      NNE
## 2  2011-01-01     1  57.2      28.4       33    29.88         10      NNE
## 3  2011-01-01     2  55.4      28.4       36    29.93         10      NNW
## 4  2011-01-01     3  53.6      28.4       38    29.94         10    North
## 5  2011-01-01     4    NA        NA       NA    29.99         10      NNW
## 6  2011-01-01     5    NA        NA       NA    30.02         10    North
## 7  2011-01-01     6  53.1      17.1       24    30.05         10    North
## 8  2011-01-01     7  53.1      16.0       23    30.07         10    North
## 9  2011-01-01     8  54.0      18.0       24    30.09         10    North
## 10 2011-01-01     9  55.4      17.6       23    30.09         10      NNE
## ..        ...   ...   ...       ...      ...      ...        ...      ...
## Variables not shown: wind_dir2 (int), wind_speed (dbl), gust_speed (dbl),
##   precip (dbl), conditions (chr), events (chr)

planes

## Source: local data frame [2,853 x 9]
## 
##     plane  year               mfr          model no.eng no.seats speed
##     (chr) (int)             (chr)          (chr)  (int)    (int) (int)
## 1  N576AA  1991 MCDONNELL DOUGLAS DC-9-82(MD-82)      2      172    NA
## 2  N557AA  1993        MARZ BARRY      KITFOX IV      1        2    NA
## 3  N403AA  1974             RAVEN           S55A     NA        1    60
## 4  N492AA  1989 MCDONNELL DOUGLAS DC-9-82(MD-82)      2      172    NA
## 5  N262AA  1985 MCDONNELL DOUGLAS DC-9-82(MD-82)      2      172    NA
## 6  N493AA  1989 MCDONNELL DOUGLAS DC-9-82(MD-82)      2      172    NA
## 7  N477AA  1988 MCDONNELL DOUGLAS DC-9-82(MD-82)      2      172    NA
## 8  N476AA  1988 MCDONNELL DOUGLAS DC-9-82(MD-82)      2      172    NA
## 9  N504AA    NA AUTHIER ANTHONY P      TIERRA II      1        2    NA
## 10 N565AA  1987 MCDONNELL DOUGLAS DC-9-83(MD-83)      2      172    NA
## ..    ...   ...               ...            ...    ...      ...   ...
## Variables not shown: engine (chr), type (chr)

airports

## Source: local data frame [3,376 x 7]
## 
##     iata              airport             city state country      lat
##    (chr)                (chr)            (chr) (chr)   (chr)    (dbl)
## 1    00M              Thigpen      Bay Springs    MS     USA 31.95376
## 2    00R Livingston Municipal       Livingston    TX     USA 30.68586
## 3    00V          Meadow Lake Colorado Springs    CO     USA 38.94575
## 4    01G         Perry-Warsaw            Perry    NY     USA 42.74135
## 5    01J     Hilliard Airpark         Hilliard    FL     USA 30.68801
## 6    01M    Tishomingo County          Belmont    MS     USA 34.49167
## 7    02A           Gragg-Wade          Clanton    AL     USA 32.85049
## 8    02C              Capitol       Brookfield    WI     USA 43.08751
## 9    02G    Columbiana County   East Liverpool    OH     USA 40.67331
## 10   03D     Memphis Memorial          Memphis    MO     USA 40.44726
## ..   ...                  ...              ...   ...     ...      ...
## Variables not shown: long (dbl)

Notice that because we used read_csv(), the data frame we received now prints nicely without having to use the head() function and does not clutter your screen.

Now let’s do some simple tasks with dplyr using the pipe operator

Select the destination, duration and distance columns (dest, time and dist)

flights %>%
  select(dest, time, dist)

## Source: local data frame [227,496 x 3]
## 
##     dest  time  dist
##    (chr) (int) (int)
## 1    DFW    40   224
## 2    DFW    45   224
## 3    DFW    48   224
## 4    DFW    39   224
## 5    DFW    44   224
## 6    DFW    45   224
## 7    DFW    43   224
## 8    DFW    40   224
## 9    DFW    41   224
## 10   DFW    45   224
## ..   ...   ...   ...

Add the arrival delay (arr_delay) and departure delay (dep_delay) columns as well.

flights %>%
  select(dest, time, dist, arr_delay, dep_delay)

## Source: local data frame [227,496 x 5]
## 
##     dest  time  dist arr_delay dep_delay
##    (chr) (int) (int)     (int)     (int)
## 1    DFW    40   224       -10         0
## 2    DFW    45   224        -9         1
## 3    DFW    48   224        -8        -8
## 4    DFW    39   224         3         3
## 5    DFW    44   224        -3         5
## 6    DFW    45   224        -7        -1
## 7    DFW    43   224        -1        -1
## 8    DFW    40   224       -16        -5
## 9    DFW    41   224        44        43
## 10   DFW    45   224        43        43
## ..   ...   ...   ...       ...       ...

Other ways to do the same

flights %>%
  select(dest, time, dist, ends_with("delay"))

## Source: local data frame [227,496 x 5]
## 
##     dest  time  dist dep_delay arr_delay
##    (chr) (int) (int)     (int)     (int)
## 1    DFW    40   224         0       -10
## 2    DFW    45   224         1        -9
## 3    DFW    48   224        -8        -8
## 4    DFW    39   224         3         3
## 5    DFW    44   224         5        -3
## 6    DFW    45   224        -1        -7
## 7    DFW    43   224        -1        -1
## 8    DFW    40   224        -5       -16
## 9    DFW    41   224        43        44
## 10   DFW    45   224        43        43
## ..   ...   ...   ...       ...       ...

and …

flights %>%
  select(dest, time, dist, contains("delay"))

## Source: local data frame [227,496 x 5]
## 
##     dest  time  dist dep_delay arr_delay
##    (chr) (int) (int)     (int)     (int)
## 1    DFW    40   224         0       -10
## 2    DFW    45   224         1        -9
## 3    DFW    48   224        -8        -8
## 4    DFW    39   224         3         3
## 5    DFW    44   224         5        -3
## 6    DFW    45   224        -1        -7
## 7    DFW    43   224        -1        -1
## 8    DFW    40   224        -5       -16
## 9    DFW    41   224        43        44
## 10   DFW    45   224        43        43
## ..   ...   ...   ...       ...       ...

Select all columns from date to arr

flights %>%
  select(date:arr)

## Source: local data frame [227,496 x 5]
## 
##                   date  hour minute   dep   arr
##                 (time) (int)  (int) (int) (int)
## 1  2011-01-01 12:00:00    14      0  1400  1500
## 2  2011-01-02 12:00:00    14      1  1401  1501
## 3  2011-01-03 12:00:00    13     52  1352  1502
## 4  2011-01-04 12:00:00    14      3  1403  1513
## 5  2011-01-05 12:00:00    14      5  1405  1507
## 6  2011-01-06 12:00:00    13     59  1359  1503
## 7  2011-01-07 12:00:00    13     59  1359  1509
## 8  2011-01-08 12:00:00    13     55  1355  1454
## 9  2011-01-09 12:00:00    14     43  1443  1554
## 10 2011-01-10 12:00:00    14     43  1443  1553
## ..                 ...   ...    ...   ...   ...

Select all except plane column using the minus sign

flights %>%
  select(-plane)

## Source: local data frame [227,496 x 13]
## 
##                   date  hour minute   dep   arr dep_delay arr_delay
##                 (time) (int)  (int) (int) (int)     (int)     (int)
## 1  2011-01-01 12:00:00    14      0  1400  1500         0       -10
## 2  2011-01-02 12:00:00    14      1  1401  1501         1        -9
## 3  2011-01-03 12:00:00    13     52  1352  1502        -8        -8
## 4  2011-01-04 12:00:00    14      3  1403  1513         3         3
## 5  2011-01-05 12:00:00    14      5  1405  1507         5        -3
## 6  2011-01-06 12:00:00    13     59  1359  1503        -1        -7
## 7  2011-01-07 12:00:00    13     59  1359  1509        -1        -1
## 8  2011-01-08 12:00:00    13     55  1355  1454        -5       -16
## 9  2011-01-09 12:00:00    14     43  1443  1554        43        44
## 10 2011-01-10 12:00:00    14     43  1443  1553        43        43
## ..                 ...   ...    ...   ...   ...       ...       ...
## Variables not shown: carrier (chr), flight (int), dest (chr), cancelled
##   (int), time (int), dist (int)

Filter

The filter() function returns rows with matching conditions. Let’s build on the previous exercise and find all flights to Boston (BOS):

flights %>%
  select(dest, time, dist) %>%
  filter(dest == "BOS")

## Source: local data frame [1,752 x 3]
## 
##     dest  time  dist
##    (chr) (int) (int)
## 1    BOS   195  1597
## 2    BOS   188  1597
## 3    BOS   190  1597
## 4    BOS   188  1597
## 5    BOS   180  1597
## 6    BOS   190  1597
## 7    BOS   185  1597
## 8    BOS   198  1597
## 9    BOS   194  1597
## 10   BOS   203  1597
## ..   ...   ...   ...

Now let’s do the filter first and then select the columns

flights %>%
  filter(dest == "BOS") %>%
  select(dest, time, dist)

## Source: local data frame [1,752 x 3]
## 
##     dest  time  dist
##    (chr) (int) (int)
## 1    BOS   195  1597
## 2    BOS   188  1597
## 3    BOS   190  1597
## 4    BOS   188  1597
## 5    BOS   180  1597
## 6    BOS   190  1597
## 7    BOS   185  1597
## 8    BOS   198  1597
## 9    BOS   194  1597
## 10   BOS   203  1597
## ..   ...   ...   ...

In this case the order doesn’t matter, but when using pipes make sure you understand that each function is executed in sequence and the results are then fed to the next one.

Exercise

Find all flights that match the following conditions:

To SFO or OAK
In January
Delayed by more than an hour
Departed between midnight and 5am
Arrival delay more than twice the departure delay

Here’s a brief summary of operators you can use

Source: Hadley Wickham’s Dplyr Tutorial

Arrange

The arrange() function is used to sort the rows based on one or more columns

flights %>%
  arrange(dest)

## Source: local data frame [227,496 x 14]
## 
##                   date  hour minute   dep   arr dep_delay arr_delay
##                 (time) (int)  (int) (int) (int)     (int)     (int)
## 1  2011-01-31 12:00:00    17     33  1733  1901        -2        -4
## 2  2011-01-30 12:00:00    17     50  1750  1913        15         8
## 3  2011-01-29 12:00:00    17     32  1732  1837        -3       -23
## 4  2011-01-28 12:00:00    17     33  1733  1848        -2       -17
## 5  2011-01-27 12:00:00    17     41  1741  1854         6       -11
## 6  2011-01-26 12:00:00    17     32  1732  1853        -3       -12
## 7  2011-01-25 12:00:00    17     29  1729  1858        -6        -7
## 8  2011-01-24 12:00:00    17     34  1734  1845        -1       -20
## 9  2011-01-23 12:00:00    17     35  1735  1853         0       -12
## 10 2011-01-22 12:00:00    17     33  1733  1843        -2       -17
## ..                 ...   ...    ...   ...   ...       ...       ...
## Variables not shown: carrier (chr), flight (int), dest (chr), plane (chr),
##   cancelled (int), time (int), dist (int)

Exercise

Order flights by departure date and time
Which flights were most delayed?
Which flights caught up the most time during flight?

Mutate

The mutate() function is used to create new variables.

Up until now we’ve only been examining the dataset but haven’t made any changes to it. All our functions so far have simply displayed the results on screen but haven’t created or modified existing variables. Let’s see how we can create a new variable called speed based on the distance and duration in the flights dataframe.

In this exercise we’re adding a new variable to an existing dataframe so we’ll just overwrite the flights variable with the one that has a speed column

flights <- flights %>%
  mutate(speed = dist / (time / 60))

Exercise

Add a variable to show how much time was made up (or lost) during flight

Groups

Let’s count the number of flights departing each day.

flights %>%
  group_by(date) %>%
  summarise(count = n())

## Source: local data frame [365 x 2]
## 
##                   date count
##                 (time) (int)
## 1  2011-01-01 12:00:00   552
## 2  2011-01-02 12:00:00   678
## 3  2011-01-03 12:00:00   702
## 4  2011-01-04 12:00:00   583
## 5  2011-01-05 12:00:00   590
## 6  2011-01-06 12:00:00   660
## 7  2011-01-07 12:00:00   661
## 8  2011-01-08 12:00:00   500
## 9  2011-01-09 12:00:00   602
## 10 2011-01-10 12:00:00   659
## ..                 ...   ...

Here’s a nice little trick. You can use View() to look at the results of a pipe operation without creating new variables.

flights %>%
  group_by(date) %>%
  summarise(count = n()) %>%
  View()

Of course, often times we’d want to save the summary in a variable for further analysis.

Let’s find the average departure delay for each destination

delays <- flights %>%
    group_by(dest) %>%
    summarise(mean = mean(dep_delay))

delays

## Source: local data frame [116 x 2]
## 
##     dest   mean
##    (chr)  (dbl)
## 1    ABQ     NA
## 2    AEX     NA
## 3    AGS 10.000
## 4    AMA     NA
## 5    ANC 24.952
## 6    ASE     NA
## 7    ATL     NA
## 8    AUS     NA
## 9    AVL     NA
## 10   BFL     NA
## ..   ...    ...

Exercise

What’s wrong with the results above, and how would you fix the problem?
Can you think of using filter to solve the problem?
Use help to find out two other ways to do summarize/n combination in dplyr.
How many different destinations can you fly to from Houston?
Which destinations have the highest average delays?
Which flights (carrier + flight number) happen everyday and where do they fly?
How do delays (of non-cancelled flights) vary over the course of a day?

Plotting Results

hourly <- flights %>%
  filter(cancelled == 0) %>%
  mutate(time = hour + minute / 60) %>%
  group_by(time) %>%
  summarise(
    arr_delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  )

plot(arr_delay ~ time, data = hourly, xaxt = 'n')
axis(1, at = seq(0, 24), las=2)

Merging Multiple Datasets

Tutorial on Merging Datasets

What if we wanted to see airport delays based on a state or city? We have the airports data that gives us the location of the airport so we can just merge it with the flights data.

flights %>%
  group_by(dest) %>%
  filter(!is.na(arr_delay)) %>%
  summarise(delay = mean(arr_delay)) %>%
  arrange(desc(delay)) %>%
  left_join(airports, by = c("dest" = "iata"))

## Source: local data frame [116 x 8]
## 
##     dest    delay                             airport                 city
##    (chr)    (dbl)                               (chr)                (chr)
## 1    ANC 26.08065 Ted Stevens Anchorage International            Anchorage
## 2    CID 17.80049                        Eastern Iowa         Cedar Rapids
## 3    DSM 15.95110            Des Moines International           Des Moines
## 4    SFO 14.89036         San Francisco International        San Francisco
## 5    BPT 14.33333            Southeast Texas Regional Beaumont/Port Arthur
## 6    GRR 13.71729           Kent County International         Grand Rapids
## 7    DAY 13.67117             James M Cox Dayton Intl               Dayton
## 8    VPS 12.45718                Eglin Air Force Base           Valparaiso
## 9    ECP 12.42222                                  NA                   NA
## 10   SAV 12.33137              Savannah International             Savannah
## ..   ...      ...                                 ...                  ...
## Variables not shown: state (chr), country (chr), lat (dbl), long (dbl)

Exercise

Are some types of planes more likely to be delayed?
Merge the planes dataset with the flights dataset to see if the type/age of plane is associated with delays
Join the flights dataset with weather dataset to see how different weather conditions affect flight delay.

Reshaping

Remember the wonderful WDI Dataset from Term 1? It’s fairly common for datasets from public sources to come in wide formats that need to be reshaped. Let’s go over the WDI example step by step to see how that’s done.

We’re using a tiny sample of the original WDI dataset here to simplify the tasks. Let’s load the dataset and see what it looks like.

wdi <- read_csv("http://ucl.ac.uk/~uctqiax/data/wdi.csv", na = "..")

wdi

## Source: local data frame [5 x 7]
## 
##         ¬Series.Name    Series.Code Country.Name Country.Code X1995.YR1995
##                (chr)          (chr)        (chr)        (chr)        (dbl)
## 1 Maternal mortality    SH.STA.MMRT       France          FRA    15.000000
## 2 Maternal mortality    SH.STA.MMRT        Spain          ESP     6.000000
## 3 Maternal mortality    SH.STA.MMRT                                     NA
## 4 Health expenditure SH.XPD.TOTL.ZS       France          FRA    10.355906
## 5 Health expenditure SH.XPD.TOTL.ZS        Spain          ESP     7.444592
## Variables not shown: X2000.YR2000 (dbl), X2005.YR2005 (dbl)

But ideally, we’d like our data to look something like this:

## Source: local data frame [6 x 5]
## 
##   CountryCode CountryName  Year MaternalMortality HealthExpenditure
##         (chr)       (chr) (dbl)             (dbl)             (dbl)
## 1         ESP       Spain  1995                 6          7.444592
## 2         ESP       Spain  2000                 5          7.214756
## 3         ESP       Spain  2005                 5          8.288271
## 4         FRA      France  1995                15         10.355906
## 5         FRA      France  2000                12         10.084833
## 6         FRA      France  2005                10         10.932626

We can see that some country names and codes are blank, so let’s get rid of them first

wdi %>%
  filter(Country.Code != "")

## Source: local data frame [4 x 7]
## 
##         ¬Series.Name    Series.Code Country.Name Country.Code X1995.YR1995
##                (chr)          (chr)        (chr)        (chr)        (dbl)
## 1 Maternal mortality    SH.STA.MMRT       France          FRA    15.000000
## 2 Maternal mortality    SH.STA.MMRT        Spain          ESP     6.000000
## 3 Health expenditure SH.XPD.TOTL.ZS       France          FRA    10.355906
## 4 Health expenditure SH.XPD.TOTL.ZS        Spain          ESP     7.444592
## Variables not shown: X2000.YR2000 (dbl), X2005.YR2005 (dbl)

So far so good. Note that we’re not making any changes yet so we can just add one function at a time to the pipeline and check the results. Once we’re satisfied with the results we save them to a variable.

We need to gather all columns that start with “X” that contain per-year values for each series (for example X1960..YR1960)

wdi %>%
  filter(Country.Code != "") %>% 
  gather(Year, Value, starts_with("X"))

## Source: local data frame [12 x 6]
## 
##          ¬Series.Name    Series.Code Country.Name Country.Code
##                 (chr)          (chr)        (chr)        (chr)
## 1  Maternal mortality    SH.STA.MMRT       France          FRA
## 2  Maternal mortality    SH.STA.MMRT        Spain          ESP
## 3  Health expenditure SH.XPD.TOTL.ZS       France          FRA
## 4  Health expenditure SH.XPD.TOTL.ZS        Spain          ESP
## 5  Maternal mortality    SH.STA.MMRT       France          FRA
## 6  Maternal mortality    SH.STA.MMRT        Spain          ESP
## 7  Health expenditure SH.XPD.TOTL.ZS       France          FRA
## 8  Health expenditure SH.XPD.TOTL.ZS        Spain          ESP
## 9  Maternal mortality    SH.STA.MMRT       France          FRA
## 10 Maternal mortality    SH.STA.MMRT        Spain          ESP
## 11 Health expenditure SH.XPD.TOTL.ZS       France          FRA
## 12 Health expenditure SH.XPD.TOTL.ZS        Spain          ESP
## Variables not shown: Year (chr), Value (dbl)

Now all values are in the Value column, so we need to spread them out to individual columns based on the Series.Code. We have to make sure that we only keep the columns that make the country-year observations unique. We use select() to keep Country.Code, Country.Name, Year, plus the two columns (Series.Code and Value) that will make up the key-value pair for the spread() function.

wdi %>%
  filter(Country.Code != "") %>% 
  gather(Year, Value, starts_with("X")) %>% 
  select(Country.Code, Country.Name, Year, Series.Code, Value) %>%
  spread(Series.Code, Value)

## Source: local data frame [6 x 5]
## 
##   Country.Code Country.Name         Year SH.STA.MMRT SH.XPD.TOTL.ZS
##          (chr)        (chr)        (chr)       (dbl)          (dbl)
## 1          ESP        Spain X1995.YR1995           6       7.444592
## 2          ESP        Spain X2000.YR2000           5       7.214756
## 3          ESP        Spain X2005.YR2005           5       8.288271
## 4          FRA       France X1995.YR1995          15      10.355906
## 5          FRA       France X2000.YR2000          12      10.084833
## 6          FRA       France X2005.YR2005          10      10.932626

It looks good, so we can rename the variables to something meaningful.

wdi %>%
  filter(Country.Code != "") %>% 
  gather(Year, Value, starts_with("X")) %>% 
  select(Country.Code, Country.Name, Year, Series.Code, Value) %>%
  spread(Series.Code, Value) %>% 
  rename(CountryName = Country.Name,
         CountryCode = Country.Code,
         MaternalMortality = SH.STA.MMRT,
         HealthExpenditure = SH.XPD.TOTL.ZS)

## Source: local data frame [6 x 5]
## 
##   CountryCode CountryName         Year MaternalMortality HealthExpenditure
##         (chr)       (chr)        (chr)             (dbl)             (dbl)
## 1         ESP       Spain X1995.YR1995                 6          7.444592
## 2         ESP       Spain X2000.YR2000                 5          7.214756
## 3         ESP       Spain X2005.YR2005                 5          8.288271
## 4         FRA      France X1995.YR1995                15         10.355906
## 5         FRA      France X2000.YR2000                12         10.084833
## 6         FRA      France X2005.YR2005                10         10.932626

Now we just need to extract the 4-digit year from the Year column. The Year column is formatted as X1995.YR1995 which means that the 4-digits for the year are in position 2,3,4, and 5. We can use the substring() function to take all the characters from position 2 to 5 and assign it back to the Year column.

Since this is the last step we might as well assign the results to a new variable.

wdi_long <- wdi %>%
  filter(Country.Code != "") %>% 
  gather(Year, Value, starts_with("X")) %>% 
  select(Country.Code, Country.Name, Year, Series.Code, Value) %>%
  spread(Series.Code, Value) %>% 
  rename(CountryName = Country.Name,
         CountryCode = Country.Code,
         MaternalMortality = SH.STA.MMRT,
         HealthExpenditure = SH.XPD.TOTL.ZS) %>%
  mutate(Year = as.numeric(substring(Year, 2, 5)))

wdi_long

## Source: local data frame [6 x 5]
## 
##   CountryCode CountryName  Year MaternalMortality HealthExpenditure
##         (chr)       (chr) (dbl)             (dbl)             (dbl)
## 1         ESP       Spain  1995                 6          7.444592
## 2         ESP       Spain  2000                 5          7.214756
## 3         ESP       Spain  2005                 5          8.288271
## 4         FRA      France  1995                15         10.355906
## 5         FRA      France  2000                12         10.084833
## 6         FRA      France  2005                10         10.932626

You can assign it back to wdi if you want, but we’re using a different name in case we make a mistake and have to start again. This way we would’ve have to reload the file all over again.

Data Manipulation in R

Setup

Flights Data

Data pipelines

Filter

Exercise

Arrange

Exercise

Mutate

Exercise

Groups

Exercise

Plotting Results

Merging Multiple Datasets

Exercise

Reshaping

Resources

Online Books

Cheat Sheets