Relational Data with Dplyr

Three families of verbs designed to work with relational data:
  • Mutating Joins
  • Filtering Joins
  • Set Operations
  • Pre requisites

    library(tidyverse)
    library(nycflights13)

    Primay keys should be unique. This is how to verify that there is only one record per key

    library(tidyverse)
    package <U+393C><U+3E31>tidyverse<U+393C><U+3E32> was built under R version 3.3.3Loading tidyverse: ggplot2
    Loading tidyverse: tibble
    Loading tidyverse: tidyr
    Loading tidyverse: readr
    Loading tidyverse: purrr
    Loading tidyverse: dplyr
    Conflicts with tidy packages ------------------------------------------------------------------------------------
    filter(): dplyr, stats
    lag():    dplyr, stats
    library(nycflights13)
    package <U+393C><U+3E31>nycflights13<U+393C><U+3E32> was built under R version 3.3.3
    planes %>%
      count(tailnum) %>%
      filter(n > 1)
    weather %>%
      count(year,month,day, hour, origin) %>%
      filter(n > 1)
    When a table lacks a unique key, it might be useful to create a surrogate key from mutate() and row_number() to make it easier to match observatins if you’ve done some filtering and want to check back in with the original data.

    # excercie add a surrogate key to flights
    flights %>% 
      arrange(year, month, day, sched_dep_time, carrier, flight) %>%
      mutate(flight_id = row_number()) %>%
      glimpse()
    Observations: 336,776
    Variables: 20
    $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013...
    $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
    $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
    $ dep_time       <int> 517, 533, 542, 544, 554, 559, 558, 559, 558, 558, 557, NA, 601, 600, 555, 554, 557, 608,...
    $ sched_dep_time <int> 515, 529, 540, 545, 558, 559, 600, 600, 600, 600, 600, 600, 600, 600, 600, 600, 600, 600...
    $ dep_delay      <dbl> 2, 4, 2, -1, -4, 0, -2, -1, -2, -2, -3, NA, 1, 0, -5, -6, -3, 8, 0, -2, 11, -2, -1, -3, ...
    $ arr_time       <int> 830, 850, 923, 1004, 740, 702, 753, 941, 849, 853, 838, NA, 844, 851, 913, 812, 709, 807...
    $ sched_arr_time <int> 819, 830, 850, 1022, 728, 706, 745, 910, 851, 856, 846, 901, 850, 858, 854, 837, 723, 73...
    $ arr_delay      <dbl> 11, 20, 33, -18, 12, -4, 8, 31, -2, -3, -8, NA, -6, -7, 19, -25, -14, 32, 12, 7, 14, -14...
    $ carrier        <chr> "UA", "UA", "AA", "B6", "UA", "B6", "AA", "AA", "B6", "B6", "B6", "B6", "B6", "B6", "B6"...
    $ flight         <int> 1545, 1714, 1141, 725, 1696, 1806, 301, 707, 49, 71, 79, 125, 343, 371, 507, 461, 5708, ...
    $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N39463", "N708JB", "N3ALAA", "N3DUAA", "N793JB"...
    $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "EWR", "JFK", "LGA", "LGA", "JFK", "JFK", "JFK", "JFK", "EWR...
    $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ORD", "BOS", "ORD", "DFW", "PBI", "TPA", "MCO", "FLL", "PBI...
    $ air_time       <dbl> 227, 227, 160, 183, 150, 44, 138, 257, 149, 158, 140, NA, 147, 152, 158, 116, 53, 139, 1...
    $ distance       <dbl> 1400, 1416, 1089, 1576, 719, 187, 733, 1389, 1028, 1005, 944, 1069, 1023, 1076, 1065, 76...
    $ hour           <dbl> 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
    $ minute         <dbl> 15, 29, 40, 45, 58, 59, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 7, 8, 10, ...
    $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 05:00:00, 201...
    $ flight_id      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 2...
    Lahman::Batting %>%
    group_by(playerID, yearID) %>%
      filter(n() > 1) 
    # since there are multiple records above, we need to add one more variable to make n() >1 return no records.
    Lahman::Batting %>%
    group_by(playerID, yearID, stint) %>%
      filter(n() > 1)

    Mutating Joins

    Like mutate() the join functions add variables to the right. First we create a narrower dataset.

    flights2 <- flights %>%
      select(year:day, hour, origin, dest, tailnum, carrier)
    flights2

    Imagine you want to add the full airline name to the flights2 data. You can combine the airlines and the flights 2 data framew ith left_join()

    # look at what airlines table contain
    View(airlines)
    # then combine it with the flights2 table, but since we don't need origin and dest 
    # we take it out with -origin, -dest
    flights2 %>%
      select(-origin, -dest) %>%
      left_join(airlines, by = "carrier")

    Notice the name of the airline is now added as a new column., Another way is to use mutate()

    flights2 %>%
      select(-origin, -dest) %>%
      mutate(name = airlines$name[match(carrier,airlines$carrier)])

    Understanding Joins

    x <-  tribble( ~key, ~val_x,
                   1,"x1",
                   2,"x2",
                   3,"x3"
                   )
    x
    y <-  tribble(
      ~key, ~val_y,
      1,"y1",
      2,"y2",
      4,"y3"
    )
    y

    A join is a way of connecting each row in x to zero, one or more rows in y.

    Inner Join

    x %>%
      inner_join(y, by ="key")

    Outer Joins

    A left join keeps all observations in x.
    A right join keeps all observations in y.
    A full join keeps all observationsin x and y

    Duplicate Keys

    x <-  tribble( ~key, ~val_x,
                   1,"x1",
                   2,"x2",
                   2,"x3",
                   1, "x4"
                   )
    x
    y <-  tribble(
      ~key, ~val_y,
      1,"y1",
      2,"y2"
    )
    y
    left_join(x,y, by="key")
    x <-  tribble( ~key, ~val_x,
                   1,"x1",
                   2,"x2",
                   2,"x3",
                   3, "x4"
                   )
    x
    y <-  tribble(
      ~key, ~val_y,
      1,"y1",
      2,"y2",
      2,"y3",
      3,"y4"
    )
    y
    left_join(x,y, by="key")

    Defining Key Columns

    The default join by=NULL uses all variables that appear in both tables, the so-called natural jin. for example, the flights and weeather tables match on their common variables, year, month, day, hour and origin

    flights2 %>%
      left_join(weather)
    Joining, by = c("year", "month", "day", "hour", "origin")

    Join using a character vecto by =“x” This is like a natural join, but uses only some of the common variables. For example, flights and planes have year variables, but they mean different things so we only want to join by tailnum

    flights2 %>% 
      left_join(planes, by = "tailnum")

    Using a named character vector (variables names do not match ) bu = c(“a” =“b”)

    flights2 %>%
      left_join(airports, c("dest" = "faa"))

    Exercise: Compute the average delay by destination, then join on the air ports data frame so you can show the spatial distribution of delays.

    View(flights)
    airports %>%
      semi_join(flights, c("faa" = "dest")) %>%
      ggplot(aes(lon,lat)) +
      borders("state") +
      geom_point() +
      coord_quickmap()

    avg_dest_delays <-  flights %>%
      group_by(dest) %>%
      summarise(delay = mean(arr_delay, na.rm = TRUE)) %>%
      inner_join(airports, by = c(dest = "faa"))
    avg_dest_delays %>%
      ggplot(aes(lon,lat, colour = delay)) +
      borders("state") +
      geom_point() +
      coord_quickmap()

    Add location of the origin and destination to flights

    flights %>%
      left_join(airports, by = c(dest = "faa")) %>%
      left_join(airports, by = c(origin = "faa")) %>%
      head()

    Is there a relationship between the age of the plane and its delays?

    plane_ages <- 
      planes %>%
      mutate(age = 2013 - year) %>%
      select(tailnum,age)
    flights %>%
      inner_join(plane_ages, by ="tailnum") %>%
      group_by(age) %>%
      filter(!is.na(dep_delay)) %>%
      summarise(delay = mean(dep_delay)) %>%
      ggplot(aes(x =age, y =delay)) + 
      geom_point()+
      geom_line()

    What weather conditions make it more likely to see a delay?

    flight_weather <-
      flights %>%
      inner_join(weather, by = c("origin" = "origin",
                                "year" = "year",
                                "month" = "month",
                                "day" = "day",
                                "hour" = "hour"))
    flight_weather %>%
      group_by(precip) %>%
      summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
      ggplot(aes(x = precip, y = delay)) +
        geom_line() + geom_point()

    What happened on June 13, 2013 display the spatial pattern of delays and then use Google to cross-reference with the weather.

    flights %>%
      filter(year == 2013, month == 6, day == 13) %>%
      group_by(dest) %>%
      summarise(delay = mean(arr_delay, na.rm = TRUE)) %>%
      inner_join(airports, by = c("dest" = "faa")) %>%
      ggplot(aes(y = lat, x = lon, size = delay, colour = delay)) +
      borders("state") +
      geom_point() +
      coord_quickmap() + 
    Error: 

    Filtering Joins

    Filtering Joins match observations in the same way as mutating joins, but affect the number of observations, not the variables. There are two types:

    semi_join(x,y) keeps all observations in x that have a match in y.
    anti_join(x,y) drops all observation in x that have a match in y.

    Why semi_joins are important: imagine you’ve found the top 10 most popular destinations:

    top_dest <-  flights %>%
      count(dest, sort = TRUE) %>%
      head(10)
    top_dest

    Now you want to find each flight that went to one of those destinations. You could contrsuct a filter yourself:

    flights %>%
      filter(dest %in% top_dest$dest)

    It is difficult to extend that approach to multiple variables. How would you construct the filter statement that used year, month, and day to match it back to flights??

    flights %>%
      semi_join(top_dest)
    Joining, by = "dest"

    Anti join

    Anti joins are useful for diagnosing join mismatches. For example, when connecting flights and planes, you might be interested to know that there are many flights that don’t have a match in planes:

    flights %>%
      anti_join(planes, by = "tailnum") %>%
                count(tailnum, sort=TRUE)

    Exercises:

    What does it mean for a flight to have a missing tailnum? What do the tail numbers that don’t have a matching record in plances have in common? (Hint: one variable explains ~90% of the problems)

    flights %>%
      anti_join(planes, by = "tailnum") %>%
      count(carrier, sort = TRUE)

    So we find that Envoy Airlines (MQ) and American Airlines(AA) don’t report tail numbers.

    Filter flights to show flights with planes that have flown at least 100 flights

    planes_gt100 <- 
      filter(flights) %>%
      group_by(tailnum) %>%
      count() %>%
      filter(n > 100)
    flights %>%
      semi_join(planes_gt100, by = "tailnum")

    Join Problems

    Your own data is unlikely to be so nice. So there are a few things that you should do with your own data to make your joins go smoothly:

    1. Start by identifying the variables that form the primary key in each table. You should usually do this based on your understanding of the data. Not empirically by looking for a combination of variables that give a unique identifier.
    2. Check that none of the variables in the primary key are missing. If a value is missing then it can’t identify an observation!
    3. Check that your foreign keys match primary keys in another table. The best way to do this is with an anti_join()

    Set Operations

    Least used of the 3 types of verbs but occasionally useful when you want to break a single complex filter into simpler pieces. All these operations work with a complete row, comparing the values of every variable. These expect the x an dy inputs to have the same variables, and treat the observations like sets:

    intersect(x,y) returns only observations in both x and y.
    union(x,y) return unique observations in x and y
    setdiff(x,y) return observations in x, but not in y.

    df1 <-  tribble(~x, ~y,
                    1,1,
                    2,1)
    df2 <-  tribble( ~x, ~y,
                     1,1,
                     1,2)
    df1
    df2

    The four possibilities are:

    # intersect(x,y)
    intersect(df1,df2)
    # union 
    union(df1,df2)
    # setdiff
    # return observations in x but not in y
    setdiff(df1,df2)
    setdiff(df2,df1)
    ---
title: "R For Data Science Chapter 10 "
output: html_notebook
---

<h1> Relational Data with Dplyr </h1>

Three families of verbs designed to work with relational data: </br>
<li> Mutating Joins </li>
<li> Filtering Joins </li>
<li> Set Operations</li>

<h2> Pre requisites </h2>
library(tidyverse) </br>
library(nycflights13) </br>

<img src="http://r4ds.had.co.nz/diagrams/relational-nycflights.png"> </img>

Primay keys should be unique. This is how to verify that there is only one record per key

```{r}
library(tidyverse)
library(nycflights13)

planes %>%
  count(tailnum) %>%
  filter(n > 1)

weather %>%
  count(year,month,day, hour, origin) %>%
  filter(n > 1)



```

When a table lacks a unique key, it might be useful to create a surrogate key from mutate() and row_number() to make it easier to match observatins if you've done some filtering and want to check back in with the original data. </p>


```{r}
# excercie add a surrogate key to flights
flights %>% 
  arrange(year, month, day, sched_dep_time, carrier, flight) %>%
  mutate(flight_id = row_number()) %>%
  glimpse()
```

```{r}
# finding the unique primary key
Lahman::Batting %>%
group_by(playerID, yearID) %>%
  filter(n() > 1) 
``` 

```{r}
# since there are multiple records above, we need to add one more variable to make n() >1 return no records.
Lahman::Batting %>%
group_by(playerID, yearID, stint) %>%
  filter(n() > 1)

```

<h2> Mutating Joins </h2>

Like mutate() the join functions add variables to the right. First we create a narrower dataset.

```{r}
flights2 <- flights %>%
  select(year:day, hour, origin, dest, tailnum, carrier)
flights2

```

Imagine you want to add the full airline name to the flights2 data. You can combine the airlines and the flights 2 data framew ith left_join()

```{r}
# look at what airlines table contain
View(airlines)
# then combine it with the flights2 table, but since we don't need origin and dest 
# we take it out with -origin, -dest



```



```{r}
flights2 %>%
  select(-origin, -dest) %>%
  left_join(airlines, by = "carrier")

```

Notice the name of the airline is now added as a new column., Another way is to use mutate() 

```{r}
flights2 %>%
  select(-origin, -dest) %>%
  mutate(name = airlines$name[match(carrier,airlines$carrier)])

```
<h2> Understanding Joins </h2>

```{r}
x <-  tribble( ~key, ~val_x,
               1,"x1",
               2,"x2",
               3,"x3"
               )

x
```
```{r}
y <-  tribble(
  ~key, ~val_y,
  1,"y1",
  2,"y2",
  4,"y3"
)
y

```

A join is a way of connecting each row in x to zero, one or more rows in y.

<h3> Inner Join </h3>

```{r}
# unmatched rows are not included in the result
x %>%
  inner_join(y, by ="key")

```

<h3> Outer Joins </h3>
 A left join keeps all observations in x. </br>
 A right join keeps all observations in y.  </br>
 A full join keeps all observationsin x and y </p>


<h2> Duplicate Keys </h2>

```{r}

x <-  tribble( ~key, ~val_x,
               1,"x1",
               2,"x2",
               2,"x3",
               1, "x4"
               )

x


y <-  tribble(
  ~key, ~val_y,
  1,"y1",
  2,"y2"
)
y
```

```{r}
left_join(x,y, by="key")
```

```{r}
x <-  tribble( ~key, ~val_x,
               1,"x1",
               2,"x2",
               2,"x3",
               3, "x4"
               )

x

y <-  tribble(
  ~key, ~val_y,
  1,"y1",
  2,"y2",
  2,"y3",
  3,"y4"
)
y

left_join(x,y, by = "key")

```

<h2> Defining Key Columns </h2>

The default join by=NULL uses all variables that appear in both tables, the so-called natural jin. for example, the flights and weeather tables match on their common variables, year, month, day, hour and origin

```{r}
flights2 %>%
  left_join(weather)
```

Join using a character vecto by ="x" This is like a natural join, but uses only some of the common variables. For example, flights and planes have year variables, but they mean different things so we only want to join by tailnum

```{r}
flights2 %>% 
  left_join(planes, by = "tailnum")

```


Using a named character vector (variables names do not match ) bu = c("a" ="b")

```{r}
flights2 %>%
  left_join(airports, c("dest" = "faa"))
```

Exercise:
Compute the average delay by destination, then join on the air ports data frame so you can show the spatial distribution of delays. 


```{r}
View(flights)
```

```{r}
airports %>%
  semi_join(flights, c("faa" = "dest")) %>%
  ggplot(aes(lon,lat)) +
  borders("state") +
  geom_point() +
  coord_quickmap()

```

```{r}
avg_dest_delays <-  flights %>%
  group_by(dest) %>%
  summarise(delay = mean(arr_delay, na.rm = TRUE)) %>%
  inner_join(airports, by = c(dest = "faa"))

avg_dest_delays %>%
  ggplot(aes(lon,lat, colour = delay)) +
  borders("state") +
  geom_point() +
  coord_quickmap()
```

Add location of the origin and destination to flights

```{r}

flights %>%
  left_join(airports, by = c(dest = "faa")) %>%
  left_join(airports, by = c(origin = "faa")) %>%

  head()

```

Is there a relationship between the age of the plane and its delays?

```{r}
plane_ages <- 
  planes %>%
  mutate(age = 2013 - year) %>%
  select(tailnum,age)

flights %>%
  inner_join(plane_ages, by ="tailnum") %>%
  group_by(age) %>%
  filter(!is.na(dep_delay)) %>%
  summarise(delay = mean(dep_delay)) %>%
  ggplot(aes(x =age, y =delay)) + 
  geom_point()+
  geom_line()

```

What weather conditions make it more likely to see a delay?

```{r}
flight_weather <-
  flights %>%
  inner_join(weather, by = c("origin" = "origin",
                            "year" = "year",
                            "month" = "month",
                            "day" = "day",
                            "hour" = "hour"))

flight_weather %>%
  group_by(precip) %>%
  summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = precip, y = delay)) +
    geom_line() + geom_point()
```


What happened on June 13, 2013 display the spatial pattern of delays and then use Google to cross-reference with the weather. 

```{r}
flights %>%
  filter(year == 2013, month == 6, day == 13) %>%
  group_by(dest) %>%
  summarise(delay = mean(arr_delay, na.rm = TRUE)) %>%
  inner_join(airports, by = c("dest" = "faa")) %>%
  ggplot(aes(y = lat, x = lon, size = delay, colour = delay)) +
  borders("state") +
  geom_point() +
  coord_quickmap() + 



```

<h2> Filtering Joins </h2>

Filtering Joins match observations in the same way as mutating joins, but affect the number of observations, not the variables. There are two types: </br>

semi_join(x,y) keeps all observations in x that have a match in y. </br>
anti_join(x,y) drops all observation in x that have a match in y. </p>


Why semi_joins are important: imagine you've found the top 10 most popular destinations:

```{r}
top_dest <-  flights %>%
  count(dest, sort = TRUE) %>%
  head(10)
top_dest
```

Now you want to find each flight that went to one of those destinations. You could contrsuct a filter yourself:

```{r}
flights %>%
  filter(dest %in% top_dest$dest)
```

It is difficult to extend that approach to multiple variables.  How would you construct the filter statement that used year, month, and day to match it back to flights??


```{r}
flights %>%
  semi_join(top_dest)
```


<h3> Anti join </h3>

Anti joins are useful for diagnosing join mismatches. For example, when connecting flights and planes, you might be interested to know that there are many flights that don't have a match in planes:

```{r}
flights %>%
  anti_join(planes, by = "tailnum") %>%
            count(tailnum, sort=TRUE)
```

Exercises:

What does it mean for a flight to have a missing tailnum? What do the tail numbers that don't have a matching record in plances have in common? (Hint: one variable explains ~90% of the problems)

```{r}
flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(carrier, sort = TRUE)

```

So we find that Envoy Airlines (MQ) and American Airlines(AA) don't report tail numbers. 


Filter flights to show flights with planes that have flown at least 100 flights

```{r}
planes_gt100 <- 
  filter(flights) %>%
  group_by(tailnum) %>%
  count() %>%
  filter(n > 100)

flights %>%
  semi_join(planes_gt100, by = "tailnum")


```

<h2> Join Problems </h2>

Your own data is unlikely to be so nice. So there are a few things that you should do with your own data to make your joins go smoothly: </p>
1. Start by identifying the variables that form the primary key in each table. You should usually do this based on your understanding of the data. Not empirically by looking for a combination of variables that give a unique identifier. </br>
2. Check that none of the variables in the primary key are missing.  If a value is missing then it can't identify an observation! </br>
3. Check that your foreign keys match primary keys in another table. The best way to do this is with an anti_join() </br>

<h2> Set Operations </h2>
Least used of the 3 types of verbs but occasionally useful when you want to break a single complex filter into simpler pieces. All these operations work with a complete row, comparing the values of every variable. These expect the x an dy inputs to have the same variables, and treat the observations like sets: </br>

intersect(x,y) returns only observations in both x and y. </br>
union(x,y) return unique observations in x and y </br>
setdiff(x,y) return observations in x, but not in y. </p>

```{r}
df1 <-  tribble(~x, ~y,
                1,1,
                2,1)

df2 <-  tribble( ~x, ~y,
                 1,1,
                 1,2)

df1
df2
```


The four possibilities are:

```{r}
# intersect(x,y)
# returns only observeation in both df1 and df2
intersect(df1,df2)
```

```{r}
# union returns unique observations in df1 and df2
# so 1,1, shows up once 

union(df1,df2)

```


```{r}
# setdiff
# return observations in x but not in y
setdiff(df1,df2)
setdiff(df2,df1)

```

























