Extra Credit

Mutating Joins

you use joins function when you have some records that are the same, some columns are the same, some might be different. But they do have one thing that is common that allow them to overlap and join them together.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(nycflights13)


# Drop unimportant variables so it's easier to understand the join results.
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)

#view(flights2)

flights2 %>% 
  left_join(airlines)

## Joining with `by = join_by(carrier)`

## # A tibble: 336,776 × 9
##     year month   day  hour origin dest  tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 EWR    IAH   N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 LGA    IAH   N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 EWR    ORD   N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      American Airlines Inc.  
## # ℹ 336,766 more rows

# this joins a data on to flights 2 when your trying to find

Interpretation of the code

In this code we have the nycflight13 data set. The flights 2 we have a data set that selected the columns years:day, hour, origin, dest, tailnum, and carrier. But we want to add the airline data.

So what we could do is grab variable called airlines and join them on the left hand side on the data frame.

Controlling how tables are matched

flights2 %>% left_join(weather)

## Joining with `by = join_by(year, month, day, hour, origin)`

## # A tibble: 336,776 × 18
##     year month   day  hour origin dest  tailnum carrier  temp  dewp humid
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
##  2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
##  3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
##  4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
##  5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
##  6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
##  7  2013     1     1     6 EWR    FLL   N516JB  B6       37.9  28.0  67.2
##  8  2013     1     1     6 LGA    IAD   N829AS  EV       39.9  25.0  54.8
##  9  2013     1     1     6 JFK    MCO   N593JB  B6       37.9  27.0  64.3
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA       39.9  25.0  54.8
## # ℹ 336,766 more rows
## # ℹ 7 more variables: wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>

Here dyplr will join both tables by its natural join. flight and weather table match with other common variables. All the records will be joined nothing will be excluded with the left join. This is a natural join

flights2 %>% left_join(planes, by = "tailnum")

## # A tibble: 336,776 × 16
##    year.x month   day  hour origin dest  tailnum carrier year.y type            
##     <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>           
##  1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixed wing mult…
##  2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixed wing mult…
##  3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixed wing mult…
##  4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixed wing mult…
##  5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixed wing mult…
##  6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixed wing mult…
##  7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixed wing mult…
##  8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixed wing mult…
##  9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixed wing mult…
## 10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA>            
## # ℹ 336,766 more rows
## # ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
## #   seats <int>, speed <int>, engine <chr>

Here we have a new argument by = ““. It locates a common variable and combines them together. They are in common by the flight and the year. So when the left join happens they are the planes data set has flight and year common with the flight2 data set.

In this code we have who character vectors “dest” and “origin” they we are telling are they equal “faa”. So what its doing his matching these two variables by destination and origin of flight

flights2 %>% 
  left_join(airports, c("dest" = "faa"))

## # A tibble: 336,776 × 15
##     year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Georg…  30.0 -95.3    97
##  2  2013     1     1     5 LGA    IAH   N24211  UA      Georg…  30.0 -95.3    97
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      Miami…  25.8 -80.3     8
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>    NA    NA      NA
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Harts…  33.6 -84.4  1026
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Chica…  42.0 -87.9   668
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort …  26.1 -80.2     9
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      Washi…  38.9 -77.5   313
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      Orlan…  28.4 -81.3    96
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chica…  42.0 -87.9   668
## # ℹ 336,766 more rows
## # ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>

flights2 %>% 
  left_join(airports, c("origin" = "faa"))

## # A tibble: 336,776 × 15
##     year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Newar…  40.7 -74.2    18
##  2  2013     1     1     5 LGA    IAH   N24211  UA      La Gu…  40.8 -73.9    22
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      John …  40.6 -73.8    13
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      John …  40.6 -73.8    13
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      La Gu…  40.8 -73.9    22
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Newar…  40.7 -74.2    18
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Newar…  40.7 -74.2    18
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      La Gu…  40.8 -73.9    22
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      John …  40.6 -73.8    13
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      La Gu…  40.8 -73.9    22
## # ℹ 336,766 more rows
## # ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>

Types of Join

creating dataframes

df1 <- tibble(x = c(1, 2), y = 2:1)

df2 <- tibble(x = c(3, 1), a = 10, b = "a")

Inner Join

Inner Join joins only common variables that match the observations.

you have two data sets and you have a common observations. so what you do is you use inner join to combine the two data sets together. Like say for example you have medical history data set and have the same patients. you can identify the patient by there number and by there gender and height. since you can use these observation you can combine differnt types of data sets together to have a better patient profile such has medical history and the drug treatments they have had in the past.

df1 %>% inner_join(df2) %>% knitr::kable()

## Joining with `by = join_by(x)`

x	y	a	b
1	2	10	a

Left Join

Includes all the observations regardless if they match or not. Its to make sure you dont lose any observations. In this data set you are connecting df2 on to df1 on the left hand side. The output of the dataset will be df1 and the common observations with data set 2.

df1 %>% left_join(df2)

## Joining with `by = join_by(x)`

## # A tibble: 2 × 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     2     1    NA <NA>

Right Join

Similar to left join but inside we are joining observations on the right hand side. The output will contain the dataset from the df1 plus the common observations on df2.

df1 %>% right_join(df2)

## Joining with `by = join_by(x)`

## # A tibble: 2 × 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     3    NA    10 a

df2 %>% left_join(df1)

## Joining with `by = join_by(x)`

## # A tibble: 2 × 4
##       x     a b         y
##   <dbl> <dbl> <chr> <int>
## 1     3    10 a        NA
## 2     1    10 a         2

Full Join

we are joining all observations from both data frames: df1 and df2 are joined together. All the observations are joined even missing values.

df1 %>% full_join(df2)

## Joining with `by = join_by(x)`

## # A tibble: 3 × 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     2     1    NA <NA> 
## 3     3    NA    10 a

Observations

In this code you can see data frame x, y, and z.

since the data frames doesnt match but it can create new observations with the data frame.

df1 <- tibble(x = c(1, 1, 2), y = 1:3)
df2 <- tibble(x = c(1, 1, 2), z = c("a", "b", "a"))

df1 %>% left_join(df2)

## Joining with `by = join_by(x)`

## Warning in left_join(., df2): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

## # A tibble: 5 × 3
##       x     y z    
##   <dbl> <int> <chr>
## 1     1     1 a    
## 2     1     1 b    
## 3     1     2 a    
## 4     1     2 b    
## 5     2     3 a

Filtering Join

semi_join(x,y) keeps all observation in df1 that have a match in df2.

anti_join(x,y) drops all common observation that are in df1 and df2.

Also they wont be duplicate observations either.

the tailname observations

filtering Join is important if you want to keep common observation of two data frames. For example sports data frames from two different teams but the same player. you would want to pull those observations.

Anti join is important

library("nycflights13")
flights %>% 
  anti_join(planes, by = "tailnum") %>% 
  count(tailnum, sort = TRUE)

## # A tibble: 722 × 2
##    tailnum     n
##    <chr>   <int>
##  1 <NA>     2512
##  2 N725MQ    575
##  3 N722MQ    513
##  4 N723MQ    507
##  5 N713MQ    483
##  6 N735MQ    396
##  7 N0EGMQ    371
##  8 N534MQ    364
##  9 N542MQ    363
## 10 N531MQ    349
## # ℹ 712 more rows

df1 <- tibble(x = c(1, 1, 3, 4), y = 1:4)
df2 <- tibble(x = c(1, 1, 2), z = c("a", "b", "a"))

# Four rows to start with:
df1 %>% nrow()

## [1] 4

# And we get four rows after the join
df1 %>% inner_join(df2, by = "x") %>% nrow()

## Warning in inner_join(., df2, by = "x"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

## [1] 4

# But only two rows actually match
df1 %>% semi_join(df2, by = "x") %>% nrow()

## [1] 2

Extra Credit

Amit Singh

2023-08-19

Mutating Joins

you use joins function when you have some records that are the same, some columns are the same, some might be different. But they do have one thing that is common that allow them to overlap and join them together.

Interpretation of the code

In this code we have the nycflight13 data set. The flights 2 we have a data set that selected the columns years:day, hour, origin, dest, tailnum, and carrier. But we want to add the airline data.

So what we could do is grab variable called airlines and join them on the left hand side on the data frame.

Controlling how tables are matched

Here dyplr will join both tables by its natural join. flight and weather table match with other common variables. All the records will be joined nothing will be excluded with the left join. This is a natural join

Here we have a new argument by = ““. It locates a common variable and combines them together. They are in common by the flight and the year. So when the left join happens they are the planes data set has flight and year common with the flight2 data set.

In this code we have who character vectors “dest” and “origin” they we are telling are they equal “faa”. So what its doing his matching these two variables by destination and origin of flight

Types of Join

creating dataframes

Inner Join

Inner Join joins only common variables that match the observations.

Left Join

Includes all the observations regardless if they match or not. Its to make sure you dont lose any observations. In this data set you are connecting df2 on to df1 on the left hand side. The output of the dataset will be df1 and the common observations with data set 2.

Right Join

Similar to left join but inside we are joining observations on the right hand side. The output will contain the dataset from the df1 plus the common observations on df2.

Full Join

we are joining all observations from both data frames: df1 and df2 are joined together. All the observations are joined even missing values.

Observations

In this code you can see data frame x, y, and z.

since the data frames doesnt match but it can create new observations with the data frame.

Filtering Join

semi_join(x,y) keeps all observation in df1 that have a match in df2.

anti_join(x,y) drops all common observation that are in df1 and df2.

Also they wont be duplicate observations either.

the tailname observations

filtering Join is important if you want to keep common observation of two data frames. For example sports data frames from two different teams but the same player. you would want to pull those observations.

Anti join is important