This tuturial is the part of the dplyr training series. Here is the YouTube Video link for this tutuorial.

https://youtu.be/COhCWuB80Lg

Here is the link to the complete series on DPLYR

https://www.youtube.com/playlist?list=PLkHcMTpvAaXVJzyRSytUn3nSK92TJphxR

Why dplyr

dplyr is a great tool to use in R.

The commands may look long and overwhelming to someone not using dplyr but that is not the case.

Once you learn the basics of it then it is very intuitive to use. Just like making a sentence once you have learnt the basic words of a language.

Audience

For beginners or experienced R users wanting to learn various commands of dplyr.

DPLYR : Joining Tables in R table joins

We will be covering all practical aspects of dplyr::case_when command in this. This tutorial is part of a series of tutorials on all practical aspects of dplyr All youtube videos are available in a single playlist on YouTube.

https://www.youtube.com/playlist?list=PLkHcMTpvAaXVJzyRSytUn3nSK92TJphxR

Create sample dataset

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3

Sample dataset.

Let us create some data for patient ages.

First dataset

Contain the age and outcome of the patients. PatientID is the unique key in this dataset.

d1 <- data.frame( PatientID  = c('P001','P002','P003')
                , Age           = c(23,12,5)
                , Outcome    = c('Died','Died',NA)   )

d1

Second dataset

Contain the Systolic blood pressure (SBP) and diastolic blood pressure (DBP) of the patients. PatientID is the unique key in this dataset.

d2 <- data.frame(PatientID   = c('P001','P002','P004')
                    , SBP       = c(120,    80, 45)
                     , DBP       = c(80,    70, 30)
 )

d2

Third dataset

Contain the weight of the patients. PatientID is the unique key in this dataset.

d3 <- data.frame(PatientID   = c('P001','P005','P006')
                   , Weight   = c(80, 65, 78)
)

d3

Fourth dataset

Contain the hear rate reading of the patients. PatID is the unique key in this dataset. Notice that in the first three datasets we had the PatientID as the key and in this dataset PatID is the key.

d4 <- data.frame(PatID   = c('P001','P004','P007')
                   , HeartRate   = c(65, 65, 72)
)

d4

First attempt of doing a left hand join

This is the simple syntax using dplyr. You provide the names of your datasets.

The position of the first and second table is important. The first table is the left side table and all the rows of this table will appear in your joined results. The second table on the rigth side will have only the matching rows data shown the in the final joined results.

As in both the datasets “PatientID” is the key so we can just specify it once saying by =“PatientID”. If the keys are different in both the cases then you have to specify both the keys as shown in subsequent examples in this document.

# Left Join

p_left <-  dplyr::left_join(d1,d2,  by  ="PatientID")

My preferred syntax

Use this syntax when joining more than two tables

# dplyr
p_left2 <-  d1%>%
            dplyr::left_join(d2,  by  = "PatientID")%>%
            dplyr::left_join(d3,  by  = "PatientID")
p_left2

Use this syntax when joining more than two tables

Notice the for d2 and d3 we specified the keys as by = “PatientID” which is the same as saying by =c(“PatientID” = “PatientID”)

But for d4 we had to say by =c(“PatientID” = “PatID”)

# dplyr
p_left3 <-  d1%>%
          dplyr::left_join(d2,  by  = "PatientID")%>%
          dplyr::left_join(d3,  by  = "PatientID")%>%
          dplyr::left_join(d4,  by  =c("PatientID" = "PatID"))
p_left3

What if your datasets had multiple keys for each dataset

dx <- data.frame( PatientID  = c('P001','P001','P002', 'P003')
                ,  VisitID    = c(1,2,3,4)
                , Age            = c(23,12,5,6)
                , Outcome      = c('ALIVE','DIED','ALIVE', 'ALIVE')   )

dx
dy <- data.frame( PatID = c('P001','P002','P003','P003')
                , VstID   = c(1,2,3,4)
                , Age           = c(23,12,5,6)
                , cost      = c(1000,2000,1000, 1010)   )


dy
p_left4 <-  dx%>%
  dplyr::left_join(dy,  by  =c('PatientID' = 'PatID'
                               ,'VisitID'   = 'VstID') )

p_left4

Remove same named columns

In case you wanted to remove the Age column from appearing twice, you can remove it from your second table as shown below.

p_left5 <-  dx%>%
           dplyr::left_join(dy%>%dplyr::select(-Age)
                  ,  by  =c('PatientID' = 'PatID'
                           ,'VisitID'   = 'VstID') )

p_left5

Watch our complete tutorial on all aspects of DPLYR.

https://www.youtube.com/playlist?list=PLkHcMTpvAaXVJzyRSytUn3nSK92TJphxR