Session 5: Advanced dplyr Functions

Let’s continue exploring dplyr functions while also building our literacy in R language/coding. We will further explore additional join() types and as well as building upon our mutate() skills, in combination with the useful if_else() function.

ls() #check out the objects in the environment
rm(list=ls()) ## here we are telling R the command *remove* (or rather, clear) all of the objects from the workspace to be able to start with a clean environment.  

setwd("~/Directories/Practice Directory")

library(dplyr)
library(readr)
library(stringr)
library(lubridate)

Load the Data

###load our data
active<- read_csv("active.csv")
acs <- read_csv("acs.csv")
cm2<- read_csv("cm2.csv")
dems <- read_csv("dems.csv")
intake <- read_csv("intake.modified.csv")

1. What’s the point?

We have begun to explore joins() which allow us to combine two different tables together in the R environment. Base R contains some useful functions to join together tables, for example rbind() (row bind) which simply stacks the rows of two tables on top of each other whilst creating a new object (table). While we tend to use rbind() prolifically in our reporting, it has limitations; for example, you must have the exact same column names and type of column for this base function to operate. We also could try concatenating two tables together, however, it will not work if the order of observations within a variable we are trying to join by is not in the exact same order. We can see examples of both rbind() and concatenate (using “identical”) below

df1 <- rbind(active, cm2)

identical(active$`Program Name`, cm2$ProgramName)

## Warning: Unknown or uninitialised column: `Program Name`.

## [1] FALSE

Joins solve for this type of situation.

Joins take another approach, where we can match data from one table to corresponding data in another table. There are many ways to approach these types of joins, and each join() serves specific purposes. We will first go over the most common ways we join tables in wrangling for reporting purposes. Then, we will review the remaining types of joins, as you may find a need for them at some point in your reporting or analysis projects.

2. Left_Joins - “Tried and True”

Left joins take all the rows in the first table, and match to their corresponding row in the other table. Let’s try out left_join() with a specific example. Suppose we want to take all of our residential clients who had CMA 2’s last quarter and add on all intake data we have for those clients. We indicate which column we want to match using the by argument. We only need to maintain the first 6 columns of the original CMA 2 table, and then we can join tables.

## first, subset the data frame to retain only the columns we need

cm2.2 <- cm2[, (1:6)] |> distinct(ParticipantID, .keep_all = TRUE)

df1 <- left_join(cm2.2, intake, by = "ParticipantID") ## rememeber, if the variable we are joining the two tables together by is text, it must be named within parentheses

We have received a warning; this is because our first table has multiple entries in the second table (remember, we wanted to match ALL intake assessment data available for each client). We must define the relationship to not receive a warning.

df1<- left_join(cm2.2, intake, by = "ParticipantID", relationship = "many-to-many")

If a row (observation) in the first table has multiple entries in the second table, the new data frame will create additional rows for each of those entries. The columns that came from the first table will contain replicated data, and the columns from the second table will be unique per each matching row/observation. We can observe this by isolating some clients that have multiple entries.

First, we will find the clients/rows which have multiple entries, and then we will view the data.

df1 |> group_by(ParticipantID) |> filter(n()!=1)

## # A tibble: 590 × 120
## # Groups:   ParticipantID [226]
##    ParticipantID DateOfBirth.x ProgramName   DateTaken.x AuditDate.x Version    
##            <dbl> <chr>         <chr>         <chr>       <chr>       <chr>      
##  1         20757 6/22/2002     Caminos       9/30/2024   9/30/2024   Final / At…
##  2         20757 6/22/2002     Caminos       9/30/2024   9/30/2024   Final / At…
##  3         20757 6/22/2002     Caminos       9/30/2024   9/30/2024   Final / At…
##  4         22914 10/5/2003     Journeys      9/30/2024   9/30/2024   Reassessme…
##  5         22914 10/5/2003     Journeys      9/30/2024   9/30/2024   Reassessme…
##  6         22914 10/5/2003     Journeys      9/30/2024   9/30/2024   Reassessme…
##  7         22914 10/5/2003     Journeys      9/30/2024   9/30/2024   Reassessme…
##  8         22914 10/5/2003     Journeys      9/30/2024   9/30/2024   Reassessme…
##  9         18703 9/17/1998     Healthy Paths 9/30/2024   9/30/2024   Reassessme…
## 10         18703 9/17/1998     Healthy Paths 9/30/2024   9/30/2024   Reassessme…
## # ℹ 580 more rows
## # ℹ 114 more variables: DateOfBirth.y <chr>, ResponseID <dbl>,
## #   `Completing Program Unique Identifier_6` <dbl>, DateTaken.y <chr>,
## #   AuditDate.y <chr>, HearAboutLarkin <chr>, HearAboutLarkinOther <chr>,
## #   FormalReferral <chr>, ReferralSource <chr>, LarkinOutreachContact <chr>,
## #   LarkinOutreachContactHood <chr>, IncreaseIncome <lgl>, LSAReferral <lgl>,
## #   WhyLSANo <lgl>, OtherLSANo <lgl>, SeekingAssistanceWith <chr>, …

df1.ex <- df1 |> filter(ParticipantID == 22914)
##view df1.ex

In other cases, there may not be a corresponding row in the second table. In those cases, the columns that begin with the joined second table will contain NA values.

df1.ex2<- df1 |> filter(is.na(DateTaken.y))
##view(df1_ex2)

In many cases, we want to see all possible entries per observation in the first table, and then we can reduce those entries by filtering what we need for a metric. Or, we can count entries, to achieve other results.
In different reports, we may need to know total NA’s to understand how many clients, for example, did not have a type of service or touchpoint during a given time period. So, left_joins are all inclusive and can be very useful. We will practice more ways to use data that includes NA’s or multiple entries later.

3. Semi_Joins - The Ultimate Filter

We often want to find which clients from one table have an entry in the second table. In these cases, we don’t necessarily need to know all the data from the second table, we ONLY need to know if there is a match in that table. For example, we want to know how many of our residential clients participated in LSA Core Curriculum groups or individual sessions.

The semi_join() functions solves for this situation. It allows to keep the part of the first table for which we have information in the second. It does NOT add columns of the second table to the new data frame (because we don’t need the detailed information, only that there was a match), and it does NOT include any rows from the first table that do NOT have a match.

We will first create a data frame that includes only the specified touchpoints, then, we will join with our active clients table to find the matches.

lsa.cc<- acs |> filter(ServiceID %in% c(111, 110)) ##110 : LSA Ind. CC, ##111 : LSA Group CC
active<- active |> distinct(ParticipantID, .keep_all = TRUE)

df2<- semi_join(active, lsa.cc, by = "ParticipantID")
dim(df2)

## [1] 229   5

We can now report that out of the 683 clients within the report period, 229 have participated in LSA Group Services: Core Curriculum or LSA Individual Services: Core Curriculum. However, if we wanted to count how many residential clients participated specifically in LSA CC Groups or LSA CC Individual, or both, semi_join() would not help because we didn’t retain the data from the second table lsa.cc which provided that information. Here is where inner_join() comes in to help.

4. Inner_Joins - The Most Specific Detail

Inner_join() let’s us keep only the rows that have information in both tables. It is an intersection of both tables, retaining only the data that is useful.

df3<- inner_join(active, lsa.cc, by = "ParticipantID") 
### note that this matches ALL possible entries, so we have a large number of matches

df3 |> filter(is.na(ServiceID))

## # A tibble: 0 × 12
## # ℹ 12 variables: ParticipantID <dbl>, Date of Birth <chr>,
## #   Program Start Date <chr>, Program End Date <chr>, Month of Export <lgl>,
## #   DateOfBirth <chr>, ServiceID <dbl>, ServiceType <chr>, ServiceDate <chr>,
## #   ResponseID <dbl>, Staff <chr>, StaffID <dbl>

Of course, when we look for any rows from the first table that do not have matches in the second table, none exist. This of course is because inner_join() ONLY finds matches and combines all the data. Now, we can count how many clients had each type of service, or both. And then, we can reduce the data to only unique ID’s (UDC) and see that the number of observations matches df2, as it should.

df3 |> count(ServiceID == 110) ##LSA Ind. CC

## # A tibble: 2 × 2
##   `ServiceID == 110`     n
##   <lgl>              <int>
## 1 FALSE               1491
## 2 TRUE                1239

df3 |> count (ServiceID==111) ## LSA Group CC.

## # A tibble: 2 × 2
##   `ServiceID == 111`     n
##   <lgl>              <int>
## 1 FALSE               1239
## 2 TRUE                1491

df3<- data.frame(df3)
df3 |> count(ServiceID ==110 & ServiceID==111) ##both services per client

##   ServiceID == 110 & ServiceID == 111    n
## 1                               FALSE 2730

df3 <- df3 |> distinct(ParticipantID, .keep_all = TRUE)

Anti_Join - how to find what is NOT there

Often, we need to find when clients in one data set do NOT have a match in another data set. This most often comes into play with outreach services and metrics which we need to provide for various public funders. Therefore, we have to report clients that meet certain matches (such as participation in Core Curriculum Groups or Case Management Sessions), however, those clients can NOT be living within residential programming or sometimes shelter.

The function anti_join() does the opposite of semi_join(); it retains the elements of the first table for which there is NO information in the second table. Let’s try this with the table that lists all clients with LSA Core Curriculum services lsa.cc. We will first reduce that table to the total UDC, and then perform an anti_join() with active clients (all of which are housed in some fashion).

lsa.cc.2 <- lsa.cc |> distinct(ParticipantID, .keep_all=TRUE)

no.res<- anti_join(lsa.cc.2, active, by = "ParticipantID")
dim(no.res)

## [1] 173   8

We can see that out of the clients who participated in the LSA Core Curriculum activities, 173 were unhoused during the report period that active covered. Later, we will do an activity where we can get more detail about those clients, also using the mutate() function. First, however, we will review the remaining types of joins that exist.

Other Joins

Here is a brief review of other types of joins, which you may find useful in the future.
Credits for this part of lesson via Rafalab from Harvard Ex

Right_Join

If instead of a table with the same rows as the first table, we want one with only the same rows as the second table, we can use right_join

cm2.3 <- cm2
dems.2<- dems
dems.2|> right_join(cm2.3, by = "ParticipantID")

## # A tibble: 397 × 202
##    ParticipantID DateOfBirth.x ChartNumberACAC EnglishFluencyReading
##            <dbl> <chr>                   <dbl> <chr>                
##  1         11933 8/16/1993                  NA Yes                  
##  2         12226 4/30/1994                   0 Yes                  
##  3         12376 5/14/1991                  NA Yes                  
##  4         12659 12/5/1991                  NA Yes                  
##  5         13884 10/20/1999                 NA Yes                  
##  6         14048 6/23/1993                  NA Yes                  
##  7         14089 5/3/1997                    0 No                   
##  8         14108 10/25/1993                 NA Yes                  
##  9         14328 8/13/1997                  NA Yes                  
## 10         14651 7/19/1998                  NA Yes                  
## # ℹ 387 more rows
## # ℹ 198 more variables: EnglishFluencySpoken <chr>,
## #   EnglishFluencyWritten <chr>, Ethnicity <chr>, EthnicitySpecify <chr>,
## #   Latino <chr>, Gender <chr>, GenderIDSpecify <chr>, AssignedSex <chr>,
## #   GenderSpecify <chr>, SexualOrientation <chr>, OrientationSpecify <chr>,
## #   MaritalStatus <chr>, Veteran <chr>, VeteranService <dbl>,
## #   ActiveCombat <chr>, Language <chr>, LanguageSpecify <chr>, …

Notice we set up the join a little different here. Here, we only retained the rowscolumns from dems.2. You can also achieve the same effect by switching the order of your tables in a semi_join(). However, there are specific occasions when using right_join() is the best case scenario. I, however, do not know of any of those occasions. If you find out, please let me know.

Full_Join

If we want to keep all rows, and fill the missing parts with NAs, we can full_join(). You can think of this as a union

full_join(cm2, intake, by = "ParticipantID") ##here we didn't create a new object

## Warning in full_join(cm2, intake, by = "ParticipantID"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 41 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

## # A tibble: 5,883 × 272
##    ParticipantID DateOfBirth.x ProgramName DateTaken.x AuditDate.x Version      
##            <dbl> <chr>         <chr>       <chr>       <chr>       <chr>        
##  1         20757 6/22/2002     Caminos     9/30/2024   9/30/2024   Final / At E…
##  2         20757 6/22/2002     Caminos     9/30/2024   9/30/2024   Final / At E…
##  3         20757 6/22/2002     Caminos     9/30/2024   9/30/2024   Final / At E…
##  4         22914 10/5/2003     Journeys    9/30/2024   9/30/2024   Reassessment…
##  5         22914 10/5/2003     Journeys    9/30/2024   9/30/2024   Reassessment…
##  6         22914 10/5/2003     Journeys    9/30/2024   9/30/2024   Reassessment…
##  7         22914 10/5/2003     Journeys    9/30/2024   9/30/2024   Reassessment…
##  8         22914 10/5/2003     Journeys    9/30/2024   9/30/2024   Reassessment…
##  9         23104 8/15/2000     Revive      9/30/2024   9/30/2024   Reassessment…
## 10         22258 3/11/2003     Revive      9/30/2024   9/30/2024   Reassessment…
## # ℹ 5,873 more rows
## # ℹ 266 more variables: CMandYouthMet <chr>, DateLastMet <chr>, CMNotes <chr>,
## #   DestinationAtExit <chr>, LarkinStreetProgram <chr>, SubsidyLocation <chr>,
## #   HousingPlanDetails <chr>, DestinationLocation <chr>, DestinationSafe <chr>,
## #   DestinationTime <chr>, OwnBed <chr>, WhereSleeping <lgl>, WhoDecides <chr>,
## #   `NameOnYouth Matters` <chr>, ObstaclesToStableHousing <chr>,
## #   OtherObstacles <chr>, RentPaymentStatus <chr>, BenefitEligibility <chr>, …

The handy image below is useful when determining what type of join will serve your reporting needs.

Types of Joins

Exercises

Let’s isolate clients and learn more information about them by using a combination of joins and mutate.

Exercise 1 - Unhoused Clients

Once again, we will find clients that were unhoused during a report period. We also have to do a separate operation, where we isolate clients that did find housing, and determine if there service date was earlier then their entry in housing/shelter. If that is their case, we want to add them back into the data set. Finally, we will create a binary variable to determine whether they were under the age of 23 at the time of their service.

Details

Create a data frame that includes only clients who had an “Outreach” touchpoint (Service ID== 157). Isolate to the earliest date of the touchpoint.
Out of those clients, create 2 data frames:
- df1 = clients who were unhoused
- df2 = clients who were housed, but actually unhoused at the time of service
  - hint #1: this is a multi step operation including changing class of variables
  - hint #2: what type of join do you want to use here?
Combine df2 clients back with df1 clients (using rbind)
- hint #3: rbind requires the same number and types of columns to operate
Create an Age column, and then determine if they were under the age of 23 at the time they received their Outreach touchpoint using a binary variable
Count how many clients were under the age of 23 at the time of service

out<- acs |> filter(ServiceID==157)
out$ServiceDate<- as.Date(out$ServiceDate, format = "%m/%d/%Y")
out<- out |> group_by(ParticipantID) |> slice_min(ServiceDate) |> distinct(ParticipantID, .keep_all = TRUE)

df1<- anti_join(out, active, by = "ParticipantID")
df2<- inner_join (out, active, by = "ParticipantID") ## number of observations should be the remainder of DF out - df1.
class(df2$`Program Start Date`)

## [1] "character"

df2$`Program Start Date`<- as.Date(df2$`Program Start Date`, format = "%m/%d/%Y")
df2 <- df2 |> filter(ServiceDate<`Program Start Date`)
names(df2)

##  [1] "ParticipantID"      "DateOfBirth"        "ServiceID"         
##  [4] "ServiceType"        "ServiceDate"        "ResponseID"        
##  [7] "Staff"              "StaffID"            "Date of Birth"     
## [10] "Program Start Date" "Program End Date"   "Month of Export"

names(df1)

## [1] "ParticipantID" "DateOfBirth"   "ServiceID"     "ServiceType"  
## [5] "ServiceDate"   "ResponseID"    "Staff"         "StaffID"

df2<- df2[, c(1:8)] ## subset df2 to match columns from df1

out<- rbind(df1, df2)

out$DateOfBirth<- as.Date(out$DateOfBirth, format = "%m/%d/%Y")
out<- out |> mutate(Age = year(ServiceDate)-year(DateOfBirth))

out<- out |> mutate(under23 = if_else(Age < 23, 1, 0))

out<-data.frame(out)
out |> count(under23)

##   under23   n
## 1       0  89
## 2       1 124
## 3      NA   1

Exercise 2 - Intake Data betwen specific dates

Details

Create a data frame that joins all housed clients (UDC) to Intake Assessment data for those clients
Isolate to their first intake date
- hint: this can be achieved after joining OR before
Determine how many clients had their first Intake Assessment in Q4 (FY2324)
Of those clients, create a new binary variable to determine if they have ever experienced the following:
- Labor Controlled Money
- Labor Tricked or Forced
- Labor Threatened
Count how many clients have experienced Forced Labor of some kind

## 1st method 
int<- left_join(active, intake, by = "ParticipantID", relationship = "many-to-many")
int$DateTaken<- as.Date(int$DateTaken, format = "%m/%d/%Y")
int<-int |> group_by(ParticipantID) |> slice_min(DateTaken) |> distinct(ParticipantID, .keep_all = TRUE)
int<- int |> filter(DateTaken>= "2024-04-01" & DateTaken<= "2024-07-01")

#### 2nd method, alternatively
intake.2<- intake
intake.2$DateTaken<- as.Date(intake.2$DateTaken, format = "%m/%d/%Y")
intake.2<- intake.2 |> group_by(ParticipantID) |> slice_min(DateTaken) |> distinct(ParticipantID, .keep_all = TRUE)
intake.2<- intake.2 |> filter(DateTaken>= "2024-04-01" & DateTaken<= "2024-07-01")

int.2<- inner_join(active, intake.2, by = "ParticipantID") ##same total as df "int"

############

int<- int |> mutate(ForcedLabor = ifelse(LaborThreatened == "Yes" |LaborTrickedOrForced == "Yes" | 
                                                           LaborControlledMoney =="Yes", 1, 0))
int<- data.frame(int)
int |> count(ForcedLabor)

##   ForcedLabor  n
## 1           0  4
## 2           1 14
## 3          NA 37

Session 5: Advanced dplyr Functions

Alexandra Candia

2025-01-03

Load the Data

1. What’s the point?

2. Left_Joins - “Tried and True”

3. Semi_Joins - The Ultimate Filter

4. Inner_Joins - The Most Specific Detail

Anti_Join - how to find what is NOT there

Other Joins

Right_Join

Full_Join

The handy image below is useful when determining what type of join will serve your reporting needs.

Exercises

Exercise 1 - Unhoused Clients

Details

Exercise 2 - Intake Data betwen specific dates

Details