Let’s continue exploring dplyr functions while also building our literacy in R language/coding. We will further explore additional join() types and as well as building upon our mutate() skills, in combination with the useful if_else() function.
ls() #check out the objects in the environment
rm(list=ls()) ## here we are telling R the command *remove* (or rather, clear) all of the objects from the workspace to be able to start with a clean environment.
setwd("~/Directories/Practice Directory")
library(dplyr)
library(readr)
library(stringr)
library(lubridate)
###load our data
active<- read_csv("active.csv")
acs <- read_csv("acs.csv")
cm2<- read_csv("cm2.csv")
dems <- read_csv("dems.csv")
intake <- read_csv("intake.modified.csv")
We have begun to explore joins() which allow us to combine two different tables together in the R environment. Base R contains some useful functions to join together tables, for example rbind() (row bind) which simply stacks the rows of two tables on top of each other whilst creating a new object (table). While we tend to use rbind() prolifically in our reporting, it has limitations; for example, you must have the exact same column names and type of column for this base function to operate. We also could try concatenating two tables together, however, it will not work if the order of observations within a variable we are trying to join by is not in the exact same order. We can see examples of both rbind() and concatenate (using “identical”) below
df1 <- rbind(active, cm2)
identical(active$`Program Name`, cm2$ProgramName)
## Warning: Unknown or uninitialised column: `Program Name`.
## [1] FALSE
Joins solve for this type of situation.
Joins take another approach, where we can match data from one table to corresponding data in another table. There are many ways to approach these types of joins, and each join() serves specific purposes. We will first go over the most common ways we join tables in wrangling for reporting purposes. Then, we will review the remaining types of joins, as you may find a need for them at some point in your reporting or analysis projects.
Left joins take all the rows in the first table, and match to their corresponding row in the other table. Let’s try out left_join() with a specific example. Suppose we want to take all of our residential clients who had CMA 2’s last quarter and add on all intake data we have for those clients. We indicate which column we want to match using the by argument. We only need to maintain the first 6 columns of the original CMA 2 table, and then we can join tables.
## first, subset the data frame to retain only the columns we need
cm2.2 <- cm2[, (1:6)] |> distinct(ParticipantID, .keep_all = TRUE)
df1 <- left_join(cm2.2, intake, by = "ParticipantID") ## rememeber, if the variable we are joining the two tables together by is text, it must be named within parentheses
We have received a warning; this is because our first table has multiple entries in the second table (remember, we wanted to match ALL intake assessment data available for each client). We must define the relationship to not receive a warning.
df1<- left_join(cm2.2, intake, by = "ParticipantID", relationship = "many-to-many")
If a row (observation) in the first table has multiple entries in the second table, the new data frame will create additional rows for each of those entries. The columns that came from the first table will contain replicated data, and the columns from the second table will be unique per each matching row/observation. We can observe this by isolating some clients that have multiple entries.
First, we will find the clients/rows which have multiple entries, and then we will view the data.
df1 |> group_by(ParticipantID) |> filter(n()!=1)
## # A tibble: 590 × 120
## # Groups: ParticipantID [226]
## ParticipantID DateOfBirth.x ProgramName DateTaken.x AuditDate.x Version
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 20757 6/22/2002 Caminos 9/30/2024 9/30/2024 Final / At…
## 2 20757 6/22/2002 Caminos 9/30/2024 9/30/2024 Final / At…
## 3 20757 6/22/2002 Caminos 9/30/2024 9/30/2024 Final / At…
## 4 22914 10/5/2003 Journeys 9/30/2024 9/30/2024 Reassessme…
## 5 22914 10/5/2003 Journeys 9/30/2024 9/30/2024 Reassessme…
## 6 22914 10/5/2003 Journeys 9/30/2024 9/30/2024 Reassessme…
## 7 22914 10/5/2003 Journeys 9/30/2024 9/30/2024 Reassessme…
## 8 22914 10/5/2003 Journeys 9/30/2024 9/30/2024 Reassessme…
## 9 18703 9/17/1998 Healthy Paths 9/30/2024 9/30/2024 Reassessme…
## 10 18703 9/17/1998 Healthy Paths 9/30/2024 9/30/2024 Reassessme…
## # ℹ 580 more rows
## # ℹ 114 more variables: DateOfBirth.y <chr>, ResponseID <dbl>,
## # `Completing Program Unique Identifier_6` <dbl>, DateTaken.y <chr>,
## # AuditDate.y <chr>, HearAboutLarkin <chr>, HearAboutLarkinOther <chr>,
## # FormalReferral <chr>, ReferralSource <chr>, LarkinOutreachContact <chr>,
## # LarkinOutreachContactHood <chr>, IncreaseIncome <lgl>, LSAReferral <lgl>,
## # WhyLSANo <lgl>, OtherLSANo <lgl>, SeekingAssistanceWith <chr>, …
df1.ex <- df1 |> filter(ParticipantID == 22914)
##view df1.ex
In other cases, there may not be a corresponding row in the second table. In those cases, the columns that begin with the joined second table will contain NA values.
df1.ex2<- df1 |> filter(is.na(DateTaken.y))
##view(df1_ex2)
In many cases, we want to see all possible entries per observation in
the first table, and then we can reduce those entries by filtering what
we need for a metric. Or, we can count entries, to achieve other
results.
In different reports, we may need to know total NA’s to understand how
many clients, for example, did not have a type of service or touchpoint
during a given time period. So, left_joins are all inclusive and can be
very useful. We will practice more ways to use data that includes NA’s
or multiple entries later.
We often want to find which clients from one table have an entry in the second table. In these cases, we don’t necessarily need to know all the data from the second table, we ONLY need to know if there is a match in that table. For example, we want to know how many of our residential clients participated in LSA Core Curriculum groups or individual sessions.
The semi_join() functions solves for this situation. It allows to keep the part of the first table for which we have information in the second. It does NOT add columns of the second table to the new data frame (because we don’t need the detailed information, only that there was a match), and it does NOT include any rows from the first table that do NOT have a match.
We will first create a data frame that includes only the specified touchpoints, then, we will join with our active clients table to find the matches.
lsa.cc<- acs |> filter(ServiceID %in% c(111, 110)) ##110 : LSA Ind. CC, ##111 : LSA Group CC
active<- active |> distinct(ParticipantID, .keep_all = TRUE)
df2<- semi_join(active, lsa.cc, by = "ParticipantID")
dim(df2)
## [1] 229 5
We can now report that out of the 683 clients within the report period, 229 have participated in LSA Group Services: Core Curriculum or LSA Individual Services: Core Curriculum. However, if we wanted to count how many residential clients participated specifically in LSA CC Groups or LSA CC Individual, or both, semi_join() would not help because we didn’t retain the data from the second table lsa.cc which provided that information. Here is where inner_join() comes in to help.
Inner_join() let’s us keep only the rows that have information in both tables. It is an intersection of both tables, retaining only the data that is useful.
df3<- inner_join(active, lsa.cc, by = "ParticipantID")
### note that this matches ALL possible entries, so we have a large number of matches
df3 |> filter(is.na(ServiceID))
## # A tibble: 0 × 12
## # ℹ 12 variables: ParticipantID <dbl>, Date of Birth <chr>,
## # Program Start Date <chr>, Program End Date <chr>, Month of Export <lgl>,
## # DateOfBirth <chr>, ServiceID <dbl>, ServiceType <chr>, ServiceDate <chr>,
## # ResponseID <dbl>, Staff <chr>, StaffID <dbl>
Of course, when we look for any rows from the first table that do not have matches in the second table, none exist. This of course is because inner_join() ONLY finds matches and combines all the data. Now, we can count how many clients had each type of service, or both. And then, we can reduce the data to only unique ID’s (UDC) and see that the number of observations matches df2, as it should.
df3 |> count(ServiceID == 110) ##LSA Ind. CC
## # A tibble: 2 × 2
## `ServiceID == 110` n
## <lgl> <int>
## 1 FALSE 1491
## 2 TRUE 1239
df3 |> count (ServiceID==111) ## LSA Group CC.
## # A tibble: 2 × 2
## `ServiceID == 111` n
## <lgl> <int>
## 1 FALSE 1239
## 2 TRUE 1491
df3<- data.frame(df3)
df3 |> count(ServiceID ==110 & ServiceID==111) ##both services per client
## ServiceID == 110 & ServiceID == 111 n
## 1 FALSE 2730
df3 <- df3 |> distinct(ParticipantID, .keep_all = TRUE)
Often, we need to find when clients in one data set do NOT have a match in another data set. This most often comes into play with outreach services and metrics which we need to provide for various public funders. Therefore, we have to report clients that meet certain matches (such as participation in Core Curriculum Groups or Case Management Sessions), however, those clients can NOT be living within residential programming or sometimes shelter.
The function anti_join() does the opposite of semi_join(); it retains the elements of the first table for which there is NO information in the second table. Let’s try this with the table that lists all clients with LSA Core Curriculum services lsa.cc. We will first reduce that table to the total UDC, and then perform an anti_join() with active clients (all of which are housed in some fashion).
lsa.cc.2 <- lsa.cc |> distinct(ParticipantID, .keep_all=TRUE)
no.res<- anti_join(lsa.cc.2, active, by = "ParticipantID")
dim(no.res)
## [1] 173 8
We can see that out of the clients who participated in the LSA Core Curriculum activities, 173 were unhoused during the report period that active covered. Later, we will do an activity where we can get more detail about those clients, also using the mutate() function. First, however, we will review the remaining types of joins that exist.
Here is a brief review of other types of joins, which you may find
useful in the future.
Credits for this part of lesson via Rafalab
from Harvard Ex
If instead of a table with the same rows as the first table, we want one with only the same rows as the second table, we can use right_join
cm2.3 <- cm2
dems.2<- dems
dems.2|> right_join(cm2.3, by = "ParticipantID")
## # A tibble: 397 × 202
## ParticipantID DateOfBirth.x ChartNumberACAC EnglishFluencyReading
## <dbl> <chr> <dbl> <chr>
## 1 11933 8/16/1993 NA Yes
## 2 12226 4/30/1994 0 Yes
## 3 12376 5/14/1991 NA Yes
## 4 12659 12/5/1991 NA Yes
## 5 13884 10/20/1999 NA Yes
## 6 14048 6/23/1993 NA Yes
## 7 14089 5/3/1997 0 No
## 8 14108 10/25/1993 NA Yes
## 9 14328 8/13/1997 NA Yes
## 10 14651 7/19/1998 NA Yes
## # ℹ 387 more rows
## # ℹ 198 more variables: EnglishFluencySpoken <chr>,
## # EnglishFluencyWritten <chr>, Ethnicity <chr>, EthnicitySpecify <chr>,
## # Latino <chr>, Gender <chr>, GenderIDSpecify <chr>, AssignedSex <chr>,
## # GenderSpecify <chr>, SexualOrientation <chr>, OrientationSpecify <chr>,
## # MaritalStatus <chr>, Veteran <chr>, VeteranService <dbl>,
## # ActiveCombat <chr>, Language <chr>, LanguageSpecify <chr>, …
Notice we set up the join a little different here. Here, we only retained the rowscolumns from dems.2. You can also achieve the same effect by switching the order of your tables in a semi_join(). However, there are specific occasions when using right_join() is the best case scenario. I, however, do not know of any of those occasions. If you find out, please let me know.
If we want to keep all rows, and fill the missing parts with NAs, we can full_join(). You can think of this as a union
full_join(cm2, intake, by = "ParticipantID") ##here we didn't create a new object
## Warning in full_join(cm2, intake, by = "ParticipantID"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 41 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
## # A tibble: 5,883 × 272
## ParticipantID DateOfBirth.x ProgramName DateTaken.x AuditDate.x Version
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 20757 6/22/2002 Caminos 9/30/2024 9/30/2024 Final / At E…
## 2 20757 6/22/2002 Caminos 9/30/2024 9/30/2024 Final / At E…
## 3 20757 6/22/2002 Caminos 9/30/2024 9/30/2024 Final / At E…
## 4 22914 10/5/2003 Journeys 9/30/2024 9/30/2024 Reassessment…
## 5 22914 10/5/2003 Journeys 9/30/2024 9/30/2024 Reassessment…
## 6 22914 10/5/2003 Journeys 9/30/2024 9/30/2024 Reassessment…
## 7 22914 10/5/2003 Journeys 9/30/2024 9/30/2024 Reassessment…
## 8 22914 10/5/2003 Journeys 9/30/2024 9/30/2024 Reassessment…
## 9 23104 8/15/2000 Revive 9/30/2024 9/30/2024 Reassessment…
## 10 22258 3/11/2003 Revive 9/30/2024 9/30/2024 Reassessment…
## # ℹ 5,873 more rows
## # ℹ 266 more variables: CMandYouthMet <chr>, DateLastMet <chr>, CMNotes <chr>,
## # DestinationAtExit <chr>, LarkinStreetProgram <chr>, SubsidyLocation <chr>,
## # HousingPlanDetails <chr>, DestinationLocation <chr>, DestinationSafe <chr>,
## # DestinationTime <chr>, OwnBed <chr>, WhereSleeping <lgl>, WhoDecides <chr>,
## # `NameOnYouth Matters` <chr>, ObstaclesToStableHousing <chr>,
## # OtherObstacles <chr>, RentPaymentStatus <chr>, BenefitEligibility <chr>, …
Let’s isolate clients and learn more information about them by using a combination of joins and mutate.
Once again, we will find clients that were unhoused during a report period. We also have to do a separate operation, where we isolate clients that did find housing, and determine if there service date was earlier then their entry in housing/shelter. If that is their case, we want to add them back into the data set. Finally, we will create a binary variable to determine whether they were under the age of 23 at the time of their service.
out<- acs |> filter(ServiceID==157)
out$ServiceDate<- as.Date(out$ServiceDate, format = "%m/%d/%Y")
out<- out |> group_by(ParticipantID) |> slice_min(ServiceDate) |> distinct(ParticipantID, .keep_all = TRUE)
df1<- anti_join(out, active, by = "ParticipantID")
df2<- inner_join (out, active, by = "ParticipantID") ## number of observations should be the remainder of DF out - df1.
class(df2$`Program Start Date`)
## [1] "character"
df2$`Program Start Date`<- as.Date(df2$`Program Start Date`, format = "%m/%d/%Y")
df2 <- df2 |> filter(ServiceDate<`Program Start Date`)
names(df2)
## [1] "ParticipantID" "DateOfBirth" "ServiceID"
## [4] "ServiceType" "ServiceDate" "ResponseID"
## [7] "Staff" "StaffID" "Date of Birth"
## [10] "Program Start Date" "Program End Date" "Month of Export"
names(df1)
## [1] "ParticipantID" "DateOfBirth" "ServiceID" "ServiceType"
## [5] "ServiceDate" "ResponseID" "Staff" "StaffID"
df2<- df2[, c(1:8)] ## subset df2 to match columns from df1
out<- rbind(df1, df2)
out$DateOfBirth<- as.Date(out$DateOfBirth, format = "%m/%d/%Y")
out<- out |> mutate(Age = year(ServiceDate)-year(DateOfBirth))
out<- out |> mutate(under23 = if_else(Age < 23, 1, 0))
out<-data.frame(out)
out |> count(under23)
## under23 n
## 1 0 89
## 2 1 124
## 3 NA 1
## 1st method
int<- left_join(active, intake, by = "ParticipantID", relationship = "many-to-many")
int$DateTaken<- as.Date(int$DateTaken, format = "%m/%d/%Y")
int<-int |> group_by(ParticipantID) |> slice_min(DateTaken) |> distinct(ParticipantID, .keep_all = TRUE)
int<- int |> filter(DateTaken>= "2024-04-01" & DateTaken<= "2024-07-01")
#### 2nd method, alternatively
intake.2<- intake
intake.2$DateTaken<- as.Date(intake.2$DateTaken, format = "%m/%d/%Y")
intake.2<- intake.2 |> group_by(ParticipantID) |> slice_min(DateTaken) |> distinct(ParticipantID, .keep_all = TRUE)
intake.2<- intake.2 |> filter(DateTaken>= "2024-04-01" & DateTaken<= "2024-07-01")
int.2<- inner_join(active, intake.2, by = "ParticipantID") ##same total as df "int"
############
int<- int |> mutate(ForcedLabor = ifelse(LaborThreatened == "Yes" |LaborTrickedOrForced == "Yes" |
LaborControlledMoney =="Yes", 1, 0))
int<- data.frame(int)
int |> count(ForcedLabor)
## ForcedLabor n
## 1 0 4
## 2 1 14
## 3 NA 37