Let’s continue exploring dplyr functions while also building our literacy in R language/coding.We will continue to learn new ways to subset our data, aggregate and group data and analyze the results.

Basic Setup

ls() #check out the objects in the environment
rm(list=ls()) ## here we are telling R the command *remove* (or rather, clear) all of the objects from the workspace to be able to start with a clean environment.  

setwd("~/Directories/Practice Directory")
library(dplyr)
library(readr)
library(stringr)
library(lubridate)
###load our data
cm2<- read_csv("cm2.csv")
dems <- read_csv("dems.csv")
intake <- read_csv("intake.modified.csv")

Review: Filter and Select

Last week we reviewed how to subset our data using filter() and select(). We were able create new data frames or simply review the data with immediate resultsUsi.
Using filter() we can get very specific by concatenating specific conditions we want our data frame to meet. As we learned previously, we can separate conditions we want filter() to meet with “&” or with commas. However, when we have multiple selections under the same variable, we want to concatenate.
Imagine: you want to select 5 program names, and you don’t want to write “ProgramName ==”program name” five times > we can use “c”, or, concatenate.

df1 <- cm2 |> filter(ProgramName %in% c("Journeys", "Creative Courses", "Casa Alma", "Revive", "Temporary Transitions"))
## we use the combination '%in%" plus a "c" before parentheses(), which include all the conditions we want to meet.
##be sure to always have each concatenated entry in quotation marks or the command will fail.

Count Function

R conveniently has many functions used for counting different entries in data frames, and the most useful choice is the count() function, part of the dplyr package. At it simplest, you can simply call the dataframe or tibble name, and then use pipes to count whatever entry you are interested in.

df1 |> count(ProgramName)##simple version
## # A tibble: 5 × 2
##   ProgramName               n
##   <chr>                 <int>
## 1 Casa Alma                21
## 2 Creative Courses         19
## 3 Journeys                 11
## 4 Revive                   60
## 5 Temporary Transitions    12
df1 |> count(BankAccount)
## # A tibble: 4 × 2
##   BankAccount                                     n
##   <chr>                                       <int>
## 1 No - does not have a bank account              34
## 2 Yes - has a checking account                   40
## 3 Yes - has a savings account                    10
## 4 Yes - has both savings and checking account    39

You can also add conditions to your count function, and R will return a TRUE or FALSE with the corresponding number.

df1 |> count(DestinationAtExitWeight==1)
## # A tibble: 2 × 2
##   `DestinationAtExitWeight == 1`     n
##   <lgl>                          <int>
## 1 FALSE                             12
## 2 NA                               111
df1 |> count(DestinationSafe=="Yes, both staff and client consider destination safe")
## # A tibble: 3 × 2
##   DestinationSafe == "Yes, both staff and client consider destination sa…¹     n
##   <lgl>                                                                    <int>
## 1 FALSE                                                                        1
## 2 TRUE                                                                        11
## 3 NA                                                                         111
## # ℹ abbreviated name:
## #   ¹​`DestinationSafe == "Yes, both staff and client consider destination safe"`

Finally, we can get a little more complicate with our count, when we want to filter for age factors. First, we have to change the column “DateOfBirth” to a date format. Then, we can find all the clients who are under 21.

df1$DateOfBirth <- as.Date(df1$DateOfBirth, format = "%m/%d/%Y")
df1 |> count(DateOfBirth <= "2003-12-03")
## # A tibble: 2 × 2
##   `DateOfBirth <= "2003-12-03"`     n
##   <lgl>                         <int>
## 1 FALSE                            14
## 2 TRUE                            109

While we can use count() on its own, it is often more useful to combine with the filter() command and the group_by() command, also a part of the tidyverse (dplyr). Here we will use the filter() command with the ubiquitous and notorious NA…..(NA NA NA notorious).

df1 |> filter(!is.na(CMandYouthMet)) |> group_by(ProgramName) |> count(CMandYouthMet=="Yes")
## # A tibble: 7 × 3
## # Groups:   ProgramName [4]
##   ProgramName           `CMandYouthMet == "Yes"`     n
##   <chr>                 <lgl>                    <int>
## 1 Creative Courses      TRUE                        13
## 2 Journeys              FALSE                        1
## 3 Journeys              TRUE                         8
## 4 Revive                FALSE                        6
## 5 Revive                TRUE                        26
## 6 Temporary Transitions FALSE                        1
## 7 Temporary Transitions TRUE                        11
##first, we filter the data frame to remove NA's. Remember, ! in front of anything means does not equal, or "is not" so,  !is.na means, "is not NA in the column: CM and Youth Met. Then, we group by Program Name, and count how many answered "Yes" per program.

Date Filtering Cont’d

On the topic of filtering, we can now pivot to look at another way to filter dates. Often when performing data analysis for reporting, we need to isolate dates. If we use actual dates as parameters, for example, we only want CMA’s from the latest quarter, we may under report or miss some data that was entered into ETO outside of the dates we indicate. Another good example is our “Intake Assessment” data. Some reports, we need to isolate the earliest intake assessment they received, to find out say, their age at entry to Larkin Street, or how long they had been experiencing homelessness at that point. Other reports, we need to know the latest information of a client, in regards to forms of victimization or being trafficked. This is where the incredibly useful commands slice_min and slice_max comes into play.

Intuitively, slice_min effectively “slices” our data down to the earliest (minimum) date we can find, based on another variable. Slice_max “slices” our data frame to the latest (maximum) date we can find, based on another variable. Both commands use the logic of dates of having values, and then, based on those values, we can isolate the maximum or minimum value.

We use these commands in conjunction with the group_by(), which tells the slice_min() command what variable to use as base variable to find the minimum or maximum value associated with.

##First change the date class to "as.Date"

intake$DateTaken<- as.Date(intake$DateTaken, format = "%m/%d/%Y")
intake.1<- intake |> group_by(ParticipantID) |> slice_min(DateTaken) ### we "group" all the Participant ID's in their own groups, and then within those groups, we find the minimum date value and isolate in a new dataframe/tibble.

Note: our “slice” commands work far beyond isolating dates. Any variable that is numeric, or alternatively, has a “value” associated with it, can be sliced.
Examples we would find in data that we use at LSYS:

  1. Weight of Exit Destination
  2. Any income variables (GA, Earnings Amounts, Financial Aid)
  3. Hourly Wage
  4. OR: any categorical variable that we have assigned a weight/value to

Mutate

The mutate() function of dplyr is probably one of the most useful commands in the dplyr/tidyverse universe. Data wrangling, and creating “tidy data” means making large data sets useful, and with variables that we can find meaning from. Often times, variables entered in columns as is may not say a lot, but combined with other variables or manipulated in some fashion, we can deduce greater meaning. Most often, in data wrangling for reporting, I use mutate() in combination with another command we will learn a bit later, if_else. However, there are plenty of uses on its own.

Here, we want to create a new variable that captures the age of our clients. While this measure cannot be 100% accurate, since we can only use the latest data we have in ETO, we can get pretty close. We will calculate a clients age with the following formula based on variables in our CMA 2:

  • (Latest) Date Taken - Date of Birth
  • This formula only works if both of the variables have been turned into “values” (as.Date())

Note: by rule of thumb, we would usually use Active Clients or Program Enrollments to get the most accurate enrollment dates; additionally, I would first create a “dummy variable” using mutate in combination with “if_else” to create a column that shows the latest possible date they were enrolled in the quarter to get their most up to date age. But for this exercise, we will keep it simple

df2 <- cm2
df2$DateTaken <- as.Date(df2$DateTaken, format = "%m/%d/%Y")
df2$DateOfBirth <- as.Date(df2$DateOfBirth, format = "%m/%d/%Y")
#first we have to find their latest intake date, using slice_max

df2<-df2 |> group_by(ParticipantID) |> slice_max(DateTaken)

df2 <- df2 |> mutate(Age = year(DateTaken)- year(DateOfBirth))
df2$Age
##   [1] 31 30 33 33 25 31 27 31 27 26 29 33 30 29 29 28 27 27 28 22 26 32 23 24 27
##  [26] 25 26 26 29 28 28 23 26 26 23 26 22 23 24 26 24 26 23 25 25 24 25 26 25 23
##  [51] 23 22 26 25 29 29 22 25 23 25 22 27 25 26 24 25 23 23 27 26 24 23 29 22 26
##  [76] 23 26 25 25 27 26 28 26 23 27 28 25 27 27 27 25 27 22 19 26 23 25 23 23 22
## [101] 25 26 25 26 27 24 28 26 25 22 24 27 22 22 22 24 23 21 20 22 27 23 23 24 26
## [126] 24 20 23 24 23 25 24 22 21 22 22 21 23 24 22 21 23 26 26 26 26 25 25 26 22
## [151] 24 25 23 26 23 22 20 22 24 25 23 25 25 22 21 26 23 26 19 23 25 23 21 21 25
## [176] 26 23 24 24 23 23 23 23 20 20 26 23 23 25 23 23 24 23 21 20 21 23 26 23 26
## [201] 24 21 24 23 20 24 21 20 23 22 24 19 23 24 25 26 25 23 22 24 20 25 24 24 24
## [226] 21 21 21 26 22 23 22 25 22 21 22 22 24 22 25 22 25 23 26 19 24 18 22 23 24
## [251] 22 21 22 24 22 22 21 24 22 24 20 23 21 19 21 25 21 23 22 22 21 23 23 26 20
## [276] 21 19 23 24 22 20 22 19 22 19 24 21 21 22 24 22 19 19 20 22 21 25 23 20 21
## [301] 22 25 22 23 23 19 23 21 26 25 23 22 20 24 22 20 24 19 21 23 22 24 19 26 21
## [326] 25 21 24 24 23 23 21 19 23 25 21 25 23 21 21 21 24 22 24 24 22 20 18 21 24
## [351] 21 18 21 23 23 18 23 18 25 23 23
##Other ways to calculate date
# df2<- df2 |> mutate(Age2=floor((DateTaken-DateOfBirth)/365))
# df2$Age2

Beginning Joins

The final dplyr command we will explore is the join() command. The are several types of joins, all have their own purpose, and almost any report processed through R will require one of the join function at least once. What is a join? The join functions in the dplyr package make sure that two tables (df’s, tibbles) are combined so that matching rows are combined together (this is very similar to SQL syntax as well, if you are familiar). One needs to identify one or more columns that serve to match up the columns of each dataframe.

left_join is the most common join and we will apply that here.

df3 <- left_join(df2, dems, by = "ParticipantID") |> select(-MiddleName, -AKA)
## 1) list the tables in order you want the joins
## 2) Identify the variable (or 2 variables) you want to join the tables by, and ensure they are in quotation marks
## 3) Removed some columns I don't want, using select and the " - " symbol (just to show another way to use the select() function)

##note this give a HUGE data set; normally I would reduce columns/clean up my df's before joining together, but this is just an example.

Putting it all together

Now, let’s combine what we know about count(), filter(), slicing, mutate() and joins to deduce some key data points regarding our client base.

We want to isolate clients from 2 specific programs who have a CMA assessment and then want to match it up with their latest intake assessment and the demographics report.

Please complete the following steps:

  1. Create a new data frame (called df4) from CMA 2, including clients from Caminos, Casa Alma, Youth Matters, and New Horizons; only include Reassessments and Initial Assessments.
  2. Create an Age Column based on their latest Assessment Date
  3. Match/join this new DF with their latest Intake Assessment, include only the columns: Physical Disability, Development Disability, Chronic Health Condition, and Labor Scared Or Unsafe.
  4. Join this new data frame also with Demographic data frame.
  5. Report the following numbers # of clients with any PRIOR disability/health condition # of clients who were born outside of the US # Breakdown of clients who have ANY income, by Program Name. # Total demographic breakdown of Gender and Birth Country for all clients between the ages of 21 and 24 # Total female clients who were born inside the US by Program Name
##subset the data
df4 <- cm2 |> filter(ProgramName %in% c("Caminos", "Casa Alma", "Youth Matters", "New Horizons") & Version %in% c("Reassessment (Quarterly)", "Initial Assessment"))
## create an Age column
## change the column type first

df4$DateOfBirth<- as.Date(df4$DateOfBirth, format = "%m/%d/%Y")
df4$DateTaken <- as.Date(df4$DateTaken, format= "%m/%d/%Y")
df4<- df4 |> mutate(Age = year(DateTaken)- year(DateOfBirth))

### we should also slice to the latest entry date per client
df4<- df4 |> group_by(ParticipantID) |> slice_max(DateTaken) |> distinct(ParticipantID, .keep_all = TRUE)
##added in a new command here, will discuss distinct in a later lesson


###join with the latest intake date
## first clean the data set
intake.2<- intake |> group_by(ParticipantID) |> slice_max(DateTaken) |> distinct(ParticipantID, .keep_all = TRUE) ##remember we already changed the date class earlier in the lesson

##isolate the columns we want
intake.2<- intake.2 |> select(ParticipantID, PhysicalDisability, DevelopmentalDisability, ChronicHealthCondition, LaborScaredOrUnsafe)
###don't forget Participant ID, since we need to be able to join these columns with the first dataframe
df4<- left_join(df4, intake.2, by = "ParticipantID")

## we could slice our Intake Assessments after we join, but it is a MUCH cleaner process to do so before.

df4<-left_join(df4, dems, by = "ParticipantID")

We have now prepped a new data frame and can analyze its contents. You may have to use the unique() function to get the exact qualitative entries in the variable columns.

# of clients with ANY PRIOR disability/health condition

df4 |> count(PhysicalDisability == "Yes – Prior Treatment, Not Current"|DevelopmentalDisability =="Yes – Prior Treatment, Not Current"| ChronicHealthCondition == "Yes – Prior Treatment, Not Current")
## # A tibble: 74 × 3
## # Groups:   ParticipantID [74]
##    ParticipantID `|...`     n
##            <dbl> <lgl>  <int>
##  1         17021 FALSE      1
##  2         17764 FALSE      1
##  3         17779 FALSE      1
##  4         17965 FALSE      1
##  5         18387 FALSE      1
##  6         18796 FALSE      1
##  7         19162 TRUE       1
##  8         19228 FALSE      1
##  9         19316 FALSE      1
## 10         20577 FALSE      1
## # ℹ 64 more rows
###it is showing only a list of responses per entry/observation. This is not helpful.  The reason we are not getting "tidy" results is due to the fact that because of all the joins, the data frame has many formats.  Let's look at the formats:
class(df4)
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
#our data frame is many things!! that is okay, however, we want to TELL R that is first and foremost a "data.frame".  The code below force changes the format of the data frame. 
df4<-data.frame(df4)

##noww, let's run again!
df4 |> count(PhysicalDisability == "Yes – Prior Treatment, Not Current"|DevelopmentalDisability =="Yes – Prior Treatment, Not Current"| ChronicHealthCondition == "Yes – Prior Treatment, Not Current")
##    |...  n
## 1 FALSE 64
## 2  TRUE  6
## 3    NA  4
# of clients who were born outside of the US
df4 |> count(BirthCountry != "United States")
##   BirthCountry != "United States"  n
## 1                           FALSE 52
## 2                            TRUE 22
# Breakdown of clients who have a ANY Income, by Program Name.

df4 |> group_by(ProgramName) |> count(AnyIncome == "Yes")
## # A tibble: 7 × 3
## # Groups:   ProgramName [4]
##   ProgramName   `AnyIncome == "Yes"`     n
##   <chr>         <lgl>                <int>
## 1 Caminos       FALSE                    6
## 2 Caminos       TRUE                    21
## 3 Casa Alma     FALSE                    9
## 4 Casa Alma     TRUE                    12
## 5 New Horizons  TRUE                    11
## 6 Youth Matters FALSE                    3
## 7 Youth Matters TRUE                    12
# Total demographic breakdown of Gender and Birth Country for all clients between the ages of 21 and 24
df4 |> filter(Age<25 & Age>= 21) |> count(Gender)
##                            Gender  n
## 1                   Cisgender Man 27
## 2                 Cisgender Woman 16
## 3 Gender Nonconforming/Non-Binary  1
## 4               Transgender Woman  2
df4 |> filter(Age<25 & Age>= 21) |> count(BirthCountry)
##     BirthCountry  n
## 1       Colombia  2
## 2     Costa Rica  1
## 3          Gabon  1
## 4          Ghana  1
## 5       Honduras  6
## 6         Mexico  1
## 7      Nicaragua  3
## 8    Philippines  1
## 9  United States 29
## 10     Venezuela  1
# Total female clients who were born inside the US by Program Name

df4 |> filter(Gender== "Cisgender Woman") |> group_by(ProgramName) |> count(BirthCountry == "United States")
## # A tibble: 4 × 3
## # Groups:   ProgramName [4]
##   ProgramName   `BirthCountry == "United States"`     n
##   <chr>         <lgl>                             <int>
## 1 Caminos       TRUE                                 12
## 2 Casa Alma     FALSE                                 3
## 3 New Horizons  TRUE                                  1
## 4 Youth Matters TRUE                                 10