Let’s continue exploring dplyr functions while also building our literacy in R language/coding.We will continue to learn new ways to subset our data, aggregate and group data and analyze the results.
ls() #check out the objects in the environment
rm(list=ls()) ## here we are telling R the command *remove* (or rather, clear) all of the objects from the workspace to be able to start with a clean environment.
setwd("~/Directories/Practice Directory")
library(dplyr)
library(readr)
library(stringr)
library(lubridate)
###load our data
cm2<- read_csv("cm2.csv")
dems <- read_csv("dems.csv")
intake <- read_csv("intake.modified.csv")
Last week we reviewed how to subset our data using
filter() and select(). We were able
create new data frames or simply review the data with immediate
resultsUsi.
Using filter() we can get very specific by
concatenating specific conditions we want our data frame to
meet. As we learned previously, we can separate conditions we want
filter() to meet with “&” or with commas. However,
when we have multiple selections under the same variable, we want to
concatenate.
Imagine: you want to select 5 program names, and you don’t want to write
“ProgramName ==”program name” five times > we can
use “c”, or, concatenate.
df1 <- cm2 |> filter(ProgramName %in% c("Journeys", "Creative Courses", "Casa Alma", "Revive", "Temporary Transitions"))
## we use the combination '%in%" plus a "c" before parentheses(), which include all the conditions we want to meet.
##be sure to always have each concatenated entry in quotation marks or the command will fail.
R conveniently has many functions used for counting different entries in data frames, and the most useful choice is the count() function, part of the dplyr package. At it simplest, you can simply call the dataframe or tibble name, and then use pipes to count whatever entry you are interested in.
df1 |> count(ProgramName)##simple version
## # A tibble: 5 × 2
## ProgramName n
## <chr> <int>
## 1 Casa Alma 21
## 2 Creative Courses 19
## 3 Journeys 11
## 4 Revive 60
## 5 Temporary Transitions 12
df1 |> count(BankAccount)
## # A tibble: 4 × 2
## BankAccount n
## <chr> <int>
## 1 No - does not have a bank account 34
## 2 Yes - has a checking account 40
## 3 Yes - has a savings account 10
## 4 Yes - has both savings and checking account 39
You can also add conditions to your count function, and R will return a TRUE or FALSE with the corresponding number.
df1 |> count(DestinationAtExitWeight==1)
## # A tibble: 2 × 2
## `DestinationAtExitWeight == 1` n
## <lgl> <int>
## 1 FALSE 12
## 2 NA 111
df1 |> count(DestinationSafe=="Yes, both staff and client consider destination safe")
## # A tibble: 3 × 2
## DestinationSafe == "Yes, both staff and client consider destination sa…¹ n
## <lgl> <int>
## 1 FALSE 1
## 2 TRUE 11
## 3 NA 111
## # ℹ abbreviated name:
## # ¹`DestinationSafe == "Yes, both staff and client consider destination safe"`
Finally, we can get a little more complicate with our count, when we want to filter for age factors. First, we have to change the column “DateOfBirth” to a date format. Then, we can find all the clients who are under 21.
df1$DateOfBirth <- as.Date(df1$DateOfBirth, format = "%m/%d/%Y")
df1 |> count(DateOfBirth <= "2003-12-03")
## # A tibble: 2 × 2
## `DateOfBirth <= "2003-12-03"` n
## <lgl> <int>
## 1 FALSE 14
## 2 TRUE 109
While we can use count() on its own, it is often more useful to combine with the filter() command and the group_by() command, also a part of the tidyverse (dplyr). Here we will use the filter() command with the ubiquitous and notorious NA…..(NA NA NA notorious).
df1 |> filter(!is.na(CMandYouthMet)) |> group_by(ProgramName) |> count(CMandYouthMet=="Yes")
## # A tibble: 7 × 3
## # Groups: ProgramName [4]
## ProgramName `CMandYouthMet == "Yes"` n
## <chr> <lgl> <int>
## 1 Creative Courses TRUE 13
## 2 Journeys FALSE 1
## 3 Journeys TRUE 8
## 4 Revive FALSE 6
## 5 Revive TRUE 26
## 6 Temporary Transitions FALSE 1
## 7 Temporary Transitions TRUE 11
##first, we filter the data frame to remove NA's. Remember, ! in front of anything means does not equal, or "is not" so, !is.na means, "is not NA in the column: CM and Youth Met. Then, we group by Program Name, and count how many answered "Yes" per program.
On the topic of filtering, we can now pivot to look at another way to filter dates. Often when performing data analysis for reporting, we need to isolate dates. If we use actual dates as parameters, for example, we only want CMA’s from the latest quarter, we may under report or miss some data that was entered into ETO outside of the dates we indicate. Another good example is our “Intake Assessment” data. Some reports, we need to isolate the earliest intake assessment they received, to find out say, their age at entry to Larkin Street, or how long they had been experiencing homelessness at that point. Other reports, we need to know the latest information of a client, in regards to forms of victimization or being trafficked. This is where the incredibly useful commands slice_min and slice_max comes into play.
Intuitively, slice_min effectively “slices” our data down to the earliest (minimum) date we can find, based on another variable. Slice_max “slices” our data frame to the latest (maximum) date we can find, based on another variable. Both commands use the logic of dates of having values, and then, based on those values, we can isolate the maximum or minimum value.
We use these commands in conjunction with the group_by(), which tells the slice_min() command what variable to use as base variable to find the minimum or maximum value associated with.
##First change the date class to "as.Date"
intake$DateTaken<- as.Date(intake$DateTaken, format = "%m/%d/%Y")
intake.1<- intake |> group_by(ParticipantID) |> slice_min(DateTaken) ### we "group" all the Participant ID's in their own groups, and then within those groups, we find the minimum date value and isolate in a new dataframe/tibble.
Note: our “slice” commands work far beyond isolating dates. Any
variable that is numeric, or alternatively, has a “value” associated
with it, can be sliced.
Examples we would find in data that we use at LSYS:
The mutate() function of dplyr is probably one of the most useful commands in the dplyr/tidyverse universe. Data wrangling, and creating “tidy data” means making large data sets useful, and with variables that we can find meaning from. Often times, variables entered in columns as is may not say a lot, but combined with other variables or manipulated in some fashion, we can deduce greater meaning. Most often, in data wrangling for reporting, I use mutate() in combination with another command we will learn a bit later, if_else. However, there are plenty of uses on its own.
Here, we want to create a new variable that captures the age of our clients. While this measure cannot be 100% accurate, since we can only use the latest data we have in ETO, we can get pretty close. We will calculate a clients age with the following formula based on variables in our CMA 2:
Note: by rule of thumb, we would usually use Active Clients or Program Enrollments to get the most accurate enrollment dates; additionally, I would first create a “dummy variable” using mutate in combination with “if_else” to create a column that shows the latest possible date they were enrolled in the quarter to get their most up to date age. But for this exercise, we will keep it simple
df2 <- cm2
df2$DateTaken <- as.Date(df2$DateTaken, format = "%m/%d/%Y")
df2$DateOfBirth <- as.Date(df2$DateOfBirth, format = "%m/%d/%Y")
#first we have to find their latest intake date, using slice_max
df2<-df2 |> group_by(ParticipantID) |> slice_max(DateTaken)
df2 <- df2 |> mutate(Age = year(DateTaken)- year(DateOfBirth))
df2$Age
## [1] 31 30 33 33 25 31 27 31 27 26 29 33 30 29 29 28 27 27 28 22 26 32 23 24 27
## [26] 25 26 26 29 28 28 23 26 26 23 26 22 23 24 26 24 26 23 25 25 24 25 26 25 23
## [51] 23 22 26 25 29 29 22 25 23 25 22 27 25 26 24 25 23 23 27 26 24 23 29 22 26
## [76] 23 26 25 25 27 26 28 26 23 27 28 25 27 27 27 25 27 22 19 26 23 25 23 23 22
## [101] 25 26 25 26 27 24 28 26 25 22 24 27 22 22 22 24 23 21 20 22 27 23 23 24 26
## [126] 24 20 23 24 23 25 24 22 21 22 22 21 23 24 22 21 23 26 26 26 26 25 25 26 22
## [151] 24 25 23 26 23 22 20 22 24 25 23 25 25 22 21 26 23 26 19 23 25 23 21 21 25
## [176] 26 23 24 24 23 23 23 23 20 20 26 23 23 25 23 23 24 23 21 20 21 23 26 23 26
## [201] 24 21 24 23 20 24 21 20 23 22 24 19 23 24 25 26 25 23 22 24 20 25 24 24 24
## [226] 21 21 21 26 22 23 22 25 22 21 22 22 24 22 25 22 25 23 26 19 24 18 22 23 24
## [251] 22 21 22 24 22 22 21 24 22 24 20 23 21 19 21 25 21 23 22 22 21 23 23 26 20
## [276] 21 19 23 24 22 20 22 19 22 19 24 21 21 22 24 22 19 19 20 22 21 25 23 20 21
## [301] 22 25 22 23 23 19 23 21 26 25 23 22 20 24 22 20 24 19 21 23 22 24 19 26 21
## [326] 25 21 24 24 23 23 21 19 23 25 21 25 23 21 21 21 24 22 24 24 22 20 18 21 24
## [351] 21 18 21 23 23 18 23 18 25 23 23
##Other ways to calculate date
# df2<- df2 |> mutate(Age2=floor((DateTaken-DateOfBirth)/365))
# df2$Age2
The final dplyr command we will explore is the join() command. The are several types of joins, all have their own purpose, and almost any report processed through R will require one of the join function at least once. What is a join? The join functions in the dplyr package make sure that two tables (df’s, tibbles) are combined so that matching rows are combined together (this is very similar to SQL syntax as well, if you are familiar). One needs to identify one or more columns that serve to match up the columns of each dataframe.
left_join is the most common join and we will apply that here.
df3 <- left_join(df2, dems, by = "ParticipantID") |> select(-MiddleName, -AKA)
## 1) list the tables in order you want the joins
## 2) Identify the variable (or 2 variables) you want to join the tables by, and ensure they are in quotation marks
## 3) Removed some columns I don't want, using select and the " - " symbol (just to show another way to use the select() function)
##note this give a HUGE data set; normally I would reduce columns/clean up my df's before joining together, but this is just an example.
Now, let’s combine what we know about count(), filter(), slicing, mutate() and joins to deduce some key data points regarding our client base.
We want to isolate clients from 2 specific programs who have a CMA assessment and then want to match it up with their latest intake assessment and the demographics report.
Please complete the following steps:
##subset the data
df4 <- cm2 |> filter(ProgramName %in% c("Caminos", "Casa Alma", "Youth Matters", "New Horizons") & Version %in% c("Reassessment (Quarterly)", "Initial Assessment"))
## create an Age column
## change the column type first
df4$DateOfBirth<- as.Date(df4$DateOfBirth, format = "%m/%d/%Y")
df4$DateTaken <- as.Date(df4$DateTaken, format= "%m/%d/%Y")
df4<- df4 |> mutate(Age = year(DateTaken)- year(DateOfBirth))
### we should also slice to the latest entry date per client
df4<- df4 |> group_by(ParticipantID) |> slice_max(DateTaken) |> distinct(ParticipantID, .keep_all = TRUE)
##added in a new command here, will discuss distinct in a later lesson
###join with the latest intake date
## first clean the data set
intake.2<- intake |> group_by(ParticipantID) |> slice_max(DateTaken) |> distinct(ParticipantID, .keep_all = TRUE) ##remember we already changed the date class earlier in the lesson
##isolate the columns we want
intake.2<- intake.2 |> select(ParticipantID, PhysicalDisability, DevelopmentalDisability, ChronicHealthCondition, LaborScaredOrUnsafe)
###don't forget Participant ID, since we need to be able to join these columns with the first dataframe
df4<- left_join(df4, intake.2, by = "ParticipantID")
## we could slice our Intake Assessments after we join, but it is a MUCH cleaner process to do so before.
df4<-left_join(df4, dems, by = "ParticipantID")
We have now prepped a new data frame and can analyze its contents. You may have to use the unique() function to get the exact qualitative entries in the variable columns.
# of clients with ANY PRIOR disability/health condition
df4 |> count(PhysicalDisability == "Yes – Prior Treatment, Not Current"|DevelopmentalDisability =="Yes – Prior Treatment, Not Current"| ChronicHealthCondition == "Yes – Prior Treatment, Not Current")
## # A tibble: 74 × 3
## # Groups: ParticipantID [74]
## ParticipantID `|...` n
## <dbl> <lgl> <int>
## 1 17021 FALSE 1
## 2 17764 FALSE 1
## 3 17779 FALSE 1
## 4 17965 FALSE 1
## 5 18387 FALSE 1
## 6 18796 FALSE 1
## 7 19162 TRUE 1
## 8 19228 FALSE 1
## 9 19316 FALSE 1
## 10 20577 FALSE 1
## # ℹ 64 more rows
###it is showing only a list of responses per entry/observation. This is not helpful. The reason we are not getting "tidy" results is due to the fact that because of all the joins, the data frame has many formats. Let's look at the formats:
class(df4)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
#our data frame is many things!! that is okay, however, we want to TELL R that is first and foremost a "data.frame". The code below force changes the format of the data frame.
df4<-data.frame(df4)
##noww, let's run again!
df4 |> count(PhysicalDisability == "Yes – Prior Treatment, Not Current"|DevelopmentalDisability =="Yes – Prior Treatment, Not Current"| ChronicHealthCondition == "Yes – Prior Treatment, Not Current")
## |... n
## 1 FALSE 64
## 2 TRUE 6
## 3 NA 4
# of clients who were born outside of the US
df4 |> count(BirthCountry != "United States")
## BirthCountry != "United States" n
## 1 FALSE 52
## 2 TRUE 22
# Breakdown of clients who have a ANY Income, by Program Name.
df4 |> group_by(ProgramName) |> count(AnyIncome == "Yes")
## # A tibble: 7 × 3
## # Groups: ProgramName [4]
## ProgramName `AnyIncome == "Yes"` n
## <chr> <lgl> <int>
## 1 Caminos FALSE 6
## 2 Caminos TRUE 21
## 3 Casa Alma FALSE 9
## 4 Casa Alma TRUE 12
## 5 New Horizons TRUE 11
## 6 Youth Matters FALSE 3
## 7 Youth Matters TRUE 12
# Total demographic breakdown of Gender and Birth Country for all clients between the ages of 21 and 24
df4 |> filter(Age<25 & Age>= 21) |> count(Gender)
## Gender n
## 1 Cisgender Man 27
## 2 Cisgender Woman 16
## 3 Gender Nonconforming/Non-Binary 1
## 4 Transgender Woman 2
df4 |> filter(Age<25 & Age>= 21) |> count(BirthCountry)
## BirthCountry n
## 1 Colombia 2
## 2 Costa Rica 1
## 3 Gabon 1
## 4 Ghana 1
## 5 Honduras 6
## 6 Mexico 1
## 7 Nicaragua 3
## 8 Philippines 1
## 9 United States 29
## 10 Venezuela 1
# Total female clients who were born inside the US by Program Name
df4 |> filter(Gender== "Cisgender Woman") |> group_by(ProgramName) |> count(BirthCountry == "United States")
## # A tibble: 4 × 3
## # Groups: ProgramName [4]
## ProgramName `BirthCountry == "United States"` n
## <chr> <lgl> <int>
## 1 Caminos TRUE 12
## 2 Casa Alma FALSE 3
## 3 New Horizons TRUE 1
## 4 Youth Matters TRUE 10