Distinct, If_else, and tidying our data

We will continue with essential dplyr function in this session, and provide a breakdown of the key function distinct(). In this session we will explore the various methods to create binary variable, mostly using if_else() and also case_when(). We will also examine ways to achieve simple statistics such as averages based on variables within data sets.

library(dplyr)
library(readr)
library(stringr)
library(lubridate)

###load our data
active<- read_csv("active.1yr.csv")
cm2<- read_csv("cm2.csv")
dems <- read_csv("dems.csv")
intake <- read_csv("intake.modified.csv")

Distinct and Other Purposes

Distinct() is a useful function that essentially keeps only unique/distinct rows from a data frame. To note, we can achieve the same function using base R, by the following: unique.data.frame(), however, distinct() is considerably faster.

For our reporting purposes, we can also think of distinct() as similar to Excel’s version of “deduplicate”, however, there are many more uses and purposes to using distinct(), other than reducing a data set to an unduplicated client number. To begin with a basic example, let’s think about a common metric such as average age of LSYS clients. We can use mutate() to create an age column based on their date of birth and date of service or perhaps program start/end date. However, if we have not applied distinct() to the dataset, then we could skew our mean (average) age.
Let’s see how this can happen:

##create a duplicate data set
active.1<- active
## change class of both date variables we need to work with (DOB & Program End Date)
active.1$`Date of Birth`<- as.Date(active.1$`Date of Birth`, format = "%m/%d/%Y")
active.1$`Program End Date`<- as.Date(active.1$`Program End Date`, format= "%m/%d/%Y")

##create a dummy variable for last day in program/last day of the quarter.

active.1 <- active.1 |> mutate(End.date = if_else(is.na(`Program End Date`), ymd("2024-06-30"), `Program End Date`)) ##mutate with if_else statement

active.1<- active.1 |> mutate(Age= year(`End.date`)-year(`Date of Birth`))

We have now successfully created an age column based on Date of Birth and either last day in the program or last day of the reporting period (6/30/2024).
Our if_else statement looked a little different than the statements we have created in previous training. We will explore the nuances of if_else() statements in a little bit. However, our statement:

if_else(is.na(Program End Date), ymd(“2024-06-30”), Program End Date))

is simply saying the following:

If program end date is NA (i.e., they are an active client without a program end date YET),
input the year/month/date: 2024-06-30, if it is NOT NA, use the program end date provided

Now, let’s learn how to calculate the mean of a column. We start by assigning the variable/column we want to average to a vector (or, a value).

c <- active.1$Age

Base R provides simple arithmetic “operators”. These can only be applied to vectors in R.
What is a vector? A vector is essentially a list of specific variables, and the simplest data structure in R. A vector can be numbers, arithmetic expression, logical values or character strings, for example.

You can create vectors without even having any data loaded. For example, we could create a vector call “odd” and assign it odd numbers up to 15.

odd = c(1, 3, 5, 7, 9, 11, 13, 15, 17, 19)

We can also create (and later use if needed) a simple variable called x, with assigned elements. We assign elements by using either the <- or = operators.

x <- 5*7 + 6 * 3*8
x = 5*7 + 6 * 3*8
x

## [1] 179

Now we can create another variable called “y” that contains a complex or basic operation containing the value of “x”.

y <- x ^4-4*x +6
y

## [1] 1026624971

But enough of that. Vectors are useful to know and can be used in more complex data analysis or at times, wrangling. But for our purposes, let’s stay on track. We created a vector called “c” that contains ALL the values of our “Age” column. Now, we can apply a mathmatical operator to that vector.

mean(c, na.rm = TRUE)

## [1] 22.4284

We input the na.rm= TRUE to tell R specifically what to do with any na’s it may encounter in the vector.
It says essentially: Remove NA’s= YES
While it is unlikely that there are any na’s in this particular vector, built created from the Age column (why is this unlikely? because we already told R what to do with NA’s when we created the if_else statement for “End.Date”, remember? **if.na(“2024-06-30”). However, is is always good to be cautious). If there were any NA’s in the data, our operator would either fail to work, or give an incorrect answer.
Alternatively, if you wanted to keep NA’s (if there were NA’s ;)), you simpy would omit the statement “na.rm= TRUE”. The default in the mean() operator is to include everything. You would not need to write “na.rm = FALSE”.

We have now found the mean of our active clients in residential and shelter. However, we know from experience that there are often clients who move within housing and shelter programs during any given report period. That likelihood increase the longer the report period (1 year vs. 1 quarter vs. 1 month). We want to remove any clients that appear in the data set so that we can get an accurate average age. Let’s do this now, by assigning a new object in our environment, a copy of active.1. Then, we will apply the distinct() call to remove any duplicates. Afterward, we will average our new mean.

##let's rename our ID column for future uses
## a simple way to do that is in base R, particularly when we are only renaming one column in a data set with minimal columns.
active.2<- active
colnames(active.2)[1]<- "ParticipantID"
active.2 <- active.2 |> distinct(ParticipantID, .keep_all = TRUE)

c.2 <- active.2$Age

## Warning: Unknown or uninitialised column: `Age`.

mean(c.2, na.rm=TRUE)

## Warning in mean.default(c.2, na.rm = TRUE): argument is not numeric or logical:
## returning NA

## [1] NA

##21.86

#let's compare to our original mean

mean(c, na.rm=TRUE)

## [1] 22.4284

##21.57

Applications of distinct() with more than one condition

There are times when we want to know an unduplicated number of clients, but within several cohorts of our data set. The example we will use involves the various residential programs at LSYS. We have many reports, both internal and external, where we need to know the total number of unduplicated clients per program. IF we were to apply distinct() with the specification of “ParticipantID” to our entire dataset of “Active Clients” we would invariably miscount our program numbers.

This is because unique will remove all the rows where participant ID shows up more than once (i.e. program enrollments). Generally, with only one condition, distinct() will retain only the highest row of a duplicate entry. This can be useful when we have exports that are organized by Service Date. **Why? In the cases where we want to see the latest service date entry (such as income in CMA 2, or outreach touchpoint in the outreach export) we can bypass the multiple commands required of “slice_max” and “group_by” and just use the “distinct()” command, as that by itself will give us the highest date entered (“highest” of course in terms of dates with values).

This is NOT useful with active clients, since that export is organized by program name. So any client with more than one enrollment, would only retain the entry/program association of the program with the highest alphabetical order. NOT USEFUL.

So, all this rhetoric only to say that, we need to add in another condition to our distinct() command. Here, we want to see unique rows for each program, but we do not need to reduce the data set to unduplicated clients completely. After we do this, we will use some handy “tidy” commands to create a neat comparison data frame what the program enrollment numbers look like when we use distinct to only one variable, versus distinct with 2 variables.

To jog your memory we have already created 2 tibbles with active clients data. active.1 = all clients, no deduplication application active.2 = all unique clients housed in residential or shelter We will start with active.1, which contains more clients.

colnames(active.1)[1]<- "ParticipantID" ### rename the ID column

active.1<- active.1 |> distinct(ParticipantID, `Program Name`, .keep_all = TRUE)

##790 clients

active.1 |> count(`Program Name`)

## # A tibble: 18 × 2
##    `Program Name`              n
##    <chr>                   <int>
##  1 B-Home                     60
##  2 Caminos                    27
##  3 Casa Alma                  32
##  4 Henry IV                   25
##  5 Journeys                   23
##  6 New Horizons SH            30
##  7 New Horizons TLP           18
##  8 Nurtured Support - Polk    11
##  9 Nurtured Support - Turk    24
## 10 Pajaritos                  47
## 11 Positive Hope              68
## 12 Revive                     49
## 13 Sparrow Lodge             203
## 14 Suenos                     27
## 15 Team Unity 1               35
## 16 Team Unity 2               13
## 17 Uprising                   84
## 18 Youth Matters              14

###now, let's see what the numbers would have looked like if we wanted to count program enrollments DF active.2, which has only unduplicated clients numbers (#683 vs. #790). 
##We will start by assigning these program counts to new tibbles.

correct<- active.1 |> count(`Program Name`)
incorrect <- active.2 |> count(`Program Name`)

##now, let's create a new dataframe with both of these lists.

enroll<- left_join(correct, incorrect, by = "Program Name")

We have now created a neat table which shows enrollment numbers based on two different uses of the distinct() command. However, the variable names of the columns are not entirely useful. You may have noted that when we created the tibbles of program counts, the counts were places in a column that R named “n”. R often names columns “n” when they are built from the count() command. N = Number of operations, so it is intuitive. However, when we combined two tibbles with the same column name in their resective data, R has to differentiate. It has added in the .x and .y to do just that. However, if we needed to share this table with someone else in the department, they may not understand what those column names mean.

While yes, the boring and more time intensive process could be to export the table and manually rename the columns to be something useful…. why would we do all that when we can just code the new names?? We will use yet another command from the tidyverse called rename() to rename our columns.

enroll<- rename(enroll, Correct_Count = n.x, Incorrect_Count = n.y)
#View(enroll)

The tidyverse also allows us other important steps if we want to clean up our table. Let’s move the order of the columns.
We can move around columns any time we wish, but generally, we do it when we know we will export a table to share with other staff or to analyze outside of R. The relocate() command allows us to do this.

enroll <- enroll |> relocate(Incorrect_Count, .before = Correct_Count)
#View(enroll)

relocate() requires conditions of the relocation. You can use either

.before
.after

Finally, we can look at how much our counts were off and have that visualized in the data. We will apply a simple mathmatical operator to our counts to produce a new column using mutate()

enroll<- enroll |> mutate(Difference = Correct_Count - Incorrect_Count)
##lastly, let's rearrange our Correct Count to be in the 2nd column, since that is visually easier to read.

enroll <- enroll |> relocate(Correct_Count, .after = `Program Name`)
#View(enroll)

Now, we can visualize and understand why it is important to use distinct() in some cases with two conditions. Here, it can dramatically affect program counts.

Distinct with complicated if_else statements

Continuing with using the distinct() command with multiple conditions, we will look at a scenario where we have to provide unduplicated counts of clients per Housing Type/Category. This will involve using mutate in conjunction with a more complex if_else statement. Active Clients does not provide a program group variable/column. Therefore, we must create our own program group variable. To do this we must first know what our program groups are. Since we are workign with a data set with dummy program name variables (for privacy purposes, since these training are public domain), for your convenience, we have a list of program designations below.

Transitional Living Programs (TLP’s) Caminos Nurtured Support - Polk Nurtured Support - Turk Revive B-Home Journey’s New Horizons TLP

Permanent Supportive Housing (PSH) Team Unity 1 Team Unity 2 Uprising New Horizons SH

Rapid Rehousing (RRH) Casa Alma Pajaritos Suenos Henry IV

Shelter Positive Hope Sparrow Lodge

Now, we want to apply a more indepth if_else statement to create program categories. Previously, we have only used if_else to create variables based on TRUE or FALSE conditions, i.e. binary variables.

Examples: if a client makes over $1000 in income, 1, if NOT, 0 ** if_else(Sum1 >= 1000, 1, 0)*

if a client is under a specific age, TRUE, if NOT FALSE

if_else (Age < 21, “TRUE”, if not, “FALSE”)

Now, we will move past creating only binary responses to if_else statements, to categorical responses to ifelse statements. We could also think of this as creating “dummy variables” based on sets of conditions.

For example, we could create income levels of 3 types, based on the amounts in the (made up) column of Sum1, from above’s example.

Example:

# if_else(Sum1 > 1000 & Sum1 <=2000, "Med.Income",
#        if_else(Sum1 > 2000, "High Income",
#        "LowORNO.income"))*

You will notice in the above example, we have only had to write out conditions of 2 of the 3 categories we want to create. Why is this? Because, we have provided 2 different “TRUE” cases, the cases where the dummy variables of either Medium Income or High Income are designated. Intuitively, any thing that is left over, OR, any rows that did not meet this condition, are our FALSE cases. We have labeled these FALSE cases (cases where the above two statements are NOT TRUE) to be Low or NO Income.

We will take this concept and apply it to housing categories. Afterward, we can apply the distinct() call to see total unduplicated clients per housing group.

# Caminos
# Nurtured Support - Polk
# Nurtured Support - Turk
# Revive
# B-Home
# Journey's
# New Horizons TLP

# Permanent Supportive Housing (PSH)
# Team Unity 1
# Team Unity 2
# Uprising
# New Horizons SH
# 
# Rapid Rehousing (RRH)
# Casa Alma
# Pajaritos
# Suenos
# Henry IV
# 
# Shelter
# Positive Hope
# Sparrow Lodge



active.3 <- active |> mutate(ProgramGroup = ifelse(`Program Name` %in% c("Caminos", "Nurtured Support - Polk", "Nurtured Support - Turk", "Revive", "B-Home", "Journeys", "New Horizons TLP"), "TLP", if_else(`Program Name` %in% c("Team Unity 1", "Team Unity 2", "Uprising", "New Horizons SH"), "PSH", if_else(`Program Name` %in% c("Positive Hope", "Sparrow Lodge"), "Shelt", "RRH"))))


active.3 |> count(ProgramGroup)

## # A tibble: 4 × 2
##   ProgramGroup     n
##   <chr>        <int>
## 1 PSH            162
## 2 RRH            145
## 3 Shelt          308
## 4 TLP            216

Case_when

Sometimes, it makes more sense to use the case_when() command. If_else is super handy but it only allows for 2 possibilities, whether something (or things) are TRUE or FALSE. What if we have lots of different values that we would like to create based on multiple conditions. This is where case_when() comes in handy. We will take the example of creating age categories.

The syntax we use here is also different. See below:

*case_when(condition ~ result) Example: you want to turn all values of NAS into 0

case_when(variable == NA ~ 0)

Example: you can put multiple conditions and results into a single case_when() function, which we will for our practice. case_when(variable == NA ~ 0, variable ==2 ~ “two years”, etc.)

We will create Age Categories, as we often must provide categories of ages groups to public funding reports. This is much more accurate and efficient than simply listing age totals and groups different sets of ages together, and then of course, counting those total up.

active.1 <- active.1 |> mutate(AgeCats = case_when(Age < 18 ~ "Under 18",
                                           Age >= 18 & Age < 21 ~ "18-20",
                                           Age >= 21 & Age < 25 ~ "21-24",
                                           Age > 24 ~ "25 +"))

active.1 |> count(AgeCats)

## # A tibble: 4 × 2
##   AgeCats      n
##   <chr>    <int>
## 1 18-20      133
## 2 21-24      392
## 3 25 +       213
## 4 Under 18    52

There are no practices with this lesson, as the training was long enough as is. However, for fun, let’s overwrite the file “active.1yr.csv” with the original version of active clients we loaded, however, with the Participant ID column renamed to participant ID.

Distinct, If_else, and tidying our data

Alexandra Candia

2025-02-05

Distinct and Other Purposes

Applications of distinct() with more than one condition

Distinct with complicated if_else statements

Case_when