You can think of the package “dplyr” (part of the tidyverse) as one of the foundations of data wrangling for reporting purposes. There is so much that we can do within dplyr that we are breaking it up into 2 sessions. Today, we will review ways in which to select specific columns and rows of the dataset, and learn even more useful tools to also specify rows of data that meet certain conditions. We will also learn how to combine multiple commands into one single command (i.e. a single command). Let’s start!

Basic Setup

ls() #check out the objects in the environment
rm(list=ls()) ## here we are telling R the command *remove* (or rather, clear) all of the objects from thew orkspace to be able to start with a clean enviornment.  

setwd("~/Directories/Practice Directory") 
library(dplyr)
library(readr)
library(stringr)

###load our data
cm2<- read_csv("cm2.csv")

Brief Review of Subsetting the data

We learned last week how to subset the data frame by assigning the altered data frame to a new object.
Additionally we learned how to subset the data using two method: R Base Package commands and the Tidyverse As a quick review, let’s subset using base R. Let’s make a new data frame with only the rows 3 through 16, and all the columns EXCEPT the 15th, 20th, and 31-35th column.

cm2.1<- cm2[3:16, -c(15, 20, 31:35)] ## remember, rows are always listed before columns, which are indicated after the comma

Now we can try using tidyverse, where we can select() the columns we want based on column names. Remember, we cannot use tidyverse select() to subset rows of the data frame. Using what we know now, if we want to both select rows and columns, we must do them in different code chunks.

Let’s again make a new data frame with the following parameters:

  1. Rows 3 through 16

  2. Columns: Participant ID, Program Name, Earnings Amount through OtherIncomeSource, Destination At Exit

cm2.2<- cm2 |> select(ParticipantID, ProgramName, EarningsAmount:OtherIncomeSource, DestinationAtExit)

cm2.2<- cm2[3:16, ]##remember you must still leave space for the columns, even if you are not subsetting any currently

We assign the new object from the original object (tibble/data frame) and then we use PIPES to tell R what to do next (remember |> is another way to say, and then). The above line of code basically says: 1. We are creating a new object using our original object cm2 as a the foundation. And then, we are selecting the columns we want. After we have created a new object, we tell R to do the following: 2. We are assigning the same object we just created (in effect, overwriting our new object) and then defining the rows we want.

Combine commands using pipes

This is more simply achieved by combining the processes together, using pipes. The order is important here, because we have to call the object we are assigning the new object from. So, subsetting using base R allows us to achieve that by doing two thing at once: calling the object (cm2) AND subsetting the rows that we want.

cm2.2<- cm2[3:16, ] |> select(ParticipantID, ProgramName, EarningsAmount, DestinationAtExit)

We don’t have to use base R functions + dplyr/tidyverse to subset columns and rows. We will ultimately shift to completing most of our tasks using dplyr function — and filter() allows us to choose rows – we will get to that later. It is, however, good practice to learn how to do this in base R.**

##Looking at the results, without making a new object We often want to look at subsets of the data, but we don’t necessarily need to make a new data frame
(reasons why: just to take a quick glance at results, we don’t want to clutter up our environment, we want to see if a code we ran, for example to create a new column, gave the types of results we want to see) To do this, we do NOT assign a new object, we simply use pipes. In using pipes, we name the object we want to see results from, and then follow the named object by pipes and the subset of data we want to see.

cm2.2 |> select(ProgramName, EarningsAmount)
## # A tibble: 14 × 2
##    ProgramName   EarningsAmount
##    <chr>                  <dbl>
##  1 Revive                    NA
##  2 Revive                     0
##  3 Revive                  1500
##  4 New Horizons               0
##  5 SH Home                    0
##  6 Healthy Paths            600
##  7 Casa Alma                 NA
##  8 Casa Alma                800
##  9 Casa Alma                 NA
## 10 Casa Alma                 NA
## 11 Casa Alma               1500
## 12 Casa Alma                300
## 13 Casa Alma                900
## 14 Casa Alma                 NA
##alternatively
cm2.2[, 2:3] ##remember, if we know the location of our columns (i.e. number), we can also just name the number. This is often useful when only keeping the Participant ID (first column generally in ETO exports)
## # A tibble: 14 × 2
##    ProgramName   EarningsAmount
##    <chr>                  <dbl>
##  1 Revive                    NA
##  2 Revive                     0
##  3 Revive                  1500
##  4 New Horizons               0
##  5 SH Home                    0
##  6 Healthy Paths            600
##  7 Casa Alma                 NA
##  8 Casa Alma                800
##  9 Casa Alma                 NA
## 10 Casa Alma                 NA
## 11 Casa Alma               1500
## 12 Casa Alma                300
## 13 Casa Alma                900
## 14 Casa Alma                 NA

Sample exercise

  1. create a tibble containing data in 100th row of the cm2 dataset

Now continue using cm2 for each of the following activities

  1. Use **nrow() to determine number of rows in the tibble.
  2. Using the number of rows determined in Question 2, extract/create a tibble that begins in the middle (row) of the dataset and ends with the last row.
  3. Using dplyr select() to include the columns 15, 35, and 50. Note that you must do this by column name, not number, and so, you must look up the column names to determine what they are.

Formatting Dates

Working with dates in R can be challenging, for new and experienced users alike. Being able to convert dates (and time) information into a variable that is appropriate and usable during analyses can be a recurring process, especially in data wrangling. We use the package lubridate which is a part of the tidyverse install and should be loaded by default. If necessary (if you are dealing with an older version of tidyverse -2022 and earlier), we can manually load it.

library(lubridate)

Usually, when we import csv’s that we have exported from ETO, columns containing dates are not in the date format. We can see the structure of our columns by using the command str(). This command is useful versus the other command that we have used in the past class because it displays the internal structure of all columns, easily read and listed.

str(cm2)

Another way to isolate the variable we want to know the class/type/structure of is:

class(cm2$DateOfBirth) ## class(DataframeName$Name of Variable)
## [1] "character"
class(cm2$DateTaken)
## [1] "character"

We see in the list that the variables DateOfBirth and DateTaken, two very useful variable that are often used in wrangling for reporting purposes, are listed as characters; the list then goes on to show sample data for the first 5 rows of data.

If our dates stay as a type of variable “character”, we are not able to manipulate the data, such as creating new variables based on time in the program or age (amongst countless options) nor can we filter based on date based conditions (more on filtering later in this lesson and more on creating new variable in the next few sessions). Therefore, we want to convert several columns into usable date formats.

Converting Variable Types: Dates

In R, you can always change the type of variable you have. R will always force it with the command. That being said, it often may not be usable, unless you have applied the command with the correct coding and applied under the perfect conditions.
Example, you can forcibly convert any character column into a numeric (as.numeric) or an integer (as.integer), but it will often change the content of the variable column and again not be usable.

note: it is often useful to turn character variables into factors, specifically when we have categorical data. FActors store categorical data as integers with labeled which can be ordered (ordinal) or unordered (nominal). When we convert character variables into factors, we essentially create levels to store and analyze our categorical responses. We will have a separate (shorter) training focused specifically on factors, because they are so important to know.

Our general scenario when data wrangling for reporting is to import our data set, and convert our dates as soon as possible to be able to use the date values throughout the analysis. The simplest way to convert our dates is to use as.Date() and it will change the class of our variable.

To convert our necessary columns into dates, use the following:

##first look at the data in the column of choice first, to observe the format
cm2 |> select(DateOfBirth)
## # A tibble: 397 × 1
##    DateOfBirth
##    <chr>      
##  1 6/22/2002  
##  2 10/5/2003  
##  3 8/15/2000  
##  4 3/11/2003  
##  5 9/5/2004   
##  6 9/20/1998  
##  7 3/9/2002   
##  8 9/17/1998  
##  9 11/11/2001 
## 10 11/25/2001 
## # ℹ 387 more rows
##now, apply the as.Date command with the format specified
cm2$DateOfBirth <- as.Date(cm2$DateOfBirth, format = "%m/%d/%Y")

##look at the dates now
cm2 |> select(DateOfBirth)
## # A tibble: 397 × 1
##    DateOfBirth
##    <date>     
##  1 2002-06-22 
##  2 2003-10-05 
##  3 2000-08-15 
##  4 2003-03-11 
##  5 2004-09-05 
##  6 1998-09-20 
##  7 2002-03-09 
##  8 1998-09-17 
##  9 2001-11-11 
## 10 2001-11-25 
## # ℹ 387 more rows
##we can see that the column type is <date> and the dates are in the year-month-day format

There is a WHOLE lot we can do with dates, but for now, it is important to understand (and practice) converting dates into usable formats. Now, we can filter based on dates.

Filtering the Dataset

We can use the filter() function in dplyr to choose rows based on specific criteria. We make an argument after the dataframe with a condition we want our final data frame to adhere to. In the case of reporting for our organization’s needs, let’s use **“Destination At Exit Weight”, and we want to see only the rows where the weight = 1.

cm2 |> filter(DestinationAtExitWeight == 3)
## # A tibble: 24 × 158
##    ParticipantID DateOfBirth ProgramName             DateTaken AuditDate Version
##            <dbl> <date>      <chr>                   <chr>     <chr>     <chr>  
##  1         20757 2002-06-22  Caminos                 9/30/2024 9/30/2024 Final …
##  2         22609 2000-08-24  Revive                  9/30/2024 10/2/2024 Final …
##  3         23990 2004-11-10  Dreams Come True - Hugo 9/28/2024 9/28/2024 Final …
##  4         24135 2003-08-29  Dreams Come True - Hyde 9/28/2024 9/29/2024 Final …
##  5         22471 1999-03-08  Pajaritos               9/27/2024 9/27/2024 Final …
##  6         24343 2003-12-16  Positive Hope           9/27/2024 9/28/2024 Final …
##  7         24525 2000-07-25  Creative Courses        9/6/2024  9/6/2024  Final …
##  8         21396 1997-09-21  New Horizons            9/6/2024  9/6/2024  Final …
##  9         21631 1999-07-06  Pajaritos               9/3/2024  9/4/2024  Final …
## 10         23471 2002-06-15  Team Unity              8/16/2024 8/16/2024 Final …
## # ℹ 14 more rows
## # ℹ 152 more variables: CMandYouthMet <chr>, DateLastMet <chr>, CMNotes <chr>,
## #   DestinationAtExit <chr>, LarkinStreetProgram <chr>, SubsidyLocation <chr>,
## #   HousingPlanDetails <chr>, DestinationLocation <chr>, DestinationSafe <chr>,
## #   DestinationTime <chr>, OwnBed <chr>, WhereSleeping <lgl>, WhoDecides <chr>,
## #   `NameOnYouth Matters` <chr>, ObstaclesToStableHousing <chr>,
## #   OtherObstacles <chr>, RentPaymentStatus <chr>, BenefitEligibility <chr>, …

The tibble returns all observations that met the criteria.

note: in R, the equal sign has different implications than expected. So, when applying filters, we must use the double == to indicate our parameters. Using only a single = sign will return an error. Why is this? Because in R language, a single = is only used when we are applying a rule. You can learn more about rules in certain commands by typing ?nameofcommand, and the result will come up in the Help Tab.

Examples:

  • converting dates, as.Date….format = “%m/%d/%Y”
  • ignoring NA’s after a specific command (na.rm = TRUE)
  • importing data with base r command read.csv(“datasetname.csv”, colnames = TRUE) just some examples You will learn many other examples as you become more proficient in R programming.

Let’s get more specific with our filter, and also, let’s learn another way to call the data set. We have been learning via assigning objects and using pipes. These are not always necessary. You can also call the data set directly after the command, by naming the data set within parentheses, followed by a comma and the rule/s you want the command to adhere to. Let’s try.

unique(cm2$Version) ### see the various reponses within the dataset for a variable. 
## [1] "Final / At Exit"                     
## [2] "Reassessment (Quarterly)"            
## [3] "Initial Assessment"                  
## [4] "Transfer to Other LS Housing Program"
filter(cm2, Version == "Final / At Exit" & DestinationAtExitWeight==3) ### variable responses that we want to set conditions for with the filter() command must be case and space sensitive and contained within quotations marks.  If variable responses are numbers, we do not need quotation marks.
## # A tibble: 24 × 158
##    ParticipantID DateOfBirth ProgramName             DateTaken AuditDate Version
##            <dbl> <date>      <chr>                   <chr>     <chr>     <chr>  
##  1         20757 2002-06-22  Caminos                 9/30/2024 9/30/2024 Final …
##  2         22609 2000-08-24  Revive                  9/30/2024 10/2/2024 Final …
##  3         23990 2004-11-10  Dreams Come True - Hugo 9/28/2024 9/28/2024 Final …
##  4         24135 2003-08-29  Dreams Come True - Hyde 9/28/2024 9/29/2024 Final …
##  5         22471 1999-03-08  Pajaritos               9/27/2024 9/27/2024 Final …
##  6         24343 2003-12-16  Positive Hope           9/27/2024 9/28/2024 Final …
##  7         24525 2000-07-25  Creative Courses        9/6/2024  9/6/2024  Final …
##  8         21396 1997-09-21  New Horizons            9/6/2024  9/6/2024  Final …
##  9         21631 1999-07-06  Pajaritos               9/3/2024  9/4/2024  Final …
## 10         23471 2002-06-15  Team Unity              8/16/2024 8/16/2024 Final …
## # ℹ 14 more rows
## # ℹ 152 more variables: CMandYouthMet <chr>, DateLastMet <chr>, CMNotes <chr>,
## #   DestinationAtExit <chr>, LarkinStreetProgram <chr>, SubsidyLocation <chr>,
## #   HousingPlanDetails <chr>, DestinationLocation <chr>, DestinationSafe <chr>,
## #   DestinationTime <chr>, OwnBed <chr>, WhereSleeping <lgl>, WhoDecides <chr>,
## #   `NameOnYouth Matters` <chr>, ObstaclesToStableHousing <chr>,
## #   OtherObstacles <chr>, RentPaymentStatus <chr>, BenefitEligibility <chr>, …

We can continue to specify multiple conditions withint the filter() function. We can combine conditions using “and” or “or” statements.
In “and” statements, an observations must meet every condition we have specified. We pass our desired conditions as arguments separated by commas OR “&” signs

filter(cm2, Version== "Final / At Exit", DestinationAtExitWeight >= 1, BenefitEligibility == "Eligible, Receiving")## Version must be "Final / At Exit" AND Destination Weight is >= 1 AND Benefit eligibility is "Eligible, Receiving
filter(cm2, Version== "Final / At Exit", DestinationAtExitWeight >= 1 |BenefitEligibility == "Eligible, Receiving") ## Version must be "Final / At Exit", and either Destination at Exit WEight >= 1 OR Benefit Eligibility == "Eligible, REceiving
filter(cm2, Version== "Final / At Exit", DestinationAtExitWeight >= 1 & BenefitEligibility == "Eligible, Receiving"|BenefitEligibility =="Eligible, Application Pending") ##Versions is "Final / At Exit" AN Destination At Exit Weight >= 1 and EITHER Benefit Eligibility is "Eligible, Receiving OR "Eligible, Application Pending

Putting it all together

Let’s create a new object using filter() and select() to get the responses we are interested in seeing. Here, let’s add in specific dates as a filter. This is common in reporting when we have to parse through exports of assessments based on date conditions of a report period. Remember, the date format must be changed before we apply a date filter.
Our new data set must meet the following conditions:

  1. Include dates only for August plus data close allowances (+ 4 days into September)
  2. Include observations that have ANY response for Benefit Eligibility.
    Here we are using a new sign combination that means the opposite “==”.
    • “!=” in R language means, “does not equal”.
  3. Create a data set that only includes columns:
    • Participant ID through Date Taken, Benefit Eligibility and all public benefit columns within the data set. We can do this with one line of code.
##check the class of the variable
class(cm2$AuditDate)
## [1] "character"
cm2$AuditDate <- as.Date(cm2$AuditDate, format = "%m/%d/%Y") ##change the class of the variable
cm2.ben <- cm2 |> filter(AuditDate >= "2024-08-01" & AuditDate <= "2024-09-04" & BenefitEligibility != "") |> select(ParticipantID:DateTaken, BenefitEligibility, GAAmount:SNAPAmount, CAAPAmount:HealthInsurance)
### if we want to indicate that the cell is not empty, we use "not equal" != " " a pair of parentheses with nothing in between.  

###if you want to write this new data frame to a csv in our data output folder.

note: it is generally better to use pipes to complete a more complicated command. However, there are times when we want to see results of the filter, BEFORE selecting columns that we want. You simply would break up the above command into two separate commands, each one assigning the data object in the beginning.

See below:

cm2.ben <- cm2 |> filter(AuditDate >= "2024-08-01" & AuditDate <= "2024-09-04" & BenefitEligibility != "") 
cm2.specific <- cm2.ben |> select(ParticipantID:DateTaken, BenefitEligibility, GAAmount:SNAPAmount, CAAPAmount:HealthInsurance)