Basic Set Up for Analysis

As always, ensure that you have the latest version of R installed. (Note: R is helpful in that they will always present a pop up when opening R reminding you if an update is required. Author’s note: ALWAYS update when prompted. Don’t leave it until later or another more convenient time. R has a funny way of forcing you to update if you try to ignore the prompt. Commands don’t run functionally, funky things will start happening with your output. Take my advice : JUST.DO.IT.)

Barring a required update, of course we will always clear our environment (1st) and then set our working directory (2nd).

After those two steps, load the necessary packages you need for your analysis. For today, we will be loading tidyr, dplyr and readr. These three packages are all part of the larger tidyverse package, a kind of “meta-package” that has many different packages with different functionalities, all focused of types of data wrangling. Why don’t we just load tidyverse then, Rather than listing individual packages and loading them individually?
1. It is important to get familiar with different individual (smaller) packages and understand their purposes and functions.
2. The core tidyverse package contains 8 core packages and even more non-core packages. It is best practice to limit our work space to what we need versus and avoid namespace conflicts.
3. Know the functions and their associated packages. This is particularly important when debugging potential conflicts.

Of course, non of these packages will load properly with out installing tidyverse, which we did in our last session. If you find that it is not installed, use the command install.packages(tidyverse).

ls() #check out the objects in the environment
## character(0)
rm(list=ls()) ## here we are telling R the command *remove* (or rather, clear) all of the objects from thew orkspace to be able to start with a clean enviornment.  

setwd("~/Directories/Practice Directory") 

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(dplyr)

library(readr)
library(stringr)

Importing the Data

You are going to load the data in R’s memory using the function read_csv As you may recall, we could use read.csv to import our data set, which is a function of base R commands. However, we prefer read_csv for its special manipulation of data sets, and ease of use with other tidyverse commands. We have to assign our data set to an object, using the key binding <-. You may recall that <- can be thought of as an = sign. To work with a more interesting data set for exploratory purposes, we will load the CSV of LSYS Case Management Assessment 2 (CMA 2).

cm2 <- read_csv("cm2.csv")
## Rows: 397 Columns: 158
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (51): DateOfBirth, ProgramName, DateTaken, AuditDate, Version, CMandYout...
## dbl (43): ParticipantID, UnemploymentInsuranceAmount, PAESAmount, TANFAmount...
## num (15): EarningsAmount, NonLegalIncomeAmount, GAAmount, SSIAmount, RentalA...
## lgl (49): WhereSleeping, OLDInteractsComm, OLDListens, OLDMediates, OLDCommR...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

If you were to type in the code above, it is likely that the read.csv() function would appear in the automatically populated list of functions. This function is different from the read_csv() function, as it is included in the “base” packages that come pre-installed with R. Overall, read.csv() behaves similar to read_csv(), with a few notable differences. First, read.csv() coerces column names with spaces and/or special characters to different names (e.g. interview date becomes interview.date). Second, read.csv() stores data as a data.frame, where read_csv() stores data as a different kind of data frame called a tibble. We prefer tibbles because they have nice printing properties among other desirable qualities.- from https://datacarpentry.org/r-socialsci/02-starting-with-data.html#importing-data

Viewing the Data

R makes it easy to look at the data we have just installed. The first step to exploring our data, however, is understanding it. We should always know what type of data we are looking at. The command class(data) tells us specifically what type of data (frame) we have just loaded. Note, you can also see this in the Console when you load the data.
Note that read_csv() actually loads the data as a tibble. A tibble is an extension of R data frames used by the tidyverse. When the data is read using read_csv(), it is stored in an object of class tbl_df, tbl, and data.frame. You can see the class of an object with

class(cm2)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
cm2 ## simply "call" the tibble by it's name and it will display the first 10 rows of the data and several columns.  
## # A tibble: 397 × 158
##    ParticipantID DateOfBirth ProgramName   DateTaken AuditDate Version          
##            <dbl> <chr>       <chr>         <chr>     <chr>     <chr>            
##  1         20757 6/22/2002   Caminos       9/30/2024 9/30/2024 Final / At Exit  
##  2         22914 10/5/2003   Journeys      9/30/2024 9/30/2024 Reassessment (Qu…
##  3         23104 8/15/2000   Revive        9/30/2024 9/30/2024 Reassessment (Qu…
##  4         22258 3/11/2003   Revive        9/30/2024 9/30/2024 Reassessment (Qu…
##  5         22946 9/5/2004    Revive        9/30/2024 9/30/2024 Reassessment (Qu…
##  6         21182 9/20/1998   New Horizons  9/30/2024 9/30/2024 Reassessment (Qu…
##  7         21599 3/9/2002    SH Home       9/30/2024 9/30/2024 Reassessment (Qu…
##  8         18703 9/17/1998   Healthy Paths 9/30/2024 9/30/2024 Reassessment (Qu…
##  9         20635 11/11/2001  Casa Alma     9/30/2024 9/30/2024 Reassessment (Qu…
## 10         22875 11/25/2001  Casa Alma     9/30/2024 9/30/2024 Reassessment (Qu…
## # ℹ 387 more rows
## # ℹ 152 more variables: CMandYouthMet <chr>, DateLastMet <chr>, CMNotes <chr>,
## #   DestinationAtExit <chr>, LarkinStreetProgram <chr>, SubsidyLocation <chr>,
## #   HousingPlanDetails <chr>, DestinationLocation <chr>, DestinationSafe <chr>,
## #   DestinationTime <chr>, OwnBed <chr>, WhereSleeping <lgl>, WhoDecides <chr>,
## #   `NameOnYouth Matters` <chr>, ObstaclesToStableHousing <chr>,
## #   OtherObstacles <chr>, RentPaymentStatus <chr>, BenefitEligibility <chr>, …

If we want to quickly see the see the data, we can also type head.

#view(cm2) #the command view also does the same thing.
head(cm2) ##shows us just the top lines of the data, see below
## # A tibble: 6 × 158
##   ParticipantID DateOfBirth ProgramName  DateTaken AuditDate Version            
##           <dbl> <chr>       <chr>        <chr>     <chr>     <chr>              
## 1         20757 6/22/2002   Caminos      9/30/2024 9/30/2024 Final / At Exit    
## 2         22914 10/5/2003   Journeys     9/30/2024 9/30/2024 Reassessment (Quar…
## 3         23104 8/15/2000   Revive       9/30/2024 9/30/2024 Reassessment (Quar…
## 4         22258 3/11/2003   Revive       9/30/2024 9/30/2024 Reassessment (Quar…
## 5         22946 9/5/2004    Revive       9/30/2024 9/30/2024 Reassessment (Quar…
## 6         21182 9/20/1998   New Horizons 9/30/2024 9/30/2024 Reassessment (Quar…
## # ℹ 152 more variables: CMandYouthMet <chr>, DateLastMet <chr>, CMNotes <chr>,
## #   DestinationAtExit <chr>, LarkinStreetProgram <chr>, SubsidyLocation <chr>,
## #   HousingPlanDetails <chr>, DestinationLocation <chr>, DestinationSafe <chr>,
## #   DestinationTime <chr>, OwnBed <chr>, WhereSleeping <lgl>, WhoDecides <chr>,
## #   `NameOnYouth Matters` <chr>, ObstaclesToStableHousing <chr>,
## #   OtherObstacles <chr>, RentPaymentStatus <chr>, BenefitEligibility <chr>,
## #   AnyIncome <chr>, EarningsAmount <dbl>, NonLegalIncomeAmount <dbl>, …

Describing the Data

More than viewing the data, we want to understand what type of data we are working with (beyond just the “class” of the data set). This means knowing the size or dimensions of our data. Learning about are the types of variables exist (class and names), even having a grasp of what variables have the most NA’s, which could impact our analysis. First, let look at the overall size of the tibble.

dim(cm2)
## [1] 397 158

We can see from the results that the object “cm2” has 397 rows and 158 columns. To note, R always lists rows first, then columns second. This will be helpful when we want to modify our data sets by removing rows or columns (or both). Applied to the type of analysis we do within R & E, we know that we have 397 individuals CMA 2’s loaded in this tibble; in data speak, we call them “observations” versus specifically, clients or participants.

If we are just interested in the number of columns, or rows, we can use the following commands

nrow(cm2)
## [1] 397
ncol(cm2)
## [1] 158

We can learn about the various columns (variables) we are working with and get a general summary of the data with the following:

summary(cm2)

Summary is great because we can see summary statistics of our tibble/data frame/vector. It provides:

Note: if there are any missing values (NA) in the vector, the summary() function will automatically exclude them when calculating the summary statistics. The summary() function will also tell us the class or type of each variable, and list the quantity of NA’s in our data. We can see that some of our variables have too many NA’s to work with. This leads us into the next section where we learn how to manipulate the data set.

Adjusting the data frame

Let’s review our summary data and locate some variables that have a lot of NA’s. Keep note of some of their names. This is all for review, so there is no importance to the variables we want to zero in on.
*note: there is a greater theme related here, to missing data in LSYS data exported from ETO. Some entries show up as NA, by default of the export, when there is no data within the touchpoint. Some entries show up as blank cells. Both of these types of entries impact how we analyze the data, and there are conflicts and concerns in both cases. We have to adjust our scripts accordingly, however, this is for a later (more advanced level) training.

In much data analysis, if there are too many NA’s we cannot use the data. If we cannot use the data, we often want to remove it from our dataframe/tibble. First to remove the data, we have to locate it. Let’s locate our variables.

names(cm2)

Subset the data frame two ways

1st Way: Base R

The variables are organized by listing the number associated with the name of the variable in the first column, immediately to the right of the number in brackets. You must count over the row to find the number of the column you want. I know, very clunky, but this is the most basic (therefore, clunky at times) way to identify our column variable. Now, let remove those columns. Remember, R always listed rows, comma, columns. Keep this in mind.
We want subset our data frame and assign as a new object, so that we know that this is the new object we will work with moving forward. (Alternatively, you can re-assign it the same object, and that will overwrite the original tibble/object).
If we have more than 1 column to remove, we concatenate the list of column numbers.

cm2_v2 <- cm2[ , -c(141, 81, 75)]### the (negative sign indicates removal)
dim(cm2_v2) ## we can check that our new data frame has 3 less columns listed than the original data frame
## [1] 397 155

We can remove (or isolate) rows by listing (or concatenating) the rows we want to remove, BEFORE the comma. This is useful, when there are perhaps clients that need to be removed from the data frame, because they did not meet a requirement of the metric, say, being employed. Let’s try removing the first 10 rows of data.

cm2_v3 <- cm2_v2[-c(1-10),]
dim(cm2_v3)
## [1]   1 155

Finally, we can subset by building, rather than removing. This is the opposite of the above process, and we do not have to use negative signs.

cm2_v4 <- cm2[3:15, c(1-8)]

Viewing the data can be achieved using the same method as above. Sometimes we just want to quickly view our data. We can use the same method of listing rows and columns consecutively, but not creating a new object. We simply take our object and identify the subset of the data we want to see within brackets.

cm2[1:5, c(4, 6)]
## # A tibble: 5 × 2
##   DateTaken Version                 
##   <chr>     <chr>                   
## 1 9/30/2024 Final / At Exit         
## 2 9/30/2024 Reassessment (Quarterly)
## 3 9/30/2024 Reassessment (Quarterly)
## 4 9/30/2024 Reassessment (Quarterly)
## 5 9/30/2024 Reassessment (Quarterly)

Before moving on to the second method of subsetting the data, let’s clear our environment of the new tibbles we created, so that we don’t confuse our process. Let’s do that by using the following function rm(list the data frames we want to remove, separated by comma)

2nd Way: Dplyr

The most common way I subset set a data frame is by using the select() command, which is part of the Dplyr package. This is only used isolating, or removing columns, not useful for removign rows. Let’s understand the select() function first.
First let’s identify our column names once more, by using the function colnames(cm2)

Now, let’s decide to isolate a few columns we want to look at briefly. The command select() can be used simply to view the first several rows of the particular columns we want to see. We do this by using PIPES. **Pipes* are the backbone of any amazing thing can be done in R. While the <- sign is important of course for assigning new objects, calling data frames, etc, you can think of pipes as essentially the command, and then. Let’ try.

names(cm2) ##review the names of the data
##   [1] "ParticipantID"                 "DateOfBirth"                  
##   [3] "ProgramName"                   "DateTaken"                    
##   [5] "AuditDate"                     "Version"                      
##   [7] "CMandYouthMet"                 "DateLastMet"                  
##   [9] "CMNotes"                       "DestinationAtExit"            
##  [11] "LarkinStreetProgram"           "SubsidyLocation"              
##  [13] "HousingPlanDetails"            "DestinationLocation"          
##  [15] "DestinationSafe"               "DestinationTime"              
##  [17] "OwnBed"                        "WhereSleeping"                
##  [19] "WhoDecides"                    "NameOnYouth Matters"          
##  [21] "ObstaclesToStableHousing"      "OtherObstacles"               
##  [23] "RentPaymentStatus"             "BenefitEligibility"           
##  [25] "AnyIncome"                     "EarningsAmount"               
##  [27] "NonLegalIncomeAmount"          "UnemploymentInsuranceAmount"  
##  [29] "GAAmount"                      "SSIAmount"                    
##  [31] "PAESAmount"                    "TANFAmount"                   
##  [33] "SNAPAmount"                    "RentalAssistanceAmount"       
##  [35] "FinancialAidAmount"            "OtherIncomeAmount"            
##  [37] "OtherIncomeSource"             "TotalMonthlyIncome"           
##  [39] "BankAccount"                   "HasBudget"                    
##  [41] "LarkinRentAmount"              "SubsidyAmount"                
##  [43] "TotalRentForMoveIn"            "PercentOfIncomeIsRent"        
##  [45] "MoveInSavings"                 "SavingsForMoveIn"             
##  [47] "SavingsDeposit"                "IncomeAdequacyRent"           
##  [49] "IncomeAdequacy"                "CurrentLegalIssues"           
##  [51] "LegalContactThisQuarter"       "DescribeLegalContact"         
##  [53] "CurrentlyOnProbation"          "ArrestHistory"                
##  [55] "AgeAtFirstArrest"              "IncarcerationHistory"         
##  [57] "AgeAtFirstIncarceration"       "MostRecentArrest"             
##  [59] "KnowsLegalIssuesAffectHousing" "LandlordsAskAboutLegal"       
##  [61] "BorrowedIdentityAndCredit"     "KnowsCreditScore"             
##  [63] "InteractsComm"                 "Listens"                      
##  [65] "Mediates"                      "CommRules"                    
##  [67] "HouseSafety"                   "RespComm"                     
##  [69] "Mindful"                       "SafeProd"                     
##  [71] "SafeAppliances"                "Dishes"                       
##  [73] "CleansUp"                      "Clutter"                      
##  [75] "Vacuums"                       "Laundry"                      
##  [77] "TimeChores"                    "HsngAppts"                    
##  [79] "TalksHsngOpts"                 "HousingApps"                  
##  [81] "IDHousingRefs"                 "HousingPhCalls"               
##  [83] "HousingInts"                   "ScoutsHousingLoc"             
##  [85] "IDLegalResources"              "IDCreditResources"            
##  [87] "ResolveLegal"                  "FinLit"                       
##  [89] "SavPlan"                       "FinDocs"                      
##  [91] "PayBills"                      "StableIncome"                 
##  [93] "DebtToIncome"                  "OLDInteractsComm"             
##  [95] "OLDListens"                    "OLDMediates"                  
##  [97] "OLDCommRules"                  "OLDHouseSafety"               
##  [99] "OLDRespComm"                   "OLDMindful"                   
## [101] "OLDSafeProd"                   "OLDSafeAppliances"            
## [103] "OLDDishes"                     "OLDCleansUp"                  
## [105] "OLDClutter"                    "OLDVacuums"                   
## [107] "OLDLaundry"                    "OLDTimeChores"                
## [109] "OLDHsngAppts"                  "OLDTalksHsngOpts"             
## [111] "OLDHousingApps"                "OLDIDHousingRefs"             
## [113] "OLDHousingPhCalls"             "OLDHousingInts"               
## [115] "OLDScoutsHousingLoc"           "OLDIDLegalResources"          
## [117] "OLDIDCreditResources"          "OLDResolveLegal"              
## [119] "OLDFinLit"                     "OLDSavPlan"                   
## [121] "OLDFinDocs"                    "OLDPayBills"                  
## [123] "OLDStableIncome"               "OLDDebtToIncome"              
## [125] "DestinationAtExitWeight"       "PrimaryPhoneAtExit"           
## [127] "AdditionalPhoneAtExit"         "PrimaryEmailAtExit"           
## [129] "AdditionalEmailAtExit"         "BestWayToContact"             
## [131] "ExitAddress"                   "ExiAddressLine1"              
## [133] "ExitAddressLine2"              "ExitCity"                     
## [135] "ExitCounty"                    "ExitState"                    
## [137] "InfoSvcsFollowUp"              "OtherInfoSvcs"                
## [139] "EventContact"                  "EventInterest"                
## [141] "EventTypes"                    "OtherTypes"                   
## [143] "CAAPAmount"                    "OtherCashBenefit"             
## [145] "OtherNonCashBenefit"           "HealthInsurance"              
## [147] "DestinationType"               "NextHousingPlan"              
## [149] "NextDestinationType"           "NextDestination"              
## [151] "NextLarkinStreetProgram"       "NextHousingPlanDetail"        
## [153] "NextDestinationLocation"       "NextDestinationSafe"          
## [155] "NextDestinationTime"           "NextOwnBed"                   
## [157] "NextWhoDecides"                "NextNameOnYouth Matters"
cm2 |> select(ProgramName, DestinationAtExit, BenefitEligibility)
## # A tibble: 397 × 3
##    ProgramName   DestinationAtExit                            BenefitEligibility
##    <chr>         <chr>                                        <chr>             
##  1 Caminos       Unsubsidized rental (receiving subsidy / ma… No Longer Eligible
##  2 Journeys      <NA>                                         Eligible, Interru…
##  3 Revive        <NA>                                         Eligible, Receivi…
##  4 Revive        <NA>                                         Eligible, Receivi…
##  5 Revive        <NA>                                         No Longer Eligible
##  6 New Horizons  <NA>                                         Eligible, Receivi…
##  7 SH Home       <NA>                                         Eligible, Receivi…
##  8 Healthy Paths <NA>                                         Eligible, Not Int…
##  9 Casa Alma     <NA>                                         Eligible, Receivi…
## 10 Casa Alma     <NA>                                         Eligible, Receivi…
## # ℹ 387 more rows

We will use this same method to subset our data frame. Note: the method of using the select can ONLY select columns, not rows.

cm2_v1<- cm2 |> select(ParticipantID:DateOfBirth, DestinationAtExit, BenefitEligibility, TotalMonthlyIncome)

Exporting the New Data Frame

Now we have a modified data set, we may wan to save it for future use. The command write_csv allows to save our file to our working directory. Note that you will want to save this to the Data Output sub directory, so that we know it is different than our exported files from ETO.

Being able to export clean data sets is very useful. In terms of LSYS reporting processes, we often have to keep running UDC of clients/per program who have met certain metrics, and this allows us to import a previous report’s UDC and add on to it with wrangling of current report period’s data, and then export it once more, with a new name/indicator, if necessary. Additionally, some reports or data request need to show client specifics, often when we share data back to programs. The write_csv allows us to manipulate the specifics of a request, and export a nice clean file.
After the command write_csv, you first must list the DF/Tibble frame you want to be written. Then, if you want the file to be saved to a sub folder, you must identify the file path within the directory, folowed by a forward slash. Finally, you can name the CSV anything you like, but do not forget to add in “.csv” just like when you are using read_csv, or the file will not be written. The file location SLASH CSV name must be within quotation marks.

write_csv(cm2_v1, "data_output/CMA2_Clean.csv")