As always, ensure that you have the latest version of R installed. (Note: R is helpful in that they will always present a pop up when opening R reminding you if an update is required. Author’s note: ALWAYS update when prompted. Don’t leave it until later or another more convenient time. R has a funny way of forcing you to update if you try to ignore the prompt. Commands don’t run functionally, funky things will start happening with your output. Take my advice : JUST.DO.IT.)
Barring a required update, of course we will always clear our environment (1st) and then set our working directory (2nd).
After those two steps, load the necessary packages you need for your
analysis. For today, we will be loading tidyr, dplyr
and readr. These three packages are all part of the larger
tidyverse package, a kind of “meta-package” that has many different
packages with different functionalities, all focused of types of data
wrangling. Why don’t we just load tidyverse then, Rather than
listing individual packages and loading them individually?
1. It is important to get familiar with different individual (smaller)
packages and understand their purposes and functions.
2. The core tidyverse package contains 8 core packages and even more
non-core packages. It is best practice to limit our work space to what
we need versus and avoid namespace conflicts.
3. Know the functions and their associated packages. This is
particularly important when debugging potential conflicts.
Of course, non of these packages will load properly with out installing tidyverse, which we did in our last session. If you find that it is not installed, use the command install.packages(tidyverse).
ls() #check out the objects in the environment
## character(0)
rm(list=ls()) ## here we are telling R the command *remove* (or rather, clear) all of the objects from thew orkspace to be able to start with a clean enviornment.
setwd("~/Directories/Practice Directory")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(dplyr)
library(readr)
library(stringr)
You are going to load the data in R’s memory using the function read_csv As you may recall, we could use read.csv to import our data set, which is a function of base R commands. However, we prefer read_csv for its special manipulation of data sets, and ease of use with other tidyverse commands. We have to assign our data set to an object, using the key binding <-. You may recall that <- can be thought of as an = sign. To work with a more interesting data set for exploratory purposes, we will load the CSV of LSYS Case Management Assessment 2 (CMA 2).
cm2 <- read_csv("cm2.csv")
## Rows: 397 Columns: 158
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (51): DateOfBirth, ProgramName, DateTaken, AuditDate, Version, CMandYout...
## dbl (43): ParticipantID, UnemploymentInsuranceAmount, PAESAmount, TANFAmount...
## num (15): EarningsAmount, NonLegalIncomeAmount, GAAmount, SSIAmount, RentalA...
## lgl (49): WhereSleeping, OLDInteractsComm, OLDListens, OLDMediates, OLDCommR...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
If you were to type in the code above, it is likely that the read.csv() function would appear in the automatically populated list of functions. This function is different from the read_csv() function, as it is included in the “base” packages that come pre-installed with R. Overall, read.csv() behaves similar to read_csv(), with a few notable differences. First, read.csv() coerces column names with spaces and/or special characters to different names (e.g. interview date becomes interview.date). Second, read.csv() stores data as a data.frame, where read_csv() stores data as a different kind of data frame called a tibble. We prefer tibbles because they have nice printing properties among other desirable qualities.- from https://datacarpentry.org/r-socialsci/02-starting-with-data.html#importing-data
R makes it easy to look at the data we have just installed. The first
step to exploring our data, however, is understanding it. We should
always know what type of data we are looking at. The command
class(data) tells us specifically what type of data
(frame) we have just loaded. Note, you can also see this in the Console
when you load the data.
Note that read_csv() actually loads the data as a tibble. A tibble is an
extension of R data frames used by the tidyverse. When the data is read
using read_csv(), it is stored in an object of class tbl_df, tbl, and
data.frame. You can see the class of an object with
class(cm2)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
cm2 ## simply "call" the tibble by it's name and it will display the first 10 rows of the data and several columns.
## # A tibble: 397 × 158
## ParticipantID DateOfBirth ProgramName DateTaken AuditDate Version
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 20757 6/22/2002 Caminos 9/30/2024 9/30/2024 Final / At Exit
## 2 22914 10/5/2003 Journeys 9/30/2024 9/30/2024 Reassessment (Qu…
## 3 23104 8/15/2000 Revive 9/30/2024 9/30/2024 Reassessment (Qu…
## 4 22258 3/11/2003 Revive 9/30/2024 9/30/2024 Reassessment (Qu…
## 5 22946 9/5/2004 Revive 9/30/2024 9/30/2024 Reassessment (Qu…
## 6 21182 9/20/1998 New Horizons 9/30/2024 9/30/2024 Reassessment (Qu…
## 7 21599 3/9/2002 SH Home 9/30/2024 9/30/2024 Reassessment (Qu…
## 8 18703 9/17/1998 Healthy Paths 9/30/2024 9/30/2024 Reassessment (Qu…
## 9 20635 11/11/2001 Casa Alma 9/30/2024 9/30/2024 Reassessment (Qu…
## 10 22875 11/25/2001 Casa Alma 9/30/2024 9/30/2024 Reassessment (Qu…
## # ℹ 387 more rows
## # ℹ 152 more variables: CMandYouthMet <chr>, DateLastMet <chr>, CMNotes <chr>,
## # DestinationAtExit <chr>, LarkinStreetProgram <chr>, SubsidyLocation <chr>,
## # HousingPlanDetails <chr>, DestinationLocation <chr>, DestinationSafe <chr>,
## # DestinationTime <chr>, OwnBed <chr>, WhereSleeping <lgl>, WhoDecides <chr>,
## # `NameOnYouth Matters` <chr>, ObstaclesToStableHousing <chr>,
## # OtherObstacles <chr>, RentPaymentStatus <chr>, BenefitEligibility <chr>, …
If we want to quickly see the see the data, we can also type head.
#view(cm2) #the command view also does the same thing.
head(cm2) ##shows us just the top lines of the data, see below
## # A tibble: 6 × 158
## ParticipantID DateOfBirth ProgramName DateTaken AuditDate Version
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 20757 6/22/2002 Caminos 9/30/2024 9/30/2024 Final / At Exit
## 2 22914 10/5/2003 Journeys 9/30/2024 9/30/2024 Reassessment (Quar…
## 3 23104 8/15/2000 Revive 9/30/2024 9/30/2024 Reassessment (Quar…
## 4 22258 3/11/2003 Revive 9/30/2024 9/30/2024 Reassessment (Quar…
## 5 22946 9/5/2004 Revive 9/30/2024 9/30/2024 Reassessment (Quar…
## 6 21182 9/20/1998 New Horizons 9/30/2024 9/30/2024 Reassessment (Quar…
## # ℹ 152 more variables: CMandYouthMet <chr>, DateLastMet <chr>, CMNotes <chr>,
## # DestinationAtExit <chr>, LarkinStreetProgram <chr>, SubsidyLocation <chr>,
## # HousingPlanDetails <chr>, DestinationLocation <chr>, DestinationSafe <chr>,
## # DestinationTime <chr>, OwnBed <chr>, WhereSleeping <lgl>, WhoDecides <chr>,
## # `NameOnYouth Matters` <chr>, ObstaclesToStableHousing <chr>,
## # OtherObstacles <chr>, RentPaymentStatus <chr>, BenefitEligibility <chr>,
## # AnyIncome <chr>, EarningsAmount <dbl>, NonLegalIncomeAmount <dbl>, …
More than viewing the data, we want to understand what type of data we are working with (beyond just the “class” of the data set). This means knowing the size or dimensions of our data. Learning about are the types of variables exist (class and names), even having a grasp of what variables have the most NA’s, which could impact our analysis. First, let look at the overall size of the tibble.
dim(cm2)
## [1] 397 158
We can see from the results that the object “cm2” has 397 rows and 158 columns. To note, R always lists rows first, then columns second. This will be helpful when we want to modify our data sets by removing rows or columns (or both). Applied to the type of analysis we do within R & E, we know that we have 397 individuals CMA 2’s loaded in this tibble; in data speak, we call them “observations” versus specifically, clients or participants.
If we are just interested in the number of columns, or rows, we can use the following commands
nrow(cm2)
## [1] 397
ncol(cm2)
## [1] 158
We can learn about the various columns (variables) we are working with and get a general summary of the data with the following:
summary(cm2)
Summary is great because we can see summary statistics of our tibble/data frame/vector. It provides:
Note: if there are any missing values (NA) in the vector, the summary() function will automatically exclude them when calculating the summary statistics. The summary() function will also tell us the class or type of each variable, and list the quantity of NA’s in our data. We can see that some of our variables have too many NA’s to work with. This leads us into the next section where we learn how to manipulate the data set.
Let’s review our summary data and locate some
variables that have a lot of NA’s. Keep note of some of their names.
This is all for review, so there is no importance to the variables we
want to zero in on.
*note: there is a greater theme related here, to missing data in LSYS
data exported from ETO. Some entries show up as NA, by default of the
export, when there is no data within the touchpoint. Some entries show
up as blank cells. Both of these types of entries impact how we analyze
the data, and there are conflicts and concerns in both cases. We have to
adjust our scripts accordingly, however, this is for a later (more
advanced level) training.
In much data analysis, if there are too many NA’s we cannot use the data. If we cannot use the data, we often want to remove it from our dataframe/tibble. First to remove the data, we have to locate it. Let’s locate our variables.
names(cm2)
The variables are organized by listing the number associated with the
name of the variable in the first column, immediately to the right of
the number in brackets. You must count over the row to find the number
of the column you want. I know, very clunky, but this is the most basic
(therefore, clunky at times) way to identify our column variable. Now,
let remove those columns. Remember, R always listed rows, comma,
columns. Keep this in mind.
We want subset our data frame and assign as a new object, so that we
know that this is the new object we will work with moving forward.
(Alternatively, you can re-assign it the same object, and that will
overwrite the original tibble/object).
If we have more than 1 column to remove, we concatenate the
list of column numbers.
cm2_v2 <- cm2[ , -c(141, 81, 75)]### the (negative sign indicates removal)
dim(cm2_v2) ## we can check that our new data frame has 3 less columns listed than the original data frame
## [1] 397 155
We can remove (or isolate) rows by listing (or concatenating) the rows we want to remove, BEFORE the comma. This is useful, when there are perhaps clients that need to be removed from the data frame, because they did not meet a requirement of the metric, say, being employed. Let’s try removing the first 10 rows of data.
cm2_v3 <- cm2_v2[-c(1-10),]
dim(cm2_v3)
## [1] 1 155
Finally, we can subset by building, rather than removing. This is the opposite of the above process, and we do not have to use negative signs.
cm2_v4 <- cm2[3:15, c(1-8)]
Viewing the data can be achieved using the same method as above. Sometimes we just want to quickly view our data. We can use the same method of listing rows and columns consecutively, but not creating a new object. We simply take our object and identify the subset of the data we want to see within brackets.
cm2[1:5, c(4, 6)]
## # A tibble: 5 × 2
## DateTaken Version
## <chr> <chr>
## 1 9/30/2024 Final / At Exit
## 2 9/30/2024 Reassessment (Quarterly)
## 3 9/30/2024 Reassessment (Quarterly)
## 4 9/30/2024 Reassessment (Quarterly)
## 5 9/30/2024 Reassessment (Quarterly)
Before moving on to the second method of subsetting the data, let’s clear our environment of the new tibbles we created, so that we don’t confuse our process. Let’s do that by using the following function rm(list the data frames we want to remove, separated by comma)
The most common way I subset set a data frame is by using the
select() command, which is part of the Dplyr package.
This is only used isolating, or removing columns, not useful for
removign rows. Let’s understand the select() function
first.
First let’s identify our column names once more, by using the function
colnames(cm2)
Now, let’s decide to isolate a few columns we want to look at briefly. The command select() can be used simply to view the first several rows of the particular columns we want to see. We do this by using PIPES. **Pipes* are the backbone of any amazing thing can be done in R. While the <- sign is important of course for assigning new objects, calling data frames, etc, you can think of pipes as essentially the command, and then. Let’ try.
names(cm2) ##review the names of the data
## [1] "ParticipantID" "DateOfBirth"
## [3] "ProgramName" "DateTaken"
## [5] "AuditDate" "Version"
## [7] "CMandYouthMet" "DateLastMet"
## [9] "CMNotes" "DestinationAtExit"
## [11] "LarkinStreetProgram" "SubsidyLocation"
## [13] "HousingPlanDetails" "DestinationLocation"
## [15] "DestinationSafe" "DestinationTime"
## [17] "OwnBed" "WhereSleeping"
## [19] "WhoDecides" "NameOnYouth Matters"
## [21] "ObstaclesToStableHousing" "OtherObstacles"
## [23] "RentPaymentStatus" "BenefitEligibility"
## [25] "AnyIncome" "EarningsAmount"
## [27] "NonLegalIncomeAmount" "UnemploymentInsuranceAmount"
## [29] "GAAmount" "SSIAmount"
## [31] "PAESAmount" "TANFAmount"
## [33] "SNAPAmount" "RentalAssistanceAmount"
## [35] "FinancialAidAmount" "OtherIncomeAmount"
## [37] "OtherIncomeSource" "TotalMonthlyIncome"
## [39] "BankAccount" "HasBudget"
## [41] "LarkinRentAmount" "SubsidyAmount"
## [43] "TotalRentForMoveIn" "PercentOfIncomeIsRent"
## [45] "MoveInSavings" "SavingsForMoveIn"
## [47] "SavingsDeposit" "IncomeAdequacyRent"
## [49] "IncomeAdequacy" "CurrentLegalIssues"
## [51] "LegalContactThisQuarter" "DescribeLegalContact"
## [53] "CurrentlyOnProbation" "ArrestHistory"
## [55] "AgeAtFirstArrest" "IncarcerationHistory"
## [57] "AgeAtFirstIncarceration" "MostRecentArrest"
## [59] "KnowsLegalIssuesAffectHousing" "LandlordsAskAboutLegal"
## [61] "BorrowedIdentityAndCredit" "KnowsCreditScore"
## [63] "InteractsComm" "Listens"
## [65] "Mediates" "CommRules"
## [67] "HouseSafety" "RespComm"
## [69] "Mindful" "SafeProd"
## [71] "SafeAppliances" "Dishes"
## [73] "CleansUp" "Clutter"
## [75] "Vacuums" "Laundry"
## [77] "TimeChores" "HsngAppts"
## [79] "TalksHsngOpts" "HousingApps"
## [81] "IDHousingRefs" "HousingPhCalls"
## [83] "HousingInts" "ScoutsHousingLoc"
## [85] "IDLegalResources" "IDCreditResources"
## [87] "ResolveLegal" "FinLit"
## [89] "SavPlan" "FinDocs"
## [91] "PayBills" "StableIncome"
## [93] "DebtToIncome" "OLDInteractsComm"
## [95] "OLDListens" "OLDMediates"
## [97] "OLDCommRules" "OLDHouseSafety"
## [99] "OLDRespComm" "OLDMindful"
## [101] "OLDSafeProd" "OLDSafeAppliances"
## [103] "OLDDishes" "OLDCleansUp"
## [105] "OLDClutter" "OLDVacuums"
## [107] "OLDLaundry" "OLDTimeChores"
## [109] "OLDHsngAppts" "OLDTalksHsngOpts"
## [111] "OLDHousingApps" "OLDIDHousingRefs"
## [113] "OLDHousingPhCalls" "OLDHousingInts"
## [115] "OLDScoutsHousingLoc" "OLDIDLegalResources"
## [117] "OLDIDCreditResources" "OLDResolveLegal"
## [119] "OLDFinLit" "OLDSavPlan"
## [121] "OLDFinDocs" "OLDPayBills"
## [123] "OLDStableIncome" "OLDDebtToIncome"
## [125] "DestinationAtExitWeight" "PrimaryPhoneAtExit"
## [127] "AdditionalPhoneAtExit" "PrimaryEmailAtExit"
## [129] "AdditionalEmailAtExit" "BestWayToContact"
## [131] "ExitAddress" "ExiAddressLine1"
## [133] "ExitAddressLine2" "ExitCity"
## [135] "ExitCounty" "ExitState"
## [137] "InfoSvcsFollowUp" "OtherInfoSvcs"
## [139] "EventContact" "EventInterest"
## [141] "EventTypes" "OtherTypes"
## [143] "CAAPAmount" "OtherCashBenefit"
## [145] "OtherNonCashBenefit" "HealthInsurance"
## [147] "DestinationType" "NextHousingPlan"
## [149] "NextDestinationType" "NextDestination"
## [151] "NextLarkinStreetProgram" "NextHousingPlanDetail"
## [153] "NextDestinationLocation" "NextDestinationSafe"
## [155] "NextDestinationTime" "NextOwnBed"
## [157] "NextWhoDecides" "NextNameOnYouth Matters"
cm2 |> select(ProgramName, DestinationAtExit, BenefitEligibility)
## # A tibble: 397 × 3
## ProgramName DestinationAtExit BenefitEligibility
## <chr> <chr> <chr>
## 1 Caminos Unsubsidized rental (receiving subsidy / ma… No Longer Eligible
## 2 Journeys <NA> Eligible, Interru…
## 3 Revive <NA> Eligible, Receivi…
## 4 Revive <NA> Eligible, Receivi…
## 5 Revive <NA> No Longer Eligible
## 6 New Horizons <NA> Eligible, Receivi…
## 7 SH Home <NA> Eligible, Receivi…
## 8 Healthy Paths <NA> Eligible, Not Int…
## 9 Casa Alma <NA> Eligible, Receivi…
## 10 Casa Alma <NA> Eligible, Receivi…
## # ℹ 387 more rows
We will use this same method to subset our data frame. Note: the method of using the select can ONLY select columns, not rows.
cm2_v1<- cm2 |> select(ParticipantID:DateOfBirth, DestinationAtExit, BenefitEligibility, TotalMonthlyIncome)
Now we have a modified data set, we may wan to save it for future use. The command write_csv allows to save our file to our working directory. Note that you will want to save this to the Data Output sub directory, so that we know it is different than our exported files from ETO.
Being able to export clean data sets is very useful. In terms of LSYS
reporting processes, we often have to keep running UDC of clients/per
program who have met certain metrics, and this allows us to import a
previous report’s UDC and add on to it with wrangling of current report
period’s data, and then export it once more, with a new name/indicator,
if necessary. Additionally, some reports or data request need to show
client specifics, often when we share data back to programs. The
write_csv allows us to manipulate the specifics of a
request, and export a nice clean file.
After the command write_csv, you first must list the
DF/Tibble frame you want to be written. Then, if you want the file to be
saved to a sub folder, you must identify the file path within the
directory, folowed by a forward slash. Finally, you can name the CSV
anything you like, but do not forget to add in “.csv” just like when you
are using read_csv, or the file will not be written.
The file location SLASH CSV name must be within quotation marks.
write_csv(cm2_v1, "data_output/CMA2_Clean.csv")