Overview and Prject Setup

For this project, I wanted to clean and explore data I work with nearly everyday. I would be quite wealthy if I had a nickle each time I downloaded these four primary data sets. Initially, I had planned on producing a shiny app to help organize, structure, and share the critical pieces of data. While I had some success getting things to work, I was not able to execute my full vision in time for this project. I still plan to continue with the app, not only as a learning opportunity, but as a useful tool for my colleagues.

I wanted to take the data as raw as possible from the various sources to ensure reproducibility not just for the project but for myself as these sets are published annually.

Data Sets

Point in Time Count & Housing Inventory Counts

(https://www.hudexchange.info/resource/3031/pit-and-hic-data-since-2007/) These data sets are annual counts of the number of people experiencing homelessness and the number of units of housing (both temporary and permanent) available to assist them. The data sets are Excel workbooks with each year as its own sheet. (Challenge #1)

System Performance Measures

(https://www.hudexchange.info/resource/5691/system-performance-measures-data-since-fy-2015/) The system performance measures (SPMs) connect to the data collection operations of each community, called “Continuums of Care” or CoCs. These are seven HUD-designated metrics that ascertain to what extent the community is ending homelessness.

Awards and Allocations

(https://www.hudexchange.info/grantees/allocations-awards/) The HUD awards and allocations are the total funding provided to states and communities over time. They provide information for all HUD programs and funding.

Getting a Clean Set and the Challenge of Reading Excel

First things first, I needed to read Excel data into R. More than that, I needed to read in several sheets for each workbook and clean up columns and join the sets. Each set had different quirks that made repeating one method for all four difficult. Now that I have a method, repeating it next year will be a snap.

Funny I’ve read JSON, HTML, XML, and CSV into R, but never Excel. I understand why that is the case, but my work often has me in .xlsx files and so it was good practice for me to work with these files since I would expect that to continue into my professional life and it’s impractical to export to csv each time.

My goal for the below is to get the Excels cleaned and exported into a csv as quickly as possible. I created the csv files initially to use for my shiny app, otherwise I could have just used the data frames produced for the code below. I do plan on using the csv files and joined data frames for a Tableau project so I think they’re quiet useful in any case.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## Warning: package 'readr' was built under R version 4.0.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(readxl)
## Warning: package 'readxl' was built under R version 4.0.3
library(DT)
## Warning: package 'DT' was built under R version 4.0.3
library(usmap)
## Warning: package 'usmap' was built under R version 4.0.3
library(ggplot2)
library(dplyr)
#Import Data
hicDir <- "D:\\CUNY\\DATA 607 - Data Aq\\FinalProjectSource\\2007-2019-Housing-Inventory-Count-by-CoC.xlsx"
pitDir <- "D:\\CUNY\\DATA 607 - Data Aq\\FinalProjectSource\\2007-2019-Point-in-Time-Estimates-by-CoC.xlsx"
spmDir <- "D:\\CUNY\\DATA 607 - Data Aq\\FinalProjectSource\\System-Performance-Measures-Data-Since-FY-2015.xlsx"
awardDir <- "D:\\CUNY\\DATA 607 - Data Aq\\FinalProjectSource\\Awards.xlsx"

Housing Inventory Counts (HIC)

First the HIC. I downloaded the source data from the websites linked above (raw files also on my GitHub). I spent some time trying to read Excel directly from GitHub. I got it working, but I had to create a temp file in the process. I thought an easier way for others to run it is to change the directory on one’s local machine instead.

The HIC file has a number of sheets one for each year. I found the package readxl which worked with a bit of tinkering. Everything that I found in the texts and online were all but waving me off from reading in Excel. Better to export the source to csv and go about it normally. I really wanted to commit myself to working with the raw file as much as I could.

The hardest part was getting the sheets to export, create a column for the sheets, and name that column for each record after the sheet name, which is the year of the data set.

sheet <- excel_sheets(hicDir)
hicdf <- lapply(setNames(sheet, sheet), function(x) read_excel(hicDir, sheet=x, col_names=FALSE,skip =1))
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
hicdf <- bind_rows(hicdf, .id="Sheet")

names(hicdf) <- as.matrix(hicdf[1, ])
## Warning: The `value` argument of ``names<-`()` can't be empty as of tibble 3.0.0.
## Column 79 must be named.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
hicdf <- hicdf[-1, ]
hicdf[] <- lapply(hicdf, function(x) type.convert(as.character(x)))

names(hicdf)[names(hicdf) == "2019"] <- "Year"
write.csv(hicdf,"hic.csv")

Point in Time Counts (PIT)

For the PIT set, I was able to use much of the same code from the HIC process. I had to change a few of the arguments to account for the different Excel structures. HUD really likes to preface their data with two or three rows of explanation or additional headers - very untidy!

sheet <- excel_sheets(pitDir)
pitdf <- lapply(setNames(sheet, sheet), function(x) read_excel(pitDir, sheet=x, col_names=TRUE))
pitdf <- bind_rows(pitdf, .id="Sheet")

names(pitdf)[names(pitdf) == "Sheet"] <- "Year"
write.csv(pitdf,"pit.csv")

System Performance Measures (SPMs)

sheet <- excel_sheets(spmDir)
spmdf <- lapply(setNames(sheet, sheet), function(x) read_excel(spmDir, sheet=x, col_names=TRUE))
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...7
## * ...
spmdf <- bind_rows(spmdf, .id="Sheet")

names(spmdf) <- as.matrix(spmdf[1, ])
## Warning: The `value` argument of ``names<-`()` can't be empty as of tibble 3.0.0.
## Columns 48, 49, 50, 51, 52, and 37 more must be named.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
spmdf <- spmdf[-1, ]
spmdf[] <- lapply(spmdf, function(x) type.convert(as.character(x)))

names(spmdf)[names(spmdf) == "2015"] <- "Year"
write.csv(spmdf,"spm.csv")

Awards

sheet <- excel_sheets(awardDir)
awarddf <- lapply(setNames(sheet, sheet), function(x) read_excel(awardDir, sheet=x, col_names=TRUE))
## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...
awarddf <- bind_rows(awarddf, .id="Sheet")

names(awarddf) <- as.matrix(spmdf[1, ])
## Warning: The `value` argument of ``names<-`()` must have the same length as `x` as of tibble 3.0.0.
## `names` must have length 8, not 89.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
awarddf <- spmdf[-1, ]
awarddf[] <- lapply(awarddf, function(x) type.convert(as.character(x)))

names(awarddf)[names(awarddf) == "2015"] <- "Year"
write.csv(awarddf,"award.csv")

Import Data

After writing to csv, I uploaded those sets to GitHub to pull into R. I found out quickly that R was treating all these columns as factors, and so I changed an argument have R read the data types correctly.

pit <- read.csv("https://raw.githubusercontent.com/iscostello/DATA607-FinalProject/main/pit.csv", stringsAsFactors = FALSE)
hic <- read.csv("https://raw.githubusercontent.com/iscostello/DATA607-FinalProject/main/hic.csv", stringsAsFactors = FALSE)
spm <- read.csv("https://raw.githubusercontent.com/iscostello/DATA607-FinalProject/main/spm.csv", stringsAsFactors = FALSE)
award <- read.csv("https://raw.githubusercontent.com/iscostello/DATA607-FinalProject/main/award.csv", stringsAsFactors = FALSE)

Joining Data

Using the joins, the four sets were combined into one mega set. For me, this would have been enough for my professional use. Combining these sets on CoC and state into a clean csv or data frame is magic. I am more than a bit annoyed at HUD, from set to set they don’t have the same nomenclature for similar columns and do a very poor job with state designation. Luckily, the CoC Number starts with the two character abbreviation for states, so I created a new column to use for mapping further down (my new facet instead of shiny).

joinSet <- left_join(pit, hic)
## Joining, by = c("Year", "CoC.Number")
joinSet <- left_join(joinSet, spm)
## Joining, by = c("Year", "CoC.Number", "NA.")
joinSet <- joinSet %>%
  select(Year:Percent.with.Successful.Retention.or.Exit) %>%
  select(-State) %>%
  mutate(state = str_extract(CoC.Number,"^.{2}"))

Data Table

I really liked this widget to show all the columns and quickly sort. It doesn’t format very well once published, but give a good idea of the amount of data I may have to sort through and combine all by hand previously.

datatable(joinSet, fillContainer = TRUE)
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

Analysis and Mapping

To do a bit of analysis to verify the data, I wanted to map out the total population experiencing homelessness by State. This is where I discovered that the state column that came with the data was not very good, leading me to create another from scratch.

Using the package usmaps I was able to construct a very simple, if informative map of the US based on population. No suprise that the large states, California, New York, Texas, and Florida account for the most people experiencing homelessness according to the PIT.

homelessPop <- joinSet %>%
  select(Year, CoC.Category, Overall.Homeless, state)

sumByState <- aggregate(x = homelessPop$Overall.Homeless,
          by = list(homelessPop$state, homelessPop$Year),
          FUN = sum)

sumByState <- sumByState %>%
  rename(state = Group.1) %>%
  rename(Year = Group.2) %>%
  rename(TotalHomeless = x)
sumByState <- sumByState[-1,]
plot_usmap(data = sumByState, values = "TotalHomeless", color = "red", labels=TRUE) + 
  scale_fill_continuous(low = "white", high = "blue", name = "2019 Homeless Population (PIT)", label = scales::comma
  ) + theme(legend.position = "right")

### Year to Year Comparisons Very interesting is to compare the population counts over the course of the five years. Sub-setting the massive joined data set, I select a few columns and scattered data by year in a new set and created a new column to sum the differences from 2019 to 2015. I was very interested to see the results.

sumByStateYear <- spread(sumByState, key=Year, value=TotalHomeless)
sumByStateYear <- sumByStateYear[-1,]

colnames(sumByStateYear)
## [1] "state" "2015"  "2016"  "2017"  "2018"  "2019"
sumByStateYear <- sumByStateYear %>%
  rename(y2015 = 2, y2016 = 3, y2017 = 4, y2018 = 5, y2019 = 6) %>%
  mutate(yearChange = y2019 - y2015)
datatable(sumByStateYear)

Increases and Decreases in Homelessness

As a resident of California with a lot of work in this field in the state, I certainly can see how CA leads the country in homelessness and that the last five year have not shown many bright spots. CA has dedicated more money, resources, and attention to homeless services and housing. CA is the most populous state and when compared on a per capita basis, Hawaii claims the top spot.

sumByStateYear %>% 
    arrange(desc(yearChange)) %>%
    slice(1:10) %>%
    ggplot(., aes(x=reorder(state, -yearChange), y=yearChange))+
              geom_bar(stat='identity')

In decreases in homelessness, I was very interested to see Florida leading. At the local and state level, there have been some bright spots for policy. Overall, FL is not a Medicaid expansion state, which limits the supportive service resources available for people experiencing homelessness with disabilities and no insurance. I think this set and these chart will open up a new line of research and will be a rich environment for me in future projects.

sumByStateYear %>% 
    arrange(yearChange) %>%
    slice(1:10) %>%
    ggplot(., aes(x=reorder(state, yearChange), y=yearChange))+
              geom_bar(stat='identity')

Conclusions

I am quite proud of the work done here. Even though I was not able to execute in time my original vision for a shiny app, that is something I can still work for and these data sets and cleaning procedures will still pay enormous dividends for my colleagues and myself.

The challenges I experienced in this project were mainly in formatting and reading data in the way I had envisioned. I still catch small errors in code and when that happens I typically assume the worst. Most often these errors are simple mistakes in spelling or punctuation rather than a code-altering horrible thing. I need to do better about error checking as I go and reading and re-reading error messages so that I understand what is throwing the error. More than once I changed a substantial amount of code just to find that I had left a piping sign (%>%) at the end of a line where I meant to leave nothing…