DATA607 - Final Data Project

December 9, 2020

Overview and Project Setup

Setting the Stage

For this project, I wanted to clean and explore data I work with nearly everyday
If I had a nickle…
Learning opportunity for me, but as a useful tool for my colleagues and my work
I wanted to take the data as raw as possible from the various sources to ensure reproducibility not just for the project but for myself as these sets are published annually.

Data Sets - Excel, why did it have to be Excel?

#Import Data
hicDir <- "D:\\CUNY\\DATA 607 - Data Aq\\FinalProjectSource\\2007-2019-Housing-Inventory-Count-by-CoC.xlsx"
pitDir <- "D:\\CUNY\\DATA 607 - Data Aq\\FinalProjectSource\\2007-2019-Point-in-Time-Estimates-by-CoC.xlsx"
spmDir <- "D:\\CUNY\\DATA 607 - Data Aq\\FinalProjectSource\\System-Performance-Measures-Data-Since-FY-2015.xlsx"
awardDir <- "D:\\CUNY\\DATA 607 - Data Aq\\FinalProjectSource\\Awards.xlsx"

Point in Time Count & Housing Inventory Counts (https://www.hudexchange.info/resource/3031/pit-and-hic-data-since-2007/) These data sets are annual counts of the number of people experiencing homelessness and the number of units of housing (both temporary and permanent) available to assist them. The data sets are Excel workbooks with each year as its own sheet.
System Performance Measures (https://www.hudexchange.info/resource/5691/system-performance-measures-data-since-fy-2015/) The system performance measures (SPMs) connect to the data collection operations of each community, called “Continuums of Care” or CoCs. These are seven HUD-designated metrics that ascertain to what extent the community is ending homelessness.
Awards and Allocations (https://www.hudexchange.info/grantees/allocations-awards/) The HUD awards and allocations are the total funding provided to states and communities over time. They provide information for all HUD programs and funding.

Getting a Clean Set and the Thrill Reading Excel

sheet <- excel_sheets(hicDir)
hicdf <- lapply(setNames(sheet, sheet), function(x) read_excel(hicDir, sheet=x, col_names=FALSE,skip =1))
hicdf <- bind_rows(hicdf, .id="Sheet")

names(hicdf) <- as.matrix(hicdf[1, ])
hicdf <- hicdf[-1, ]
hicdf[] <- lapply(hicdf, function(x) type.convert(as.character(x)))

names(hicdf)[names(hicdf) == "2019"] <- "Year"
write.csv(hicdf,"hic.csv")

sheet <- excel_sheets(pitDir)
pitdf <- lapply(setNames(sheet, sheet), function(x) read_excel(pitDir, sheet=x, col_names=TRUE))
pitdf <- bind_rows(pitdf, .id="Sheet")

names(pitdf)[names(pitdf) == "Sheet"] <- "Year"
write.csv(pitdf,"pit.csv")

sheet <- excel_sheets(spmDir)
spmdf <- lapply(setNames(sheet, sheet), function(x) read_excel(spmDir, sheet=x, col_names=TRUE))
spmdf <- bind_rows(spmdf, .id="Sheet")

names(spmdf) <- as.matrix(spmdf[1, ])
spmdf <- spmdf[-1, ]
spmdf[] <- lapply(spmdf, function(x) type.convert(as.character(x)))

names(spmdf)[names(spmdf) == "2015"] <- "Year"
write.csv(spmdf,"spm.csv")

Import Data

After writing to csv, I uploaded those sets to GitHub to pull into R. I found out quickly that R was treating all these columns as factors, and so I changed an argument have R read the data types correctly.

pit <- read.csv("https://raw.githubusercontent.com/iscostello/DATA607-FinalProject/main/pit.csv", stringsAsFactors = FALSE)
hic <- read.csv("https://raw.githubusercontent.com/iscostello/DATA607-FinalProject/main/hic.csv", stringsAsFactors = FALSE)
spm <- read.csv("https://raw.githubusercontent.com/iscostello/DATA607-FinalProject/main/spm.csv", stringsAsFactors = FALSE)
award <- read.csv("https://raw.githubusercontent.com/iscostello/DATA607-FinalProject/main/award.csv", stringsAsFactors = FALSE)

Joining Data

Using the joins, the four sets were combined into one mega set by CoC Community
More than a bit annoyed at HUD, from set to set they don’t have the same nomenclature for similar columns and do a very poor job with state designation.
Luckily, the CoC Number starts with the two character abbreviation for states, so I created a new column to use for mapping

joinSet <- left_join(pit, hic)
joinSet <- left_join(joinSet, spm)

joinSet <- joinSet %>%
  select(Year:Percent.with.Successful.Retention.or.Exit) %>%
  select(-State) %>%
  mutate(state = str_extract(CoC.Number,"^.{2}"))

Analysis and Mapping

To do a bit of analysis to verify the data, I wanted to map out the total population experiencing homelessness by State. This is where I discovered that the state column that came with the data was not very good, leading me to create another from scratch.

Using the package usmaps I was able to construct a very simple, if informative map of the US based on population. No suprise that the large states, California, New York, Texas, and Florida account for the most people experiencing homelessness according to the PIT.

homelessPop <- joinSet %>%
  select(Year, CoC.Category, Overall.Homeless, state)

sumByState <- aggregate(x = homelessPop$Overall.Homeless,
          by = list(homelessPop$state, homelessPop$Year),
          FUN = sum)

sumByState <- sumByState %>%
  rename(state = Group.1) %>%
  rename(Year = Group.2) %>%
  rename(TotalHomeless = x)
sumByState <- sumByState[-1,]

Map of US by Total Homeless Population

plot_usmap(data = sumByState, values = "TotalHomeless", color = "red", labels=TRUE) + 
  scale_fill_continuous(low = "white", high = "blue", name = "2019 Homeless Population (PIT)", label = scales::comma
  ) + theme(legend.position = "right")

Increases in Homelessness over Five Years

sumByStateYear %>% 
    arrange(desc(yearChange)) %>%
    slice(1:10) %>%
    ggplot(., aes(x=reorder(state, -yearChange), y=yearChange))+
              geom_bar(stat='identity')

Decreases in Homelessness over Five Years

sumByStateYear %>% 
    arrange(yearChange) %>%
    slice(1:10) %>%
    ggplot(., aes(x=reorder(state, yearChange), y=yearChange))+
              geom_bar(stat='identity')

Conclusions

I am quite proud of the work so far, this clean data set (and process) was a good investment
Interesting to see some of the exploratory results
The challenges I experienced in this project were mainly in formatting and reading data
Need to be more mindful of the small errors in spelling and code and not jump to conclusions