December 9, 2020

Overview and Project Setup

Setting the Stage

  • For this project, I wanted to clean and explore data I work with nearly everyday
  • If I had a nickle…
  • Learning opportunity for me, but as a useful tool for my colleagues and my work
  • I wanted to take the data as raw as possible from the various sources to ensure reproducibility not just for the project but for myself as these sets are published annually.

Data Sets - Excel, why did it have to be Excel?

#Import Data
hicDir <- "D:\\CUNY\\DATA 607 - Data Aq\\FinalProjectSource\\2007-2019-Housing-Inventory-Count-by-CoC.xlsx"
pitDir <- "D:\\CUNY\\DATA 607 - Data Aq\\FinalProjectSource\\2007-2019-Point-in-Time-Estimates-by-CoC.xlsx"
spmDir <- "D:\\CUNY\\DATA 607 - Data Aq\\FinalProjectSource\\System-Performance-Measures-Data-Since-FY-2015.xlsx"
awardDir <- "D:\\CUNY\\DATA 607 - Data Aq\\FinalProjectSource\\Awards.xlsx"

Getting a Clean Set and the Thrill Reading Excel

sheet <- excel_sheets(hicDir)
hicdf <- lapply(setNames(sheet, sheet), function(x) read_excel(hicDir, sheet=x, col_names=FALSE,skip =1))
hicdf <- bind_rows(hicdf, .id="Sheet")

names(hicdf) <- as.matrix(hicdf[1, ])
hicdf <- hicdf[-1, ]
hicdf[] <- lapply(hicdf, function(x) type.convert(as.character(x)))

names(hicdf)[names(hicdf) == "2019"] <- "Year"
write.csv(hicdf,"hic.csv")
sheet <- excel_sheets(pitDir)
pitdf <- lapply(setNames(sheet, sheet), function(x) read_excel(pitDir, sheet=x, col_names=TRUE))
pitdf <- bind_rows(pitdf, .id="Sheet")

names(pitdf)[names(pitdf) == "Sheet"] <- "Year"
write.csv(pitdf,"pit.csv")
sheet <- excel_sheets(spmDir)
spmdf <- lapply(setNames(sheet, sheet), function(x) read_excel(spmDir, sheet=x, col_names=TRUE))
spmdf <- bind_rows(spmdf, .id="Sheet")

names(spmdf) <- as.matrix(spmdf[1, ])
spmdf <- spmdf[-1, ]
spmdf[] <- lapply(spmdf, function(x) type.convert(as.character(x)))

names(spmdf)[names(spmdf) == "2015"] <- "Year"
write.csv(spmdf,"spm.csv")

Import Data

After writing to csv, I uploaded those sets to GitHub to pull into R. I found out quickly that R was treating all these columns as factors, and so I changed an argument have R read the data types correctly.

pit <- read.csv("https://raw.githubusercontent.com/iscostello/DATA607-FinalProject/main/pit.csv", stringsAsFactors = FALSE)
hic <- read.csv("https://raw.githubusercontent.com/iscostello/DATA607-FinalProject/main/hic.csv", stringsAsFactors = FALSE)
spm <- read.csv("https://raw.githubusercontent.com/iscostello/DATA607-FinalProject/main/spm.csv", stringsAsFactors = FALSE)
award <- read.csv("https://raw.githubusercontent.com/iscostello/DATA607-FinalProject/main/award.csv", stringsAsFactors = FALSE)

Joining Data

  • Using the joins, the four sets were combined into one mega set by CoC Community
  • More than a bit annoyed at HUD, from set to set they don’t have the same nomenclature for similar columns and do a very poor job with state designation.
  • Luckily, the CoC Number starts with the two character abbreviation for states, so I created a new column to use for mapping
joinSet <- left_join(pit, hic)
joinSet <- left_join(joinSet, spm)

joinSet <- joinSet %>%
  select(Year:Percent.with.Successful.Retention.or.Exit) %>%
  select(-State) %>%
  mutate(state = str_extract(CoC.Number,"^.{2}"))

Analysis and Mapping

To do a bit of analysis to verify the data, I wanted to map out the total population experiencing homelessness by State. This is where I discovered that the state column that came with the data was not very good, leading me to create another from scratch.

Using the package usmaps I was able to construct a very simple, if informative map of the US based on population. No suprise that the large states, California, New York, Texas, and Florida account for the most people experiencing homelessness according to the PIT.

homelessPop <- joinSet %>%
  select(Year, CoC.Category, Overall.Homeless, state)

sumByState <- aggregate(x = homelessPop$Overall.Homeless,
          by = list(homelessPop$state, homelessPop$Year),
          FUN = sum)

sumByState <- sumByState %>%
  rename(state = Group.1) %>%
  rename(Year = Group.2) %>%
  rename(TotalHomeless = x)
sumByState <- sumByState[-1,]

Map of US by Total Homeless Population

plot_usmap(data = sumByState, values = "TotalHomeless", color = "red", labels=TRUE) + 
  scale_fill_continuous(low = "white", high = "blue", name = "2019 Homeless Population (PIT)", label = scales::comma
  ) + theme(legend.position = "right")

Increases in Homelessness over Five Years

sumByStateYear %>% 
    arrange(desc(yearChange)) %>%
    slice(1:10) %>%
    ggplot(., aes(x=reorder(state, -yearChange), y=yearChange))+
              geom_bar(stat='identity')

Decreases in Homelessness over Five Years

sumByStateYear %>% 
    arrange(yearChange) %>%
    slice(1:10) %>%
    ggplot(., aes(x=reorder(state, yearChange), y=yearChange))+
              geom_bar(stat='identity')

Conclusions

  • I am quite proud of the work so far, this clean data set (and process) was a good investment
  • Interesting to see some of the exploratory results
  • The challenges I experienced in this project were mainly in formatting and reading data
  • Need to be more mindful of the small errors in spelling and code and not jump to conclusions