Address Cleaning in R for MCNW Map Project

library(tidyverse)
library(readxl)
library(knitr)
library(kableExtra)

knitr::opts_chunk$set(warning = FALSE, message = FALSE, eval = FALSE)

Overview

This document outlines the process by which data is cleaned in preparation for geocoding and visualization. The first and most basic step is to run a vistashare query that has a column of addresses. You can include as many other columns as you want.

This cleaning process is simply an example, you may perform more or less cleaning, keep certain address types, whatever. The ultimate goal is to run addresses through a geocoding service (arcGIS, mapquest, etc.) and obtain latitude and longitude coordinates. It is up to the map maintainer to decide what is acceptable in terms of number of addresses that are unable to be geocoded or might be lost in the cleaning process.

One more time to hammer home the simplicity of the end goal: All we are trying to do is produce a .csv file that has a columns for latitude and longitude and any other variables you feel are relevant (loan amount, ethnicity, activity status etc.).

Issues with the addresses in the data include:

Incomplete address (no street name, city, etc.)
PO boxes, apartment and suite numbers.
Multiline addresses that cause parsing problems

Examples of bad addresses in the data

The raw data loaded into R:

#Load the raw query from Vistashare
raw <- read.csv("Data/all_addresses_012218.csv")

raw[1:5,1:4] %>% kable("html") %>%
                kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "hover"))

ï..Full.Name.Last.First.Mdl	System.Name.ID	Role	Full.Address
Your Champion	5624017	Client’s Business
Lauren Taylor	5623941	Client’s Business	1721 SE Tacoma St. #306 Portland, OR 97202
Herron Medicine LLC	5623874	Client’s Business	10535 NW Flotoma Portland, OR
Always Available and Reliable Towing	5623873	Client’s Business	14620 SE Hwy 212 Clackamas, OR 97015
K A Consulting	5623196	Client’s Business	3625 NE Rodney Ave Portland, OR 97212

The important fields are the address, and the unique ID for each business (this query outputs it as System.Name.ID). I proceed to clean the Full.Address field which is a string column.

My first task was to remove all newline characters, they cause nothing but problems. If an output .csv file looks funny, it is often caused by newline characters or commas within a field. Other cleaning operations included filtering out rows whose addresses were empty, included only letters or only numbers, or that were PO boxes. The following code accomplishes this using mostly dplyr and grepl.

regexpr <- "(?<=[,[:space:]])Apt[\\.[:space:]]*#{0,1}[A-z0-9]+|[,[:space:]]*SUITE[[:space:]]+[0-9A-z]+|#\\s*[0-9A-z]+|ste\\.*\\s+[0-9A-z\\.-]+"

#eliminate duplicate rows and perform various regex substitution and filtering
address_info <- raw %>% group_by(System.Name.ID) %>%
  unique() %>%                                                                              
  ungroup() %>% 
  mutate(Full.Address = gsub("\\n", ", ", Full.Address, ignore.case = TRUE, perl = TRUE)) %>%
  mutate(Full.Address = gsub(regexpr, "", Full.Address, ignore.case = TRUE, perl = TRUE)) %>% 
  filter(Full.Address != "", 
         !grepl("P[\\.[:space:]]*O[\\.[:space:]]+Box", Full.Address, ignore.case = TRUE), 
         !grepl("^[A-Z]|^\\n", Full.Address, ignore.case = TRUE),
         grepl("[0-9]", Full.Address, ignore.case = TRUE),
         grepl("[A-Z]", Full.Address, ignore.case = TRUE)) %>%
  rename(full_name = ï..Full.Name.Last.First.Mdl)

The data is then fed into a geocoding service, in my case I had arcGIS’s service available, which returned latitude/longitude coordinates. The resulting table was exported to a .csv file which is read into R below. I filter the data to only Oregon and Washington and write the data to a .csv file. A snapshot of the data written to disk is shown below.

##INITIAL OUTPUT FROM ARCGIS##
##THIS SHOULD INCLUDE LAT-LONG, OTHER GEODATA, AND A UNIQUE ID FOR EACH BUSINESS###
allfields <- read_excel("C:/Users/Daniel/Google Drive/MercyCorps/MCNW/Mapping/Data/output_table_012218.xls")

#### LIMIT TO OREGON AND WASHINGTON ####
trimmed <- allfields %>% select(LongLabel, Region, Subregion, Nbrhd, X, Y, DisplayX, DisplayY, 70:83) %>%
           filter(Region %in% c("Oregon", "Washington"))

trimmed[1:5,1:8] %>% kable("html") %>%
                kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "hover"))

LongLabel	Region	Subregion	Nbrhd	X	Y	DisplayX	DisplayY
1721 SE Tacoma St, Portland, OR, 97202, USA	Oregon	Multnomah County	Sellwood-Moreland Improvemnt League	-122.6466	45.46440	-122.6466	45.46440
10535 NW Flotoma Dr, Portland, OR, 97229, USA	Oregon	Washington County	Cedar Hills-Cedar Mill	-122.7849	45.52220	-122.7846	45.52242
14620 SE Highway 212, Clackamas, OR, 97015, USA	Oregon	Clackamas County	Rock Creek	-122.5129	45.40997	-122.5129	45.40954
3625 NE Rodney Ave, Portland, OR, 97212, USA	Oregon	Multnomah County	Boise	-122.6647	45.54923	-122.6650	45.54924
3944 SE 32nd Ave, Portland, OR, 97202, USA	Oregon	Multnomah County	Creston-Kenilworth	-122.6327	45.49407	-122.6324	45.49407

#write a csv file which can be used in powerBI
write.csv(trimmed, "Data/coords_metadata_012218.csv")

Appending Other Metadata

This section details some of the steps I took to add relevant data to the map. The important takeaway here is that if you want to add data, you need to do one of two things.

Include the data as a column in the original .csv file that has the addresses. That column will be carried along for the geocoding ride.
Get a table which includes the data you want as well a column with the unique identifier for the business associated with each observation.

The two additions below use the second option.

Loan Amounts

A Vistashare query was run to get loan amounts for each business with the intent of displaying total loan amounts. The result of the query, loaded into R, and with rows having no loan amount removed:

loan_amounts <- read_csv("Data/biz_loan_amounts.csv")

loan_amounts %>% filter(!is.na(`Loan Amount`)) %>% head(5) %>% kable("html") %>%
                kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "hover"))

System Name ID	Role	DLM Borrower.Loan Amount	Rel_8. DLM Borrower.Loan Amount	Loan Amount	Loan ID Number
543143	Client’s Business	NA	NA	500	0069
543143	Client’s Business	NA	NA	500	0069
543143	Client’s Business	NA	NA	500	0069
543143	Client’s Business	NA	NA	500	0069
543144	Client’s Business	NA	NA	500	0070

One can see that there are sometimes multiple rows for a single loan. I collapsed the dataframe by keeping a single row per Loan ID, grouped the data by business ID (here called System Name ID), and then collapsed each group into a single row while simultaneously adding a column which contained the sum of all loans for that business. The resulting table is simple the business ID and the corresponding loan total for that business. This table was joined to the table containing the coordinates before putting it into power BI.

loan_amounts <- loan_amounts %>% filter(!is.na(`Loan Amount`)) %>%
  distinct(Loan.ID.Number, .keep_all = TRUE) %>%
  group_by(`System Name ID`) %>%
  summarise(total_loans = sum(`Loan Amount`))

loan_amounts %>% head() %>% kable("html") %>%
  kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "hover"))

System Name ID	total_loans
543143	2000
543144	2000
543148	2000
543153	2000
543154	1000
543215	15200

code joining the two tables

biz_locs <- read.csv("Data/coords_metadata_012218.csv")

biz_locs <- biz_locs %>% left_join(loan_amounts, by = c("System_Nam"= "System Name ID")) %>% View()
  select(-c(14:23))

write.csv(biz_locs, "Data/coords_loansums_0214.csv")

Role Type and Matched IDA Funds

The next Vistashare query was run to get information on the client type, and matched IDA funds. This was necessary because of limitations on what can be queried using related fields in Vistashare. The important columns in this query included information on whether the client was active, how much(if any) matched IDA funds they had received, and what type of service we had provided to them.

In many cases, the business could have been serviced in multiple ways by MCNW (i.e. with loans, IDA’s, or classes), meaning that there were multiple rows corresponding to a certain business. In this case we cant/dont want to summarize the data (how can we average or sum two roles?) so we keep it as a seperate table and define a relationship within powerBI using the business ID field.

IDA <- read.csv("Data/IDA_join_cols_0214.csv")
biz_locs <- read.csv("Data/coords_loansums_0214.csv")

IDA_coords <- IDA %>% left_join(biz_locs, by = c("Related.System.Name.ID" = "System_Nam")) %>% 
  filter(!is.na(DisplayX)) %>%
  distinct(System.Name.ID, .keep_all = TRUE) 

write.csv(IDA_coords, "Data/IDA_and_loanapp_locs.csv")