library(tidyverse)
library(readxl)
library(knitr)
library(kableExtra)
knitr::opts_chunk$set(warning = FALSE, message = FALSE, eval = FALSE)
This document outlines the process by which data is cleaned in preparation for geocoding and visualization. The first and most basic step is to run a vistashare query that has a column of addresses. You can include as many other columns as you want.
This cleaning process is simply an example, you may perform more or less cleaning, keep certain address types, whatever. The ultimate goal is to run addresses through a geocoding service (arcGIS, mapquest, etc.) and obtain latitude and longitude coordinates. It is up to the map maintainer to decide what is acceptable in terms of number of addresses that are unable to be geocoded or might be lost in the cleaning process.
One more time to hammer home the simplicity of the end goal: All we are trying to do is produce a .csv file that has a columns for latitude and longitude and any other variables you feel are relevant (loan amount, ethnicity, activity status etc.).
Issues with the addresses in the data include:
Examples of bad addresses in the data
The raw data loaded into R:
#Load the raw query from Vistashare
raw <- read.csv("Data/all_addresses_012218.csv")
raw[1:5,1:4] %>% kable("html") %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "hover"))
| ï..Full.Name.Last.First.Mdl | System.Name.ID | Role | Full.Address |
|---|---|---|---|
| Your Champion | 5624017 | Client’s Business | |
| Lauren Taylor | 5623941 | Client’s Business | 1721 SE Tacoma St. #306 Portland, OR 97202 |
| Herron Medicine LLC | 5623874 | Client’s Business | 10535 NW Flotoma Portland, OR |
| Always Available and Reliable Towing | 5623873 | Client’s Business | 14620 SE Hwy 212 Clackamas, OR 97015 |
| K A Consulting | 5623196 | Client’s Business | 3625 NE Rodney Ave Portland, OR 97212 |
The important fields are the address, and the unique ID for each business (this query outputs it as System.Name.ID). I proceed to clean the Full.Address field which is a string column.
My first task was to remove all newline characters, they cause nothing but problems. If an output .csv file looks funny, it is often caused by newline characters or commas within a field. Other cleaning operations included filtering out rows whose addresses were empty, included only letters or only numbers, or that were PO boxes. The following code accomplishes this using mostly dplyr and grepl.
regexpr <- "(?<=[,[:space:]])Apt[\\.[:space:]]*#{0,1}[A-z0-9]+|[,[:space:]]*SUITE[[:space:]]+[0-9A-z]+|#\\s*[0-9A-z]+|ste\\.*\\s+[0-9A-z\\.-]+"
#eliminate duplicate rows and perform various regex substitution and filtering
address_info <- raw %>% group_by(System.Name.ID) %>%
unique() %>%
ungroup() %>%
mutate(Full.Address = gsub("\\n", ", ", Full.Address, ignore.case = TRUE, perl = TRUE)) %>%
mutate(Full.Address = gsub(regexpr, "", Full.Address, ignore.case = TRUE, perl = TRUE)) %>%
filter(Full.Address != "",
!grepl("P[\\.[:space:]]*O[\\.[:space:]]+Box", Full.Address, ignore.case = TRUE),
!grepl("^[A-Z]|^\\n", Full.Address, ignore.case = TRUE),
grepl("[0-9]", Full.Address, ignore.case = TRUE),
grepl("[A-Z]", Full.Address, ignore.case = TRUE)) %>%
rename(full_name = ï..Full.Name.Last.First.Mdl)
The data is then fed into a geocoding service, in my case I had arcGIS’s service available, which returned latitude/longitude coordinates. The resulting table was exported to a .csv file which is read into R below. I filter the data to only Oregon and Washington and write the data to a .csv file. A snapshot of the data written to disk is shown below.
##INITIAL OUTPUT FROM ARCGIS##
##THIS SHOULD INCLUDE LAT-LONG, OTHER GEODATA, AND A UNIQUE ID FOR EACH BUSINESS###
allfields <- read_excel("C:/Users/Daniel/Google Drive/MercyCorps/MCNW/Mapping/Data/output_table_012218.xls")
#### LIMIT TO OREGON AND WASHINGTON ####
trimmed <- allfields %>% select(LongLabel, Region, Subregion, Nbrhd, X, Y, DisplayX, DisplayY, 70:83) %>%
filter(Region %in% c("Oregon", "Washington"))
trimmed[1:5,1:8] %>% kable("html") %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "hover"))
| LongLabel | Region | Subregion | Nbrhd | X | Y | DisplayX | DisplayY |
|---|---|---|---|---|---|---|---|
| 1721 SE Tacoma St, Portland, OR, 97202, USA | Oregon | Multnomah County | Sellwood-Moreland Improvemnt League | -122.6466 | 45.46440 | -122.6466 | 45.46440 |
| 10535 NW Flotoma Dr, Portland, OR, 97229, USA | Oregon | Washington County | Cedar Hills-Cedar Mill | -122.7849 | 45.52220 | -122.7846 | 45.52242 |
| 14620 SE Highway 212, Clackamas, OR, 97015, USA | Oregon | Clackamas County | Rock Creek | -122.5129 | 45.40997 | -122.5129 | 45.40954 |
| 3625 NE Rodney Ave, Portland, OR, 97212, USA | Oregon | Multnomah County | Boise | -122.6647 | 45.54923 | -122.6650 | 45.54924 |
| 3944 SE 32nd Ave, Portland, OR, 97202, USA | Oregon | Multnomah County | Creston-Kenilworth | -122.6327 | 45.49407 | -122.6324 | 45.49407 |
#write a csv file which can be used in powerBI
write.csv(trimmed, "Data/coords_metadata_012218.csv")
This section details some of the steps I took to add relevant data to the map. The important takeaway here is that if you want to add data, you need to do one of two things.
The two additions below use the second option.
A Vistashare query was run to get loan amounts for each business with the intent of displaying total loan amounts. The result of the query, loaded into R, and with rows having no loan amount removed:
loan_amounts <- read_csv("Data/biz_loan_amounts.csv")
loan_amounts %>% filter(!is.na(`Loan Amount`)) %>% head(5) %>% kable("html") %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "hover"))
| System Name ID | Role |
|
Rel_8. DLM Borrower.Loan Amount | Loan Amount | Loan ID Number |
|---|---|---|---|---|---|
| 543143 | Client’s Business | NA | NA | 500 | 0069 |
| 543143 | Client’s Business | NA | NA | 500 | 0069 |
| 543143 | Client’s Business | NA | NA | 500 | 0069 |
| 543143 | Client’s Business | NA | NA | 500 | 0069 |
| 543144 | Client’s Business | NA | NA | 500 | 0070 |
loan_amounts <- loan_amounts %>% filter(!is.na(`Loan Amount`)) %>%
distinct(Loan.ID.Number, .keep_all = TRUE) %>%
group_by(`System Name ID`) %>%
summarise(total_loans = sum(`Loan Amount`))
loan_amounts %>% head() %>% kable("html") %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "hover"))
| System Name ID | total_loans |
|---|---|
| 543143 | 2000 |
| 543144 | 2000 |
| 543148 | 2000 |
| 543153 | 2000 |
| 543154 | 1000 |
| 543215 | 15200 |
biz_locs <- read.csv("Data/coords_metadata_012218.csv")
biz_locs <- biz_locs %>% left_join(loan_amounts, by = c("System_Nam"= "System Name ID")) %>% View()
select(-c(14:23))
write.csv(biz_locs, "Data/coords_loansums_0214.csv")
The next Vistashare query was run to get information on the client type, and matched IDA funds. This was necessary because of limitations on what can be queried using related fields in Vistashare. The important columns in this query included information on whether the client was active, how much(if any) matched IDA funds they had received, and what type of service we had provided to them.
In many cases, the business could have been serviced in multiple ways by MCNW (i.e. with loans, IDA’s, or classes), meaning that there were multiple rows corresponding to a certain business. In this case we cant/dont want to summarize the data (how can we average or sum two roles?) so we keep it as a seperate table and define a relationship within powerBI using the business ID field.
IDA <- read.csv("Data/IDA_join_cols_0214.csv")
biz_locs <- read.csv("Data/coords_loansums_0214.csv")
IDA_coords <- IDA %>% left_join(biz_locs, by = c("Related.System.Name.ID" = "System_Nam")) %>%
filter(!is.na(DisplayX)) %>%
distinct(System.Name.ID, .keep_all = TRUE)
write.csv(IDA_coords, "Data/IDA_and_loanapp_locs.csv")