Setup R

If RStudio is not installed on your computer download it here: https://www.rstudio.com/products/rstudio/download/#download
Launch RStudio, File -> New Project -> Existing Directory -> Choose File Location
File -> New File -> R Notebook
Run R code in the R Notebook. Instructions will pop up explaining how to add new code chunks and execute code.
Data frames will appear on the right hand side in the Environment window. Click on the name to view data. Alternatively, run the command View(name of data). Click on the blue arrow to view the data types and variables of a data frame.

Install & Load Packages

install.packages("dplyr")
library(dplyr)

library(ggmap)
install.packages("ggmap")

Set Up Google Maps API

Note: You need to be logged in to a personal Google account — your northwestern Google account will not work. Also, you will need to enter a credit card however do not fret you will not be charged. Google gives you $300 of initial free credit and recurring $200 of monthly credit. Geocoding 1,000 addresses costs only $4.

Start at this link: https://developers.google.com/maps/gmp-get-started?authuser=1
Click Get Started
Select “Create Project”
Enter a “Project name.” Do not worry about the “Location”
We now need to enable “Billing” for the project
1. From the upper left navigation menu select “Billing”
2. Select “Manage Billing Accounts”
3. Under “My Billing Accounts” click “Create Account”
4. Setup your billing profile
5. Again from the upper left navigation menu select “Billing”
6. Click “Link a Billing Account” and select the “Billing Account” you just created
We now need to enable the Geocoding API for this project
1. From the upper left navigation meny select “APIs & Services”
2. Click “Enable APIS and Services”
3. Search “Geocoding API” select and enable
Finally we can get our API Key! From the “APIs & Services” menu select “Credentials”
1. Click “+ Create Credentials”
2. Select “API Key”
3. Your API key will now be displayed. Copy and paste your key below and click “Restrict Key”
4. Name your key and click save. Do not worry about the “Application restrictions” if you want you can restrict the key to the Geocoding API .

Register R with Google API Key

register_google("INSERT PERSONAL API KEY")

Import Provider Data

providers_raw <- read.csv("YOUR FILE PATH/Sample_Provider_Level_Data_2010.csv", stringsAsFactors = F)

Prepare Provider Data for Google Geocoding

To geocode our addresses we need create a single variable that has all the address information, like how you would search an address in Google Maps. For example, “1999 Campus Dr, Evanston, IL, 60208.” So we want to concatenate the variables “StreetAddress” “City” and “State” separated by commas. We will also include “Zip_4” if “Zip_4” does not equal 0

mutate - adds or modifies a new variable to a dataframe

paste - concatenates variables separated by characters of your choice set using the sep option

General R advice: Each time I modify a data frame (adding a new variable, selecting columns, etc.) I usually create a new data frame (here, I modify “providers_raw” and save the new data frame as “providers_geocode_prep”). This way I can track all the changes I make and fix errors more easily

providers_geocode_prep <- 
  providers_raw %>%
  mutate(ADDRESS_GEOCODE=paste(StreetAddress, City, State, sep = ", ")) %>%
# if Zip_4==0 then keep ADDRESS_GEOCODE as is, else concatenate Zip_4 to ADDRESS_GEOCODE
  mutate(ADDRESS_GEOCODE = if_else(Zip_4==0,
                                   ADDRESS_GEOCODE,
                                   paste(ADDRESS_GEOCODE,Zip_4,sep=", ")))

Geocode Addresses

mutate_geocode - pass this function an address and it will create variables with the latitude and longitude

providers_geocoded <- providers_geocode_prep %>% 
                      mutate_geocode(location = `ADDRESS_GEOCODE`)

Here I geocoded all of the addresses in one go because I only have 99 addresses. You might want to split yours into multiple batches of say 500 so you can make sure it is working properly. For example:

providers_geocoded_batch1 <- providers_geocode_prep[1:500,] %>% 
                             mutate_geocode(location = `ADDRESS_GEOCODE`)

providers_geocoded_batch2 <- providers_geocode_prep[501:1000,] %>% 
                             mutate_geocode(location = `ADDRESS_GEOCODE`)

providers_geocoded <- bind_rows(providers_geocoded_batch1,providers_geocoded_batch2)

Export Geocoded Provider Data to CSV

write.csv(providers_geocoded, 
          "YOUR FILE PATH/providers_geocoded.csv", 
          row.names = F)

Switch to ArcGIS Tutorial

Cleaning and Joining ACS Data

This suppresses scientific notation. I always like to include this because otherwise R may convert the GEOIDs to scientific notation causing problems

options(scipen=999)

Import ACS Data

acs_race_raw <- read.csv("YOUR FILE PATH/race2019.csv", 
                         stringsAsFactors = F)

Remove second row of the data (the messy descriptions of the variables)

acs_race_clean1 <- acs_race_raw[-c(1),]

The GEOIDs all start with the prefix 1400000US which we want to remove so that we only have the part that uniquely identifies each census tract (the last 11 numbers, equivalently position 10 to 22 of the string)

substr - extracts part of a character variable

acs_race_clean2 <- acs_race_clean1 %>%
                      mutate(GEOID = substr(GEO_ID,10,22))

Select & Name Columns of Interest

Typically use 5 Racial/Ethnic Categories:

Asian & Pacific Islander: B03002_006E + B03002_007E
Black: B03002_004E
Hispanic/Latino: B03002_012E
White: B03002_003E
Other: B03002_005E + B03002_008E + B03002_009E

Make sure to convert the variables from characters to numbers

acs_race_clean3 <- 
  acs_race_clean2 %>%
    mutate(
    ASIAN_PI = as.numeric(B03002_006E) + as.numeric(B03002_007E),
    BLACK = as.numeric(B03002_004E),
    HISP = as.numeric(B03002_012E),
    WHITE = as.numeric(B03002_003E),
    OTHER = as.numeric(B03002_005E) + as.numeric(B03002_008E) + as.numeric(B03002_009E),
    TOTAL = as.numeric(B03002_001E))

Calculate the percentage of each race that is in a given tract

Here we can see that 7 of the tracts have population of 0 which means that these percentages will be filled in with NAN

nrow(acs_race_clean3 %>% filter(TOTAL==0))

## [1] 7

acs_race_clean4 <- acs_race_clean3 %>%
                      mutate(PERC_ASIAN_PI = ASIAN_PI / TOTAL,
                             PERC_BLACK = BLACK / TOTAL,
                             PERC_HISP = HISP / TOTAL,
                             PERC_WHITE = WHITE / TOTAL,
                             PERC_OTHER = OTHER / TOTAL)

Now create our final clean ACS Racial/Ethnic dataset by selecting only the relevant and nicely named columns

acs_race_2019_clean <- acs_race_clean4 %>%
                          select(GEOID:PERC_OTHER)

head(acs_race_2019_clean, n=10)

Join ACS Data to Provider Data

Import Provider Joined with Chicago Communities and Census Tracts Data

provider_comm_tract <- read.csv("YOUR FILE PATH/chi_comm_providers_tracts_join.csv", stringsAsFactors = F)

Join in ACS Race data using a left_join by GEOID. This will join the ACS data to the provider data by the unique census tract GEOID keeping only the rows in the provider data!

We first need to make sure that GEOID is of the same data type in both of the dataframes. I always like making the GEOID a character rather than a number.

Change the GEOID to of type character in the provider data

provider_comm_tract_v2 <- provider_comm_tract %>%
                            mutate(GEOID = as.character(GEOID))

provider_comm_tract_acs <- left_join(provider_comm_tract_v2,
                                     acs_race_2019_clean,
                                     by="GEOID")

Now we have our complete dataset of providers linked to their Chicago community, census tract, and to Racial/Ethnic ACS Data! Hooray! :)

Summarizing the Data

How many providers are in each community?

provider_by_community <- provider_comm_tract_acs %>%
                         group_by(community) %>%
                         summarise(N_PROVIDERS = n()) %>%
                         arrange(desc(N_PROVIDERS))

head(provider_by_community, n = 20)

What’s the mean and median percentage of different Racial/Ethnic groups in provider neighborhoods within different Chicago communities?

provider_by_community_race <-  
  provider_comm_tract_acs %>%
    group_by(community) %>%
      summarise(
        N_PROVIDERS = n(),
        MEAN_PERC_BLACK = mean(PERC_BLACK, na.rm=T),
        MED_PERC_BLACK = median(PERC_BLACK, na.rm=T),
        TOTAL_BLACK = sum(BLACK, na.rm=T),
        MEAN_PERC_WHITE = mean(PERC_WHITE, na.rm=T),
        MED_PERC_WHITE = median(PERC_WHITE, na.rm=T),
        TOTAL_WHITE = sum(WHITE, na.rm=T),
        TOTAL_POP = sum(TOTAL, na.rm=T)) %>%
      arrange(desc(N_PROVIDERS))

head(provider_by_community_race, n=10)

RA Training: Geocoding & ACS Data Cleaning

Daniel Ross

7/15/2021