install.packages("dplyr")
library(dplyr)
library(ggmap)
install.packages("ggmap")
Note: You need to be logged in to a personal Google account — your northwestern Google account will not work. Also, you will need to enter a credit card however do not fret you will not be charged. Google gives you $300 of initial free credit and recurring $200 of monthly credit. Geocoding 1,000 addresses costs only $4.
Start at this link: https://developers.google.com/maps/gmp-get-started?authuser=1
Click Get Started
Select “Create Project”
Enter a “Project name.” Do not worry about the “Location”
We now need to enable “Billing” for the project
From the upper left navigation menu select “Billing”
Select “Manage Billing Accounts”
Under “My Billing Accounts” click “Create Account”
Setup your billing profile
Again from the upper left navigation menu select “Billing”
Click “Link a Billing Account” and select the “Billing Account” you just created
We now need to enable the Geocoding API for this project
From the upper left navigation meny select “APIs & Services”
Click “Enable APIS and Services”
Search “Geocoding API” select and enable
Finally we can get our API Key! From the “APIs & Services” menu select “Credentials”
Click “+ Create Credentials”
Select “API Key”
Your API key will now be displayed. Copy and paste your key below and click “Restrict Key”
Name your key and click save. Do not worry about the “Application restrictions” if you want you can restrict the key to the Geocoding API .
register_google("INSERT PERSONAL API KEY")
providers_raw <- read.csv("YOUR FILE PATH/Sample_Provider_Level_Data_2010.csv", stringsAsFactors = F)
To geocode our addresses we need create a single variable that has all the address information, like how you would search an address in Google Maps. For example, “1999 Campus Dr, Evanston, IL, 60208.” So we want to concatenate the variables “StreetAddress” “City” and “State” separated by commas. We will also include “Zip_4” if “Zip_4” does not equal 0
mutate - adds or modifies a new variable to a dataframe
paste - concatenates variables separated by characters of your choice set using the sep option
General R advice: Each time I modify a data frame (adding a new variable, selecting columns, etc.) I usually create a new data frame (here, I modify “providers_raw” and save the new data frame as “providers_geocode_prep”). This way I can track all the changes I make and fix errors more easily
providers_geocode_prep <-
providers_raw %>%
mutate(ADDRESS_GEOCODE=paste(StreetAddress, City, State, sep = ", ")) %>%
# if Zip_4==0 then keep ADDRESS_GEOCODE as is, else concatenate Zip_4 to ADDRESS_GEOCODE
mutate(ADDRESS_GEOCODE = if_else(Zip_4==0,
ADDRESS_GEOCODE,
paste(ADDRESS_GEOCODE,Zip_4,sep=", ")))
mutate_geocode - pass this function an address and it will create variables with the latitude and longitude
providers_geocoded <- providers_geocode_prep %>%
mutate_geocode(location = `ADDRESS_GEOCODE`)
Here I geocoded all of the addresses in one go because I only have 99 addresses. You might want to split yours into multiple batches of say 500 so you can make sure it is working properly. For example:
providers_geocoded_batch1 <- providers_geocode_prep[1:500,] %>%
mutate_geocode(location = `ADDRESS_GEOCODE`)
providers_geocoded_batch2 <- providers_geocode_prep[501:1000,] %>%
mutate_geocode(location = `ADDRESS_GEOCODE`)
providers_geocoded <- bind_rows(providers_geocoded_batch1,providers_geocoded_batch2)
write.csv(providers_geocoded,
"YOUR FILE PATH/providers_geocoded.csv",
row.names = F)
This suppresses scientific notation. I always like to include this because otherwise R may convert the GEOIDs to scientific notation causing problems
options(scipen=999)
acs_race_raw <- read.csv("YOUR FILE PATH/race2019.csv",
stringsAsFactors = F)
Remove second row of the data (the messy descriptions of the variables)
acs_race_clean1 <- acs_race_raw[-c(1),]
The GEOIDs all start with the prefix 1400000US which we want to remove so that we only have the part that uniquely identifies each census tract (the last 11 numbers, equivalently position 10 to 22 of the string)
substr - extracts part of a character variable
acs_race_clean2 <- acs_race_clean1 %>%
mutate(GEOID = substr(GEO_ID,10,22))
Typically use 5 Racial/Ethnic Categories:
Make sure to convert the variables from characters to numbers
acs_race_clean3 <-
acs_race_clean2 %>%
mutate(
ASIAN_PI = as.numeric(B03002_006E) + as.numeric(B03002_007E),
BLACK = as.numeric(B03002_004E),
HISP = as.numeric(B03002_012E),
WHITE = as.numeric(B03002_003E),
OTHER = as.numeric(B03002_005E) + as.numeric(B03002_008E) + as.numeric(B03002_009E),
TOTAL = as.numeric(B03002_001E))
Calculate the percentage of each race that is in a given tract
Here we can see that 7 of the tracts have population of 0 which means that these percentages will be filled in with NAN
nrow(acs_race_clean3 %>% filter(TOTAL==0))
## [1] 7
acs_race_clean4 <- acs_race_clean3 %>%
mutate(PERC_ASIAN_PI = ASIAN_PI / TOTAL,
PERC_BLACK = BLACK / TOTAL,
PERC_HISP = HISP / TOTAL,
PERC_WHITE = WHITE / TOTAL,
PERC_OTHER = OTHER / TOTAL)
Now create our final clean ACS Racial/Ethnic dataset by selecting only the relevant and nicely named columns
acs_race_2019_clean <- acs_race_clean4 %>%
select(GEOID:PERC_OTHER)
head(acs_race_2019_clean, n=10)
provider_comm_tract <- read.csv("YOUR FILE PATH/chi_comm_providers_tracts_join.csv", stringsAsFactors = F)
Join in ACS Race data using a left_join by GEOID. This will join the ACS data to the provider data by the unique census tract GEOID keeping only the rows in the provider data!
We first need to make sure that GEOID is of the same data type in both of the dataframes. I always like making the GEOID a character rather than a number.
Change the GEOID to of type character in the provider data
provider_comm_tract_v2 <- provider_comm_tract %>%
mutate(GEOID = as.character(GEOID))
provider_comm_tract_acs <- left_join(provider_comm_tract_v2,
acs_race_2019_clean,
by="GEOID")
Now we have our complete dataset of providers linked to their Chicago community, census tract, and to Racial/Ethnic ACS Data! Hooray! :)
How many providers are in each community?
provider_by_community <- provider_comm_tract_acs %>%
group_by(community) %>%
summarise(N_PROVIDERS = n()) %>%
arrange(desc(N_PROVIDERS))
head(provider_by_community, n = 20)
What’s the mean and median percentage of different Racial/Ethnic groups in provider neighborhoods within different Chicago communities?
provider_by_community_race <-
provider_comm_tract_acs %>%
group_by(community) %>%
summarise(
N_PROVIDERS = n(),
MEAN_PERC_BLACK = mean(PERC_BLACK, na.rm=T),
MED_PERC_BLACK = median(PERC_BLACK, na.rm=T),
TOTAL_BLACK = sum(BLACK, na.rm=T),
MEAN_PERC_WHITE = mean(PERC_WHITE, na.rm=T),
MED_PERC_WHITE = median(PERC_WHITE, na.rm=T),
TOTAL_WHITE = sum(WHITE, na.rm=T),
TOTAL_POP = sum(TOTAL, na.rm=T)) %>%
arrange(desc(N_PROVIDERS))
head(provider_by_community_race, n=10)