How do we access geographic data in R, especially at very local levels? This short tutorial describes how to gather neighborhood level data using the osmdata package, and then identify the neighborhood in which each of a series of points may fall.
Try this tutorial out yourself in RStudio.Cloud here! (Note: sometimes the website can be persnickety, so a direct click doesn’t work. Please right click on [here] above and click ‘Open in a New Tab’. Then, it should load properly.)
First, let’s install the following packages:
install.packages(c("tidyverse", "sf", "osmdata"))Then, let’s load the following packages.
library(tidyverse) # for data wrangling
library(sf) # for spatial data management
library(osmdata) # for downloading data from OpenStreetMaposmdataWe’re going to use the osmdata package to extract geospatial information about our city of interest. In this case, we’re looking at Caracas, Venezuela.
First, we need to extract a bounding box - the longitude-latitude coordinates for a box around our area of interest. OpenStreetMap (OSM) needs that box to zoom in on just the information relevant to us. We’ll use the getbb() function to get that data and save it in mybox.
mybox <- getbb(place_name = "Caracas, Venezuela")This creates a small matrix, which we can view below.
mybox## min max
## x -67.16228 -66.69088
## y 10.34050 10.56415
Next, we’re going to run a series of functions.
opq(mybox) function, we will tell OSM to run a query within the geographic area of mybox. - Using add_osm_feature(), we will ask it to track down data that refers to administrative boundaries (key = 'admin_level'), rather than, say, coffeeshops.value = '4').osmdata_sf() to extract data from our query in the sf package’s preferred format.with(osm_multipolygons) to extract the osm_multipolygons data.frame from the list output from osmdata_sf().adm4. (You can name it anything.)adm4 <- opq(mybox) %>%
add_osm_feature(key = 'admin_level',
value = "4") %>%
osmdata_sf() %>%
with(osm_multipolygons)Let’s check out its contents! - Lots of information. - The top chunk is all just metadata. - The rows refer to entries in your dataset - each is a specific set of polygons - eg. the polygon for Miranda, La Guaira, Distrito Capital, and other regions.
adm4 %>% head()## Simple feature collection with 3 features and 21 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -67.41273 ymin: 9.934589 xmax: -65.4209 ymax: 10.87641
## Geodetic CRS: WGS 84
## # A tibble: 3 × 22
## osm_id name ISO3166.2 admin_level alt_name border_type boundary name.en
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 272673 Miranda VE-M 4 "" entidad fe… adminis… Mirand…
## 2 2442704 La Guaira VE-X 4 "Vargas" entidad fe… adminis… Vargas…
## 3 2444378 Distrito … VE-A 4 "" entidad fe… adminis… Capita…
## # … with 14 more variables: name.es <chr>, name.ja <chr>, old_name <chr>,
## # place <chr>, population <chr>, population.date <chr>, ref <chr>,
## # source.population <chr>, source.ref <chr>, start_date <chr>, type <chr>,
## # wikidata <chr>, wikipedia <chr>, geometry <MULTIPOLYGON [°]>
Finally, we can visualize this layer in ggplot(), using the helper function from the sf package called geom_sf(). This special function automatically interprets the geometry column in sf data.frames to locate each point/polygon in space on the map.
ggplot() +
geom_sf(data = adm4, color = "black", size = 1, fill = NA)Let’s repeat this process, but extract administrative polygons at lower levels!
adm6 <- opq(mybox) %>%
add_osm_feature(key = 'admin_level',
value = "6") %>%
osmdata_sf() %>%
with(osm_multipolygons)View the contents.
adm6 %>% head()## Simple feature collection with 6 features and 18 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -67.41273 ymin: 10.31421 xmax: -66.30938 ymax: 10.6402
## Geodetic CRS: WGS 84
## # A tibble: 6 × 19
## osm_id name admin_level alt_name border_type boundary old_name place
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2442703 Municipio Va… 6 "" "" adminis… "" ""
## 2 2458069 Municipio Ca… 6 "" "" adminis… "" ""
## 3 2458293 Municipio El… 6 "" "" adminis… "" ""
## 4 3321735 Municipio Li… 6 "Munici… "municipio" adminis… "Depart… "cou…
## 5 3335672 Municipio Ch… 6 "" "" adminis… "" ""
## 6 3335760 Municipio Ba… 6 "" "" adminis… "" ""
## # … with 11 more variables: population <chr>, population.date <chr>,
## # postal_code <chr>, source <chr>, source.population <chr>, start_date <chr>,
## # type <chr>, wikidata <chr>, wikimedia_commons <chr>, wikipedia <chr>,
## # geometry <MULTIPOLYGON [°]>
adm7 <- opq(mybox) %>%
add_osm_feature(key = 'admin_level',
value = "7") %>%
osmdata_sf() %>%
with(osm_multipolygons)View the contents.
adm7 %>% head()## Simple feature collection with 6 features and 17 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -67.41273 ymin: 10.38767 xmax: -66.77939 ymax: 10.63164
## Geodetic CRS: WGS 84
## # A tibble: 6 × 18
## osm_id name admin_level boundary cap_munic cap_parroq gid hectareas
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2442705 Parroquia C… 7 adminis… "" "" "" ""
## 2 2442706 Parroquia C… 7 adminis… "" "" "" ""
## 3 2442707 Parroquia C… 7 adminis… "" "" "" ""
## 4 2442709 Parroquia C… 7 adminis… "" "" "" ""
## 5 2442710 Parroquia E… 7 adminis… "La Guai… "El Junko" "109… "2037.59…
## 6 2442711 Parroquia L… 7 adminis… "" "" "" ""
## # … with 10 more variables: old_name <chr>, population <chr>,
## # population.date <chr>, short_name <chr>, source <chr>,
## # source.population <chr>, type <chr>, wikidata <chr>, wikipedia <chr>,
## # geometry <MULTIPOLYGON [°]>
adm8 <- opq(mybox) %>%
add_osm_feature(key = 'admin_level',
value = "8") %>%
osmdata_sf() %>%
with(osm_multipolygons)View the contents.
adm8 %>% head()## Simple feature collection with 1 feature and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -67.16228 ymin: 10.3405 xmax: -66.69088 ymax: 10.56415
## Geodetic CRS: WGS 84
## # A tibble: 1 × 7
## osm_id name admin_level boundary place type geometry
## <chr> <chr> <chr> <chr> <chr> <chr> <MULTIPOLYGON [°]>
## 1 11219583 Caracas 8 administra… city boun… (((-66.90885 10.40716, -…
adm9 <- opq(mybox) %>%
add_osm_feature(key = 'admin_level',
value = "9") %>%
osmdata_sf() %>%
with(osm_multipolygons)View the contents.
adm9 %>% head()## Simple feature collection with 6 features and 21 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -66.86208 ymin: 10.48838 xmax: -66.83773 ymax: 10.5134
## Geodetic CRS: WGS 84
## # A tibble: 6 × 22
## osm_id name admin_level alt_name boundary fixme.wikipedia is_in
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 5172972 Sector Los Palos … 9 "" adminis… "" ""
## 2 5174401 Sector Altamira 9 "" adminis… "" ""
## 3 5177155 Sector La Floresta 9 "" adminis… "" ""
## 4 5177401 Sector El Dorado 9 "" adminis… "" ""
## 5 5177773 Sector Bello Campo 9 "" adminis… "" ""
## 6 5183056 Sector La Castell… 9 "" adminis… "" ""
## # … with 15 more variables: is_in.continent <chr>, is_in.country <chr>,
## # is_in.country_code <chr>, is_in.state_code <chr>, old_name <chr>,
## # openGeoDB.postal_codes <chr>, openGeoDB.telephone_area_code <chr>,
## # place <chr>, population <chr>, source <chr>, source.population <chr>,
## # type <chr>, wikidata <chr>, wikipedia <chr>, geometry <MULTIPOLYGON [°]>
adm10 <- opq(mybox) %>%
add_osm_feature(key = 'admin_level',
value = "10") %>%
osmdata_sf() %>%
with(osm_multipolygons)View the contents.
adm10 %>% head()## Simple feature collection with 6 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -66.83482 ymin: 10.40204 xmax: -66.78954 ymax: 10.45649
## Geodetic CRS: WGS 84
## # A tibble: 6 × 12
## osm_id name admin_level alt_name boundary old_name place source start_date
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 11218161 Barri… 10 "" adminis… "" neig… "" ""
## 2 11218166 Los P… 10 "" adminis… "" neig… "" ""
## 3 11218167 El So… 10 "" adminis… "" neig… "" ""
## 4 11218168 Urban… 10 "" adminis… "" neig… "" ""
## 5 11218173 Loma … 10 "" adminis… "" neig… "" ""
## 6 11218174 Lomas… 10 "" adminis… "" neig… "" ""
## # … with 3 more variables: type <chr>, wikidata <chr>,
## # geometry <MULTIPOLYGON [°]>
adm11 <- opq(mybox) %>%
add_osm_feature(key = 'admin_level',
value = "11") %>%
osmdata_sf() %>%
with(osm_multipolygons)View the contents.
adm11 %>% head()## Simple feature collection with 6 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -66.98063 ymin: 10.52713 xmax: -66.94835 ymax: 10.53216
## Geodetic CRS: WGS 84
## # A tibble: 6 × 10
## osm_id name admin_level alt_name boundary place source type wikidata
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 13282129 Barrio Jos… 11 "Barrio… adminis… neig… Maxar… boun… Q108766…
## 2 13282551 Barrio La … 11 "" adminis… neig… Maxar… boun… Q108772…
## 3 13282552 Barrio El … 11 "" adminis… neig… Maxar… boun… Q108772…
## 4 13282553 Barrio El … 11 "" adminis… neig… Maxar… boun… Q108772…
## 5 13282554 Barrio El … 11 "" adminis… neig… Maxar… boun… Q108772…
## 6 13282555 Barrio La … 11 "" adminis… neig… Maxar… boun… Q108772…
## # … with 1 more variable: geometry <MULTIPOLYGON [°]>
Depending on the country, data for levels differs greatly. For example, I noticed that level 8 was just the city Caracas, while levels 7 and 9 contained many more polygons. I would recommend you closely inspect the data, just to make sure you’re using the level that is most useful for your analysis.
We can visualize our data quite quickly, using the ggplot() package, where each layer gets its own color.
# continues until level 10 or 11!
ggplot() +
geom_sf(data = adm8, color = "red", fill = NA, size = 0.5) +
geom_sf(data = adm7, color = "firebrick", fill = NA, size = 0.5) +
geom_sf(data = adm6, color = "grey", fill = NA, size = 0.5) +
geom_sf(data = adm4, color = "black", size = 1, fill = NA)Alternatively, we can bind our polygons all into one data.frame, assume that they share column names, and visualize it all at once. This uses the bind_rows() function from dplyr.
all <- bind_rows(adm4, adm6, adm7, adm8, adm9, adm10, adm11) %>%
# Let's order them
mutate(admin_level = factor(admin_level, levels = unique(admin_level) %>% as.numeric() %>% sort() )) %>%
# Let's rename them
mutate(admin_level = admin_level %>% recode_factor(
"4" = "4 (Distrito Capital)",
"5" = "5 (Distrito Metropolitano,\nAlcaldía Mayor)",
"6" = "6 (Municipio)",
"7" = "7 (Parroquia)",
"8" = "8 (Ciudades y poblaciones)",
"9" = "9 (Sectores y Barrios de 1° nivel)",
"10" = "10 (Sectores y Barrios de 2° nivel)",
"11" = "11 (Otros)"))ggplot() +
geom_sf(data = all, mapping = aes(color = admin_level), fill = NA, size = 0.5)And if we wanted an even cleaner visual, we could try this:
ggplot() +
geom_sf(data = adm4, fill = "darkgrey", color = "black", size = 5) +
geom_sf(data = adm4, fill = "darkgrey", color = NA) +
geom_sf(data = all, mapping = aes(color = admin_level), fill = NA, size = 2.5, color = "white") +
geom_sf(data = all, mapping = aes(color = admin_level), fill = NA, size = 2) +
theme_void(base_size = 14) +
labs(subtitle = "Jurisdictional Boundaries in and around Caracas, Venezuela",
caption = "Source: OpenStreetMap, 2022.",
color = "Administrative\nLevel")For the last part of this exercise, we’re going to use some randomly generated points, save them to CSV, and then walk through identifying the neighborhood that each point is located in.
First, we import the data.frame of points (in this case random locations) with read_csv() from the readr package (or in our case, the tidyverse super-package).
mydat <- read_csv("mypoints.csv")Second, we transform the data.frame into an sf data.frame, using st_as_sf().
crs), in this case crs = 4326, the system for the World Geodetic System 84, a pretty safe one.mypoints <- mydat %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326)And voila! We have our points. Check it out!
mypoints %>% head(2)## Simple feature collection with 2 features and 1 field
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -67.05809 ymin: 10.45787 xmax: -66.9598 ymax: 10.49246
## Geodetic CRS: WGS 84
## # A tibble: 2 × 2
## id geometry
## <dbl> <POINT [°]>
## 1 1 (-67.05809 10.45787)
## 2 2 (-66.9598 10.49246)
Next, let’s use the sf_join() function, which will automatically attach the name/traits of each neighborhood to our points.
mypoints %>%
# We can then transform the CRS to whatever the CRS is for our polygons
st_transform(crs = st_crs(adm4)) %>%
# And then spatially join in just the 'name' column from our adm4 data.frame,
# using the geometry column as the shared id
# you'll notice I renamed 'name' to 'level_4'.
st_join(adm4 %>% select(level_4 = name)) %>%
# Let's look at the first 6 rows
head()## Simple feature collection with 6 features and 2 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -67.05809 ymin: 10.35878 xmax: -66.80126 ymax: 10.52398
## Geodetic CRS: WGS 84
## # A tibble: 6 × 3
## id geometry level_4
## <dbl> <POINT [°]> <chr>
## 1 1 (-67.05809 10.45787) Distrito Capital
## 2 2 (-66.9598 10.49246) Distrito Capital
## 3 3 (-66.87696 10.52398) Distrito Capital
## 4 4 (-66.85315 10.35878) Miranda
## 5 5 (-66.80126 10.47997) Miranda
## 6 6 (-66.89291 10.46588) Distrito Capital
Or….. we could do even more!
results <- mypoints %>%
# We can then transform the CRS to whatever the CRS is for our polygons
# In this case, let's just use adm4 as our baseline
st_transform(crs = st_crs(adm4)) %>%
# Join in several!
st_join(adm4 %>% select(level_4 = name)) %>%
st_join(adm6 %>% select(level_6 = name)) %>%
st_join(adm7 %>% select(level_7 = name)) %>%
st_join(adm8 %>% select(level_8 = name)) %>%
st_join(adm9 %>% select(level_9 = name)) %>%
st_join(adm10 %>% select(level_10 = name)) %>%
st_join(adm11 %>% select(level_11 = name))
# Let's look at the first 6 rowsWe can check out the results - some points weren’t in polygons queried by OSM (because they’re just outside Caracas), so they get an NA, but everything else worked! Very exciting!
results %>%
head()## Simple feature collection with 6 features and 8 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -67.05809 ymin: 10.35878 xmax: -66.80126 ymax: 10.52398
## Geodetic CRS: WGS 84
## # A tibble: 6 × 9
## id geometry level_4 level_6 level_7 level_8 level_9
## <dbl> <POINT [°]> <chr> <chr> <chr> <chr> <chr>
## 1 1 (-67.05809 10.45787) Distrito Capital Municipio… Parroq… Caracas <NA>
## 2 2 (-66.9598 10.49246) Distrito Capital Municipio… Parroq… Caracas <NA>
## 3 3 (-66.87696 10.52398) Distrito Capital Municipio… Parroq… Caracas <NA>
## 4 4 (-66.85315 10.35878) Miranda Municipio… Parroq… Caracas <NA>
## 5 5 (-66.80126 10.47997) Miranda Municipio… Parroq… Caracas Petare…
## 6 6 (-66.89291 10.46588) Distrito Capital Municipio… Parroq… Caracas <NA>
## # … with 2 more variables: level_10 <chr>, level_11 <chr>
Finally, let’s export the data. These now describe our best approximation of the locations of these points. Yay!
results %>%
# convert to data.frame
as_tibble() %>%
# drop the geometry column
select(-geometry) %>%
# write to csv
write_csv("results.csv")And you’re done! Happy geolocating!