# Libraries
library(sf)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(spData)
library(ggrepel)
library(geobr)
library(readxl)
library(geosphere)
library(ggspatial)
library(basemaps)
library(prettymapr)

\(\color{darkblue}{\text{Paper 1}}\)

Paper: “Irrigation dams, water and infant mortality: Evidence from South Africa”.

1. Relevant information of the paper for replicating the map

Region: South Africa.
Dams data collected from the “Dam Safety Ofﬁce within the Department of Water Affairs”.
- Only dams of at least 5 m in height and 50,000 cubic meters in capacity are included.
- Sample restricted to dams with “irrigation” uses.
- Dams seem to be located with coordinates as points.
- Data source: http://www.dwaf.gov.za/DSO/Default.aspx
- About 15 percent of dams were dropped because of missing critical information that could not be veriﬁed (690/4,830).
  - For most missing dams (658/690), the missing information was completion date.
  - Of those dropped, 499 reported a purpose of “irrigation.”
- The resulting dataset included 4,140 dams, which were restricted to the subset of irrigation dams (3,176). These dams were matched to magisterial districts using their GPS coordinates.
“To construct river gradient, I match river pixels to land gradient data and calculate the fraction of river pixels that are steep, deﬁned as greater than six percent slope, within each district.” Data of the Department of Water Affairs.
- Used as the color legend for the map.
- The river and land gradient measures are constructed using elevation data from ArcGIS.
- The elevation data is provided in a raster (pixel) format. “I use Slope (3D analyst), an ArcGIS tool, to construct the slope at each pixel using the raster elevation data.”
- “I then use this to construct the average district slope and river gradient slope (the latter is restricted to pixels along the river network provided by the Department of Water Affairs)”.
- Probably, we won’t be able to replicate this part - since it uses raster data (and we still haven’t seen how to deal with it).
- For more information on this, check appendix A.3 of the paper.
Territorial divisions: “magisterial district boundaries that were in place before the end of Apartheid”. There were 354 magisterial districts. Data obtained from Global Administrative Boundaries.

Conclusion on the data that we need (for what we can replicate with the tools that we have learned until now):

Territorial divisions: “magisterial district boundaries that were in place before the end of Apartheid”. Data obtained from Global Administrative Boundaries.
- Magisterial districts in 1996: https://agis.nda.agric.za/portal/home/item.html?id=9a47f9828584496bab347a5dd820b154.
- Magisterial districts in 2005: https://agis.nda.agric.za/portal/home/item.html?id=f0942201f456496ab94189b50660caf4
- Source of the data used below (354 magisterial district boundaries, as in the paper): https://hub.arcgis.com/datasets/nga::land-ownership/about?layer=34.
Dams data which includes dams constructed until March 2013, and only restricted to “irrigation” uses.
- Source: AQUASTAT (https://www.fao.org/aquastat/en/databases/dams/). Accessed on the 26th of January 2025. Excel file.
- Data available:
  - Country where the dam is located.
  - Year of completion of the dam.
  - Dam height (in meters).
  - Reservoir capacity (in million cubic meters).
  - Dam purpose (irrigation included).
  - Coordinates (latitude + longitude), in decimal degrees.
- Further description of the data: https://openknowledge.fao.org/server/api/core/bitstreams/15f369d0-8fb3-4091-bf06-60d40f311348/content

2. Importing and cleaning the data

We need to read the data from 2 different sources: an Excel (for the dams) and a shape file (for the magisterial district boundaries).

2.1. Dams data

# We read the sheet containing the dams. We skip 1 row when reading the data
# (which is just the title of the Excel), and set the column names as defined
# in the Excel
df_dams <- read_excel("Africa-dams_eng.xlsx", sheet = "Dams", col_names = TRUE, skip = 1)
head(df_dams)

## # A tibble: 6 × 28
##   Country `Name of dam` `Alternate dam name` `ISO alpha- 3`
##   <chr>   <chr>         <chr>                <chr>         
## 1 Algeria Meurad        <NA>                 DZA           
## 2 Algeria Hamiz         <NA>                 DZA           
## 3 Algeria Oued Fodda    <NA>                 DZA           
## 4 Algeria Boughzoul     <NA>                 DZA           
## 5 Algeria Cheurfas      <NA>                 DZA           
## 6 Algeria Bakhadda      <NA>                 DZA           
## # ℹ 24 more variables: `Administrative\r\nUnit` <chr>, `Nearest city` <chr>,
## #   River <chr>, `Major basin` <chr>, `Sub-basin` <chr>,
## #   `Completed /operational since` <chr>, `Dam height (m)` <dbl>,
## #   `Reservoir capacity (million m3)` <dbl>, `Reservoir area (km2)` <dbl>,
## #   `Sedimen-tation \r\n(latest known) \r\n(%)` <dbl>, Irrigation <chr>,
## #   `Water supply` <chr>, `Flood control` <chr>, `Hydroelectricity (MW)` <chr>,
## #   Navigation <chr>, Recreation <chr>, `Pollution control` <chr>, …

Cleaning that is done in the cells below:

Selection of the variables we’re interested in and exclusion of the rest.
Removal of rows with null values in the coordinates.
Filtering of dams which:
- Serve irrigation purposes (marked with an “x” in the dataset).
- Have at least 5m height (though those with missing values will be kept, as there are too many missing values in this column).
- Have at least 50,000 cubic meters of capacity (0.05 million cubic meters).
- Completed at most until 2013.

# Just keep variables of interest and filter to keep only South African dams
df_dams_clean <- df_dams %>% 
  select(`Country`, 
         `Name of dam`, 
         `Completed /operational since`, 
         `Dam height (m)`, 
         `Reservoir capacity (million m3)`, 
         `Irrigation`, 
         `Decimal degree latitude`, 
         `Decimal degree longitude`) %>%
  filter(Country == "South Africa")

# Only keep rows where there are no missing values in the coordinates. The ~
# tells dplyr to treat the expression that follows as a formula, where the 
# . represents each column being evaluated.
columns_na <- c("Decimal degree latitude", "Decimal degree longitude")
df_dams_clean_complete_geo <- df_dams_clean %>%
  drop_na(columns_na)
head(df_dams_clean_complete_geo)

## # A tibble: 6 × 8
##   Country      `Name of dam` `Completed /operational since` `Dam height (m)`
##   <chr>        <chr>         <chr>                                     <dbl>
## 1 South Africa Hill Crest    1691                                         NA
## 2 South Africa Gibson        1880                                         NA
## 3 South Africa Newberry      1896                                         NA
## 4 South Africa Woodhead      1897                                         38
## 5 South Africa Milner        1900                                         NA
## 6 South Africa Jameson       1900                                         NA
## # ℹ 4 more variables: `Reservoir capacity (million m3)` <dbl>,
## #   Irrigation <chr>, `Decimal degree latitude` <dbl>,
## #   `Decimal degree longitude` <dbl>

Clearly, this dataset is a lot less rich than the one used in the paper: while in the paper they have data of 4,140 dams, of which 3,176 were used for irrigation purposes, in this dataset we just have 531 dams for which we have geospatial data (before even filtering further the dataset)!

# Filtering (keeping only) dams with at least 5m height, with minimum 
# 0.05 million cubic meters of capacity and that serve irrigation purposes
df_dams_clean_filtered <- df_dams_clean_complete_geo %>%
  filter(`Dam height (m)` >= 5) %>%
  filter(`Reservoir capacity (million m3)` >= 0.05) %>%
  filter(Irrigation == "x") %>%
  filter(`Completed /operational since` <= 2013)

head(df_dams_clean_filtered)

## # A tibble: 6 × 8
##   Country      `Name of dam`  `Completed /operational since` `Dam height (m)`
##   <chr>        <chr>          <chr>                                     <dbl>
## 1 South Africa Calitzdorp     1918                                       30.5
## 2 South Africa Kammanassie    1923                                       35  
## 3 South Africa Mogoto         1924                                       36.5
## 4 South Africa Nwqeba         1925                                       46  
## 5 South Africa Olifantsnek    1929                                       30  
## 6 South Africa Rust De Winter 1934                                       31  
## # ℹ 4 more variables: `Reservoir capacity (million m3)` <dbl>,
## #   Irrigation <chr>, `Decimal degree latitude` <dbl>,
## #   `Decimal degree longitude` <dbl>

# Now, we use the structure as simple feature function for creating the
# dataset with geospatial features. We just indicate the columns we want
# to use as geospatial features. Code structure taken from lecture 2.
sf_dams_1 <- st_as_sf(
  df_dams_clean_filtered,
  coords = c("Decimal degree longitude", "Decimal degree latitude"),
  crs = 'EPSG:4326'
)

sf_dams_2 <- st_as_sf(
  df_dams_clean_complete_geo,
  coords = c("Decimal degree longitude", "Decimal degree latitude"),
  crs = 'EPSG:4326'
)

head(sf_dams_1)

## Simple feature collection with 6 features and 6 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 21.70417 ymin: -33.64306 xmax: 29.23333 ymax: -24.26667
## Geodetic CRS:  WGS 84
## # A tibble: 6 × 7
##   Country      `Name of dam`  `Completed /operational since` `Dam height (m)`
##   <chr>        <chr>          <chr>                                     <dbl>
## 1 South Africa Calitzdorp     1918                                       30.5
## 2 South Africa Kammanassie    1923                                       35  
## 3 South Africa Mogoto         1924                                       36.5
## 4 South Africa Nwqeba         1925                                       46  
## 5 South Africa Olifantsnek    1929                                       30  
## 6 South Africa Rust De Winter 1934                                       31  
## # ℹ 3 more variables: `Reservoir capacity (million m3)` <dbl>,
## #   Irrigation <chr>, geometry <POINT [°]>

2.2. Magisterial district boundaries

# For reading the data, we indicate the name of the folder where
# the data is located and the name of the shape file
sf_mag_boundaries_sa <- st_read("South_Africa_DIVA_GIS_State_L2_Admin_Boundaries.shp")

## Reading layer `South_Africa_DIVA_GIS_State_L2_Admin_Boundaries' from data source `/Users/nourkhalid/Desktop/Geospatial DS/Assignment 1/South_Africa_DIVA_GIS_State_L2_Admin_Boundaries.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 354 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 16.45802 ymin: -34.83514 xmax: 32.89125 ymax: -22.12661
## Geodetic CRS:  WGS 84

Note that it has 354 features, like the 354 magisterial boundaries used in the paper.

3. Map replication

ggplot() + 
  geom_sf(data = sf_mag_boundaries_sa) +
  geom_sf(data = sf_dams_1) +
  labs(title = "Map of (filtered) Dams & Magisterial District Boundaries of South Africa") +
  theme_minimal()+ 
  theme(
    axis.text = element_blank()
  )

However, in the map above we see that the observations are reduced significantly when filtering for dam height, reservoir capacity and for dams that serve irrigation purposes. There are 2 reasons for that: 1. The data used here contains a lot less observations than the data used in the paper. 2. There are several missing values in the dam height column, which filters the dataset more than it probably would if we had a complete dataset.

ggplot() + 
  geom_sf(data = sf_mag_boundaries_sa) +
  geom_sf(data = sf_dams_2) +
  labs(title = "Map of Dams & Magisterial District Boundaries of South Africa") +
  theme_minimal()+
  theme(
    axis.text = element_blank()
  )

\(\color{darkblue}{\text{Paper 2}}\)

Paper: “Rural electrification, migration and structural transformation: Evidence from Ethiopia” (Stephie Fried, David Lagakos, 2020).

1. Relevant information for replicating the map

Region: Ethiopia.
The main data source is the Ethiopian Rural Socioeconomic Survey (ERSS) conducted in 2011/2012 and again in 2013/2014 on a set of enumeration areas representative of parts of Ethiopia. The map shows the location of sample enumeration areas in the ERSS, along with population density (darker regions being more dense), major roads (thicker lines), and the high-voltage electricity grid (thinner, straighter, lines).

2. Our data sources

Layer 1: Subnational Administrative Boundaries dataset, collected by UNDP and OCHA Ethiopia https://data.humdata.org/dataset/cod-ab-eth
Layer 2: Ethiopia administrative level 3 disaggregated population statistics (1,084 woredas), 2022, collected by OCHA. https://data.humdata.org/dataset/cod-ps-eth
Layer 3: Ethiopia Roads (OpenStreetMap Export), 2018 https://data.humdata.org/dataset/hotosm_eth_roads
Layer 4: Electricity Transmission Network, World Bank, 2014 https://datacatalog.worldbank.org/search/dataset/0039865/Ethiopia---Electricity-Transmission-Network
Layer 5: Locations of sample enumeration areas from the ERSS, 2011-2014 https://microdata.worldbank.org/index.php/catalog/2053/data-dictionary
Layer 6: Data for power plants in Ethiopia with total installed generating capacity 10 mw from the Platts World Electric Power Plants Database, WEPP 2006. https://datacatalog.worldbank.org/search/dataset/0041714/Ethiopia-Power-Plants

2.1. Layer 1: Administrative boundaries

# Load the shapefile with Administrative level 3 boundaries
sf.eth_admin_3 <- st_read("eth_admbnda_adm3_csa_bofedb_2021.shp")

## Reading layer `eth_admbnda_adm3_csa_bofedb_2021' from data source 
##   `/Users/nourkhalid/Desktop/Geospatial DS/Assignment 1/eth_admbnda_adm3_csa_bofedb_2021.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1082 features and 16 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 32.9918 ymin: 3.40667 xmax: 47.98824 ymax: 14.84548
## Geodetic CRS:  WGS 84

2.2. Layer 2: Population Density color map

#  demographic data from 2021
eth_pop <- read.csv("pop_adm3_2022_v2.csv")

if (!("ADM3_PCODE" %in% colnames(eth_pop))) {
  eth_pop <- rename(eth_pop, ADM3_PCODE = admin3Pcode) 
}

# Join the data based on the common column
joined_pop <- left_join(sf.eth_admin_3, eth_pop, by = "ADM3_PCODE")

# Filter out all the unnecessary columns that we won't need
selected_columns <- c("ADM3_PCODE", "T_TL", "geometry")
joined_pop_geo <- joined_pop[selected_columns]

# Calculate the area of each admin unit based on geometry column, express it in square km
joined_pop_geo$area_km2 <- as.numeric(st_area(joined_pop_geo) / 1e6)

# Calculate population density as Population / Area
joined_pop_geo$pop_density <- joined_pop_geo$T_TL / joined_pop_geo$area_km2

#Filter out missing or zero values
joined_pop_geo <- joined_pop_geo %>%
  filter(pop_density != 0)

2.3. Layer 3: Main Roads

#Open roads map 
sf.roads <- st_read("hotosm_eth_roads_lines_shp.shp")

## Reading layer `hotosm_eth_roads_lines_shp' from data source 
##   `/Users/nourkhalid/Desktop/Geospatial DS/Assignment 1/hotosm_eth_roads_lines_shp.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 420912 features and 18 fields
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: 32.9902 ymin: 3.410445 xmax: 47.98158 ymax: 14.8554
## Geodetic CRS:  WGS 84

# Filter to keep only primary roads
roads_type <- c("primary")
roads_prim <- sf.roads %>%
  filter(highway %in% roads_type)

# Convert the geometry column to an sf object
roads_geo <- st_as_sf(roads_prim, wkt = "geometry", crs = 4326)

2.4. Layer 4: Power grid lines

sf.lines <- st_read("Ethiopia_grid.shp")

## Reading layer `Ethiopia_grid' from data source 
##   `/Users/nourkhalid/Desktop/Geospatial DS/Assignment 1/Ethiopia_grid.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 132 features and 11 fields
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: 34.14213 ymin: 5.389761 xmax: 43.19406 ymax: 14.44666
## Geodetic CRS:  WGS 84

2.5. Layer 5: Enumeration localities

#Load coordinates of enumerators coord from ERSS dataset 
enum <- read.csv("pub_eth_householdgeovariables_y1.csv")

2.6. Layer 6: Power Plants

#Load map of power plants 
plants <- st_read("ETH_PowerPlants.shp")

## Reading layer `ETH_PowerPlants' from data source 
##   `/Users/nourkhalid/Desktop/Geospatial DS/Assignment 1/ETH_PowerPlants.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 22 features and 8 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 36.9127 ymin: 6.856073 xmax: 40.17 ymax: 13.3
## CRS:           NA

plants_geo <- st_set_crs(plants, 4326)

3. Plot replication

#Plot country, roads, lines, density and power plants in the same plot
ggplot() +
  
  #Layer 1: boundaries. We decided to use log of population density to normalize the extremely right-skewed distribution of the data. The range was from 1.5 to 45,408 people per sq km, with 80% being in the range of 0-1,000, and 95% falling into the range of 0-10,000.
  geom_sf(data = joined_pop_geo, aes(fill = log1p(pop_density))) + 

  #Layer 2: population density  
  scale_fill_gradient(low = "aliceblue", high = "midnightblue", name = "Population Density") +  
  
  #Layer 3: roads
  geom_sf(data = roads_geo, color = "black", size = 2) + 
  
  #Layer 4: power lines
  geom_sf(data = sf.lines, color = "red", size = 0.5) + 
  
  #Layer 5: enumerators
  geom_point(data = enum, aes(x = LON_DD_MOD, y = LAT_DD_MOD), color = "black", fill = "white", size = 1, shape = 21) + 
  
  #Layer 6: power plants
  geom_point(data = plants_geo, aes(x = LON, y = LAT), shape = 8, color = "blue") + 
  theme_minimal()

We tried to underlay a baselayer of a relief map, but for some reason, it always includes the zero-zero coordinate and makes the graph look uncentered. It works fine if we only plot linestrings and polygons, but breaks down as soon as we include Layer 5 and 6 which are points. We have tried to fix this issue using four different suggestions found on the internet: 1. Make sure all spatial data layers and the basemap are using the same CRS, 2. Filter out points where (x = 0, y = 0), 3. Limit the map extent to the region of interest using coord_sf(), 4. Ensure that the baselayer is correctly fetched and aligned, for example adjust the zoom level.

However, none of these tips worked.

We have included an optional ggplot without these geom_point layers just for the sake of experimenting with baselayers.

The bounding box of the plot is necessary to calculate the correct coordinates andrestrict the displayed area of the baselayer for when it´s turned on.

bbox <- st_bbox(sf.eth_admin_3)

min_lon <- bbox["xmin"]  # Minimum longitude
max_lon <- bbox["xmax"]  # Maximum longitude
min_lat <- bbox["ymin"]  # Minimum latitude
max_lat <- bbox["ymax"]  # Maximum latitude

cat("Bounding Box:\n")

## Bounding Box:

cat("Min Longitude:", min_lon, "\n")

## Min Longitude: 32.9918

cat("Max Longitude:", max_lon, "\n")

## Max Longitude: 47.98824

cat("Min Latitude:", min_lat, "\n")

## Min Latitude: 3.40667

cat("Max Latitude:", max_lat, "\n")

## Max Latitude: 14.84548

#We had to turn off the geom_points layers to prevent inclusion of the zero coordinate.
ggplot() +
  annotation_map_tile(type = "cartolight", zoom = 4) +  # Layer 0: baselayer
  coord_sf(xlim = c(32.9918, 47.98824), ylim = c(3.40667, 14.84548)) + 
  geom_sf(data = joined_pop_geo, aes(fill = log1p(pop_density))) + #Layer 1: boundaries
  scale_fill_gradient(low = "aliceblue", high = "midnightblue", name = "Population Density") +  #Layer 2: population density
  geom_sf(data = roads_geo, color = "black", size = 2) + #Layer 3: roads
  geom_sf(data = sf.lines, color = "red", size = 0.5) + #Layer 4: power lines
  #geom_point(data = enum, aes(x = LON_DD_MOD, y = LAT_DD_MOD), color = "black", fill = "white", size = 1, shape = 21) + #Layer 5: enumerators
  #geom_point(data = plants_geo, aes(x = LON, y = LAT), shape = 8, color = "blue") + #Layer 6: power plants
  theme_minimal()

## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
## Zoom: 4

\(\color{darkblue}{\text{Paper 3}}\)

Paper: “Migration, Specialization, and Trade: Evidence from Brazil’s March to the West”

Relevant Information

Map to be replicated: The Spatial Distribution of the Brazilian Population between 1950 and 2010. Figure 2 from the working paper Migration, Specialization, and Trade: Evidence from Brazil’s March to the West by Heitor S. Pellegrina and Sebastian Sotelo, 2021.
The map in the original map is divided into meso regions, of which there are around 136. Nonetheless, it was not possible to find population data that followed those divisions in the official data sources of the country. Similarly, there was no population data available for dates prior to 1989.
For this reason, we replicated the map with the data at state-level (27 states) and for the years 2000, 2010, 2020. It is important to note that this year are significantly closer together than the ones picked by the authors (1950, 1980, and 2010) and might probably not have been subject to the phenomenon of they are observing in the paper, at least to the same extent. Therefore, our resulting map shows practically no observable differences in the population distribution. This data was obtained from the Instituto Brasileiro de Geografia e Estatística: https://www.ibge.gov.br/en/statistics/social/population/18448-estimates-of-resident-population-for-municipalities-and-federation-units.html?=&t=downloads .
On the other hand, the geographical data was obtained from https://gadm.org/maps/BRA_1.html .

states <- st_read('BRA_adm1.shp')

## Reading layer `BRA_adm1' from data source 
##   `/Users/nourkhalid/Desktop/Geospatial DS/Assignment 1/BRA_adm1.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 27 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -73.98971 ymin: -33.74708 xmax: -28.84694 ymax: 5.264878
## Geodetic CRS:  WGS 84

# canviar per el package del hannes - meso_states

#states <- read_state()
states <- st_transform(states, crs = 4326)
head(states)

## Simple feature collection with 6 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -73.98971 ymin: -18.34986 xmax: -35.15182 ymax: 4.44236
## Geodetic CRS:  WGS 84
##   ID_0 ISO NAME_0 ID_1   NAME_1 TYPE_1 ENGTYPE_1 NL_NAME_1 VARNAME_1
## 1   33 BRA Brazil    1     Acre Estado     State      <NA>      <NA>
## 2   33 BRA Brazil    2  Alagoas Estado     State      <NA>      <NA>
## 3   33 BRA Brazil    3    Amapá Estado     State      <NA>      <NA>
## 4   33 BRA Brazil    4 Amazonas Estado     State      <NA>   Amazone
## 5   33 BRA Brazil    5    Bahia Estado     State      <NA>      Ba¡a
## 6   33 BRA Brazil    6    Ceará Estado     State      <NA>      <NA>
##                         geometry
## 1 MULTIPOLYGON (((-73.33251 -...
## 2 MULTIPOLYGON (((-35.90153 -...
## 3 MULTIPOLYGON (((-50.02403 0...
## 4 MULTIPOLYGON (((-67.32623 2...
## 5 MULTIPOLYGON (((-38.69708 -...
## 6 MULTIPOLYGON (((-38.47542 -...

Importing and Cleaning the Data

We imported three excel files containing the population data for the 3 different years. The one for year 2000 had the population distributed across municipalities, but included an indicator for the state (the state code), so we aggregated the data. The 2010 file required minor cleaning (changing some data types and column names) while the 2020 file required no cleaning at all. Next, we merged these three data frames together and also with the data frame containing the state geometry. We also highlighted the West states, those of interest in the study, and joined them into a singular geographical unit to be able to create the red boundary in the map.

# Importing and standarising population data

# YEAR 2000
population2000 <- read_xlsx("population2000.xlsx")
pop2000 <- population2000 %>% 
  group_by(Cód.) %>% #grouping by state code because it's divided into territorial divisions smaller than States
  summarize(Pop2000 = sum(Pop2000, na.rm = TRUE), .groups = "drop")
rm(population2000) # removing the un-grouped dataset

# YEAR 2010
pop2010 <- read_xlsx("population2010.xlsx")
pop2010 <- pop2010[-c(1, 2), ]
pop2010$`Cód.` <- as.numeric(pop2010$`Cód.`) # making the state code a numeric column, so the join works
pop2010 <- pop2010[2:ncol(pop2010)]

# YEAR 2020
pop2020 <- read_xlsx("population2020.xlsx")

Merging all the data into a dataset with the relevant columns

population <- left_join(pop2010, pop2000, by = c('Cód.' = 'Cód.'))
population <- left_join(population, pop2020, by = c("Brasil e Unidade da Federação" = 'UF'))

merged_data <- left_join(states, population, by = c("NAME_1" = "Brasil e Unidade da Federação"))

# Making population percentage columns
merged_data <- merged_data %>% 
  rename(State = "NAME_1") %>%
  select(State,"geometry", "Pop2000","Pop2010", "Pop2020") %>% 
  mutate(share00 = Pop2000/sum(Pop2000) * 100,
         share10 = Pop2010/sum(Pop2010) * 100,
         share20 = Pop2020/sum(Pop2020) * 100)

head(merged_data)

## Simple feature collection with 6 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -73.98971 ymin: -18.34986 xmax: -35.15182 ymax: 4.44236
## Geodetic CRS:  WGS 84
##      State  Pop2000  Pop2010  Pop2020                       geometry   share00
## 1     Acre   541873   733559   894470 MULTIPOLYGON (((-73.33251 -... 0.3262084
## 2  Alagoas  2738378  3120494  3351543 MULTIPOLYGON (((-35.90153 -... 1.6485079
## 3    Amapá   458796   669526   861773 MULTIPOLYGON (((-50.02403 0... 0.2761959
## 4 Amazonas  2641251  3483985  4207714 MULTIPOLYGON (((-67.32623 2... 1.5900373
## 5    Bahia 13135262 14016906 14930634 MULTIPOLYGON (((-38.69708 -... 7.9074486
## 6    Ceará  7200167  8452381  9187103 MULTIPOLYGON (((-38.47542 -... 4.3345120
##     share10   share20
## 1 0.3845540 0.4224066
## 2 1.6358580 1.5827405
## 3 0.3509859 0.4069657
## 4 1.8264111 1.9870606
## 5 7.3480891 7.0508773
## 6 4.4309956 4.3385389

# Filtering Western states and making a joint geographic element
western_states <- c("Rondônia", "Acre", "Amazonas", "Roraima", "Pará", 
                         "Amapá", "Mato Grosso", "Mato Grosso do Sul", 
                         "Goiás", "Tocantins", "Distrito Federal")

west_data <- merged_data %>% filter(State %in% western_states)

# We merge the geometries of these states into a single shape
merged_west <- st_union(west_data)

# We then piviot the data to make the 3 maps side by side
long_data <- merged_data %>%
  select(State, geometry, share00, share10, share20) %>%
  pivot_longer(cols = starts_with("share"), 
               names_to = "Year", 
               values_to = "Share") %>%
  mutate(Year = recode(Year, 
                       share00 = "2000", 
                       share10 = "2010", 
                       share20 = "2020"))
head(long_data)

## Simple feature collection with 6 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -73.98971 ymin: -11.14516 xmax: -35.15182 ymax: -7.12132
## Geodetic CRS:  WGS 84
## # A tibble: 6 × 4
##   State                                                     geometry Year  Share
##   <chr>                                           <MULTIPOLYGON [°]> <chr> <dbl>
## 1 Acre    (((-73.33251 -7.324879, -73.27482 -7.350334, -72.87016 -7… 2000  0.326
## 2 Acre    (((-73.33251 -7.324879, -73.27482 -7.350334, -72.87016 -7… 2010  0.385
## 3 Acre    (((-73.33251 -7.324879, -73.27482 -7.350334, -72.87016 -7… 2020  0.422
## 4 Alagoas (((-35.90153 -9.861805, -35.90153 -9.862083, -35.90125 -9… 2000  1.65 
## 5 Alagoas (((-35.90153 -9.861805, -35.90153 -9.862083, -35.90125 -9… 2010  1.64 
## 6 Alagoas (((-35.90153 -9.861805, -35.90153 -9.862083, -35.90125 -9… 2020  1.58

Visualization

ggplot(data = long_data) +
  geom_sf(aes(fill = Share, geometry = geometry), color = "black", linewidth = 0.3) +
  geom_sf(data = merged_west, fill = NA, color = "red", linewidth = 0.3) +
  facet_wrap(~ Year, ncol = 3) +
  scale_fill_continuous(name = "Population Share (%)", low = "aliceblue", high = "darkblue") +
  theme_minimal() +
  theme(
    panel.background = element_rect(fill = "white"),
    panel.border = element_rect(color = "black", fill = NA, linewidth = 1),  
    axis.text = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 10),
    legend.position = "bottom",
    legend.title = element_text(size = 10),
    legend.text = element_text(size = 8),
    strip.text = element_text(size = 10),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  ) +
  ggtitle("Brazilian Population Share by State between 2000 and 2020")

\(\color{darkblue}{\text{Paper 4}}\)

Paper: “In harm’s way? infrastructure investments and the persistence of coastal cities”.

Notes on the possible sources of data for figure 4 (“Road maps of Vietnam, 2000 and 2010”): - Replication package for paper 4: https://www.openicpsr.org/openicpsr/project/207641/version/V1/view. Useful information in: - Raw data -> Boundaries (for constructing Vietnam’s coastline as shown in the paper). - The author says that they were constructed using Natural Earth Data and then cropping the shapefile to include only Vietnam’s coastline. - But their files don’t include the boundaries of the whole Vietnam! So the replication package is not very useful. - Vietnam’s country boundaries obtained from: https://gadm.org/download_country.html. Different shape files for different territorial administrative levels. - Vietnam’s road network information (for 2015) obtained from: https://data.humdata.org/dataset/viet-nam-roads

1. Relevant information of the paper for replicating the map

Maps to be replicated: road network of Vietnam in 2000 and 2010.
Need map of Vietnam.
Need road networks by road type (if they are available). On how the author constructed Vietnam’s road network:

"I obtain road network data from the 2000 and 2010 editions of ITMB Publishing’s detailed International Travel Maps of Vietnam, which show the location of freeways, dual carriageways, major, minor and other roads.
I geo-referenced each map and manually traced the location of each road category to obtain a GIS shapefile of the entire road network in each road category in 2000 and 2010, shown in Figure 3."

2. Importing and cleaning the data

We need to read the data from 2 different sources: an Excel (for the dams) and a shape file (for the magisterial district boundaries).

2.1. Vietnam’s boundaries

sf_vietnam <- st_read(
  "gadm41_VNM_0.shp"
)

## Reading layer `gadm41_VNM_0' from data source 
##   `/Users/nourkhalid/Desktop/Geospatial DS/Assignment 1/gadm41_VNM_0.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 102.1446 ymin: 8.381355 xmax: 109.4692 ymax: 23.39269
## Geodetic CRS:  WGS 84

2.2. Road network

sf_vietnam_roads <- st_read(
  "vnm_rdsl_2015_0SM.shp"
)

## Reading layer `vnm_rdsl_2015_0SM' from data source 
##   `/Users/nourkhalid/Desktop/Geospatial DS/Assignment 1/vnm_rdsl_2015_0SM.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 118138 features and 5 fields
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: 102.1562 ymin: 8.596338 xmax: 109.4536 ymax: 23.37756
## Geodetic CRS:  WGS 84

str(sf_vietnam_roads)

## Classes 'sf' and 'data.frame':   118138 obs. of  6 variables:
##  $ osm_id    : chr  "9566542" "9656653" "9656730" "9963509" ...
##  $ name      : chr  "C?u Thê Húc" "Mã Mây" "C?u Chuong Duong" "Hàng Cót" ...
##  $ ref       : chr  NA NA NA NA ...
##  $ type      : chr  "footway" "residential" "primary" "secondary" ...
##  $ Shape_Leng: num  0.000413 0.002599 0.011551 0.001603 0.001613 ...
##  $ geometry  :sfc_MULTILINESTRING of length 118138; first list element: List of 1
##   ..$ : num [1:2, 1:2] 106 106 21 21
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTILINESTRING" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA
##   ..- attr(*, "names")= chr [1:5] "osm_id" "name" "ref" "type" ...

unique(sf_vietnam_roads$type)

##  [1] "footway"        "residential"    "primary"        "secondary"     
##  [5] "pedestrian"     "tertiary"       "trunk"          "unclassified"  
##  [9] "service"        "motorway_link"  "motorway"       "primary_link"  
## [13] "track"          "tertiary_link"  "road"           "trunk_link"    
## [17] "secondary_link" "living_street"  "steps"          "path"          
## [21] "construction"   "cycleway"       "proposed"       "crossing"      
## [25] "services"       "rest_area"      "yes"

The type of road is stored within the type column. Given that there are 28 unique types of roads (as they include crossings and linkks as separate types), below we simplify the classification by separating the types into either: 1. “motorway” or “trunk” roads (assimilated to a “Dual carriageway” as used in the paper). 2. “primary” road (that could be assimilated to the term “Freeway” as used in the paper). 3. “secondary” roads (assimilated to “major” roads). 4. “tertiary” roads (assimilated to “minor” roads). 5. “other” roads. The other types of roads not included in the previous classifications.

We can map this renaming into the simple features dataset:

road_type_mapping <- c(
  "motorway" = "Dual carriageway/Freeway",
  "motorway_link" = "Dual carriageway/Freeway",
  "trunk" = "Dual carriageway/Freeway",
  "trunk_link" = "Dual carriageway/Freeway",
  "primary" = "Major road",
  "primary_link" = "Major road",
  "secondary" = "Minor road",
  "secondary_link" = "Minor road",
  "tertiary" = "Minor road",
  "tertiary_link" = "Minor road",
  # All others are mapped to "Other"
  "footway" = "Other",
  "residential" = "Other",
  "pedestrian" = "Other",
  "unclassified" = "Other",
  "service" = "Other",
  "track" = "Other",
  "road" = "Other",
  "living_street" = "Other",
  "steps" = "Other",
  "path" = "Other",
  "construction" = "Other",
  "cycleway" = "Other",
  "proposed" = "Other",
  "crossing" = "Other",
  "services" = "Other",
  "rest_area" = "Other",
  "yes" = "Other"
)

sf_vietnam_roads <- sf_vietnam_roads %>%
  mutate(mapped_type = recode(type, !!!road_type_mapping))

unique(sf_vietnam_roads$mapped_type)

## [1] "Other"                    "Major road"              
## [3] "Minor road"               "Dual carriageway/Freeway"

The type of road is stored within the type column. Given that there are 28 unique types of roads (as they include crossings and linkks as separate types), below we simplify the classification by separating the types into either:

“Dual carriageway/Freeway” roads (assimilated to “motorways” and “trunks”).
“Major” roads (that could be assimilated to “primary roads”).
“Minor” roads (assimilated to “secondary” and “tertiary” roads).
“Other” roads. The other types of roads not included in the previous classifications.

3. Map replication

ggplot() + 
  geom_sf(data = sf_vietnam) +
  geom_sf(data = sf_vietnam_roads, aes(color = mapped_type)) +
  scale_color_manual(
    name = "Road Types",
    values = c(
      "Dual carriageway/Freeway" = "#213921",
      "Major road" = "#ff000074",
      "Minor road" = "#ffa60075",
      "Other" = "#e1e193"
    )
  ) +
  labs(
    title = "Map of Vietnam's road network in 2015"
  ) +
 theme_minimal() +
   theme(
    axis.text = element_blank()
  )

\(\color{darkblue}{\text{Paper 5}}\)

Paper: “The Effects of Roads on Trade and Migration: Evidence from a Planned Capital City” The following analysis identifies radial and non-radial highways in Brazil, using Brasília as a central point. The roads are classified based on proximity to Brasília and visualized on a map.

We use data from the Instituto Brasileiro de Geografia e Estatística (Brazilian Institute of Geography and Statistics), which is responsible for the official collection of statistical, geographic, cartographic, geodetic, and environmental information in Brazil. The road dataset represents the main structuring road axes of the Brazilian territory. Additionally, we use the world map data and filter it to include only Brazil.

Load Datasets

# Load Brazil road dataset (I rename some of the variables just for ease of use)
road <- st_read("eixo_rodoviario_estruturante_2014.shp") %>%
  rename(
    SegmentDescription = DescSeg,    
    RoadType = TipoPNV,             
    Code = CODIGO                    
  )
head(road)

# We filter for Brazil from the world dataset
brazil <- world %>%
  filter(name_long == "Brazil")
# We select the cities highlight on fig. 1 in the paper
selected_municipality <- c("Boa Vista", "Macapá", "Belém", "Manaus", "Porto Velho", 
                     "Rio Branco", "Palmas", "Brasília", 
                     "Cuiabá", "Campo Grande", 
                     "São Paulo", "Rio de Janeiro", "Belo Horizonte", 
                     "Florianópolis", "Porto Alegre", "Salvador", "Recife", 
                     "Natal", "Fortaleza", "João Pessoa", 
                     "Aracaju", "Maceió", "Goiânia", "Vitória")

# We filter for the year 2000 as indicated in fig. 1 in the paper. We the function read_municipality(year = 2000) to load municipality data for Brazil from the year 2000. Then, we use filter(name_muni %in% selected_municipality) to keep only the rows where the municipality name (`name_muni`) matches the names listed in the selected_municipality vector, effectively narrowing the dataset to include only the cities we are interested in.
cities <- read_municipality(year=2000) %>%
  filter(name_muni %in% selected_municipality)

## Using year/date 2000

We use Brasília as the center point as the paper states “The roads radiating out from Brasília are known as radial highways” (Morten & Oliveira, 2018). Moreover, we are making sure that the point representing Brasília is in the same CRS as the road dataset. This is important because as seen in class the spatial operations like calculating distances will only work correctly if both the point and the roads use the same CRS.

# Brasília coordinates
brasilia_coords <- cities %>%
  filter(name_muni == "Brasília") %>%
  st_coordinates()

# Brasília as a point in the same CRS as the roads dataset. I use st_point() to create a spatial point for Brasília, where [1, "X"] and [1, "Y"] are the extract the longitude (X) and latitude (Y) from the first row of the coordinate matrix of Brasília. I then use st_sfc() to convert this point into a simple feature geometry in the datset, whicht then lets me to use it in spatial operations and assigns it the same CRS as the road dataset.
brasilia_point <- st_sfc(
  st_point(c(brasilia_coords[1, "X"], brasilia_coords[1, "Y"])), 
  crs = st_crs(road)
)

# We classify roads as Radial or Non-Radial based on distance from Brasília. 
# We use a threshold of 500 to try to get as close to the figure as possible, but this is not exact as there was no clear method of doing this from the paper, as well as the fact that the data source is not exact. 
road <- road %>%
  mutate(
    distance_km = as.numeric(st_distance(geometry, brasilia_point)) / 1000,
    classification = ifelse(distance_km < 500, "Radial highways (2000)", 
                            "Non−radial highways (2000)")  
  )

Visualization

ggplot() +
  geom_sf(data = brazil, fill = NA, color = "grey") +  
  geom_sf(data = cities, size = 3, color = "black") +  
  geom_text_repel(data = cities, aes(label = name_muni, geometry = geom), 
                  stat = "sf_coordinates", size = 2.5) +
  geom_sf(data = road, aes(linetype = classification, color = classification), 
          size = 0.5) +  
  scale_linetype_manual(
    values = c("Radial highways (2000)" = "solid", 
               "Non−radial highways (2000)" = "dotted")
  ) +
  scale_color_manual(
    values = c("Radial highways (2000)" = "black", 
               "Non−radial highways (2000)" = "darkgrey")
  ) +
  theme_minimal() +  
  theme(
    axis.title = element_blank(), 
    axis.text = element_blank(), 
    legend.title = element_blank()
  ) +
  labs(
    linetype = "Highway Type",  
    color = "Highway Type",     
    caption = "Figure 1: Map of straight-line instrument and radial highways"
  )

## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
## give correct results for longitude/latitude data

Note: We didn’t add the Minimum Spanning Tree (MST) road network because the method in the paper seems complex. It involves slicing the map, and finding the shortest paths from Brasília to state capitals in each slice and then combining those paths.

References

Mettetal, E., 2019. Irrigation dams, water and infant mortality: Evidence from South Africa (fig. 2: hydro dams in South Africa)

Fried, S. and Lagakos, D., 2021. Rural electrification, migration and structural transformation: Evidence from Ethiopia (fig. 4: districts and electricity grid in Ethiopia)

Pellegrina, H.S. and Sotelo, S., 2021. Migration, Specialization, and Trade: Evidence from Brazil’s March to the West (fig. 2: Population in Brazil’s meso-regions (or districts) in different periods

Balboni, C.A., 2019. In harm’s way? infrastructure investments and the persistence of coastal cities. Link here (fig. 3: Vietnam’s road infrastructure by road type - if available)

Morten, M. & Oliveira, J., 2018. The Effects of Roads on Trade and Migration: Evidence from a Planned Capital City (fig. 1: Brazil’s capital and main road infrastructure)

Data Sources

Paper 1:

Department of Water Affairs (Dam Safety Office):Department of Water Affairs. (2013). Dam Safety Office dataset. Retrieved from http://www.dwaf.gov.za/DSO/Default.aspx

Global Administrative Boundaries (Magisterial Districts): Global Administrative Boundaries. Magisterial district boundaries (pre-1996). Retrieved from https://hub.arcgis.com/datasets/nga::land-ownership/about?layer=34

AQUASTAT (FAO): Food and Agriculture Organization of the United Nations (FAO). (2025). AQUASTAT dams database. Retrieved from https://www.fao.org/aquastat/en/databases/dams/

Paper 2:

United Nations Development Programme (UNDP) & OCHA Ethiopia. (n.d.). Subnational Administrative Boundaries dataset. Retrieved from https://data.humdata.org/dataset/cod-ab-eth

OCHA Ethiopia. (2022). Ethiopia administrative level 3 disaggregated population statistics (1,084 woredas). Retrieved from https://data.humdata.org/dataset/cod-ps-eth

OpenStreetMap. (2018). Ethiopia Roads (OpenStreetMap Export). Retrieved from https://data.humdata.org/dataset/hotosm_eth_roads

World Bank. (2014). Electricity Transmission Network. Retrieved from https://datacatalog.worldbank.org/search/dataset/0039865/Ethiopia—Electricity-Transmission-Network

World Bank. (2011–2014). Locations of sample enumeration areas from the ERSS. Retrieved from https://microdata.worldbank.org/index.php/catalog/2053/data-dictionary

Platts. (2006). World Electric Power Plants Database (WEPP): Data for power plants in Ethiopia with total installed generating capacity 10 MW. Retrieved from https://datacatalog.worldbank.org/search/dataset/0041714/Ethiopia-Power-Plants

Paper 3:

Pellegrina, H. S., & Sotelo, S. (2021). Migration, specialization, and trade: Evidence from Brazil’s march to the west. Retrieved from https://www.ibge.gov.br/en/statistics/social/population/18448-estimates-of-resident-population-for-municipalities-and-federation-units.html?=&t=downloads

Paper 4:

Replication Package for Paper 4: Balboni, C.A. Retrieved from https://www.openicpsr.org/openicpsr/project/207641/version/V1/view

Natural Earth Data (for Vietnam’s coastline):Natural Earth. Retrieved from https://www.naturalearthdata.com

Vietnam’s Country Boundaries: GADM. Retrieved from https://gadm.org/download_country.html

Vietnam’s Road Network (2015): United Nations Office for the Coordination of Humanitarian Affairs. (2015). Retrieved from https://data.humdata.org/dataset/viet-nam-roads

Paper 5:

Instituto Brasileiro de Geografia e Estatística. (2019). Transportation logistics. Retrieved from https://www.ibge.gov.br/en/geosciences/maps/brazil-geographic-networks-mapasdobrasil/18884-transportation-logistics.html

Assignment 1: Geospatial Data Science

Anastasiia Chernavskaia, Blanca Jimenez, Pablo Fernández, Nour Mohamed

01/28/2025

\(\color{darkblue}{\text{Paper 1}}\)

1. Relevant information of the paper for replicating the map

2. Importing and cleaning the data

2.1. Dams data

2.2. Magisterial district boundaries

3. Map replication

\(\color{darkblue}{\text{Paper 2}}\)

1. Relevant information for replicating the map

2. Our data sources

2.1. Layer 1: Administrative boundaries

2.2. Layer 2: Population Density color map

2.3. Layer 3: Main Roads

2.4. Layer 4: Power grid lines

2.5. Layer 5: Enumeration localities

2.6. Layer 6: Power Plants

3. Plot replication

\(\color{darkblue}{\text{Paper 3}}\)

Relevant Information

Importing and Cleaning the Data

Merging all the data into a dataset with the relevant columns

Visualization

\(\color{darkblue}{\text{Paper 4}}\)

1. Relevant information of the paper for replicating the map

2. Importing and cleaning the data

2.1. Vietnam’s boundaries

2.2. Road network

3. Map replication

\(\color{darkblue}{\text{Paper 5}}\)

Load Datasets

Visualization

References

Data Sources