Allowing for Static and Dynamic Visualizations in Geospatial Data

STA/ISS 313 - Project 2

Author

The gg_twizzlers

Introduction

Life expectancy, population, fertility, and mortality rates are factors that can tell a story about a country and the world. What if a user could view these factors in a customizable manner where they can choose which countries and how to visualize this data? Our high-level goal is to create an interactive spatio-temporal visualization regarding the changes in world population and other related factors (child mortality, life expectancy, children per woman). Our final product is a Shiny application that allows users to visualize these factors in various forms, including a spatio-temporal visualization for a world map or singular country, compare a related factor for multiple countries, and predict factors in the future using a time series regression model. To analyze our data we will be using four primary datasets that were all downloaded from the website Gapminder, at ‘https://www.gapminder.org/’.

Our primary inspiration for this data comes from a YouTube video watched called ‘200 countries in 200 years’ by Hans Rosling as well as visualizations on the gapminder website. In this video Hans Rosling creates a spatio-temporal animation of population and life expectancy over years. Our group created a similar map, while adding other visualizations and customizability for users, using geospatial visualizations, predictions, and plots to compare countries.

We decided to approach the visualizations by creating a shiny app that allows people to pick and select which countries to compare and use a slider to toggle through time at their own pace and see snapshots of each year, for a single country as well as for all countries. Everyone visualizes data in different ways and different people have different preferred methods of visualizing data so this was to maximize the user’s input/customization in how they wanted the data to be presented. The four main visualization options we created are a world map visualization to create a big-picture idea about the state of the world in a given year for a factor (population, fertility, mortality, or life expectancy), a single-country geo-spacial visualization to see how a relevent factor changed in a country over time, a linear model to compare the factor as it relates to multiple countries, and a regression model to show a prediction for how a factor will change in the future for a selected country.

Data and Data Cleaning

First we load the necessary data.

library(tidyverse)
Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
had status 1
life_exp <- read_csv("data/life_expectancy_years.csv")
pop <- read_csv("data/population_total.csv")
mortality <- read_csv("data/child_mortality.csv")
fertility <- read_csv("data/fertility.csv")

The first dataset is a life expectancy dataset.

This dataset has 195 rows which take the form of an independent country and 302. Each observation is a year and the life expectancy of people in that given country in that year (as doubles). If there are NAs in the data frame it is because the modern day country does not have a life expectancy statistic via the World Bank.

The second dataset is a population total dataset.

This dataset has 197 rows which take the form of an independent country and 302. Each observation is a year and the population of people in that given country in that year (as characters). If there are NAs in the data frame it is because the modern day country does not have a population statistic via the World Bank.

The third dataset is a child_mortality dataset.

This dataset has 197 rows which take the form of an independent country and 302. Each observation is a year and the number of children out of 1000 that die from the age 0-5 in that given country in that year (as integers). If there are NAs in the data frame it is because the modern day country does not have a population statistic via the World Bank.

This dataset has 202 rows which take the form of an independent country and 302. Each observation is a year and the total number of children per fertile woman in that given country in that year (as integers). If there are NAs in the data frame it is because the modern day country does not have a population statistic via the World Bank.

We then underwent some data processing to clean the data. We cleaned each data frame individually so that year is just one variable with each year as an observation rather than several variables. We also make sure the geospatial sf data is linked to the proper country. Some countries were also named differently in the data frame than in the geospatial data frame so we had to rename observations in order for them to be assigned their correct geospatial sf geometry. Finally we filtered the data by data within the years 1880 to 2018 which were the years in which a large number of observations were confirmed data.

The below code represents the steps taken to load and clean the data:

Here we load the necessary libraries and set the proper theme for our rendered document.

library(tidyverse)
library(countdown)
library(mapproj)
library(sf)
library(geofacet)
Warning in fun(libname, pkgname): rgeos: versions of GEOS runtime 3.11.1-CAPI-1.17.1
and GEOS at installation 3.10.2-CAPI-1.16.0differ
library(ggrepel)
library(ggspatial)
library(patchwork)
library(rnaturalearth)
library(rnaturalearthdata)
library(drc)
library(gganimate)
library(knitr)
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 14))
options(width = 65)
knitr::opts_chunk$set(
  fig.width = 7,
  fig.asp = 0.618, 
  fig.retina = 3, 
  fig.align = "center", 
  dpi = 300
)

Next, we clean the variable names of the years in the data so that they can be converted to numerics.

names(fertility) <- sapply(str_remove_all(colnames(fertility), "X"), "[")
names(life_exp) <- sapply(str_remove_all(colnames(life_exp), "X"), "[")
names(mortality) <- sapply(str_remove_all(colnames(mortality), "X"), "[")
names(pop) <- sapply(str_remove_all(colnames(pop), "X"), "[")

Now, we clean the data in our four data frames: fertility, life_exp, mortality, and pop. The process for cleaning each data frame was the same. We first renamed the countries that were named differently in the data set and the geospatial data set in order for them to be assigned their correct geospatial sf geometry. We then joined the data set with the geospatial data set. Next, we pivoted our data so that year was a variable and each year was now an observation instead of its own variable. Finally, we converted each year to be a numeric/double.

world <- ne_countries(scale = "medium", returnclass = "sf")
new_world <- data.frame(world$admin, world$geometry)
new_world <- rename(new_world, country = world.admin)

fertility <- fertility |>
  filter(!(country %in% c("Netherlands Antilles", "Guadeloupe", "French Guiana", 
                      "Martinique", "Mayotte", "Reunion"))) |>
  mutate(country = case_when(
    country == "Bahamas" ~ "The Bahamas",
    country == "Channel Islands" ~ "Jersey",
    country == "Cote d'Ivoire" ~ "Ivory Coast",
    country == "Congo, Dem. Rep." ~ "Democratic Republic of the Congo",
    country == "Congo, Rep." ~ "Republic of Congo",
    country == "Micronesia, Fed. Sts." ~ "Federated States of Micronesia",
    country == "Guinea-Bissau" ~ "Guinea Bissau",
    country == "Hong Kong, China" ~ "Hong Kong S.A.R.",
    country == "Kyrgyz Republic" ~ "Kyrgyzstan",
    country == "St. Kitts and Nevis" ~ "Saint Kitts and Nevis",
    country == "Lao" ~ "Laos",
    country == "St. Lucia" ~ "Saint Lucia",
    country == "Macao, China" ~ "Macao S.A.R",
    country == "North Macedonia" ~ "Macedonia",
    country == "Serbia" ~ "Republic of Serbia",
    country == "Slovak Republic" ~ "Slovakia",
    country == "Eswatini" ~ "Swaziland",
    country == "Timor-Leste" ~ "East Timor",
    country == "Tanzania" ~ "United Republic of Tanzania",
    country == "United States" ~ "United States of America",
    country == "St. Vincent and the Grenadines" ~ "Saint Vincent and the Grenadines",
    country == "Virgin Islands (U.S.)" ~ "United States Virgin Islands",
    TRUE ~ country
  ))

fertility_clean <- fertility |>
  left_join(new_world, by = "country")

fertility_clean <- fertility_clean[ , c("geometry",
                        names(fertility_clean)[names(fertility_clean) != "geometry"])]

fertility_clean <- fertility_clean |>
  pivot_longer(cols = 3:303, names_to = "year", values_to = "est_fertility") |>
  filter(year >= 1880, year <= 2018)
life_exp <- life_exp |>
  filter(country != "Tuvalu") |>
  mutate(country = case_when(
    country == "Bahamas" ~ "The Bahamas",
    country == "Cote d'Ivoire" ~ "Ivory Coast",
    country == "Congo, Dem. Rep." ~ "Democratic Republic of the Congo",
    country == "Congo, Rep." ~ "Republic of Congo",
    country == "Micronesia, Fed. Sts." ~ "Federated States of Micronesia",
    country == "Guinea-Bissau" ~ "Guinea Bissau",
    country == "Hong Kong, China" ~ "Hong Kong S.A.R.",
    country == "Kyrgyz Republic" ~ "Kyrgyzstan",
    country == "St. Kitts and Nevis" ~ "Saint Kitts and Nevis",
    country == "Lao" ~ "Laos",
    country == "St. Lucia" ~ "Saint Lucia",
    country == "North Macedonia" ~ "Macedonia",
    country == "Serbia" ~ "Republic of Serbia",
    country == "Slovak Republic" ~ "Slovakia",
    country == "Eswatini" ~ "Swaziland",
    country == "Timor-Leste" ~ "East Timor",
    country == "Tanzania" ~ "United Republic of Tanzania",
    country == "United States" ~ "United States of America",
    country == "St. Vincent and the Grenadines" ~ "Saint Vincent and the Grenadines",
    TRUE ~ country
  ))

life_clean <- left_join(life_exp, new_world, by = "country")

life_clean <- life_clean[ , c("geometry",
                        names(life_clean)[names(life_clean) != "geometry"])]

life_clean <- life_clean |>
  pivot_longer(cols = 3:303, names_to = "year", values_to = "life_exp")

life_clean <- life_clean |>
  filter(year >= 1880, year <= 2018) |>
  mutate(year = as.numeric(year))

datebreaks1 <-
  seq(1880, 2020, by = 10)
mortality <- mortality |>
  mutate(country = case_when(
    country == "Bahamas" ~ "The Bahamas",
    country == "Congo, Dem. Rep." ~ "Democratic Republic of the Congo",
    country == "Congo, Rep." ~ "Republic of Congo",
    country == "Cote d'Ivoire" ~ "Ivory Coast",
    country == "Micronesia, Fed. Sts." ~ "Federated States of Micronesia",
    country == "Guinea-Bissau" ~ "Guinea Bissau",
    country == "Hong Kong, China" ~ "Hong Kong S.A.R.",
    country == "Kyrgyz Republic" ~ "Kyrgyzstan",
    country == "Lao" ~ "Laos",
    country == "St. Lucia" ~ "Saint Lucia",
    country == "North Macedonia" ~ "Macedonia",
    country == "Serbia" ~ "Republic of Serbia",
    country == "Slovak Republic" ~ "Slovakia", 
    country == "Timor-Leste" ~ "East Timor",
    country == "Tanzania" ~ "United Republic of Tanzania", 
    country == "United States" ~ "United States of America",
    country == "St. Vincent and the Grenadines" ~ "Saint Vincent and the Grenadines",
    TRUE ~ country
  )) 
  
mortality_clean <- left_join(mortality, new_world, by = "country")

mortality_clean <- mortality_clean[ , c("geometry",
                       names(mortality_clean)[names(mortality_clean) != "geometry"])]

mortality_clean <- mortality_clean |>
  pivot_longer(cols = 3:303, names_to = "year", values_to = "mortality") |>
  filter(year <= 2018,
         year >= 1880) |>
  mutate(year = as.numeric(year))
pop <- pop |>
  filter(country != "Tuvalu" & country != "Holy See") |>
  mutate(country = case_when(
    country == "Bahamas" ~ "The Bahamas",
    country == "Cote d'Ivoire" ~ "Ivory Coast",
    country == "Congo, Dem. Rep." ~ "Democratic Republic of the Congo",
    country == "Congo, Rep." ~ "Republic of Congo",
    country == "Micronesia, Fed. Sts." ~ "Federated States of Micronesia",
    country == "Guinea-Bissau" ~ "Guinea Bissau",
    country == "Hong Kong, China" ~ "Hong Kong S.A.R.",
    country == "Kyrgyz Republic" ~ "Kyrgyzstan",
    country == "St. Kitts and Nevis" ~ "Saint Kitts and Nevis",
    country == "Lao" ~ "Laos",
    country == "St. Lucia" ~ "Saint Lucia",
    country == "North Macedonia" ~ "Macedonia",
    country == "Serbia" ~ "Republic of Serbia",
    country == "Slovak Republic" ~ "Slovakia", 
    country == "Eswatini" ~ "Swaziland",
    country == "Timor-Leste" ~ "East Timor",
    country == "Tanzania" ~ "United Republic of Tanzania", 
    country == "United States" ~ "United States of America",
    country == "St. Vincent and the Grenadines" ~ "Saint Vincent and the Grenadines",
    TRUE ~ country
  ))

pop_clean <- left_join(pop, new_world, by = "country")
pop_clean = pop_clean[ , c("geometry",
                           names(pop_clean)[names(pop_clean) != "geometry"])]

pop_clean <- pop_clean |>
  pivot_longer(cols = 3:303, names_to = "year", values_to = "population") |>
  filter(year <= 2018,
         year >= 1880) |>
  mutate(year = as.numeric(year))

pop_clean <- pop_clean |>
  mutate(population = case_when(
    str_sub(population, -1, -1) == "M" ~ as.numeric(str_sub(population, 1, -2)) * 1000000,
    str_sub(population, -1, -1) == "k" ~ as.numeric(str_sub(population, 1, -2)) * 1000
  ))

pop_clean <- pop_clean |>
  mutate("dset_type" = "Population")

Visualization Examples

In all our visualizations we will primarily use either one data set or one country as an example, but the same visualizations can be extended to each of the four data sets or any country within that data set.

ggplot(data = fertility_clean |> filter(year == 1920)) +
  geom_sf(aes(geometry = geometry, fill = est_fertility)) +
  scale_fill_distiller(palette = "RdYlGn",
                       direction = 1,
                       limits = c(
                         min(fertility_clean$est_fertility, na.rm = TRUE),
                         max(fertility_clean$est_fertility, na.rm = TRUE)
                       )) +
  theme_void() +
  labs(title = "Average Fertility", 
       subtitle = "For each country in 1920", 
       fill = "Average Children\n per Woman")

Figure 1. Fertility in 1920

ggplot(data = fertility_clean |> filter(year == 1980)) +
  geom_sf(aes(geometry = geometry, fill = est_fertility)) +
  scale_fill_distiller(palette = "RdYlGn",
                       direction = 1,
                       limits = c(
                         min(fertility_clean$est_fertility, na.rm = TRUE),
                         max(fertility_clean$est_fertility, na.rm = TRUE)
                       )) +
  theme_void() +
  labs(title = "Average Fertility", 
       subtitle = "For each country in 1920", 
       fill = "Average Children\n per Woman")

Figure 1. Fertility in 1920

This visualization is one example of what a user could create using the Shiny application. This geo-spatial plot shows the average number of children per woman, in every country in the year 1920. To recreate this, the user would set the year to 1920 using the slider, select fertility as the related factor, and use the ‘World Map’ tab. The user would then also be able to set the slider to 1980 making it easy for them to compare the average number of children per woman over time. In this case the average children per woman looks to have decreased from 1920 to 1980.

ggplot(data = life_clean |> filter(year == 1920)) +
  geom_sf(aes(geometry = geometry, fill = life_exp)) +
  scale_fill_distiller(palette = "RdYlGn",
                       limits = c(
                         min(life_clean$life_exp, na.rm = TRUE),
                         max(life_clean$life_exp, na.rm = TRUE)
                       ),
                       direction = 1) +
  theme_void() +
  labs(title = "Average Life Expectancy", 
       subtitle = "For each country in 1920", 
       fill = "Average Life \nExpectancy in Years")

Figure 2. Life expectancy in 1920

This visualization shows the average life expectancy in each country in 1920.

#filtered by country, can change the colors how we'd like
ggplot(data = life_clean |> filter(year == 1920,
                                   country == "Spain")) +
  geom_sf(aes(geometry = geometry, fill = life_exp)) +
  scale_fill_distiller(palette = "RdYlGn",
                       direction = 1,
                       limits = c(
                         min(life_clean$life_exp, na.rm = TRUE),
                         max(life_clean$life_exp, na.rm = TRUE)
                       )) +
  theme_void() +
  labs(title = "Life Expectancy in Spain",
       subtitle = "In 1920", 
       fill = "Average age at death")

Figure 3. Life expectancy in Spain in 1920

This plot shows that the average life expectancy was around 40 years old in Spain in 1920. This visualization is not very informative on its own, but users on the Shiny app can use a slide to visualize how life expectancy (as well as other factors) have changed over time for each country. On the Shiny app, a user would use the ‘Single Country’ tab, select Spain and slide the slider to 1920. They could do the same for a later year to compare how life expectancy has changed in Spain.

One thing about the above plots is that they are good for intracountry comparison. These visualizations are effective for showing how life expectancy in particular countries have changed over time (increased or decreased or remained static).

However, some people may want to look at intercountry comparisons. That is they want to see how countries compare to the world as a whole in terms of, say, life expectancy over time. To do this we allow the legend to become dynamic, with the min and max changing as the year changed. We can see in the following plots how comparisons can be made with a dynamically changing legend.

ggplot(data = life_clean |> filter(year == 1980)) +
  geom_sf(aes(geometry = geometry, fill = life_exp)) +
  scale_fill_distiller(palette = "RdYlGn", direction = 1) + 
  labs(title = "Life expectancy in 1980", fill = "Life Expectancy") +
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank())

ggplot(data = life_clean |> filter(year == 2000)) +
  geom_sf(aes(geometry = geometry, fill = life_exp)) +
  scale_fill_distiller(palette = "RdYlGn", direction = 1) + 
  labs(title = "Life expectancy in 2000", fill = "Life Expectancy") +
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank())

The above visualization shows the same single country visualization as before but with a dynamic plot. Take a country like Australia for example, in 2000 it is colored in darker green than in 1980. Not only does this mean that life expectancy increased between those years in Australia, but also compared to the rest of the world life expectancy in Australia also increased.

datebreaks1 <-
  seq(1880, 2020, by = 10)
min_year = 1990
max_year = 2010
country_name = "Spain"
country_name2 = "Romania"

ggplot(data = life_clean |> filter(country %in% c(country_name, country_name2))) +
  geom_point(
    aes(
      x = year,
      y = life_exp,
      shape = country,
      color = life_exp
      )
    ) +
  labs(title = "Life Expectancy of Spain and Romania", 
       subtitle = "From 1990 to 2010", 
       x = "Year", 
       y = "Life Expectancy",
       shape = "Country",
       color = "Life Expectency \n(years)") +
  scale_color_distiller(palette = "RdYlGn", direction = 1) +
  coord_cartesian(xlim = c(min_year, max_year))

Figure 4. Comparing life expectancies in Spain and Romania

This plot shows how a user on the Shiny app can visualize a factor for multiple countries using a scatter plot. This makes it easier for users to see the change in a factor over time, as an alternative to using a geo-spatial visualization.

country_name = "Spain"
country_name2 = "France"
min_year = 1900
max_year = 2010

yminyear1 <- life_clean |>
  filter(country == country_name,
         year == min_year) |>
  pull(life_exp)

yminyear2 <- life_clean |>
  filter(country == country_name2,
         year == min_year) |>
  pull(life_exp)

ymaxyear1 <- life_clean |>
  filter(country == country_name,
         year == max_year) |>
  pull(life_exp)

ymaxyear2 <- life_clean |>
  filter(country == country_name2,
         year == max_year) |>
  pull(life_exp)

ymin = min(yminyear1, yminyear2, ymaxyear1, ymaxyear2)
ymax = max(yminyear1, yminyear2, ymaxyear1, ymaxyear2)

ggplot(data = life_clean |> filter(country %in% c(country_name, country_name2))) +
  geom_line(aes(
    x = year,
    y = life_exp,
    group = country,
    color = country
  )) +
  labs(title = "Life Expectancy of Spain and France", 
       subtitle = "From 1900 to 2010", 
       x = "Year", 
       y = "Life Expectancy",
       color = "Country") +
  coord_cartesian( ylim = c(ymin - 2, ymax + 2)) +
  theme_minimal() +
  scale_color_viridis_d() 

Figure 5. Comparing life expectancies in France and Spain

Another way the factors can be visualized over time using our Shiny app is by plotting a factor of multiple countries in a line graph. Users can compare countries over a period of time. Our shiny app would also allow users to see an animation of how something like life expectancy is dynamically changing over time. This plot takes a long time to render, however, so we did not include it in the write up. It is in our shiny however.

Life expectancy time series regression model

Lastly, we added a time series regression model to the project as our, “something we learned outside of class” aspect. We first tried to use a linear regression model to predict this data. However, we realized the plot of something like life expectancy was not linear, but more so sigmoidal. We also realized that each of life expectancy, fertility, mortality, and population may not be linear over time so we decided that a time series model via an ARIMA (autoregressive integrated moving average) model would be better to fit the data.

We first converted our data frame into time series data so that it would work with the ARIMA model.

country_name == "Spain"
[1] TRUE
life_clean_ts <- life_clean |>
  filter(country == country_name) |>
  dplyr::select(life_exp)

life_clean_ts <- ts(life_clean_ts, start = 1880, end = 2018, frequency = 1)

Next, we make sure that our data is indeed in the time series form.

is.ts(life_clean_ts)
[1] TRUE
# print(life_clean_ts)

Now, we fit our ARIMA model.

AR <- arima(life_clean_ts, order = c(1,0,0))
print(AR)

Call:
arima(x = life_clean_ts, order = c(1, 0, 0))

Coefficients:
         ar1  intercept
      0.9977    56.7898
s.e.  0.0031    23.2020

sigma^2 estimated as 2.907:  log likelihood = -274.08,  aic = 554.16

Finally, we make predictions on future years using the ARIMA model. Here you can specify until which year you want the ARIMA model to make predictions for. This ties into the capabilities of our shiny app where the user can manually select which year they want to see predictions until.

predict_AR <- predict(AR)
year_wanted = 2027
prediction <- predict(AR, n.ahead = year_wanted - 2018)
pred_df <- data.frame(prediction)
x_vec <- c()

for(i in 2019:year_wanted){
  x_vec <- c(x_vec, i)
}

pred_df <- pred_df |>
  mutate(year = x_vec,
         life_exp = pred) |>
  dplyr::select(year, life_exp) |>
  kable()

pred_df
year life_exp
2019 83.03873
2020 82.97761
2021 82.91662
2022 82.85578
2023 82.79509
2024 82.73453
2025 82.67411
2026 82.61383
2027 82.55370

Discussion

We were able to successfully create a Shiny application that allows users to visualize population, fertility, life expectancy and mortality data in a customizable manner. Our application allows users to compare countries geospatially using the ‘World Map Visualization’ tab, as well as in the form of a line graph using the ‘Comparing countries’ tab. Users are also to see predictions about how factors will change in the future using the ‘Prediction’ tab.

We explored the data using our Shiny app. We found that in the world, life expectancy and population has increased, while fertility and mortality has decreased. However, the life expectancy of Asian and African countries has consistently been less than the average life expectancy of American and European countries, while the fertility has been consistently higher. We noticed drops in line graphs of the life expectancy of European countries. The two most obvious ones being around 1918 and 1940. We attributed the significant decrease around 1918 to the Spanish flu pandemic, the deadliest pandemic in modern history, and the decrease in 1940 to World War II, which killed millions of people at a young age.

There are a couple limitations in our application. Some factors have missing data for countries, which affects the trends and predictions in the application. Additionally, the datasets we used stop at 2018, and it would be interesting to see how trends in mortality, population, life expectancy, and fertility have changed in recent years. In particular, it would be interesting to see if/how the effects of the Covid-19 pandemic could be seen in our graphs. This data would also allow the predictions we made to be more accurate. With our Shiny app, there are many directions which future work could go. We could create functions that allowed multiple factors to be compared to each other on one plot, rather than limiting comparisons and graphs to one factor. We could also include more factors such as age of child-bearing, carbon emissions, or income. Adding more factors would provide users with further customizability and information.