One of the most important skills for a data scientist is data presentation. The countless hours of behind the scenes work: aggregating the data, tidying it, performing statistical analyses, etc is wasted if the conclusions drawn from this work cannot be conveyed properly to a general audience. It is for this reason that data visualization is such a rapidly growing field, for as they say: “a picture is worth a thousand words”.
This is all well and good when talking about trends over time for a single variable, approval ratings or other simple datasets that can be presented in the form of bar charts, scatter plots or box plots. What does one do when the goal is to compare data from various countries?
In this case, it is useful to use a combination of the package ggplot2 from the TidyVerse along with a few other packages containing geographical data to plot a world map. (Additional package inspiration from https://www.r-spatial.org/r/2018/10/25/ggplot2-sf.html)
Let’s begin by loading the tidyverse and additional required packages:
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(rnaturalearth)
library(rnaturalearthdata)
library(rgeos)
## Loading required package: sp
## rgeos version: 0.5-2, (SVN revision 621)
## GEOS runtime version: 3.6.1-CAPI-1.10.1
## Linking to sp version: 1.3-2
## Polygon checking: TRUE
Following the approach outlined in https://www.r-spatial.org/r/2018/10/25/ggplot2-sf.html we can plot the map and zoom into a certain area. The For the data we will be looking at what we want is Europe.
Start by pulling the data for all countries using the ne_countries
function from the package rnaturalearth
. The return class is driven by what ggplot2
expects as an input, which is the sf class.
Specify the data selected above as the input for ggplot
(a core function within the ggplot2
package).
geom_sf
can be applied to a ggplot
, assuming that geometric information is stored in a column titled geometry within the dataset. In this case that means that geom_sf is pulling the information contained in world$geometry. The output of geom_sf
is dependent on the information contained within the geometry column. In this case, the geometry column contains complex plygon coordinates for all of the countries on the world map based on latitude and longitude.
Finally, narrow the x and y limits of the plot to coincide with the outer reaches of the area you want to focus on. In this case this is the latidude and longitude ranges encompassing mainland Europe.
world <- ne_countries(scale = "medium", returnclass = "sf")
Europe <- world[which(world$continent == "Europe"),]
ggplot(Europe) +
geom_sf() +
coord_sf(xlim = c(-25,50), ylim = c(35,70), expand = FALSE)
Sample Map of Europe
We now need a set of data to work with. Here is a collection of data from the European Social Survey (ESS) pulled from Kaggle.com (https://www.kaggle.com/pascalbliem/european-social-survey-ess-8-ed21-201617/data#) with responses to a survey conducted throughout Europe. The data contains 44387 responses collected throughout 23 countries.
data <- read.csv("ESS8e02.1_F1.csv")
Let’s subset the dataset to a few variables we want to plot on the map to make it more manageable:
sub_data <- data[,c("idno","cntry","polintr","trstplc","happy","rlgblg")]
Split out data and take out erroneous responses (“Do Not Know”, “Refusal”, “No Answer”, etc.)
PolInt <- sub_data[which(sub_data$polintr < 5),c(1:3)]
PoliceTrust <- sub_data[which(sub_data$trstplc < 11),c(1,2,4)]
Happy <- sub_data[which(sub_data$happy < 11),c(1,2,5)]
Religious <- sub_data[which(sub_data$rlgblg < 3),c(1,2,6)]
Let’s combine data to look at averages of the responses to the chosen questions:
PolInt_df <- PolInt %>%
group_by(cntry) %>%
summarise(PolitIntr = mean(polintr), n = n())
Police_df <- PoliceTrust %>%
group_by(cntry) %>%
summarise(PolTrust = mean(trstplc), n = n())
Happy_df <- Happy %>%
group_by(cntry) %>%
summarise(Happy = mean(happy), n = n())
Religious_df <- Religious %>%
group_by(cntry) %>%
summarise(Religious = mean(rlgblg), n = n())
Results <- data.frame(PolInt_df$cntry, PolInt_df$PolitIntr,Police_df$PolTrust, Happy_df$Happy, Religious_df$Religious)
#Remove Israel as it is not part of Europe.
Results <- Results[-c(13),]
names(Results) <- c("Country", "PolitInt", "PoliceTrst", "Happy", "Religious")
Change the abbreviated names of coutnries from the survey to the full names used by the rnaturalearth
dataset
abv <- c("AT","BE","CH","CZ","DE","EE","ES","FI","FR","GB","HU","IE","IS","IT","LT","NL","NO","PL","PT","RU","SE","SI" )
repl <- c("Austria","Belgium","Switzerland","Czech Rep.","Germany", "Estonia", "Spain", "Finland", "France", "United Kingdom", "Hungary", "Ireland", "Iceland", "Italy", "Lithuania", "Netherlands","Norway", "Poland", "Portugal", "Russia", "Sweden" , "Slovenia")
replacement_df <- data.frame(abv = abv, rep = repl)
fix_name <- function(abbv){
for (i in 1:23){
if (abbv == replacement_df[i,1]){
return (replacement_df[i,2])
}
}
}
Results[,"Country"] <- sapply(Results[,"Country"],fix_name)
Let’s combine our data into the full Europe dataset and get plotting
for (i in 1:nrow(Results)){
row <- which(Europe$name == Results[i,"Country"])
Europe[row,"PolitInt"] <- Results[i,"PolitInt"]
Europe[row,"PoliceTrst"] <- Results[i,"PoliceTrst"]
Europe[row,"Happy"] <- Results[i,"Happy"]
Europe[row,"Religious"] <- Results[i,"Religious"]
}
Now we can look at the ways ggplot can be used to visualize data. One can provide the geom_sf
with aesthetic mappings, which specify how exactly to present data. To do this, one must specify aes()
within the geom_sf function.
aes()
accepts three color related aesthetics:
1. aes(color)
The color aesthetic assigns a value to the geometries drawn by geom_sf
based on the value associated with each observation. In our case this means the outline of the country will be colored based on the value we specify.
Let’s see how that looks when we assign the outlines values based on the interest in politics within each surveyed country:
politic_plot <- ggplot(Europe) +
geom_sf(aes(color = PolitInt)) +
coord_sf(xlim = c(-25,50), ylim = c(35,70), expand = FALSE)
Map of Europe with country outlines colored based on interest in politics
2. aes(alpha)
The alpha aesthetic assigns a value to the transparency of the fill of each geometry. For us this means how transparent each country appears on the map. Let’s take a look at this using the Polie Trust level:
police_plot <- ggplot(Europe) +
geom_sf(aes(alpha = PoliceTrst)) +
coord_sf(xlim = c(-25,50), ylim = c(35,70), expand = FALSE)
Map of Europe with transparency of country based on trust in the police
3. aes(fill)
The fill aesthetic is especially helpful when working with maps. It assigns the fill of the geometry basd on the value associated with each one. We can plot the Happiness data using this aesthetic to see which countries’ respondents consider themselves happier on average:
happy_plot <- ggplot(Europe) +
geom_sf(aes(fill = Happy))
coord_sf(xlim = c(-25,50), ylim = c(35,70), expand = FALSE)
## <ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
## aspect: function
## backtransform_range: function
## clip: on
## crs: NULL
## datum: crs
## default: FALSE
## distance: function
## expand: FALSE
## fixup_graticule_labels: function
## is_free: function
## is_linear: function
## label_axes: list
## label_graticule:
## labels: function
## limits: list
## modify_scales: function
## ndiscr: 100
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_params: function
## setup_params: function
## transform: function
## super: <ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
Map of Europe with country fill-in colored based on overall happiness
If we want, we can specify what colors to use for the fill to customize our map a little. Here is an example of applying a built in gradient (gradient2
) to the fill applied to our map.
Map of Europe with country fill-in color based on overall happiness
Other built-in color options include:
Now let’s take a look at a way to specify the colors we want used while switching over to the religious index data.
religious_plot <- ggplot(Europe) +
geom_sf(aes(fill = Religious)) +
coord_sf(xlim = c(-25,50), ylim = c(35,70), expand = FALSE)
Applying the style scale_fill_gradient
allows us to specify the two extreme colors and let R fill in the gradient for us.
In this case let’s create a gradient between orange for the low values and blue or high.
religious_plot + scale_fill_gradient(low = "orange", high = "blue")
Map of Europe with country fill-in color based on religious index
When looking to visualize data from various countries, overlaying onto a map can be a very effective way. Using the rnaturalearth
package coupled with the geom_sf
capability within ggplot2
allows one to plot any part of the world map. Collects data can then be assigned to each color and visualized as outlines, fills and transparencies based on the color
, fill
and alpha
aesthetics available for within aes()
option. There are countless ways to customize the fills and colors based on what the analyst is attempting to convey.