Introduction

I would to talk about my favourite ggplot component call “geom_sf”

This set of geom, stat, and coord are used to visualise simple feature (sf) objects. For simple plots, you will only need geom_sf() as it uses stat_sf() and adds coord_sf() for you. geom_sf() is an unusual geom because it will draw different geometric objects depending on what simple features are present in the data: you can get points, lines, or polygons. For text and labels, you can use geom_sf_text() and geom_sf_label(). (Source I)


Content Overview

I don’t think that I’ll reveal anything new if I tell you that Earth is not flat but a globe instead. When plotting in two dimensions, the representation will always be tricky depending on the coordinates. For example, if you peel an orange, once it is peeled, it won’t be flat once you place it over a horizontal surface, right? The exact same theory applies here.

“Projection is the process of making a globe, or a portion of it, fit into a flat picture. When you make that transformation, some features will always be distorted; some distances, shapes, and areas will be stretched, and others will be compressed. There’s no 100 percent accurate representation of a globe other than a globe itself.” (Source II)

[Source II]


Why You Should Care

This topic is valuable because… Come on, the work talks by itself. Working with ggplot allows you to add layer by layer to the plot a lot of features that will help you to make the best visualization ever. geom_sf in particular, will be a plain globe for you to play with it. So, keep going through this code-through and check it out.


Learning Objectives

Specifically, you’ll learn how to:
1. Plot the simplest world map
2. Add a coordinate system to make it more realistic
3. Add some color

Some more advanced features
4. Add more layers to your graph with your favourite data
5. Let’s reduce layers and focus on one state

Bonus
6. Annotate function to add texts
7. Annotate function to add arrows


Body Title

Here, we’ll show how it works:

coord_sf(
xlim = NULL,
ylim = NULL,
expand = TRUE,
crs = NULL,
default_crs = NULL,
datum = sf::st_crs(4326),
label_graticule = waiver(),
label_axes = waiver(),
lims_method = c(“cross”, “box”, “orthogonal”, “geometry_bbox”),
ndiscr = 100,
default = FALSE,
clip = “on”
)

geom_sf(
mapping = aes(),
data = NULL,
stat = “sf”,
position = “identity”,
na.rm = FALSE,
show.legend = NA,
inherit.aes = TRUE,

)

geom_sf_label(
mapping = aes(),
data = NULL,
stat = “sf_coordinates”,
position = “identity”,
…,
parse = FALSE,
nudge_x = 0,
nudge_y = 0,
label.padding = unit(0.25, “lines”),
label.r = unit(0.15, “lines”),
label.size = 0.25,
na.rm = FALSE,
show.legend = NA,
inherit.aes = TRUE,
fun.geometry = NULL
)

geom_sf_text(
mapping = aes(),
data = NULL,
stat = “sf_coordinates”,
position = “identity”,
…,
parse = FALSE,
nudge_x = 0,
nudge_y = 0,
check_overlap = FALSE,
na.rm = FALSE,
show.legend = NA,
inherit.aes = TRUE,
fun.geometry = NULL
)

stat_sf(
mapping = NULL,
data = NULL,
geom = “rect”,
position = “identity”,
na.rm = FALSE,
show.legend = NA,
inherit.aes = TRUE,

)
[Source I]


Arguments

xlim, ylim: Limits for the x and y axes. These limits are specified in the units of the default CRS. By default, this means projected coordinates (default_crs = NULL). How limit specifications translate into the exact region shown on the plot can be confusing when non-linear or rotated coordinate systems are used as the default crs. First, different methods can be preferable under different conditions. See parameter lims_method for details. Second, specifying limits along only one direction can affect the automatically generated limits along the other direction. Therefore, it is best to always specify limits for both x and y. Third, specifying limits via position scales or xlim()/ylim() is strongly discouraged, as it can result in data points being dropped from the plot even though they would be visible in the final plot region.

expand: if TRUE, the default, adds a small expansion factor to the limits to ensure that data and axes don’t overlap. If FALSE, limits are taken exactly from the data or xlim/ylim.

crs: The coordinate reference system (CRS) into which all data should be projected before plotting. If not specified, will use the CRS defined in the first sf layer of the plot.

default_crs: The default CRS to be used for non-sf layers (which don’t carry any CRS information) and scale limits. The default value of NULL means that the setting for crs is used. This implies that all non-sf layers and scale limits are assumed to be specified in projected coordinates. A useful alternative setting is default_crs = sf::st_crs(4326), which means x and y positions are interpreted as longitude and latitude, respectively, in the World Geodetic System 1984 (WGS84).

datum: CRS that provides datum to use when generating graticules.

label_graticule: Character vector indicating which graticule lines should be labeled where. Meridians run north-south, and the letters “N” and “S” indicate that they should be labeled on their north or south end points, respectively. Parallels run east-west, and the letters “E” and “W” indicate that they should be labeled on their east or west end points, respectively. Thus, label_graticule = “SW” would label meridians at their south end and parallels at their west end, whereas label_graticule = “EW” would label parallels at both ends and meridians not at all. Because meridians and parallels can in general intersect with any side of the plot panel, for any choice of label_graticule labels are not guaranteed to reside on only one particular side of the plot panel. Also, label_graticule can cause labeling artifacts, in particular if a graticule line coincides with the edge of the plot panel. In such circumstances, label_axes will generally yield better results and should be used instead. This parameter can be used alone or in combination with label_axes.

label_axes: Character vector or named list of character values specifying which graticule lines (meridians or parallels) should be labeled on which side of the plot. Meridians are indicated by “E” (for East) and parallels by “N” (for North). Default is “–EN”, which specifies (clockwise from the top) no labels on the top, none on the right, meridians on the bottom, and parallels on the left. Alternatively, this setting could have been specified with list(bottom = “E”, left = “N”). This parameter can be used alone or in combination with label_graticule.

lims_method: Method specifying how scale limits are converted into limits on the plot region. Has no effect when default_crs = NULL. For a very non-linear CRS (e.g., a perspective centered around the North pole), the available methods yield widely differing results, and you may want to try various options. Methods currently implemented include “cross” (the default), “box”, “orthogonal”, and “geometry_bbox”. For method “cross”, limits along one direction (e.g., longitude) are applied at the midpoint of the other direction (e.g., latitude). This method avoids excessively large limits for rotated coordinate systems but means that sometimes limits need to be expanded a little further if extreme data points are to be included in the final plot region. By contrast, for method “box”, a box is generated out of the limits along both directions, and then limits in projected coordinates are chosen such that the entire box is visible. This method can yield plot regions that are too large. Finally, method “orthogonal” applies limits separately along each axis, and method “geometry_bbox” ignores all limit information except the bounding boxes of any objects in the geometry aesthetic.

ndiscr: Number of segments to use for discretising graticule lines; try increasing this number when graticules look incorrect.

default: Is this the default coordinate system? If FALSE (the default), then replacing this coordinate system with another one creates a message alerting the user that the coordinate system is being replaced. If TRUE, that warning is suppressed.

clip: Should drawing be clipped to the extent of the plot panel? A setting of “on” (the default) means yes, and a setting of “off” means no. In most cases, the default of “on” should not be changed, as setting clip = “off” can cause unexpected results. It allows drawing of data points anywhere on the plot, including in the plot margins. If limits are set via xlim and ylim and some data points fall outside those limits, then those data points may show up in places such as the axes, the legend, the plot title, or the plot margins.

mapping: Set of aesthetic mappings created by aes() or aes_(). If specified and inherit.aes = TRUE (the default), it is combined with the default mapping at the top level of the plot. You must supply mapping if there is no plot mapping.

data: The data to be displayed in this layer. There are three options:
1) if NULL, the default, the data is inherited from the plot data as specified in the call to ggplot().
2) A data.frame, or other object, will override the plot data. All objects will be fortified to produce a data frame. See fortify() for which variables will be created.
3) A function will be called with a single argument, the plot data. The return value must be a data.frame, and will be used as the layer data. A function can be created from a formula (e.g. ~ head(.x, 10)).

stat: The statistical transformation to use on the data for this layer, as a string.

position: Position adjustment, either as a string, or the result of a call to a position adjustment function.

na.rm: If FALSE, the default, missing values are removed with a warning. If TRUE, missing values are silently removed.

show.legend: logical. Should this layer be included in the legends? NA, the default, includes if any aesthetics are mapped. FALSE never includes, and TRUE always includes. You can also set this to one of “polygon”, “line”, and “point” to override the default legend.

inherit.aes: If FALSE, overrides the default aesthetics, rather than combining with them. This is most useful for helper functions that define both data and aesthetics and shouldn’t inherit behaviour from the default plot specification, e.g. borders(). Other arguments passed on to layer(). These are often aesthetics, used to set an aesthetic to a fixed value, like colour = “red” or size = 3. They may also be parameters to the paired geom/stat.

parse: If TRUE, the labels will be parsed into expressions and displayed as described in ?plotmath.

nudge_x: Horizontal and vertical adjustment to nudge labels by. Useful for offsetting text from points, particularly on discrete scales. Cannot be jointly specified with position.

nudge_y: Horizontal and vertical adjustment to nudge labels by. Useful for offsetting text from points, particularly on discrete scales. Cannot be jointly specified with position.

label.padding: Amount of padding around label. Defaults to 0.25 lines.

label.r: Radius of rounded corners. Defaults to 0.15 lines.

label.size: Size of label border, in mm.

fun.geometry: A function that takes a sfc object and returns a sfc_POINT with the same length as the input. If NULL, function(x) sf::st_point_on_surface(sf::st_zm(x)) will be used. Note that the function may warn about the incorrectness of the result if the data is not projected, but you can ignore this except when you really care about the exact locations.

check_overlap: If TRUE, text that overlaps previous text in the same layer will not be plotted. check_overlap happens at draw time and in the order of the data. Therefore data should be arranged by the label column before calling geom_text(). Note that this argument is not supported by geom_label().

geom: The geometric object to use display the data

[Source I]


Further Exposition

This is based/expanded on the theory/work/extension of the ggplot package. ggplot() initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden. [Source I]


Basic Examples

Here are some examples of types of maps you could work with. Again, I recommend to work with the Robinson one because it is, from my point of view, the most realistic one. However, all the other options are NOT wrong, since they do have their particular use according to your goal.

Here are some common ones that change according the coordinate reference systems:

ESRI:54002: Equidistant cylindrical projection for the world2
EPSG:3395: Mercator projection for the world
ESRI:54008: Sinusoidal projection for the world
ESRI:54009: Mollweide projection for the world
ESRI:54030: Robinson projection for the world
EPSG:4326: WGS 84: DOD GPS coordinates (standard −180 to 180 system)
EPSG:4269: NAD 83: Relatively common projection for North America
ESRI:102003: Albers projection specifically for the contiguous United States

I am pretty sure there might be more but this are a few to consider

For the sake of this class I will get rid of Antarctica. Why?! Because I can

# Some code
map <- world.map %>%                             # download map
  filter(SOV_A3 != 'ATA')                        # get rid of Antarctica, bye penguins!


ggplot()+
  geom_sf(data = map)+                                 
  labs(title = 'Simplest world map')+                  # add a title
  theme(plot.title = element_text(face = "bold"))      # make the title bold

ggplot()+
  geom_sf(data = map)+
  labs(title = 'Robinson coordinates')+                
  theme(plot.title = element_text(face = "bold"))+
coord_sf(crs = st_crs("ESRI:54030"))                   # coordination system

ggplot()+
  geom_sf(data = map,
          fill = 'forestgreen',                   # color of the countries
          color = 'black')+                       # color of the border of the countries
  labs(title = 'Robinson coordinates',
       subtitle = "Lets add some color to it")+
  theme(plot.title = element_text(face = "bold"),
        panel.background = element_rect(fill = "paleturquoise1"))+   # background color
coord_sf(crs = st_crs("ESRI:54030"))


Advanced Examples

Now that we know the basics, we can go crazy and try to do some real stuff. I would like to mention that this is my first R class, but I have watched thousands of YouTube videos and I kinda fell in love with the software. Why am I saying this? for two main reasons:
1) What I do could and may be wrong.
2) It could look WAY better.

With that being said, here we go, and let’s try to where is located the famous fast-food chain McDonald’s. For the sake of the project and because it was the data I found, this information is from December 2018 until May 2019. Moreover, let’s focus on the US (without Alaska, Hawaii, and PR) only because it is the only country in the world where all the bad things happen, according to the movies.

# USA
usa <- usa.country %>%                                      # USA map 
  filter(!(NAME %in% c('Puerto Rico', 'Alaska', 'Hawaii'))) # get rid of some states and mi gente latina (joke... I am latino)

# McDonald's
mcdonalds <- restaurants %>%                       # fast food data
  filter(name == "McDonald's") %>%                 # filter by McDonald's
  filter(!(province %in% c('AK', 'HI'))) %>%       # clean data by filtering
  st_as_sf(coords = c("longitude", "latitude"),    # we need the following code (st_as_sf) to make the necessary geometry column 
           crs = st_crs("EPSG:4326"))             # better coordination system for the US

ggplot() +
  geom_sf(data = usa,                     # USA layer
          fill = 'white')+                # fill out the states in white
  geom_sf(data = mcdonalds,               # McDonald's layer
          alpha = .4)+                    # make the dots a little more transparent
  coord_sf(crs = st_crs("ESRI:102003"))+  # better coordination system
  
  labs(title = "McDonald's in USA",                             # title
       subtitle = "Data from December 2018 to May 2019",        # subtitle
       caption = "Source: Kaggle.com") +                        # caption
  
  theme(plot.caption = element_text(face = "italic"),           # make the caption italic
    plot.title = element_text(face = "bold"),                   # make the title bold
    legend.position = "none")                                   # get rid of the legend

nrow(mcdonalds)                            # for fun, how many McDonald's are
## [1] 751

There were 751 McDonald’s in the US from December 2018 to May 2019 according to the data.


What’s more it can also be used for?


Let your imagination fly! you can download data from Google or Kaggle and get some data that you would like to check and enjoy playing with the world map.

To try to gain some bonus points, let’s focus on the state of Georgia to keep showing you how cool this is:

# Georgia
ga <- counties %>%                   # download data
  filter(STATEFP == 13)              # filtering by Georgia

mc.ga <- mcdonalds %>%               # McDonald's data
  filter(province == 'GA')           # McDonald's in Georgia only

ggplot() +
  geom_sf(data = ga,
          fill = 'white',
          size = .5)+                
  geom_sf(data = mc.ga,
          alpha = .8,
          size = 2,                                          # size of dots
          aes(color = mc.ga$city == 'Atlanta'))+             # point out Atlanta in
  scale_color_manual(values = c('black', 'red'))+            # all black, Atlanta in red
  
  labs(title = "McDonald's in Georgia",
       subtitle = "Data from December 2018 to May 2019",
       caption = "Source: Kaggle.com",
       x = '',
       y = '')+
  
  theme(plot.caption = element_text(face = "italic"),
    plot.title = element_text(face = "bold"),
    panel.background = element_rect(fill = NA),
    legend.position = "none")+

annotate(geom = "text",                                   # text
         x = -82,
         y = 34.5,
         size = 10,
         label = 'City of\nAtlanta\nUnited!',
         hjust = 'left',
         fontface = 'bold')+
  
annotate(geom = "curve",                                   # arrow
         x = -82.1,
         xend = -84.2,
         y = 34.5,
         yend = 33.9,
         curvature = .3,
         color = 'red',
         size = 1.3,
         arrow = arrow(length = unit(3, 'mm')))

nrow(mc.ga)                           # For fun again, how many McDonald's are?
## [1] 34

Huh… there are only 34. We have to recall this data is from December 2018 to May 2019 and that we are trusting the data I found. Again, this is just for the class.



Story

As shown, geom_sf is a layer of ggplot, and it is, from my point of view, one of the best tools you could use to provide an excellent infographic. Some of the examples I have seen during my research were immigration, pollution, inflation, travels, among other… Come on, the ideas are countless!

In this case, we could have colored dots by amount of selling per McDonalds, most visited McDonalds with a bubble size layer, color by oldest McDonald’s, colored by region, etc. As mentioned, countless ideas. However, the information I found was very limited so we ended up with a boring plain map. However, why don’t you try it? Get a good data on internet and play with colors and data.


So, go ahead and try it! Good luck, mate!







Further Resources

Learn all the different coordinates systems and compare them clicking here
Want to add texts and arrows with the annotation function? More info



Works Cited

This code through references and cites the following sources:


  • Hadley Wickham, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, Dewey Dunnington. (Unknown year). Source I: click here

  • Albert Cairo (2016). Source II: click here