Visualising Geographic Data

There are a number of really interesting data sets that provide data points with geographic coordinates. Some you may find interested include NASA Fires data or the Global Terrorism Database . In this Vignette, I explain some techniques that can be used to display data with geographic coordinates, Longitude and Latitude.

There are two slightly different techniques, using elements of the TidyVerse that can produce some nice looking maps efficiently. One uses ggplot and another uses ggmap which uses the same syntax as ggplot. Using these techniques we are able to visualise data in a very visual, easy to understand way.

United States Libraries Data

For this analysis we will use data from AggData, a website with lots of interesting locational data, some for free and others you need to pay for. One free file available on the website is the location of United States Libraries from 2005 found here. Unforuntately we are unable to input it to R using an API and have to download the csv file seperately. Save the file to your directory, then import it using the readr package. We use different packages throughout this Vignette, but load them as we go so that it’s clear why we use it. Remember to install the package using install.packages("readr") if you haven’t already.

library(readr)
## Warning: package 'readr' was built under R version 3.2.5
libraries <- read_csv("public_libraries.csv")
## Parsed with column specification:
## cols(
##   `Location Number` = col_character(),
##   `Location Name` = col_character(),
##   `Location Type` = col_character(),
##   Address = col_character(),
##   City = col_character(),
##   State = col_character(),
##   `Zip Code` = col_integer(),
##   `Phone Number` = col_character(),
##   County = col_character(),
##   Latitude = col_double(),
##   Longitude = col_double(),
##   Accuracy = col_character()
## )

Now that we have our data we can start to look at it. To see the first rows of our data, use head(). It gives a good overview of what the data looks like. We can also run a summary using summary().

head(libraries)
## # A tibble: 6 x 12
##   `Location Number` `Location Name`   `Location Type` Address  City  State
##   <chr>             <chr>             <chr>           <chr>    <chr> <chr>
## 1 AK0035-003        Mendenhall Valle… Branch Library  9105 Me… #134… AK   
## 2 AK0095            Akiachak School/… Library System  1 Main … Akia… AK   
## 3 AK0095-002        Akiachak School/… Central Library 1 Main … Akia… AK   
## 4 AK0094-005        Nunamiut School/… Branch Library  114 Ill… Anak… AK   
## 5 AK0001-002        Anchor Point Pub… Central Library 72251 M… Anch… AK   
## 6 AK0001            Anchor Point Pub… Library System  72251 M… Anch… AK   
## # ... with 6 more variables: `Zip Code` <int>, `Phone Number` <chr>,
## #   County <chr>, Latitude <dbl>, Longitude <dbl>, Accuracy <chr>
summary(libraries)
##  Location Number    Location Name      Location Type     
##  Length:26526       Length:26526       Length:26526      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##    Address              City              State              Zip Code    
##  Length:26526       Length:26526       Length:26526       Min.   :  802  
##  Class :character   Class :character   Class :character   1st Qu.:21064  
##  Mode  :character   Mode  :character   Mode  :character   Median :48890  
##                                                           Mean   :47250  
##                                                           3rd Qu.:68839  
##                                                           Max.   :99950  
##  Phone Number          County             Latitude       Longitude      
##  Length:26526       Length:26526       Min.   :13.26   Min.   :-170.28  
##  Class :character   Class :character   1st Qu.:36.23   1st Qu.: -96.63  
##  Mode  :character   Mode  :character   Median :40.51   Median : -87.64  
##                                        Mean   :39.42   Mean   : -89.75  
##                                        3rd Qu.:42.55   3rd Qu.: -78.46  
##                                        Max.   :71.30   Max.   : 145.75  
##    Accuracy        
##  Length:26526      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

We have information about the location of 26,526 libraries across the United States. We will use the Latitude and Longitude variable for our visualisation, but we can also use some of the other information to get a better idea about the distribution of libraries across the United States. We can use the dplyr package, but first we should clean up the column names which have a space in the name. This will make it easier to call these variables throughout our analysis.

#Rename column names 
colnames(libraries)[1] <- "Location_Number"
colnames(libraries)[2] <- "Name"
colnames(libraries)[3] <- "Type"
colnames(libraries)[7] <- "Zip"
colnames(libraries)[8] <- "Phone_Number"

#Install package 
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.5
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Create summary table
by_state <- group_by(libraries, State, Type)
summarise(by_state, n())
## # A tibble: 211 x 3
## # Groups:   State [?]
##    State Type            `n()`
##    <chr> <chr>           <int>
##  1 AK    Bookmobile          1
##  2 AK    Branch Library     16
##  3 AK    Central Library    83
##  4 AK    Library System     84
##  5 AL    Bookmobile         13
##  6 AL    Branch Library     76
##  7 AL    Central Library   218
##  8 AL    Library System    220
##  9 AR    Bookmobile          1
## 10 AR    Branch Library    166
## # ... with 201 more rows

For a number of the states it looks like the counts for Central Library and Library System are very similar, and in some cases they are the same. We might be double counting some libraries. While there may be an administrative reason for this, we are interesting in seeing the location of libraries and we only need each library once. We can clean the data to remove duplicates.

#Clean data - remove duplicates 
libraries_clean <- libraries %>%
  group_by(Name, Longitude, Latitude) %>%
  filter(row_number() == 1)
## Warning: package 'bindrcpp' was built under R version 3.2.5
count(libraries_clean)
## # A tibble: 18,601 x 4
## # Groups:   Name, Longitude, Latitude [18,601]
##    Name                          Longitude Latitude     n
##    <chr>                             <dbl>    <dbl> <int>
##  1 115Th Street Branch               -74.0     40.8     1
##  2 125Th Street Branch               -73.9     40.8     1
##  3 151St Street Center               -87.5     41.6     1
##  4 30/31 Branch Library              -90.6     41.5     1
##  5 4S Ranch                         -117.      33.0     1
##  6 58 Street Branch Library          -74.0     40.8     1
##  7 67Th Street Branch                -74.0     40.8     1
##  8 78Th Street Community Library     -82.4     27.9     1
##  9 81St Avenue Branch Library       -122.      37.8     1
## 10 95Th Street Branch Library        -88.2     41.7     1
## # ... with 18,591 more rows

Making a map

First we need to load some new packages. If you haven’t already installed these packages, don’t forget to do so. ggplot can produce quite nice maps, but we will use the Google API later to pull some maps, and we will need ggmap to download and display this data.

library(maps) #We want this package to help us download some of our maps
## Warning: package 'maps' was built under R version 3.2.5
library(ggplot2) #We use this to produce our plot
library(ggmap) #We need this for our maps

Getting a map of the United States

There are two ways to do this. The first, less glamourous way is to download the United States map using the map_data command from the maps package. This package has a lot of different maps available for download, you can see them all here. We want a map of the United States of America to match our data, which is fortunately pre-loaded by this package.

This can then be plotted using geom_polygon, a geom within ggplot. We use coord_fixed(1.3) to ensure that our dimensions are reasonable. Without this command, our graph will be too square.

#Download USA State Borders 
USA_states <- map_data("state")

#Plot the data using ggplot
ggplot() + geom_polygon(data = USA_states, aes(x=long, y = lat, group = group), fill = NA, colour = "black") + 
  coord_fixed(1.3) 

Visualising the Libraries

We saw before that the US Libraries data has Longitude and Latitude coordinates for each of the Libraries in the data set. This data is sometimes called point data. We can display point data using these co-ordinates, and treat it similar to points in a scatter plot. ggplot is plotting an x and a y for each row, which in essence is very similar. We can therefore add in our point data using geom_point.

#Adding in our point data
ggplot() + geom_polygon(data = USA_states, aes(x=long, y = lat, group = group), fill = NA, colour = "black") + 
  coord_fixed(1.3) +
  geom_point(data = libraries_clean, aes(x=Longitude, y = Latitude))

As you can see, it looks like our data set has US libraries not on the mainland. We can load a world map using the map_data command and see where they appear to be.

World_map <- map_data("world")

#Visualising our data
ggplot() + geom_polygon(data = World_map, aes(x=long, y = lat, group = group), fill = NA, colour = "black") + 
  coord_fixed(1.3) +
  geom_point(data = libraries_clean, aes(x=Longitude, y = Latitude), colour = "red")

It looks like our data set contains Libraries in Hawaii and Alaska, states off the mainland, but also in other US territories such as Guam in the Pacific. This is ok, but we want to visualise the data on the mainland only. I’m more interested in the other ones so below I will try see which Libraries we have that aren’t on the mainland, and not in Hawaii or Alaska.

To see which libraries are in our data set but not on the US Mainland, we can filter using our USA borders using some of the tools available in dplyr. The below filter gives us all libraries in the data set which fall outside of the US Mainland’s longitude, and which don’t sit in Alaska or Hawaii. It shows we have libraries from Guam, the Northern Mariana Islands, The Virgin Islands but alarmingly, also Connecticut and Maine, which should be apprearing on the mainland. I checked the ISO codes here.

The libraries in Maine and Connecticut should sit on our mainland, so they must be data errors. The ones in Maine look to be just outside the border, while the Connecticut library appears to be in the UK. We exclude them from our sample, using the and command (&) rather than the or command (|) used in our filter. We also remove NA values, to clean the data set.

#Filter to see which libraries we have. 
libraries_clean %>% 
  filter(Longitude <= min(USA_states$long) | Longitude >= max(USA_states$long), State != "AK" & State != "HI") %>%
  select(Name, City, State)
## Adding missing grouping variables: `Longitude`, `Latitude`
## # A tibble: 21 x 5
## # Groups:   Name, Longitude, Latitude [21]
##    Longitude Latitude Name                                   City    State
##        <dbl>    <dbl> <chr>                                  <chr>   <chr>
##  1     -1.12     52.6 Andover Public Library                 Andover CT   
##  2    145.       13.5 Guam Public Library Bookmobile         Agana   GU   
##  3    145.       13.5 Guam Public Library System             Agana   GU   
##  4    145.       13.5 Nieves M. Flores Memorial Library      Agana   GU   
##  5    145.       13.4 Maria Rivera Aguigui Memorial Library… Agat    GU   
##  6    145.       13.5 Barrigada Branch Public Library        Barrig… GU   
##  7    145.       13.5 Dededo Branch Public Library           Dededo  GU   
##  8    145.       13.3 Rosa Aguigui Reyes Memorial Library (… Merizo  GU   
##  9    145.       13.4 Yona Public Library                    Yona    GU   
## 10    -67.0      44.9 Peavey Memorial Library                Eastpo… ME   
## # ... with 11 more rows
#We now want to filter these out
libraries_USA <- libraries_clean %>% 
  filter(!Longitude <= min(USA_states$long) & !Longitude >= max(USA_states$long), !is.na(Longitude))

A Clean Map

Now that we have our data cleaned, we can start looking at producing some better looking maps.

ggplot() + geom_polygon(data = USA_states, aes(x=long, y = lat, group = group), fill = NA, colour = "black") + 
  coord_fixed(1.3) +
  geom_point(data = libraries_USA, aes(x=Longitude, y = Latitude))

Because we are using ggplot, we are able to use some of the main aesthetic features available in ggplot. To change the aesthetics we can do the following: * Colour: We can set a colour for our libraries data to differentiate it from our borders. * Alpha: We can change the transparency of our point data, to see where there are larger concentrations of libraries. * Theme: We can apply a ggplot theme. I like theme_minimal(), which gives a nice clean background for our map. * Labels: Using lab() we are able to change our axis titles, and add a title to our graph. * Legend: I want to get rid of the legend because it doesn’t add much value. We can do this using theme(legend.position = "none").

ggplot() + geom_polygon(data = USA_states, aes(x=long, y = lat, group = group), fill = NA, colour = "black") + 
  coord_fixed(1.3) +
  geom_point(data = libraries_USA, aes(x=Longitude, y = Latitude, colour = "red"), alpha=1/15) +
  theme_minimal() + theme(legend.position = "none") +
  labs(title = "The Locations of Libraries in the United States", x = "Longitude", y = "Latitude")

It looks like there are clusters where there are more libraries present in our data set. We can overlay the position of US cities to see if we are visualising cities in our data set. There is data within the maps package for US cities. We can use the world.cities data to get US cities. We are able to filter these cities based on population size and then add these to our plot.

#Import Data
USA_cities_raw <- world.cities[world.cities$country.etc == "USA",]

#Look at data - from this we can see there is a population variable. 
head(USA_cities_raw)
##        name country.etc    pop   lat    long capital
## 222 Abilene         USA 113888 32.45  -99.74       0
## 747   Akron         USA 206634 41.08  -81.52       0
## 811 Alameda         USA  70069 37.77 -122.26       0
## 857  Albany         USA  45535 44.62 -123.09       0
## 858  Albany         USA  75510 31.58  -84.18       0
## 859  Albany         USA  93576 42.67  -73.80       0
#Lets limit it to cities with a population over 500000 people. 
USA_cities <- USA_cities_raw %>%
  filter(pop > 500000)

#Add it to our Map
ggplot() + geom_polygon(data = USA_states, aes(x=long, y = lat, group = group), fill = NA, colour = "black") + 
  coord_fixed(1.3) +
  geom_point(data = libraries_USA, aes(x=Longitude, y = Latitude, colour = "red"), alpha=1/15) +
  theme_minimal() + theme(legend.position = "none") +
  labs(title = "The Locations of Libraries in the United States", x = "Longitude", y = "Latitude") +
  geom_point(data= USA_cities, aes(x=long, y = lat, size = pop), colour = "blue", alpha = 1/2)

We can see that libraries correspond with big cities, but that not all of the places with lots of libraries seem to have large population. Further EDA could uncover the relationship between population and libraries.

Using Google Maps API

The above map looks much cleaner than before, be we aren’t used to looking at maps as basic as this. Using the Google Maps API we are able to pull in a more familiar looking map. Be careful using this API too much, Google only give you a certain number of queries per 24 hours before you have to register. I want to start with two maps, one of the United States as a whole and another for New York.

#Download the data
Google_USA <- get_map(location = "USA", zoom = 4)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=USA&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=USA&sensor=false
Google_NY <- get_map(location = "New York", zoom = 10)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=New+York&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=New%20York&sensor=false

The zoom determines how close in will work. Zoom = 4 will give us the continent, while zoom = 10 is suitable for a city and is the default. We can zoom in further depending on our interest and purpose. Note what we put in the quotation marks is what Google will deliver us, similar to if we searched in Google Maps, so you need to make sure you have the correct search term.

Lets make our maps. Here we will use ggmap instead of using ggplot. We can use the same commands as we use in ggplot.

#USA
ggmap(Google_USA) +
  geom_point(data = libraries_USA, aes(x=Longitude, y = Latitude, colour = "red"), alpha=1/20) +
  labs(title = "The Locations of Libraries in the United States", x = "Longitude", y = "Latitude") + 
  theme(legend.position = "none") 
## Warning: Removed 67 rows containing missing values (geom_point).

#New York
ggmap(Google_NY) +
  geom_point(data = libraries_USA, aes(x=Longitude, y = Latitude, colour = "red"), alpha=1/2) +
  labs(title = "The Locations of Libraries in the New York", x = "Longitude", y = "Latitude") + 
  theme(legend.position = "none") 
## Warning: Removed 17866 rows containing missing values (geom_point).

We have two outputs, one for the United States and one for New York. The New York map is a closer resolution and we can make out more details. We can start to see which areas of New York the libraries are located.

The get_map command can also pull a satellite image from the Google API. I’m interested in Los Angeles. I set zoom = 10, which will give me the city.

#Download the data
Google_LA<- get_map(location = "LA", zoom = 10, maptype = c("satellite"))
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=LA&zoom=10&size=640x640&scale=2&maptype=satellite&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=LA&sensor=false
#LA
ggmap(Google_LA) +
  geom_point(data = libraries_USA, aes(x=Longitude, y = Latitude, colour = "red"), alpha=1/2) +
  labs(title = "The Locations of Libraries in LA", x = "Longitude", y = "Latitude") + 
  theme(legend.position = "none") 
## Warning: Removed 18143 rows containing missing values (geom_point).

We can see that libraries are generally along the areas that look like main roads, with some in less dense regions, but there are none in the mountains.

Concluding remarks

Mapping data in R gives us the ability to visualise our data and try understand different trends. It’s possible to overlay more data on these maps to get a better understanding of relationships in different types of data. We could overlay roads, schools, train/bus stations or other demographic data on to the library data and see where we might need a library in the future.