There are a number of really interesting data sets that provide data points with geographic coordinates. Some you may find interested include NASA Fires data or the Global Terrorism Database . In this Vignette, I explain some techniques that can be used to display data with geographic coordinates, Longitude and Latitude.
There are two slightly different techniques, using elements of the TidyVerse that can produce some nice looking maps efficiently. One uses ggplot and another uses ggmap which uses the same syntax as ggplot. Using these techniques we are able to visualise data in a very visual, easy to understand way.
For this analysis we will use data from AggData, a website with lots of interesting locational data, some for free and others you need to pay for. One free file available on the website is the location of United States Libraries from 2005 found here. Unforuntately we are unable to input it to R using an API and have to download the csv file seperately. Save the file to your directory, then import it using the readr package. We use different packages throughout this Vignette, but load them as we go so that it’s clear why we use it. Remember to install the package using install.packages("readr") if you haven’t already.
library(readr)
## Warning: package 'readr' was built under R version 3.2.5
libraries <- read_csv("public_libraries.csv")
## Parsed with column specification:
## cols(
## `Location Number` = col_character(),
## `Location Name` = col_character(),
## `Location Type` = col_character(),
## Address = col_character(),
## City = col_character(),
## State = col_character(),
## `Zip Code` = col_integer(),
## `Phone Number` = col_character(),
## County = col_character(),
## Latitude = col_double(),
## Longitude = col_double(),
## Accuracy = col_character()
## )
Now that we have our data we can start to look at it. To see the first rows of our data, use head(). It gives a good overview of what the data looks like. We can also run a summary using summary().
head(libraries)
## # A tibble: 6 x 12
## `Location Number` `Location Name` `Location Type` Address City State
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 AK0035-003 Mendenhall Valle… Branch Library 9105 Me… #134… AK
## 2 AK0095 Akiachak School/… Library System 1 Main … Akia… AK
## 3 AK0095-002 Akiachak School/… Central Library 1 Main … Akia… AK
## 4 AK0094-005 Nunamiut School/… Branch Library 114 Ill… Anak… AK
## 5 AK0001-002 Anchor Point Pub… Central Library 72251 M… Anch… AK
## 6 AK0001 Anchor Point Pub… Library System 72251 M… Anch… AK
## # ... with 6 more variables: `Zip Code` <int>, `Phone Number` <chr>,
## # County <chr>, Latitude <dbl>, Longitude <dbl>, Accuracy <chr>
summary(libraries)
## Location Number Location Name Location Type
## Length:26526 Length:26526 Length:26526
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Address City State Zip Code
## Length:26526 Length:26526 Length:26526 Min. : 802
## Class :character Class :character Class :character 1st Qu.:21064
## Mode :character Mode :character Mode :character Median :48890
## Mean :47250
## 3rd Qu.:68839
## Max. :99950
## Phone Number County Latitude Longitude
## Length:26526 Length:26526 Min. :13.26 Min. :-170.28
## Class :character Class :character 1st Qu.:36.23 1st Qu.: -96.63
## Mode :character Mode :character Median :40.51 Median : -87.64
## Mean :39.42 Mean : -89.75
## 3rd Qu.:42.55 3rd Qu.: -78.46
## Max. :71.30 Max. : 145.75
## Accuracy
## Length:26526
## Class :character
## Mode :character
##
##
##
We have information about the location of 26,526 libraries across the United States. We will use the Latitude and Longitude variable for our visualisation, but we can also use some of the other information to get a better idea about the distribution of libraries across the United States. We can use the dplyr package, but first we should clean up the column names which have a space in the name. This will make it easier to call these variables throughout our analysis.
#Rename column names
colnames(libraries)[1] <- "Location_Number"
colnames(libraries)[2] <- "Name"
colnames(libraries)[3] <- "Type"
colnames(libraries)[7] <- "Zip"
colnames(libraries)[8] <- "Phone_Number"
#Install package
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Create summary table
by_state <- group_by(libraries, State, Type)
summarise(by_state, n())
## # A tibble: 211 x 3
## # Groups: State [?]
## State Type `n()`
## <chr> <chr> <int>
## 1 AK Bookmobile 1
## 2 AK Branch Library 16
## 3 AK Central Library 83
## 4 AK Library System 84
## 5 AL Bookmobile 13
## 6 AL Branch Library 76
## 7 AL Central Library 218
## 8 AL Library System 220
## 9 AR Bookmobile 1
## 10 AR Branch Library 166
## # ... with 201 more rows
For a number of the states it looks like the counts for Central Library and Library System are very similar, and in some cases they are the same. We might be double counting some libraries. While there may be an administrative reason for this, we are interesting in seeing the location of libraries and we only need each library once. We can clean the data to remove duplicates.
#Clean data - remove duplicates
libraries_clean <- libraries %>%
group_by(Name, Longitude, Latitude) %>%
filter(row_number() == 1)
## Warning: package 'bindrcpp' was built under R version 3.2.5
count(libraries_clean)
## # A tibble: 18,601 x 4
## # Groups: Name, Longitude, Latitude [18,601]
## Name Longitude Latitude n
## <chr> <dbl> <dbl> <int>
## 1 115Th Street Branch -74.0 40.8 1
## 2 125Th Street Branch -73.9 40.8 1
## 3 151St Street Center -87.5 41.6 1
## 4 30/31 Branch Library -90.6 41.5 1
## 5 4S Ranch -117. 33.0 1
## 6 58 Street Branch Library -74.0 40.8 1
## 7 67Th Street Branch -74.0 40.8 1
## 8 78Th Street Community Library -82.4 27.9 1
## 9 81St Avenue Branch Library -122. 37.8 1
## 10 95Th Street Branch Library -88.2 41.7 1
## # ... with 18,591 more rows
First we need to load some new packages. If you haven’t already installed these packages, don’t forget to do so. ggplot can produce quite nice maps, but we will use the Google API later to pull some maps, and we will need ggmap to download and display this data.
library(maps) #We want this package to help us download some of our maps
## Warning: package 'maps' was built under R version 3.2.5
library(ggplot2) #We use this to produce our plot
library(ggmap) #We need this for our maps
There are two ways to do this. The first, less glamourous way is to download the United States map using the map_data command from the maps package. This package has a lot of different maps available for download, you can see them all here. We want a map of the United States of America to match our data, which is fortunately pre-loaded by this package.
This can then be plotted using geom_polygon, a geom within ggplot. We use coord_fixed(1.3) to ensure that our dimensions are reasonable. Without this command, our graph will be too square.
#Download USA State Borders
USA_states <- map_data("state")
#Plot the data using ggplot
ggplot() + geom_polygon(data = USA_states, aes(x=long, y = lat, group = group), fill = NA, colour = "black") +
coord_fixed(1.3)
We saw before that the US Libraries data has Longitude and Latitude coordinates for each of the Libraries in the data set. This data is sometimes called point data. We can display point data using these co-ordinates, and treat it similar to points in a scatter plot. ggplot is plotting an x and a y for each row, which in essence is very similar. We can therefore add in our point data using geom_point.
#Adding in our point data
ggplot() + geom_polygon(data = USA_states, aes(x=long, y = lat, group = group), fill = NA, colour = "black") +
coord_fixed(1.3) +
geom_point(data = libraries_clean, aes(x=Longitude, y = Latitude))
As you can see, it looks like our data set has US libraries not on the mainland. We can load a world map using the map_data command and see where they appear to be.
World_map <- map_data("world")
#Visualising our data
ggplot() + geom_polygon(data = World_map, aes(x=long, y = lat, group = group), fill = NA, colour = "black") +
coord_fixed(1.3) +
geom_point(data = libraries_clean, aes(x=Longitude, y = Latitude), colour = "red")
It looks like our data set contains Libraries in Hawaii and Alaska, states off the mainland, but also in other US territories such as Guam in the Pacific. This is ok, but we want to visualise the data on the mainland only. I’m more interested in the other ones so below I will try see which Libraries we have that aren’t on the mainland, and not in Hawaii or Alaska.
To see which libraries are in our data set but not on the US Mainland, we can filter using our USA borders using some of the tools available in dplyr. The below filter gives us all libraries in the data set which fall outside of the US Mainland’s longitude, and which don’t sit in Alaska or Hawaii. It shows we have libraries from Guam, the Northern Mariana Islands, The Virgin Islands but alarmingly, also Connecticut and Maine, which should be apprearing on the mainland. I checked the ISO codes here.
The libraries in Maine and Connecticut should sit on our mainland, so they must be data errors. The ones in Maine look to be just outside the border, while the Connecticut library appears to be in the UK. We exclude them from our sample, using the and command (&) rather than the or command (|) used in our filter. We also remove NA values, to clean the data set.
#Filter to see which libraries we have.
libraries_clean %>%
filter(Longitude <= min(USA_states$long) | Longitude >= max(USA_states$long), State != "AK" & State != "HI") %>%
select(Name, City, State)
## Adding missing grouping variables: `Longitude`, `Latitude`
## # A tibble: 21 x 5
## # Groups: Name, Longitude, Latitude [21]
## Longitude Latitude Name City State
## <dbl> <dbl> <chr> <chr> <chr>
## 1 -1.12 52.6 Andover Public Library Andover CT
## 2 145. 13.5 Guam Public Library Bookmobile Agana GU
## 3 145. 13.5 Guam Public Library System Agana GU
## 4 145. 13.5 Nieves M. Flores Memorial Library Agana GU
## 5 145. 13.4 Maria Rivera Aguigui Memorial Library… Agat GU
## 6 145. 13.5 Barrigada Branch Public Library Barrig… GU
## 7 145. 13.5 Dededo Branch Public Library Dededo GU
## 8 145. 13.3 Rosa Aguigui Reyes Memorial Library (… Merizo GU
## 9 145. 13.4 Yona Public Library Yona GU
## 10 -67.0 44.9 Peavey Memorial Library Eastpo… ME
## # ... with 11 more rows
#We now want to filter these out
libraries_USA <- libraries_clean %>%
filter(!Longitude <= min(USA_states$long) & !Longitude >= max(USA_states$long), !is.na(Longitude))
Now that we have our data cleaned, we can start looking at producing some better looking maps.
ggplot() + geom_polygon(data = USA_states, aes(x=long, y = lat, group = group), fill = NA, colour = "black") +
coord_fixed(1.3) +
geom_point(data = libraries_USA, aes(x=Longitude, y = Latitude))
Because we are using ggplot, we are able to use some of the main aesthetic features available in ggplot. To change the aesthetics we can do the following: * Colour: We can set a colour for our libraries data to differentiate it from our borders. * Alpha: We can change the transparency of our point data, to see where there are larger concentrations of libraries. * Theme: We can apply a ggplot theme. I like theme_minimal(), which gives a nice clean background for our map. * Labels: Using lab() we are able to change our axis titles, and add a title to our graph. * Legend: I want to get rid of the legend because it doesn’t add much value. We can do this using theme(legend.position = "none").
ggplot() + geom_polygon(data = USA_states, aes(x=long, y = lat, group = group), fill = NA, colour = "black") +
coord_fixed(1.3) +
geom_point(data = libraries_USA, aes(x=Longitude, y = Latitude, colour = "red"), alpha=1/15) +
theme_minimal() + theme(legend.position = "none") +
labs(title = "The Locations of Libraries in the United States", x = "Longitude", y = "Latitude")
It looks like there are clusters where there are more libraries present in our data set. We can overlay the position of US cities to see if we are visualising cities in our data set. There is data within the maps package for US cities. We can use the world.cities data to get US cities. We are able to filter these cities based on population size and then add these to our plot.
#Import Data
USA_cities_raw <- world.cities[world.cities$country.etc == "USA",]
#Look at data - from this we can see there is a population variable.
head(USA_cities_raw)
## name country.etc pop lat long capital
## 222 Abilene USA 113888 32.45 -99.74 0
## 747 Akron USA 206634 41.08 -81.52 0
## 811 Alameda USA 70069 37.77 -122.26 0
## 857 Albany USA 45535 44.62 -123.09 0
## 858 Albany USA 75510 31.58 -84.18 0
## 859 Albany USA 93576 42.67 -73.80 0
#Lets limit it to cities with a population over 500000 people.
USA_cities <- USA_cities_raw %>%
filter(pop > 500000)
#Add it to our Map
ggplot() + geom_polygon(data = USA_states, aes(x=long, y = lat, group = group), fill = NA, colour = "black") +
coord_fixed(1.3) +
geom_point(data = libraries_USA, aes(x=Longitude, y = Latitude, colour = "red"), alpha=1/15) +
theme_minimal() + theme(legend.position = "none") +
labs(title = "The Locations of Libraries in the United States", x = "Longitude", y = "Latitude") +
geom_point(data= USA_cities, aes(x=long, y = lat, size = pop), colour = "blue", alpha = 1/2)
We can see that libraries correspond with big cities, but that not all of the places with lots of libraries seem to have large population. Further EDA could uncover the relationship between population and libraries.
The above map looks much cleaner than before, be we aren’t used to looking at maps as basic as this. Using the Google Maps API we are able to pull in a more familiar looking map. Be careful using this API too much, Google only give you a certain number of queries per 24 hours before you have to register. I want to start with two maps, one of the United States as a whole and another for New York.
#Download the data
Google_USA <- get_map(location = "USA", zoom = 4)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=USA&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=USA&sensor=false
Google_NY <- get_map(location = "New York", zoom = 10)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=New+York&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=New%20York&sensor=false
The zoom determines how close in will work. Zoom = 4 will give us the continent, while zoom = 10 is suitable for a city and is the default. We can zoom in further depending on our interest and purpose. Note what we put in the quotation marks is what Google will deliver us, similar to if we searched in Google Maps, so you need to make sure you have the correct search term.
Lets make our maps. Here we will use ggmap instead of using ggplot. We can use the same commands as we use in ggplot.
#USA
ggmap(Google_USA) +
geom_point(data = libraries_USA, aes(x=Longitude, y = Latitude, colour = "red"), alpha=1/20) +
labs(title = "The Locations of Libraries in the United States", x = "Longitude", y = "Latitude") +
theme(legend.position = "none")
## Warning: Removed 67 rows containing missing values (geom_point).
#New York
ggmap(Google_NY) +
geom_point(data = libraries_USA, aes(x=Longitude, y = Latitude, colour = "red"), alpha=1/2) +
labs(title = "The Locations of Libraries in the New York", x = "Longitude", y = "Latitude") +
theme(legend.position = "none")
## Warning: Removed 17866 rows containing missing values (geom_point).
We have two outputs, one for the United States and one for New York. The New York map is a closer resolution and we can make out more details. We can start to see which areas of New York the libraries are located.
The get_map command can also pull a satellite image from the Google API. I’m interested in Los Angeles. I set zoom = 10, which will give me the city.
#Download the data
Google_LA<- get_map(location = "LA", zoom = 10, maptype = c("satellite"))
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=LA&zoom=10&size=640x640&scale=2&maptype=satellite&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=LA&sensor=false
#LA
ggmap(Google_LA) +
geom_point(data = libraries_USA, aes(x=Longitude, y = Latitude, colour = "red"), alpha=1/2) +
labs(title = "The Locations of Libraries in LA", x = "Longitude", y = "Latitude") +
theme(legend.position = "none")
## Warning: Removed 18143 rows containing missing values (geom_point).
We can see that libraries are generally along the areas that look like main roads, with some in less dense regions, but there are none in the mountains.
Mapping data in R gives us the ability to visualise our data and try understand different trends. It’s possible to overlay more data on these maps to get a better understanding of relationships in different types of data. We could overlay roads, schools, train/bus stations or other demographic data on to the library data and see where we might need a library in the future.