Rpub Link: Click Here
This is a document for Coursera Developing Data Product assignment. The objective is to create a webpage using R Markdown that features a map created with Leaflet.
The dataset is coming from the zipcode package. This package contains a database of city, state, latitude and longitude information for US ZIP codes from CivicSpace Database (August 2004) augmented by Daniel Coven’s federalgovernmentzipcodes.us website (updated January 22,2012).
# Clear cache first
rm(list=ls())
# Check for missing dependencies and load necessary R packages
if(!require(zipcode)){install.packages('zipcode')}; library(zipcode)
if(!require(leaflet)){install.packages('leaflet')}; library(leaflet)
if(!require(dplyr)){install.packages('dplyr')}; library(dplyr)
if(!require(ggplot2)){install.packages('ggplot2')}; library(ggplot2)
# Load dataset
data(zipcode)
Now, lets take a look at the zipcode dataset.
summary(zipcode)
## zip city state latitude
## Length:44336 Length:44336 Length:44336 Min. :-44.25
## Class :character Class :character Class :character 1st Qu.: 34.96
## Mode :character Mode :character Mode :character Median : 39.10
## Mean : 38.47
## 3rd Qu.: 41.86
## Max. : 71.30
## NA's :647
## longitude
## Min. :-176.64
## 1st Qu.: -97.28
## Median : -87.82
## Mean : -90.84
## 3rd Qu.: -80.06
## Max. : 171.18
## NA's :647
Interesting, looks like it contains all the zipcodes from all the cities across all the different states, along with their latitude & longitude information. There seemed to be some NA values inside the latitude and longitude columns so I’m removing it now.
zipcode2 <-na.omit(zipcode)
Using dplyr package, we’ll summarise the data into city, latitude, longitude and number of zipcodes per state. Latitude and Longitude will be computed as the average of each state. Renaming latitude as lat and longitude as lng so that leaflet can consume later.
zipcode3 <- zipcode2 %>% group_by(state) %>% summarise(lat=mean(latitude), lng=mean(longitude), count_zipcode=n())
Now, take a quick look at the transformed dataset.
summary(zipcode3)
## state lat lng count_zipcode
## Length:60 Min. :-44.25 Min. :-170.77 Min. : 1.0
## Class :character 1st Qu.: 32.96 1st Qu.: -99.27 1st Qu.: 293.0
## Mode :character Median : 38.71 Median : -86.37 Median : 609.5
## Mean : 34.16 Mean : -71.43 Mean : 728.1
## 3rd Qu.: 42.20 3rd Qu.: -74.92 3rd Qu.:1035.5
## Max. : 61.48 Max. : 169.46 Max. :2768.0
View(zipcode3)
Noticed there are some states with very few zipcodes and their latitude and longitude are way too off, these are likely to be errors in the original dataset. So I’m filtering states that has latitude between 28 & 50.
zipcode4 <- subset(zipcode3, lat>28 & lat<50)
Alright, let’s apply the leaflet package and plot the markers. Size of the markers will indicate the number zipcode in that state.
zipcode4 %>% leaflet() %>% addTiles() %>% addCircles(weight=1, radius=zipcode4$count_zipcode*150)
Plotting this in a pareto chart to see more clearly.
# Filter for top10 states
zipcode5 <- arrange(zipcode4,desc(count_zipcode))[1:20,]
plot1 <- ggplot(zipcode5, aes(x=reorder(state,count_zipcode), y=count_zipcode, fill=count_zipcode)) +
geom_bar(stat="identity") +
scale_fill_gradient2(low='red', mid='snow3', high='red', space='Lab') +
labs(title="Top 20 US State with Most Number of Zipcodes", y="Number of Zipcodes", x="State") +
coord_flip()
plot1
We can see that Texas is the state with the highest number of zipcodes, follow by California and Pennsylvania.