To implement this idea, we don’t need any fancy packages other than ggplot2. The steps are simple:
For our example, we are going to examine the crime incident dataset from Seattle 911 Calls on data.gov. Note that I have covered this data set through multiple blog posts already such as map plots in R and time based heat maps.
Install and Load Libraries
install.packages("ggplot2")
Installing package into <U+393C><U+3E31>C:/Users/jafareem/Documents/R/win-library/3.5<U+393C><U+3E32>
(as <U+393C><U+3E31>lib<U+393C><U+3E32> is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ggplot2_3.0.0.zip'
Content type 'application/zip' length 3579751 bytes (3.4 MB)
downloaded 3.4 MB
package ‘ggplot2’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\jafareem\AppData\Local\Temp\RtmpcZQ3iW\downloaded_packages
library(lubridate)
package <U+393C><U+3E31>lubridate<U+393C><U+3E32> was built under R version 3.5.1
Attaching package: <U+393C><U+3E31>lubridate<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:base<U+393C><U+3E32>:
date
library(ggplot2)
library(ggmap)
package <U+393C><U+3E31>ggmap<U+393C><U+3E32> was built under R version 3.5.1Google Maps API Terms of Service: http://developers.google.com/maps/terms.
Please cite ggmap if you use it: see citation('ggmap') for details.
library(dplyr)
Attaching package: <U+393C><U+3E31>dplyr<U+393C><U+3E32>
The following objects are masked from <U+393C><U+3E31>package:lubridate<U+393C><U+3E32>:
intersect, setdiff, union
The following objects are masked from <U+393C><U+3E31>package:stats<U+393C><U+3E32>:
filter, lag
The following objects are masked from <U+393C><U+3E31>package:base<U+393C><U+3E32>:
intersect, setdiff, setequal, union
library(data.table)
data.table 1.11.4 Latest news: http://r-datatable.com
Attaching package: <U+393C><U+3E31>data.table<U+393C><U+3E32>
The following objects are masked from <U+393C><U+3E31>package:dplyr<U+393C><U+3E32>:
between, first, last
The following objects are masked from <U+393C><U+3E31>package:lubridate<U+393C><U+3E32>:
hour, isoweek, mday, minute, month, quarter,
second, wday, week, yday, year
library(ggrepel)
library(magrittr)
Attaching package: <U+393C><U+3E31>magrittr<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:ggmap<U+393C><U+3E32>:
inset
incidents= fread('https://raw.githubusercontent.com/lgellis/MiscTutorial/master/ggmap/i2Sample.csv', stringsAsFactors = FALSE)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0
2 3348k 2 97435 0 0 35225 0 0:01:37 0:00:02 0:01:35 35225
9 3348k 9 319k 0 0 86779 0 0:00:39 0:00:03 0:00:36 86779
14 3348k 14 495k 0 0 103k 0 0:00:32 0:00:04 0:00:28 103k
20 3348k 20 671k 0 0 116k 0 0:00:28 0:00:05 0:00:23 153k
25 3348k 25 847k 0 0 125k 0 0:00:26 0:00:06 0:00:20 178k
30 3348k 30 1023k 0 0 131k 0 0:00:25 0:00:07 0:00:18 185k
35 3348k 35 1199k 0 0 136k 0 0:00:24 0:00:08 0:00:16 176k
41 3348k 41 1375k 0 0 140k 0 0:00:23 0:00:09 0:00:14 176k
46 3348k 46 1551k 0 0 144k 0 0:00:23 0:00:10 0:00:13 176k
51 3348k 51 1727k 0 0 146k 0 0:00:22 0:00:11 0:00:11 176k
56 3348k 56 1903k 0 0 149k 0 0:00:22 0:00:12 0:00:10 176k
62 3348k 62 2079k 0 0 151k 0 0:00:22 0:00:13 0:00:09 176k
67 3348k 67 2255k 0 0 152k 0 0:00:21 0:00:14 0:00:07 176k
72 3348k 72 2431k 0 0 154k 0 0:00:21 0:00:15 0:00:06 176k
77 3348k 77 2607k 0 0 155k 0 0:00:21 0:00:16 0:00:05 176k
83 3348k 83 2783k 0 0 156k 0 0:00:21 0:00:17 0:00:04 176k
88 3348k 88 2975k 0 0 158k 0 0:00:21 0:00:18 0:00:03 179k
94 3348k 94 3151k 0 0 159k 0 0:00:21 0:00:19 0:00:02 179k
99 3348k 99 3327k 0 0 160k 0 0:00:20 0:00:20 --:--:-- 179k
100 3348k 100 3348k 0 0 160k 0 0:00:20 0:00:20 --:--:-- 180k
str(incidents)
Classes ‘data.table’ and 'data.frame': 50000 obs. of 20 variables:
$ V1 : int 450024 1028516 1319344 1334148 32080 326478 1010497 1095263 1251322 1278385 ...
$ CAD.CDW.ID : int 813558 1002946 1036985 1053917 48491 681155 931030 2032781 2200447 2257162 ...
$ CAD.Event.Number :integer64 12000168676 12000438147 15000080173 15000105067 10000294206 11000387576 12000351857 17000233529 ...
$ General.Offense.Number : int 2012168676 2012438147 201580173 2015105067 2010294206 2011387576 2012351857 2017233529 201847957 201528719 ...
$ Event.Clearance.Code : chr "064" "100" "161" "100" ...
$ Event.Clearance.Description: chr "SHOPLIFT" "FRAUD (INCLUDING IDENTITY THEFT)" "TRESPASS" "FRAUD (INCLUDING IDENTITY THEFT)" ...
$ Event.Clearance.SubGroup : chr "THEFT" "FRAUD CALLS" "TRESPASS" "FRAUD CALLS" ...
$ Event.Clearance.Group : chr "SHOPLIFTING" "FRAUD CALLS" "TRESPASS" "FRAUD CALLS" ...
$ Event.Clearance.Date : chr "05/31/2012 06:00:00 PM" "12/24/2012 11:14:00 AM" "03/11/2015 12:45:00 PM" "03/31/2015 04:56:00 PM" ...
$ Hundred.Block.Location : chr "39XX BLOCK OF S OTHELLO ST" "27XX BLOCK OF ALKI AVE SW" "6XX BLOCK OF NW MARKET ST" "77XX BLOCK OF RAINIER AV S" ...
$ District.Sector : chr "S" "W" "B" "S" ...
$ Zone.Beat : chr "S1" "W1" "B2" "S2" ...
$ Census.Tract : chr "11000.1011" "9701.2000" "4800.4000" "11102.4008" ...
$ Longitude : num -122 -122 -122 -122 -122 ...
$ Latitude : num 47.5 47.6 47.7 47.5 47.6 ...
$ Incident.Location : chr "(47.537044021, -122.282344886)" "(47.579317217, -122.409989598)" "(47.668651602, -122.364558421)" "(47.533143434, -122.269986901)" ...
$ Initial.Type.Description : chr "" "TRU - FORGERY/CHKS/BUNCO/SCAMS/ID THEFT" "BURG - RES (INCL UNOCC STRUCTURES ON PROP)" "FRAUD - FORGERY,BUNCO, SCAMS, ID THEFT, ETC" ...
$ Initial.Type.Subgroup : chr "" "FRAUD CALLS" "BURGLARY" "FRAUD CALLS" ...
$ Initial.Type.Group : chr "" "FRAUD CALLS" "RESIDENTIAL BURGLARIES" "FRAUD CALLS" ...
$ At.Scene.Time : chr "" "12/24/2012 10:33:00 AM" "" "" ...
- attr(*, ".internal.selfref")=<externalptr>
attach(incidents)
# Create some color variables for graphing later
custGrey = "#A9A9A9"
#add year to the incidents data frame
incidents$ymd <-mdy_hms(Event.Clearance.Date)
incidents$month <- lubridate::month(incidents$ymd)
incidents$year <- year(incidents$ymd)
incidents$wday <- lubridate::wday(incidents$ymd, label = TRUE)
incidents$hour <- hour(incidents$ymd)
#Create a more manageable data frame with only 2017 data
i2 <- incidents[year>=2017, ]
#Only include complete cases
i2[complete.cases(i2), ]
attach(i2)
The following objects are masked from incidents:
At.Scene.Time, CAD.CDW.ID, CAD.Event.Number,
Census.Tract, District.Sector,
Event.Clearance.Code, Event.Clearance.Date,
Event.Clearance.Description,
Event.Clearance.Group,
Event.Clearance.SubGroup,
General.Offense.Number,
Hundred.Block.Location, Incident.Location,
Initial.Type.Description,
Initial.Type.Group, Initial.Type.Subgroup,
Latitude, Longitude, V1, Zone.Beat
head(i2)
#Group the data into a new data frame which has the count of events per month by subgroup
groupSummaries <- i2 %>%
group_by(month, Event.Clearance.SubGroup) %>%
summarize(N = length(Event.Clearance.SubGroup))
#View the new data set
head(groupSummaries, n=100)
attach(groupSummaries)
The following objects are masked from i2:
Event.Clearance.SubGroup, month
The following object is masked from incidents:
Event.Clearance.SubGroup
#Graph the data set through ggplot 2
ggplot(groupSummaries, aes(x=month, y=N, color=Event.Clearance.SubGroup) )+
geom_line() +
theme(legend.position="bottom",legend.text=element_text(size=7),
legend.title = element_blank()) +
scale_x_discrete(name ="Month",
limits=c(3,6,9,12))
# Create a data frame with only events types that have had a peak of 95 calls in a month or more
groupSummariesF <- groupSummaries %>%
group_by(Event.Clearance.SubGroup) %>%
filter(max(N) > 95) %>%
ungroup()
head(groupSummariesF)
# Create a layered plot with one layer of grey data for the full data set and one layer of color data for the subset data set
ggplot() +
geom_line(aes(month, N, group = Event.Clearance.SubGroup),
data = groupSummaries, colour = alpha("grey", 0.7)) +
geom_line(aes(month, N, group = Event.Clearance.SubGroup, colour = Event.Clearance.SubGroup),
data = groupSummariesF) +
scale_x_discrete(name ="Month",
limits=c(3,6,9,12)) +
theme(legend.position="bottom",legend.text=element_text(size=7),
legend.title = element_blank())
While the above methodology is quite easy, it can be a bit of a pain at times to create and add the new data frame. Further, you have to tinker more with the labelling to really call out the highlighted data points.
Thanks to Hiroaki Yutani, we now have the gghighlight package which does most of the work for us with a small function call!! Please note that a lot of this code was created by looking at examples on her introduction document.
The new school way is even simplier:
For our first example, we are going to create the same time series graph from above. However, we are going to perform the highlighting with gghighlight vs manual layering.
# Install the gghighlight package
#install.packages("gghighlight")
#library(gghighlight)
# Create the highlighted graph
ggplot(groupSummaries, aes(month, N, colour = Event.Clearance.SubGroup)) +
geom_line() +
gghighlight(max(N) > 95, label_key = Event.Clearance.SubGroup) +
scale_x_discrete(name ="Month",
limits=c(3,6,9,12))
Well that was so easy, we are going to try a few more ggmap plot types to see how we fare. Below show both a scatterplot and histogram chart.
# Try a scatterplot chart
ggplot(groupSummaries, aes(month, N, colour = Event.Clearance.SubGroup, use_group_by=FALSE)) +
geom_point() +
gghighlight(N > 200, label_key = Event.Clearance.SubGroup) +
scale_x_discrete(name ="Month",
limits=c(3,6,9,12))
You set use_group_by = TRUE, but grouped calculation failed.
Falling back to ungrouped filter operation...
# Try a histogram chart
ggplot(groupSummaries, aes(N, fill = Event.Clearance.SubGroup)) +
geom_histogram() +
theme(legend.position="bottom",legend.text=element_text(size=7),
legend.title = element_blank()) +
gghighlight(N > 100, label_key = Event.Clearance.SubGroup, use_group_by = FALSE) +
facet_wrap(~ Event.Clearance.SubGroup)