This document is an example to produce faceted heatmaps with R package ggplot2. Especially, focusing on the log-type dataset, it present a way to manipulate time/date variable. The original post is here.
The dataset is a simulated data which provides 200,000 “events” (RDP login attempts) in the eventlog.csv file that has three fields, timestamp, source_country (ISO 3166-1 alpha-2 country code, source of the attack), and the tz (time zone of the source IP address).
library(data.table) # faster fread() function
library(dplyr) # data.frame operations
library(tidyr)
library(purrr) # list/vecgtor munging
library(lubridate) # date/time manipulation
library(countrycode) # turn country codes into names
library(ggplot2) #
library(scales) # working with ggplot2 for label formatting
library(gridExtra) # working with ggplots for arranging plots
library(ggthemes) # clean theme for ggplot2
library(viridis) # color palette
library(DT) # prettier data.frame output
There are two ways to load the CSV file here, data.table function and/or fread function.
attacks = read.table("https://raw.githubusercontent.com/hrbrmstr/facetedcountryheatmaps/master/data/eventlog.csv",sep=",", header=T, na.strings="?")
head(attacks)
## timestamp source_country tz
## 1 2015-03-12T15:59:16.718901Z CN Asia/Shanghai
## 2 2015-03-12T16:00:48.841746Z FR Europe/Paris
## 3 2015-03-12T16:02:26.731256Z CN Asia/Shanghai
## 4 2015-03-12T16:02:38.469907Z US America/Chicago
## 5 2015-03-12T16:03:22.201903Z CN Asia/Shanghai
## 6 2015-03-12T16:03:45.984616Z CN Asia/Shanghai
attacks = fread("https://raw.githubusercontent.com/hrbrmstr/facetedcountryheatmaps/master/data/eventlog.csv")
head(attacks)
## timestamp source_country tz
## 1: 2015-03-12T15:59:16.718901Z CN Asia/Shanghai
## 2: 2015-03-12T16:00:48.841746Z FR Europe/Paris
## 3: 2015-03-12T16:02:26.731256Z CN Asia/Shanghai
## 4: 2015-03-12T16:02:38.469907Z US America/Chicago
## 5: 2015-03-12T16:03:22.201903Z CN Asia/Shanghai
## 6: 2015-03-12T16:03:45.984616Z CN Asia/Shanghai
attacks = tbl_df(attacks)
datatable(head(attacks))
For the activities heatmap, we want to present how active the attack is during weekday of each hour. We need to get the weekday/hour information for each attack from the timestamp field. To do this, we write up a function.
make_hr_wkday = function(cc, ts, tz) {
real_times = ymd_hms(ts, tz = tz[1], quiet = TRUE)
data_frame(source_country = cc,
wkday = weekdays(as.Date(real_times, tz = tz[1])),
hour = format(real_times, "%H", tz = tz[1]))
}
attacks = group_by(attacks, tz) %>%
do(make_hr_wkday(.$source_country, .$timestamp, .$tz)) %>%
ungroup() %>%
mutate(wkday=factor(wkday,
levels=levels(weekdays(0, FALSE))))
datatable(head(attacks))
It’s straightforward to make an overall heatmap of activity. Group & count the number of “attack” by weekday and hour then use geom_tile() function.
Group and count:
wkdays = count(attacks, wkday, hour)
datatable(head(wkdays))
Create the ggplot object, use the hour as the x-axis, the wkday as y-axis, fill by n aggregated count:
gg = ggplot(wkdays, aes(x=hour, y=wkday, fill=n))
Make tiles at each x&y location we specified using geom_tile() function, and tell it to use a thin (0.1 unit) white border to separate the tiles.
gg = gg + geom_tile(color="white", size=0.1)
Add the color scale using the viridis package. By specifying the name here, we get a label on the legend.
gg = gg + scale_fill_viridis(name="# Events", label=comma)
To make sure each tile have a 1:1 aspect ratio (geom_tile() which draws rectangles in default):
gg = gg + coord_equal()
Add title and remove x-axis and y-axis labels
gg = gg + labs(x=NULL, y=NULL, title="Events per weekday & hour of day")
Pick a nice theme from the ggthemes package, theme_tufte(). Put the title left-aligned.
gg = gg + theme_tufte(base_family="Helvetica")
gg = gg + theme(plot.title=element_text(hjust=0))
Remove the tick marks on the axes, also you can adjust the title and legend fonts size.
gg = gg + theme(axis.ticks = element_blank())
gg = gg + theme(axis.text = element_text(size=7))
gg = gg + theme(legend.title = element_text(size=8))
gg = gg + theme(legend.text = element_text(size=6))
gg
There are two ways to do it: first all heatmap plots of each country are using the same scale, then with each country heatmay we use its own scale. By doing this, it will let us compare at a macro and micro level.
For each view, we need to rearrange the dataset with nice country names versue 2-letter abbreviations.
events_by_country = count(attacks, source_country) %>%
mutate(percent=percent(n/sum(n)), count=comma(n)) %>%
mutate(country=sprintf("%s (%s)", countrycode(source_country, "iso2c", "country.name"), source_country)) %>% arrange(desc(n))
datatable(events_by_country)
Make a simple ggplot facet, exclude the top 2 attacking countries since they skew things a lot.
Here we filtered by the top 10 (exclude the top 2) countries, then doing the gorup and count as previous. Using the left_join to get the country info. Not all countries attacked every day/hour, here we use complete() operation to ensure we have values for all countries fro each day/hour combination. Also, turn the country into an ordered factor.
cc_heat = filter(attacks, source_country %in% events_by_country$source_country[3:12]) %>%
count(source_country, wkday, hour) %>%
ungroup() %>%
left_join(events_by_country[,c(1,5)]) %>%
complete(country, wkday, hour, fill=list(n=0)) %>%
mutate(country=factor(country,
levels=events_by_country$country[3:12]))
## Joining by: "source_country"
gg <- ggplot(cc_heat, aes(x=hour, y=wkday, fill=n))
gg <- gg + geom_tile(color="white", size=0.1)
gg <- gg + scale_fill_viridis(name="# Events")
gg <- gg + coord_equal()
gg <- gg + facet_wrap(~country, ncol=2)
gg <- gg + labs(x=NULL, y=NULL, title="Events per weekday & time of day by country\n")
gg <- gg + theme_tufte(base_family="Helvetica")
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text=element_text(size=5))
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(plot.title=element_text(hjust=0))
gg <- gg + theme(strip.text=element_text(hjust=0))
gg <- gg + theme(panel.margin.x=unit(0.5, "cm"))
gg <- gg + theme(panel.margin.y=unit(0.5, "cm"))
gg <- gg + theme(legend.title=element_text(size=6))
gg <- gg + theme(legend.title.align=1)
gg <- gg + theme(legend.text=element_text(size=6))
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(legend.key.size=unit(0.2, "cm"))
gg <- gg + theme(legend.key.width=unit(1, "cm"))
gg
To get individual scales for each country we need to make n separate ggplot objects first, and then combine them using gridExtra::grid.arrange function. It’s very similar commands as before, only without the facet call.
Here We’ll do the top 16 countries (not excluding anything).
Prepare the dataset
cc_heat2 = count(attacks, source_country, wkday, hour) %>%
ungroup() %>%
left_join(events_by_country[,c(1,5)]) %>%
complete(country, wkday, hour, fill=list(n=0)) %>%
mutate(country=factor(country,
levels=events_by_country$country))
## Joining by: "source_country"
Make the heatmap plots
cclist = lapply(events_by_country$country[1:16], function(cc) {
gg <- ggplot(filter(cc_heat2, country==cc),
aes(x=hour, y=wkday, fill=n, frame=country))
gg <- gg + geom_tile(color="white", size=0.1)
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_discrete(expand=c(0,0))
gg <- gg + scale_fill_viridis(name="")
gg <- gg + coord_equal()
gg <- gg + labs(x=NULL, y=NULL,
title=sprintf("%s", cc))
gg <- gg + theme_tufte(base_family="Helvetica")
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text=element_text(size=5))
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(plot.title=element_text(hjust=0, size=6))
gg <- gg + theme(panel.margin.x=unit(0.5, "cm"))
gg <- gg + theme(panel.margin.y=unit(0.5, "cm"))
gg <- gg + theme(legend.title=element_text(size=6))
gg <- gg + theme(legend.title.align=1)
gg <- gg + theme(legend.text=element_text(size=6))
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(legend.key.size=unit(0.2, "cm"))
gg <- gg + theme(legend.key.width=unit(1, "cm"))
gg
})
cclist[["ncol"]] <- 2
do.call(grid.arrange, cclist)
Last updated: 2016-02-16