This document is an example to produce faceted heatmaps with R package ggplot2. Especially, focusing on the log-type dataset, it present a way to manipulate time/date variable. The original post is here.

The dataset is a simulated data which provides 200,000 “events” (RDP login attempts) in the eventlog.csv file that has three fields, timestamp, source_country (ISO 3166-1 alpha-2 country code, source of the attack), and the tz (time zone of the source IP address).

Load the necessary libraries

library(data.table)    # faster fread() function
library(dplyr)         # data.frame operations
library(tidyr)
library(purrr)         # list/vecgtor munging
library(lubridate)     # date/time manipulation
library(countrycode)   # turn country codes into names 
library(ggplot2)       #
library(scales)        # working with ggplot2 for label formatting
library(gridExtra)     # working with ggplots for arranging plots
library(ggthemes)      # clean theme for ggplot2
library(viridis)       # color palette
library(DT)            # prettier data.frame output

Load the data from csv file

There are two ways to load the CSV file here, data.table function and/or fread function.

attacks = read.table("https://raw.githubusercontent.com/hrbrmstr/facetedcountryheatmaps/master/data/eventlog.csv",sep=",", header=T, na.strings="?")
head(attacks)
##                     timestamp source_country              tz
## 1 2015-03-12T15:59:16.718901Z             CN   Asia/Shanghai
## 2 2015-03-12T16:00:48.841746Z             FR    Europe/Paris
## 3 2015-03-12T16:02:26.731256Z             CN   Asia/Shanghai
## 4 2015-03-12T16:02:38.469907Z             US America/Chicago
## 5 2015-03-12T16:03:22.201903Z             CN   Asia/Shanghai
## 6 2015-03-12T16:03:45.984616Z             CN   Asia/Shanghai
attacks = fread("https://raw.githubusercontent.com/hrbrmstr/facetedcountryheatmaps/master/data/eventlog.csv")
head(attacks)
##                      timestamp source_country              tz
## 1: 2015-03-12T15:59:16.718901Z             CN   Asia/Shanghai
## 2: 2015-03-12T16:00:48.841746Z             FR    Europe/Paris
## 3: 2015-03-12T16:02:26.731256Z             CN   Asia/Shanghai
## 4: 2015-03-12T16:02:38.469907Z             US America/Chicago
## 5: 2015-03-12T16:03:22.201903Z             CN   Asia/Shanghai
## 6: 2015-03-12T16:03:45.984616Z             CN   Asia/Shanghai
attacks = tbl_df(attacks)
datatable(head(attacks))

Prepare the data for plots

For the activities heatmap, we want to present how active the attack is during weekday of each hour. We need to get the weekday/hour information for each attack from the timestamp field. To do this, we write up a function.

make_hr_wkday = function(cc, ts, tz) {
 
     real_times = ymd_hms(ts, tz = tz[1], quiet = TRUE)
 
     data_frame(source_country = cc,
                wkday = weekdays(as.Date(real_times, tz = tz[1])),
                hour = format(real_times, "%H", tz = tz[1]))
}
 
attacks = group_by(attacks, tz) %>%
  do(make_hr_wkday(.$source_country, .$timestamp, .$tz)) %>%
  ungroup() %>%
  mutate(wkday=factor(wkday,
                      levels=levels(weekdays(0, FALSE))))
datatable(head(attacks))

Overall heatmap step-by-step

It’s straightforward to make an overall heatmap of activity. Group & count the number of “attack” by weekday and hour then use geom_tile() function.

Group and count:

wkdays = count(attacks, wkday, hour)
datatable(head(wkdays))

Create the ggplot object, use the hour as the x-axis, the wkday as y-axis, fill by n aggregated count:

gg = ggplot(wkdays, aes(x=hour, y=wkday, fill=n))

Make tiles at each x&y location we specified using geom_tile() function, and tell it to use a thin (0.1 unit) white border to separate the tiles.

gg = gg + geom_tile(color="white", size=0.1)

Add the color scale using the viridis package. By specifying the name here, we get a label on the legend.

gg = gg + scale_fill_viridis(name="# Events", label=comma)

To make sure each tile have a 1:1 aspect ratio (geom_tile() which draws rectangles in default):

gg = gg + coord_equal()

Add title and remove x-axis and y-axis labels

gg = gg + labs(x=NULL, y=NULL, title="Events per weekday & hour of day")

Pick a nice theme from the ggthemes package, theme_tufte(). Put the title left-aligned.

gg = gg + theme_tufte(base_family="Helvetica")
gg = gg + theme(plot.title=element_text(hjust=0))

Remove the tick marks on the axes, also you can adjust the title and legend fonts size.

gg = gg + theme(axis.ticks = element_blank())
gg = gg + theme(axis.text = element_text(size=7))

gg = gg + theme(legend.title = element_text(size=8))
gg = gg + theme(legend.text = element_text(size=6))

gg

Heatmap breakdown by country

There are two ways to do it: first all heatmap plots of each country are using the same scale, then with each country heatmay we use its own scale. By doing this, it will let us compare at a macro and micro level.

For each view, we need to rearrange the dataset with nice country names versue 2-letter abbreviations.

Heatmap with the same scale

events_by_country = count(attacks, source_country) %>%
  mutate(percent=percent(n/sum(n)), count=comma(n)) %>%
  mutate(country=sprintf("%s (%s)", countrycode(source_country, "iso2c", "country.name"), source_country)) %>% arrange(desc(n))
datatable(events_by_country)

Make a simple ggplot facet, exclude the top 2 attacking countries since they skew things a lot.

Here we filtered by the top 10 (exclude the top 2) countries, then doing the gorup and count as previous. Using the left_join to get the country info. Not all countries attacked every day/hour, here we use complete() operation to ensure we have values for all countries fro each day/hour combination. Also, turn the country into an ordered factor.

cc_heat = filter(attacks, source_country %in% events_by_country$source_country[3:12]) %>%
  count(source_country, wkday, hour) %>%
  ungroup() %>%
  left_join(events_by_country[,c(1,5)]) %>%
  complete(country, wkday, hour, fill=list(n=0)) %>%
  mutate(country=factor(country,
                        levels=events_by_country$country[3:12]))
## Joining by: "source_country"
gg <- ggplot(cc_heat, aes(x=hour, y=wkday, fill=n))
gg <- gg + geom_tile(color="white", size=0.1)
gg <- gg + scale_fill_viridis(name="# Events")
gg <- gg + coord_equal()
gg <- gg + facet_wrap(~country, ncol=2)
gg <- gg + labs(x=NULL, y=NULL, title="Events per weekday & time of day by country\n")
gg <- gg + theme_tufte(base_family="Helvetica")
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text=element_text(size=5))
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(plot.title=element_text(hjust=0))
gg <- gg + theme(strip.text=element_text(hjust=0))
gg <- gg + theme(panel.margin.x=unit(0.5, "cm"))
gg <- gg + theme(panel.margin.y=unit(0.5, "cm"))
gg <- gg + theme(legend.title=element_text(size=6))
gg <- gg + theme(legend.title.align=1)
gg <- gg + theme(legend.text=element_text(size=6))
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(legend.key.size=unit(0.2, "cm"))
gg <- gg + theme(legend.key.width=unit(1, "cm"))
gg

Heatmap plots with their own scale

To get individual scales for each country we need to make n separate ggplot objects first, and then combine them using gridExtra::grid.arrange function. It’s very similar commands as before, only without the facet call.

Here We’ll do the top 16 countries (not excluding anything).

Prepare the dataset

cc_heat2 = count(attacks, source_country, wkday, hour) %>% 
  ungroup() %>% 
  left_join(events_by_country[,c(1,5)]) %>% 
  complete(country, wkday, hour, fill=list(n=0)) %>% 
  mutate(country=factor(country,
                        levels=events_by_country$country))
## Joining by: "source_country"

Make the heatmap plots

cclist = lapply(events_by_country$country[1:16], function(cc) {
  gg <- ggplot(filter(cc_heat2, country==cc), 
               aes(x=hour, y=wkday, fill=n, frame=country))
  gg <- gg + geom_tile(color="white", size=0.1)
  gg <- gg + scale_x_discrete(expand=c(0,0))
  gg <- gg + scale_y_discrete(expand=c(0,0))
  gg <- gg + scale_fill_viridis(name="")
  gg <- gg + coord_equal()
  gg <- gg + labs(x=NULL, y=NULL, 
                  title=sprintf("%s", cc))
  gg <- gg + theme_tufte(base_family="Helvetica")
  gg <- gg + theme(axis.ticks=element_blank())
  gg <- gg + theme(axis.text=element_text(size=5))
  gg <- gg + theme(panel.border=element_blank())
  gg <- gg + theme(plot.title=element_text(hjust=0, size=6))
  gg <- gg + theme(panel.margin.x=unit(0.5, "cm"))
  gg <- gg + theme(panel.margin.y=unit(0.5, "cm"))
  gg <- gg + theme(legend.title=element_text(size=6))
  gg <- gg + theme(legend.title.align=1)
  gg <- gg + theme(legend.text=element_text(size=6))
  gg <- gg + theme(legend.position="bottom")
  gg <- gg + theme(legend.key.size=unit(0.2, "cm"))
  gg <- gg + theme(legend.key.width=unit(1, "cm"))
  gg
})

cclist[["ncol"]] <- 2
do.call(grid.arrange, cclist)

Last updated: 2016-02-16