イベントを時系列に変換して視覚化する

はじめに

情報セキュリティ業務で眺めるログは、イベント（点過程）であることが多く、時系列であるケースは少ないです。このログを、日次・週次などと一定の時間間隔で集計して視覚化を行うことは、一種の典型業務かもしれません。

今回は、元イベントデータである時系列データを視覚化する場合のggplot2での慣用表現を取り扱います。最初にggplotの既定のグラフを出し、少しずつ加工していきます。

pacman::p_load(tidyverse)

read_combined <- function(file) {
  names <- c("ip_address", "remote_user_ident", "local_user_ident", "timestamp", 
             "request", "status_code", "bytes_sent", "referer", "user_agent")
  col_types <- list(col_character(), col_character(), col_character(), 
                    col_datetime("%d/%b/%Y:%H:%M:%S %z"), col_character(), 
                    col_character(), col_integer(), col_character(), col_character())
  data <- read_log(file = file, col_names = names, col_types = col_types)
  return(data)
}

src_url <- "https://github.com/elastic/examples/blob/master/Common%20Data%20Formats/apache_logs/apache_logs?raw=true"
log_raw <- read_combined(src_url)

log_raw %>% 
  slice_head(n = 5)

## # A tibble: 5 × 9
##   ip_address remot…¹ local…² timestamp           request statu…³ bytes…⁴ referer
##   <chr>      <chr>   <chr>   <dttm>              <chr>   <chr>     <int> <chr>  
## 1 83.149.9.… <NA>    <NA>    2015-05-17 10:05:03 GET /p… 200      203023 http:/…
## 2 83.149.9.… <NA>    <NA>    2015-05-17 10:05:43 GET /p… 200      171717 http:/…
## 3 83.149.9.… <NA>    <NA>    2015-05-17 10:05:47 GET /p… 200       26185 http:/…
## 4 83.149.9.… <NA>    <NA>    2015-05-17 10:05:12 GET /p… 200        7697 http:/…
## 5 83.149.9.… <NA>    <NA>    2015-05-17 10:05:07 GET /p… 200        2892 http:/…
## # … with 1 more variable: user_agent <chr>, and abbreviated variable names
## #   ¹remote_user_ident, ²local_user_ident, ³status_code, ⁴bytes_sent

元データには毎度ながら、Elasticのサンプルを使わせていただきました。

視覚化手順

このApacheログには、2015年5月17日10時から20日21時までのレコードが存在します。今回は、5月18日の1時間ごとの集計値を視覚化することにしましょう。

こうした場合のコツは2つあって、

注目する時間単位（日とか時間とか）よりも小さい単位を切り捨てたフィールドを作ること。
グラフの表示範囲は、注目する時間範囲よりも半単位分左にずらすこと。

です。

floor_date()による切り捨て

pacman::p_load(lubridate)
log_mod <- 
  log_raw |> 
  mutate(
    dttm_m = floor_date(timestamp, unit = "hour")
  ) |> 
  filter(dttm_m >= ymd("2015-05-18", tz = "UTC")) |> 
  filter(dttm_m <  ymd("2015-05-19", tz = "UTC"))

上のコードで、lubridate::floor_date()関数が切り捨てに相当します。次のfilter()は不要なレコードを削除しているだけなので、必須ではありません。表示範囲を18日だけ限定しておけば、たとえ17日や19日のデータが存在していても、グラフには影響しないからです。

ここでは、lburidate::ymd()関数にタイムゾーンのオプションがあることを示したくて、あえて実行しました。

裸のgeom_bar()

特に装飾を施さずに棒グラフで表示してみます。

log_mod |> 
  ggplot() +
  aes(x = dttm_m, fill = status_code) +
  geom_bar()

自分で確認するだけなら十分なのですが、「よそ行き」として使うには、もう少し体裁を整えたいところです。たとえば、次のような改善を加えたいです。

X軸のdttm_mもY軸のcountも消したい。代わりに何かタイトルをつけたい。
X軸の範囲を18日0時から23時までに納め、1時間ごとに目盛りをつけたい。
もう少し大人しめの色づかいにしたい。

以下、順番に改善します。

labs()

ggplot::labs()は、軸やラベルを調整する関数です。X軸、Y軸ともにラベルが不要なら、空の文字列を代入します。

pacman::p_load(lubridate, scales)

log_mod |> 
  ggplot() +
  aes(x = dttm_m, fill = status_code) +
  geom_bar() +
  labs(x = "", y = "",
      title = "2015年5月18日のアクセス数（1時間ごと）")

scale_x_datetime()

軸の目盛りを調節する関数は、scale_軸_変数型()です。いまのX軸は日時型なので、 scale_x_datetime()を使って調節します。

pacman::p_load(lubridate, scales)

log_mod |> 
  ggplot() +
  aes(x = dttm_m, fill = status_code) +
  geom_bar() +
  scale_x_datetime(date_breaks = "hour",
                   minor_breaks = NULL,
                   limits = c(ymd_hms("2015-05-17 23:30:00", tz = "UTC"), 
                              ymd_hms("2015-05-18 23:30:00", tz = "UTC")),
                   expand = c(0.01, 0.01),
                   labels = label_date_short(format = c("%Y年", "%b月", "%e日", "%H"), sep = "\n")
  ) +
  labs(x = "", y = "",
      title = "2015年5月18日のアクセス数（1時間ごと）")

上のコードで、limitsの中が重要です。2015-05-18 00:00:00～2015-05-19 00:00:00までを 1時間ごとに集計したいわけですが、時間範囲ぴったりの

limits = c(ymd_hms("2015-05-18 00:00:00", tz = "UTC"), 
           ymd_hms("2015-05-19 00:00:00", tz = "UTC"))

としてはいけません。グラフ描画される時点は0時です。したがって上の記載だと、 18日0時台は描画されず、19日0時台が描画されうる状態になります。

expandは余白で、この値は個人的な経験則に基づくものです。

また、年月日の目盛りを縦に表示させているのが、 scales::label_date_short()関数です。

fct_rev()

ggplotでは、色分けはfill（塗りつぶし）やcolor（外側）で定義されます。今回、fillにはstatus_codeを選んでいます。HTTPステータスコードは「200」や「404」などと数字で表現されますが、データを読み取るときにcol_character()と指定しているので、これらの変数型は文字列型（カテゴリー変数）です。

Rでは、カテゴリー変数を順序尺度だとみなして並び付けをします。デフォルトから並びを逆転させるために、 forcats::fct_rev()を使います。

pacman::p_load(lubridate, scales)

log_mod |> 
  ggplot() +
  aes(x = dttm_m, fill = forcats::fct_rev(status_code)) +
  geom_bar() +
#  scale_fill_viridis_d(guide = guide_legend(reverse=TRUE)) +
  scale_x_datetime(date_breaks = "hour",
                   minor_breaks = NULL,
                   limits = c(ymd_hms("2015-05-17 23:30:00", tz = "UTC"), 
                              ymd_hms("2015-05-18 23:30:00", tz = "UTC")),
                   expand = c(0.01, 0.01),
                   labels = label_date_short(format = c("%Y年", "%b月", "%e日", "%H"), sep = "\n")
  ) +
  labs(x = "", y = "",
      title = "2015年5月18日のアクセス数（1時間ごと）")

scale_fill_discrete()

今度は凡例ラベルが妙なことになってしまっています。また、凡例ラベルの順番は200が上のほうがいいように思います。

ggplot2でカテゴリー変数を扱う場合に使われる標準的なパレットは、 scale_fill_discrete()です。この関数のオプションを指定しています。 directionを-1として、塗る色の順番を反転させます。加えて、guideオプションでreverse=TRUEとすることで、凡例ラベルの順序を反転させます。

pacman::p_load(lubridate, scales)

log_mod |> 
  ggplot() +
  aes(x = dttm_m, fill = forcats::fct_rev(status_code)) +
  geom_bar() +
  scale_fill_discrete(
    name = "Status Code",
    direction = -1,
    guide = guide_legend(reverse=TRUE)
  ) +
  scale_x_datetime(date_breaks = "hour",
                   minor_breaks = NULL,
                   limits = c(ymd_hms("2015-05-17 23:30:00", tz = "UTC"), 
                              ymd_hms("2015-05-18 23:30:00", tz = "UTC")),
                   expand = c(0.01, 0.01),
                   labels = label_date_short(format = c("%Y年", "%b月", "%e日", "%H"), sep = "\n")
  ) +
  labs(x = "", y = "",
      title = "2015年5月18日のアクセス数（1時間ごと）")

scale_fill_viridis_d()

scale_fill_descrete()は既定のパレットを用いますが、別のパレットをセットしたものもあります。ここでは、色覚異常をもつ方への見やすさを配慮したviridisを採用した scale_fill_viridis_d()を使ってみます。もし手動で色指定したければ、 scale_fill_manual()を使います。

pacman::p_load(lubridate, scales)

log_mod |> 
  ggplot() +
  aes(x = dttm_m, fill = forcats::fct_rev(status_code)) +
  geom_bar() +
  scale_fill_viridis_d(
    name = "Status Code",
    direction = -1,
    guide = guide_legend(reverse=TRUE)
  ) +
  scale_x_datetime(date_breaks = "hour",
                   minor_breaks = NULL,
                   limits = c(ymd_hms("2015-05-17 23:30:00", tz = "UTC"), 
                              ymd_hms("2015-05-18 23:30:00", tz = "UTC")),
                   expand = c(0.01, 0.01),
                   labels = label_date_short(format = c("%Y年", "%b月", "%e日", "%H"), sep = "\n")
  ) +
  labs(x = "", y = "",
      title = "2015年5月18日のアクセス数（1時間ごと）")

おまけ: 細かい調整

上までで十分だと思いますが、機能の紹介も兼ねて、無理やり変更を続けます。以下では、凡例の位置を右上に変えました。

pacman::p_load(lubridate, scales)

log_mod |> 
  ggplot() +
  aes(x = dttm_m, fill = forcats::fct_rev(status_code)) +
  geom_bar() +
  scale_fill_viridis_d(
    name = "Status Code",
    direction = -1,
    guide = guide_legend(reverse=TRUE)
  ) +
  scale_x_datetime(date_breaks = "hour",
                   minor_breaks = NULL,
                   limits = c(ymd_hms("2015-05-17 23:30:00", tz = "UTC"), 
                              ymd_hms("2015-05-18 23:30:00", tz = "UTC")),
                   expand = c(0.01, 0.01),
                   labels = label_date_short(format = c("%Y年", "%b月", "%e日", "%H"), sep = "\n")
  ) +
  scale_y_continuous(limits = c(0, 150)) +
  theme(
    legend.position = c(1, 1.07),
    legend.justification = c(1, 1),
    legend.title = element_text(size = 9),
    legend.direction = "horizontal",
    legend.box.spacing = unit(0, "pt"),
    legend.margin = margin(2, 2, 2, 2, "pt")
  ) +
  labs(x = "", y = "",
      title = "2015年5月18日のアクセス数（1時間ごと）")

おわりに（参考文献）

以上、イベントデータを時系列データに変換したうえで、 ggplotで体裁を整える手順を紹介しました。

情報セキュリティと関係のある分野ではありませんが、「疫学のためのRハンドブック」からは、視覚化のアイデアや実現手法を学ぶうえで有効です。

ggplotについてしっかりと学びたい場合には、「ggplot2 book」があります。

また、「ggplot2のscale_*()関数についてのまとめ」は、よく整理されています。