dplyr and findIntervalMy friend, Wilson Chua, was trying to merge two data sets in R in which a value in the variable SrcIP (to be explained in a minute) is to be matched with an indicator variable ASNUM which is assigned to a range of values with begin_ip_num as the lower limit and end_ip_num as the upper limit.
I have a similar problem before in which I attempted to use R to match a numerical rating to intervals of scores for rating my students. That was easily solved by using the base function cut. But one of the data sets Wilson was working had 1,770,665 rows and at least 9 columns. When I attempted a cut solution to his problem, the object created exceeded my computer’s physical memory.
So I did some searching and came upon a function in the base package that I have rarely used: findInterval.
From the documentation of findInterval, findInterval(x, vec):
Given a vector of non-decreasing breakpoints in vec, find the interval containing each element of x
The output of findInterval is a vector of increasing index numbers i. An element of x having the index number i belongs to the interval to which findInterval assigned that index number i.
Suppose we have closed intervals \([3,5]\), \([6,10]\), and \([12, 15]\) which are assigned assigned the values AS1, AS2, and AS3. Note that the intervals do not cover all of the numbers 1 to 18. We want to assign one of AS1, AS2, and AS3 to each value of x whenever that value falls in the corresponding interval. What we can do is find the intervals defined by the break points u (the lower limits) and find the intervals defined by the break points v (the upper limits). We have to add 1 to v though in order to ensure that v doesn’t belong to the interval with lower limit v + 1.
x <- 1:18
u <- c(3,6,12)
v <- c(5,10,15)
data1 <- data.frame(x, loc_lower = findInterval(x, u), loc_upper = findInterval(x, v + 1))
data1
## x loc_lower loc_upper
## 1 1 0 0
## 2 2 0 0
## 3 3 1 0
## 4 4 1 0
## 5 5 1 0
## 6 6 2 1
## 7 7 2 1
## 8 8 2 1
## 9 9 2 1
## 10 10 2 1
## 11 11 2 2
## 12 12 3 2
## 13 13 3 2
## 14 14 3 2
## 15 15 3 2
## 16 16 3 3
## 17 17 3 3
## 18 18 3 3
Next, we create the second data set which has a unique identifier ASNUM for each pair of index numbers loc_lower and loc_upper.
data2 <- data.frame(
# find the index numbers for values above the lower limits
loc_lower = findInterval(u, u),
# find the index numbers for values above the upper limits
loc_upper = findInterval(u, v + 1),
ASNUM = c("AS1", "AS2", "AS3")
)
data2
## loc_lower loc_upper ASNUM
## 1 1 0 AS1
## 2 2 1 AS2
## 3 3 2 AS3
To merge the data, we can use dplyr::left_join.
library(tidyverse)
## + ggplot2 2.2.1 Date: 2017-08-28
## + tibble 1.3.4 R: 3.4.1
## + tidyr 0.7.0 OS: Manjaro Linux
## + readr 1.1.1 GUI: X11
## + purrr 0.2.3 Locale: en_PH.UTF-8
## + dplyr 0.7.2 TZ: Asia/Manila
## + stringr 1.2.0
## + forcats 0.2.0
## ── Conflicts ────────────────────────────────────────────────────
## * filter(), from dplyr, masks stats::filter()
## * lag(), from dplyr, masks stats::lag()
data3 <- left_join(data1, data2, by = c("loc_lower", "loc_upper"))
data3
## x loc_lower loc_upper ASNUM
## 1 1 0 0 <NA>
## 2 2 0 0 <NA>
## 3 3 1 0 AS1
## 4 4 1 0 AS1
## 5 5 1 0 AS1
## 6 6 2 1 AS2
## 7 7 2 1 AS2
## 8 8 2 1 AS2
## 9 9 2 1 AS2
## 10 10 2 1 AS2
## 11 11 2 2 <NA>
## 12 12 3 2 AS3
## 13 13 3 2 AS3
## 14 14 3 2 AS3
## 15 15 3 2 AS3
## 16 16 3 3 <NA>
## 17 17 3 3 <NA>
## 18 18 3 3 <NA>
We note that since x = 1 does not belong to the intervals that we have defined earlier, its ASNUM is NA.
There must be, for instance using data.table but I am not versed in that package. If you know how to do this more elegantly in R, please comment below.
Wilson is kind enough to send me the mlab and maxmind data sets. He wanted to match IP’s in mlab data set with the ASN in the maxmind data set. In order to do this, we will need the iptools package in order to transform IP’s to numeric values and match those values with the range of numeric values (with lower limits begin_ip_num and upper limits end_ip_num in the maxmind data set, respectively) to find the specfic ASN (ASNUM). The data sets have been saved in Dropbox and are available here and here. Download these data sets and save them in your working directory. The combined data sets is at least 300 Mb in size, so your download might take a while. Once you have done this, you can start with the following codes. As usual, I put comments in the codes for the benefit of my readers who are just starting in R.
library(iptools)
#read in the source logs from Mlabs
mlab <- read_csv("mlab.csv", col_types = cols(MinRTT = col_double(), Duration = col_double(), OctetsAcked = col_double()), na = ".")
mlab.1 <- mlab %>% select(HostIP, ClientsIP)
Now that we have loaded the csv table next we find the AS number and geolocation.
# since the asnip is not part of the iptools that we have downloaded
# we now need to convert dotted ip into decimal for lookup with maxmind database
mlab.2 <- mlab.1 %>% mutate(SrCIP = ip_to_numeric(HostIP), DstIP = ip_to_numeric(ClientsIP))
mlab.2
## # A tibble: 1,770,665 x 4
## HostIP ClientsIP SrCIP DstIP
## <chr> <chr> <dbl> <dbl>
## 1 83.212.4.10 222.127.245.239 1406403594 3732927983
## 2 38.98.51.33 222.127.246.128 643969825 3732928128
## 3 203.5.76.140 222.127.246.217 3406122124 3732928217
## 4 203.5.76.151 222.127.246.227 3406122135 3732928227
## 5 38.98.51.20 222.127.246.239 643969812 3732928239
## 6 203.5.76.140 222.127.246.239 3406122124 3732928239
## 7 4.71.210.224 222.127.246.239 71815904 3732928239
## 8 203.5.76.151 222.127.248.113 3406122135 3732928625
## 9 83.212.4.37 222.127.248.137 1406403621 3732928649
## 10 217.163.1.88 222.127.248.213 3651338584 3732928725
## # ... with 1,770,655 more rows
Read in the maxmind data for ASN lookup.
maxmind <- read_csv("maxmind.csv",
col_types = cols(
begin_ip_num = col_double(),
end_ip_num = col_double()),
na = "."
)
# sort the table by begin_ip_num to make sure that
# findInterval does not throw fits
maxmind <- arrange(maxmind, begin_ip_num) %>%
# assign an interval locator number to each begin_ip_num
# this works since begin_ip_num and end_ip_num
# are clearly limits of non-overlapping intervals
mutate(
interval_lower = findInterval(begin_ip_num, begin_ip_num),
# add 1 from end_ip_num so that the upper limit is indexed correctly
interval_upper = findInterval(begin_ip_num, end_ip_num + 1)
)
Assign the same intervals to mlab.2 using the same interval location variables name interval_lower and interval_upper.
mlab.2 <- mlab.2 %>% mutate(
interval_lower = findInterval(mlab.2$SrCIP, maxmind$begin_ip_num),
# add 1 from end_ip_num so that the upper limit is indexed correctly
interval_upper = findInterval(mlab.2$SrCIP, maxmind$end_ip_num + 1)
)
Merge the data sets using interval_lower and interval_upper.
mlab.3 <- left_join(mlab.2, maxmind, by = c("interval_lower", "interval_upper"))
Print the Organizations in descending order.
mlab.3 %>% count(ASNUM, Org) %>% arrange(desc(n))
## # A tibble: 38 x 3
## ASNUM Org n
## <chr> <chr> <int>
## 1 AS7575 Australian Academic and Reasearch Network (AARNet) 480676
## 2 AS1659 Taiwan Academic Network (TANet) Information Center 282865
## 3 AS9821 Department of Science and Technology 238522
## 4 AS174 Cogent Communications 173984
## 5 AS3356 Level 3 Communications, Inc. 152140
## 6 AS5408 Greek Research and Technology Network S.A 120336
## 7 AS1299 Telia Company AB 106184
## 8 AS2500 WIDE Project 62502
## 9 AS29791 Voxel Dot Net, Inc. 52429
## 10 AS36492 Google, Inc. 37403
## # ... with 28 more rows
Make a bar graph of the top 20 organizations.
mlab.3 %>%
# group and taly by organization
count(Org) %>%
# arrange in descending order
mutate(Org = reorder(Org, n)) %>%
# choose the top 20 organizations
top_n(20) %>%
# plot
ggplot(aes(Org, n)) +
# the bar graph
geom_bar(stat = "identity") +
# change the x label
xlab("Organization") +
# I like the black and white theme
theme_bw() +
# flip the coordinates
coord_flip()
## Selecting by n