Looking up values in R using dplyr and findInterval

Introduction

My friend, Wilson Chua, was trying to merge two data sets in R in which a value in the variable SrcIP (to be explained in a minute) is to be matched with an indicator variable ASNUM which is assigned to a range of values with begin_ip_num as the lower limit and end_ip_num as the upper limit.

I have a similar problem before in which I attempted to use R to match a numerical rating to intervals of scores for rating my students. That was easily solved by using the base function cut. But one of the data sets Wilson was working had 1,770,665 rows and at least 9 columns. When I attempted a cut solution to his problem, the object created exceeded my computer’s physical memory.

So I did some searching and came upon a function in the base package that I have rarely used: findInterval.

A demonstration with small data sets.

From the documentation of findInterval, findInterval(x, vec):

   Given a vector of non-decreasing breakpoints in vec, find the interval containing each element of x

The output of findInterval is a vector of increasing index numbers i. An element of x having the index number i belongs to the interval to which findInterval assigned that index number i.

Suppose we have closed intervals \([3,5]\), \([6,10]\), and \([12, 15]\) which are assigned assigned the values AS1, AS2, and AS3. Note that the intervals do not cover all of the numbers 1 to 18. We want to assign one of AS1, AS2, and AS3 to each value of x whenever that value falls in the corresponding interval. What we can do is find the intervals defined by the break points u (the lower limits) and find the intervals defined by the break points v (the upper limits). We have to add 1 to v though in order to ensure that v doesn’t belong to the interval with lower limit v + 1.

x <- 1:18
u <- c(3,6,12)
v <- c(5,10,15)
data1 <- data.frame(x, loc_lower = findInterval(x, u), loc_upper = findInterval(x, v + 1))
data1

##     x loc_lower loc_upper
## 1   1         0         0
## 2   2         0         0
## 3   3         1         0
## 4   4         1         0
## 5   5         1         0
## 6   6         2         1
## 7   7         2         1
## 8   8         2         1
## 9   9         2         1
## 10 10         2         1
## 11 11         2         2
## 12 12         3         2
## 13 13         3         2
## 14 14         3         2
## 15 15         3         2
## 16 16         3         3
## 17 17         3         3
## 18 18         3         3

Next, we create the second data set which has a unique identifier ASNUM for each pair of index numbers loc_lower and loc_upper.

data2 <- data.frame(
  # find the index numbers for values above the lower limits
  loc_lower = findInterval(u, u), 
  # find the index numbers for values above the upper limits
  loc_upper = findInterval(u, v + 1),
  ASNUM = c("AS1", "AS2", "AS3")
)
data2

##   loc_lower loc_upper ASNUM
## 1         1         0   AS1
## 2         2         1   AS2
## 3         3         2   AS3

To merge the data, we can use dplyr::left_join.

library(tidyverse)

## + ggplot2 2.2.1        Date: 2017-08-28
## + tibble  1.3.4           R: 3.4.1
## + tidyr   0.7.0          OS: Manjaro Linux
## + readr   1.1.1         GUI: X11
## + purrr   0.2.3      Locale: en_PH.UTF-8
## + dplyr   0.7.2          TZ: Asia/Manila
## + stringr 1.2.0      
## + forcats 0.2.0

## ── Conflicts ────────────────────────────────────────────────────

## * filter(),  from dplyr, masks stats::filter()
## * lag(),     from dplyr, masks stats::lag()

data3 <- left_join(data1, data2, by = c("loc_lower", "loc_upper"))
data3

##     x loc_lower loc_upper ASNUM
## 1   1         0         0  <NA>
## 2   2         0         0  <NA>
## 3   3         1         0   AS1
## 4   4         1         0   AS1
## 5   5         1         0   AS1
## 6   6         2         1   AS2
## 7   7         2         1   AS2
## 8   8         2         1   AS2
## 9   9         2         1   AS2
## 10 10         2         1   AS2
## 11 11         2         2  <NA>
## 12 12         3         2   AS3
## 13 13         3         2   AS3
## 14 14         3         2   AS3
## 15 15         3         2   AS3
## 16 16         3         3  <NA>
## 17 17         3         3  <NA>
## 18 18         3         3  <NA>

We note that since x = 1 does not belong to the intervals that we have defined earlier, its ASNUM is NA.

Is there a better solution?

There must be, for instance using data.table but I am not versed in that package. If you know how to do this more elegantly in R, please comment below.

A larger application

Wilson is kind enough to send me the mlab and maxmind data sets. He wanted to match IP’s in mlab data set with the ASN in the maxmind data set. In order to do this, we will need the iptools package in order to transform IP’s to numeric values and match those values with the range of numeric values (with lower limits begin_ip_num and upper limits end_ip_num in the maxmind data set, respectively) to find the specfic ASN (ASNUM). The data sets have been saved in Dropbox and are available here and here. Download these data sets and save them in your working directory. The combined data sets is at least 300 Mb in size, so your download might take a while. Once you have done this, you can start with the following codes. As usual, I put comments in the codes for the benefit of my readers who are just starting in R.

library(iptools)

#read in the source logs from Mlabs
mlab <- read_csv("mlab.csv", col_types = cols(MinRTT = col_double(), Duration = col_double(), OctetsAcked = col_double()), na = ".")
mlab.1 <- mlab %>% select(HostIP, ClientsIP)

Now that we have loaded the csv table next we find the AS number and geolocation.

# since the asnip is not part of the iptools that we have downloaded
# we now need to convert dotted ip into decimal for lookup with maxmind database
mlab.2 <- mlab.1 %>% mutate(SrCIP = ip_to_numeric(HostIP), DstIP = ip_to_numeric(ClientsIP))
mlab.2

## # A tibble: 1,770,665 x 4
##          HostIP       ClientsIP      SrCIP      DstIP
##           <chr>           <chr>      <dbl>      <dbl>
##  1  83.212.4.10 222.127.245.239 1406403594 3732927983
##  2  38.98.51.33 222.127.246.128  643969825 3732928128
##  3 203.5.76.140 222.127.246.217 3406122124 3732928217
##  4 203.5.76.151 222.127.246.227 3406122135 3732928227
##  5  38.98.51.20 222.127.246.239  643969812 3732928239
##  6 203.5.76.140 222.127.246.239 3406122124 3732928239
##  7 4.71.210.224 222.127.246.239   71815904 3732928239
##  8 203.5.76.151 222.127.248.113 3406122135 3732928625
##  9  83.212.4.37 222.127.248.137 1406403621 3732928649
## 10 217.163.1.88 222.127.248.213 3651338584 3732928725
## # ... with 1,770,655 more rows

Read in the maxmind data for ASN lookup.

maxmind <- read_csv("maxmind.csv", 
                    col_types = cols(
                      begin_ip_num = col_double(), 
                      end_ip_num = col_double()),
                    na = "."
                    )
# sort the table by begin_ip_num to make sure that 
# findInterval does not throw fits
maxmind <-  arrange(maxmind, begin_ip_num) %>%
# assign an interval locator number to each begin_ip_num
# this works since begin_ip_num and end_ip_num
# are clearly limits of non-overlapping intervals
  mutate(
    interval_lower = findInterval(begin_ip_num, begin_ip_num),
  # add 1 from end_ip_num so that the upper limit is indexed correctly
    interval_upper = findInterval(begin_ip_num, end_ip_num + 1)
)

Assign the same intervals to mlab.2 using the same interval location variables name interval_lower and interval_upper.

mlab.2 <- mlab.2 %>% mutate(
  interval_lower = findInterval(mlab.2$SrCIP, maxmind$begin_ip_num),
  # add 1 from end_ip_num so that the upper limit is indexed correctly
  interval_upper = findInterval(mlab.2$SrCIP, maxmind$end_ip_num + 1)
)

Merge the data sets using interval_lower and interval_upper.

mlab.3 <- left_join(mlab.2, maxmind, by = c("interval_lower", "interval_upper"))

Print the Organizations in descending order.

mlab.3 %>% count(ASNUM, Org) %>% arrange(desc(n))

## # A tibble: 38 x 3
##      ASNUM                                                Org      n
##      <chr>                                              <chr>  <int>
##  1  AS7575 Australian Academic and Reasearch Network (AARNet) 480676
##  2  AS1659 Taiwan Academic Network (TANet) Information Center 282865
##  3  AS9821               Department of Science and Technology 238522
##  4   AS174                              Cogent Communications 173984
##  5  AS3356                       Level 3 Communications, Inc. 152140
##  6  AS5408          Greek Research and Technology Network S.A 120336
##  7  AS1299                                   Telia Company AB 106184
##  8  AS2500                                       WIDE Project  62502
##  9 AS29791                                Voxel Dot Net, Inc.  52429
## 10 AS36492                                       Google, Inc.  37403
## # ... with 28 more rows

Make a bar graph of the top 20 organizations.

mlab.3 %>% 
  # group and taly by organization
  count(Org) %>% 
  # arrange in descending order
  mutate(Org = reorder(Org, n)) %>% 
  # choose the top 20 organizations
  top_n(20) %>%
  # plot
  ggplot(aes(Org, n)) + 
  # the bar graph
  geom_bar(stat = "identity") + 
  # change the x label
  xlab("Organization") + 
  # I like the black and white theme
  theme_bw() +
  # flip the coordinates
  coord_flip()

## Selecting by n

Looking up values in R using `dplyr` and `findInterval`

Joseph S. Tabadero, Jr.

April 15, 2017

Introduction

A demonstration with small data sets.

Is there a better solution?

A larger application