The data obtained in this code are available on GitHub here.

Notice to Mariners

The U.S. Notice to Mariners data are available in multiple formats, but they all leave something to be desired in terms of data quality and completeness.
Unfortunately, there are many Broadcast Warnings included in the PDF publications that are missing in other available formats, making the PDF versions the most complete source of data.

To overcome this, we must step into the seventh circle of data mining hell: poorly structured PDF files.

Dependencies

library(dplyr)
library(tidyr)
library(stringr)
library(rvest)
library(pdftools)
library(ggplot2)

Most Recently Published Issue

We can tap into the NGA site’s HTML and scrape the latest publication’s issue number like so:

notice_to_mariners_url <- "https://msi.nga.mil/NGAPortal/MSI.portal?_nfpb=true&_st=&_pageLabel=msi_portal_page_61"

most_recent_issue <- notice_to_mariners_url %>%
  read_html() %>%
  html_nodes("tr:nth-child(2) .dec-inv") %>%
  html_text()

paste(cat("The most recent issue is:\n"), most_recent_issue)

## The most recent issue is:

## [1] " 48/2017"

most_recent_issue %<>%
  str_extract("\\d{2}") %>%
  as.numeric

paste(cat("The week number we will use for our URL range is:\n"), most_recent_issue)

## The week number we will use for our URL range is:

## [1] " 48"

For demonstration, you can see the

table of contents of the most recent Notice to Mariners (48/2017):
- here
the Broadcast Warnings/MARAD Advisories/Special Warnings PDF:
- here

URLs

The URLs for 2017’s PDFs follow a template containing the four digit year followed by the two digit week.

Here we replace the two digit week with %s so that we can format a the template for each week of the year.

url_template <-
  "https://msi.nga.mil/MSISiteContent/StaticFiles/NAV_PUBS/UNTM/201701/Broadcast_Warn.pdf" %>%
  str_replace("(\\d{4})\\d{2}", "\\1%s")

paste(cat("Our formatted URL template is:\n"), url_template)

## Our formatted URL template is:

## [1] " https://msi.nga.mil/MSISiteContent/StaticFiles/NAV_PUBS/UNTM/2017%s/Broadcast_Warn.pdf"

Using sprintf() with url_template, we can build a vector of urls with each two digit week up until most_recent_issue.

urls <- sprintf(url_template,
                str_pad(seq(1:most_recent_issue),
                        width = 2,
                        pad = "0"))

paste(cat("Our URLs follow this pattern:\n"), urls[1:5])

## Our URLs follow this pattern:

## [1] " https://msi.nga.mil/MSISiteContent/StaticFiles/NAV_PUBS/UNTM/201701/Broadcast_Warn.pdf"
## [2] " https://msi.nga.mil/MSISiteContent/StaticFiles/NAV_PUBS/UNTM/201702/Broadcast_Warn.pdf"
## [3] " https://msi.nga.mil/MSISiteContent/StaticFiles/NAV_PUBS/UNTM/201703/Broadcast_Warn.pdf"
## [4] " https://msi.nga.mil/MSISiteContent/StaticFiles/NAV_PUBS/UNTM/201704/Broadcast_Warn.pdf"
## [5] " https://msi.nga.mil/MSISiteContent/StaticFiles/NAV_PUBS/UNTM/201705/Broadcast_Warn.pdf"

We will eventually iterate through urls to extract and parse the desired information.

Setup for Data Extraction

Useful Regular Expressions

Let’s prepare a handy regex variable of months in the format that we see is used for dates in the PDF documents.

months_regex <- c("JAN", "FEB", "MAR", "APR", "MAY", "JUN",
                  "JUL", "AUG", "SEP", "OCT", "NOV", "DEC") %>%
  str_c(collapse = "|")

paste(cat("The resulting regex looks like so:\n"), months_regex)

## The resulting regex looks like so:

## [1] " JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC"

We’ll also prepare a regex variable containing terms of interest to extract from the messages to facilitate filtering for research.

relevant_terms_regex <- c("ROCKET", "MISSILE", "HAZARDOUS",
                          "GUNNERY", "LAUNCHING") %>%
  str_c(collapse = "|")

paste(cat("The resulting regex looks like so:\n"), relevant_terms_regex)

## The resulting regex looks like so:

## [1] " ROCKET|MISSILE|HAZARDOUS|GUNNERY|LAUNCHING"

`list_to_text()` Modification

This is an incredibly handy list_to_text() function that is slightly modified from that in the exploratory package.

The difference is that the character returned contains only unique() elements from the original list, which I have found to be useful when you have a set of key terms with which you’d like to annotate a text dataset.

list_to_text <- function(column, sep = ":"){
  loadNamespace("stringr")
  ret <- sapply(column, function(x){
    ret <- stringr::str_c(unique(x), collapse = sep)
    if(identical(ret, character(0))){
      # if it's character(0)
      NA
    } else {
      ret}})
  as.character(ret)
}

Empty `tibble` data frame

We’ll create an empty tibble which we will use to store our information.

raw_df <- tibble()

Data Wrangling

Extraction and Initial Cleaning

pull the PDF text: pdf_text()
remove document’s header and footer clutter: str_replace_all()
annotate the positions to
- split desired text from body clutter into a nested list : str_replace_all()
- split desired entries into separate fields : str_replace_all()
unlist top layer: unlist()
keep only those elements in the list with our annotations: str_subset()
remove inner excess white space: str_replace_all()
remove outer excess white space: str_trim()
convert the list to a data frame: as_tibble()
bind data to raw_df: bind_rows()
rinse, recycle, repeat

for(URL in urls){
  temp_df <- URL %>%
    pdf_text() %>%
    str_replace_all("([I]+-\\d\\.\\d\\s)|(NM\\s\\d+/\\d+)|(SECTION\\s[I]+)", "") %>%
    str_replace_all("(\\d+/\\d+\\(.*?\\)\\.)", "~_~~\\1") %>% 
    str_split("~_") %>%
    unlist(., recursive = FALSE) %>%
    str_subset("~~") %>%
    str_replace_all("~~", "") %>%
    str_replace_all("\\s+", " ") %>%
    str_trim %>%
    as_tibble
    
  raw_df <- bind_rows(raw_df, temp_df)
}

The Heavy Lifting

There are more steps than required here as the format that will be used is not yet determined. That said, modification from this point into whatever structure is required for future analysis will be simple.

total_df <- raw_df %>%
  mutate(value = str_replace(value, "(\\)\\.)", "\\)\\.~")) %>%
  separate(value, c("ID", "message"), sep =  "~") %>%
  group_by(ID) %>%
  mutate(coords_1 = str_extract_all(message,
                                    "\\d+-\\d+\\.\\d+[NS]{1}\\s\\d+-\\d+\\.\\d+[EW]"),
         coords_2 = str_extract_all(message,
                                    "\\d+-\\d+[NS]\\s\\d+-\\d+[EW]")) %>%
  mutate(coords_1 = list_to_text(coords_1),
         coords_2 = list_to_text(coords_2)) %>%
  separate(coords_1, paste0("fine_", c(1:40)), sep = ":") %>%
  separate(coords_2, paste0("coarse_", c(1:40)), sep = ":") %>%
  gather(coord_precision, coords, contains("_")) %>%
  drop_na(coords) %>%
  separate(coords, c("lat", "long"), sep = " ") %>%
  mutate(lat = str_replace(lat, "-", "d"),
         lat = str_replace(lat, "-", "'"),
         lat = str_replace(lat, "\\.", "'"),
         lat = if_else(str_detect(coord_precision, "fine"),
                       str_replace_all(lat, str_sub(lat, -1L, -1L),
                                       paste0("\\\" ", str_sub(lat, -1L, -1L))),
                       str_replace_all(lat, str_sub(lat, -1L, -1L),
                                       paste0(" ", str_sub(lat, -1L, -1L)))),
         lat = as.numeric(sp::char2dms(lat))) %>%
  mutate(long = str_replace(long, "-", "d"),
         long = str_replace(long, "-", "'"),
         long = str_replace(long, "\\.", "'"),
         long = if_else(str_detect(coord_precision, "fine"),
                        str_replace_all(long, str_sub(long, -1L, -1L),
                                        paste0("\\\" ", str_sub(long, -1L, -1L))),
                        str_replace_all(long, str_sub(long, -1L, -1L),
                                        paste0(" ", str_sub(long, -1L, -1L)))),
         long = as.numeric(sp::char2dms(long))) %>%
  mutate(relevant_terms = str_extract_all(message, relevant_terms_regex),
         relevant_terms = list_to_text(relevant_terms)) %>%
  mutate(time_date = str_extract(message, "\\(\\d{6}Z.*?\\)"),
         time_date = str_replace_all(time_date, "(\\()|(\\))", ""),
         message_mday = str_sub(time_date, 1L, 2L),
         message_month = str_extract(time_date, months_regex),
         message_year = str_extract(time_date, "\\s\\d{4}"),
         message_date = as.Date(paste0(message_year,
                                       message_month,
                                       message_mday),
                                format = "%Y%b%d"),
         message_time = str_sub(time_date, 3L, 6L),
         message_time = str_replace(message_time, "(\\d{2})", "\\1:"),
         message_date_time = as.POSIXct(paste(message_date,
                                              message_time),
                                        "%Y-%m-%d %H:%M", tz = "GMT")) %>%
  rename(zulu_time_date = time_date) %>%
  select(-message_mday, -message_month, -message_year) %>%
  mutate(coord_precision = if_else(str_detect(coord_precision, "fine"),
                                   "fine", "coarse")) %>%
  select(message_date_time, relevant_terms, long, lat, 
         coord_precision, message, zulu_time_date, ID) %>%
  distinct() %>%
  arrange(desc(message_date_time))

Results

head(total_df)

## # A tibble: 6 x 8
## # Groups:   ID [6]
##     message_date_time relevant_terms      long       lat coord_precision
##                <dttm>          <chr>     <dbl>     <dbl>           <chr>
## 1 2017-11-21 09:38:00      HAZARDOUS 146.05972  16.01833            fine
## 2 2017-11-21 08:47:00           <NA> -68.53528  63.71889            fine
## 3 2017-11-21 08:37:00           <NA> -68.53528  63.71889            fine
## 4 2017-11-21 08:15:00           <NA> 130.00000 -10.00000          coarse
## 5 2017-11-21 08:09:00           <NA>  12.00000  33.00000          coarse
## 6 2017-11-21 08:03:00      HAZARDOUS 144.00000  13.00000          coarse
## # ... with 3 more variables: message <chr>, zulu_time_date <chr>, ID <chr>

glimpse(total_df)

## Observations: 12,813
## Variables: 8
## $ message_date_time <dttm> 2017-11-21 09:38:00, 2017-11-21 08:47:00, 2...
## $ relevant_terms    <chr> "HAZARDOUS", NA, NA, NA, NA, "HAZARDOUS", "H...
## $ long              <dbl> 146.05972, -68.53528, -68.53528, 130.00000, ...
## $ lat               <dbl> 16.018333, 63.718889, 63.718889, -10.000000,...
## $ coord_precision   <chr> "fine", "fine", "fine", "coarse", "coarse", ...
## $ message           <chr> " WESTERN NORTH PACIFIC. MARIANA ISLANDS. DN...
## $ zulu_time_date    <chr> "210938Z NOV 2017", "210847Z NOV 2017", "210...
## $ ID                <chr> "4055/17(81).", "393/17(15).", "1093/17(15)....

The Resulting Table

pander::panderOptions('table.split.table', Inf)
pander::pander(head(total_df, n = 10))

message_date_time	relevant_terms	long	lat	coord_precision	message	zulu_time_date	ID
2017-11-21 09:38:00	HAZARDOUS	146.1	16.02	fine	WESTERN NORTH PACIFIC. MARIANA ISLANDS. DNC 12. 1. HAZARDOUS OPERATIONS 212230Z TO 220230Z NOV IN AREA WITHIN 12 MILES OF 16-01.06N 146-03.35E. 2. CANCEL THIS MSG 220330Z NOV 17. (210938Z NOV 2017) III-1.20	210938Z NOV 2017	4055/17(81).
2017-11-21 08:47:00	NA	-68.54	63.72	fine	DAVIS STRAIT. CANADA-NORTHEAST COAST. DNC 28. MCTS IQALUIT CENTRE 63-43.8N 068-32.7W NBDP AND DSC SERVICES INOPERATIVE. (210847Z NOV 2017) III-1.22	210847Z NOV 2017	393/17(15).
2017-11-21 08:37:00	NA	-68.54	63.72	fine	DAVIS STRAIT. CANADA-NORTHEAST COAST. MCTS IQALUIT CENTRE 63-43.8N 068-32.7W NBDP AND DSC SERVICES INOPERATIVE. (210837Z NOV 2017)	210837Z NOV 2017	1093/17(15).
2017-11-21 08:15:00	NA	130	-10	coarse	TIMOR SEA. AUSTRALIA-NORTH COAST. DNC 04. 1. FISHING NET ADRIFT IN 10-40S 130-40E AT 210444Z NOV. 2. CANCEL HYDROPAC 4039/17. 3. CANCEL THIS MSG 240815Z NOV 17. (210815Z NOV 2017)	210815Z NOV 2017	4054/17(74).
2017-11-21 08:09:00	NA	12	33	coarse	EASTERN MEDITERRANEAN SEA. DNC 08, DNC 09. VESSEL, NUMEROUS PERSONS ON BOARD, IN NEED OF ASSISTANCE IN 33-42N 012-40E AT 210400Z NOV. VESSELS IN VICINITY REQUESTED TO KEEP A SHARP LOOKOUT, ASSIST IF POSSIBLE. REPORTS TO MRCC ROME, INMARSAT-C: 424744220, PHONE: 3906 5908 4527, 3906 5908 4409, FAX: 390 6592 2737, 3906 5908 4793, E-MAIL: ITMRCC@MIT.GOV.IT. (210809Z NOV 2017)	210809Z NOV 2017	3747/17(56).
2017-11-21 08:03:00	HAZARDOUS	144	13	coarse	WESTERN NORTH PACIFIC. GUAM. DNC 12. 1. HAZARDOUS OPERATIONS 212230Z TO 220230Z NOV IN AREA BOUND BY 13-10N 144-30E, 13-10N 144-42E, 12-50N 144-45E, 11-00N 144-45E, 11-00N 143-00E, 11-45N 143-00E, 12-50N 144-30E. 2. CANCEL THIS MSG 220330Z NOV 17. (210803Z NOV 2017)	210803Z NOV 2017	4053/17(81).
2017-11-21 08:03:00	HAZARDOUS	144	12	coarse	WESTERN NORTH PACIFIC. GUAM. DNC 12. 1. HAZARDOUS OPERATIONS 212230Z TO 220230Z NOV IN AREA BOUND BY 13-10N 144-30E, 13-10N 144-42E, 12-50N 144-45E, 11-00N 144-45E, 11-00N 143-00E, 11-45N 143-00E, 12-50N 144-30E. 2. CANCEL THIS MSG 220330Z NOV 17. (210803Z NOV 2017)	210803Z NOV 2017	4053/17(81).
2017-11-21 08:03:00	HAZARDOUS	144	11	coarse	WESTERN NORTH PACIFIC. GUAM. DNC 12. 1. HAZARDOUS OPERATIONS 212230Z TO 220230Z NOV IN AREA BOUND BY 13-10N 144-30E, 13-10N 144-42E, 12-50N 144-45E, 11-00N 144-45E, 11-00N 143-00E, 11-45N 143-00E, 12-50N 144-30E. 2. CANCEL THIS MSG 220330Z NOV 17. (210803Z NOV 2017)	210803Z NOV 2017	4053/17(81).
2017-11-21 08:03:00	HAZARDOUS	143	11	coarse	WESTERN NORTH PACIFIC. GUAM. DNC 12. 1. HAZARDOUS OPERATIONS 212230Z TO 220230Z NOV IN AREA BOUND BY 13-10N 144-30E, 13-10N 144-42E, 12-50N 144-45E, 11-00N 144-45E, 11-00N 143-00E, 11-45N 143-00E, 12-50N 144-30E. 2. CANCEL THIS MSG 220330Z NOV 17. (210803Z NOV 2017)	210803Z NOV 2017	4053/17(81).
2017-11-21 01:36:00	NA	173	1	coarse	NORTH PACIFIC. KIRIBATI. DNC 06, DNC 12. 15 FOOT VESSEL, YELLOW HULL, THREE PERSONS ON BOARD, UNREPORTED TARAWA ATOLL (01-20N 173-01E) TO MAIANA ATOLL (01-02N 173-01E). VESSELS IN VICINITY REQUESTED TO KEEP A SHARP LOOKOUT, ASSIST IF POSSIBLE. REPORTS TO ANY COASTAL RADIO STATION. (210136Z NOV 2017) III-1.19	210136Z NOV 2017	4052/17(81,83).

Plotting Map Coordinates

With the data extracted and cleaned, let’s plot the coordinates on a world map.

world_map <- borders("world", 
                     colour="gray50",
                     fill="gray50")

total_df %>%
  select(long, lat, relevant_terms) %>%
  ggplot() +
  world_map +
  geom_point(aes(long, lat, 
                 color = factor(relevant_terms,
                                levels = c("GUNNERY:MISSILE:HAZARDOUS",
                                           "MISSILE",
                                           "GUNNERY",
                                           "HAZARDOUS"))),
             size = 3, alpha = 0.35) +
  scale_colour_discrete(name = "Terms Matched",
                        labels = c("All Terms",
                                   "Missile",
                                   "Gunnery",
                                   "Hazardous")) +
  theme(legend.position = "bottom", 
        legend.title = element_text(),
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        axis.text = element_blank(),
        panel.grid = element_blank()) +
  labs(title = "2017 Broadcast Warnings",
       subtitle = "Brendan Knapp",
       caption = "Source: National Geospatial-Intelligence Agency")

Save

As the format to be used is not yet determined, I save a few different versions that are easy to distribute, including an .rda file that is simple to load for future use and include in a package.

readr::write_excel_csv(total_df, "data_raw/total_df_excel.csv")
readr::write_csv(total_df, "data_raw/total_df.csv")
save(total_df, file = "data/total_df.rda")

Time Elapsed

The time taken to execute all code and knit the document follows. Not too shabby.

Sys.time() - time_start

## Time difference of 1.302391 mins

`sessionInfo()`

sessionInfo()

## R version 3.4.2 (2017-09-28)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 15063)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] maps_3.2.0       bindrcpp_0.2     ggplot2_2.2.1    pdftools_1.4    
## [5] rvest_0.3.2.9000 xml2_1.1.1       stringr_1.2.0    tidyr_0.7.2     
## [9] dplyr_0.7.4     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.13     compiler_3.4.2   plyr_1.8.4       bindr_0.1       
##  [5] tools_3.4.2      digest_0.6.12    evaluate_0.10.1  tibble_1.3.4    
##  [9] gtable_0.2.0     lattice_0.20-35  pkgconfig_2.0.1  rlang_0.1.4     
## [13] curl_3.0         yaml_2.1.14      httr_1.3.1       knitr_1.17      
## [17] hms_0.3          rprojroot_1.2    grid_3.4.2       tidyselect_0.2.2
## [21] glue_1.1.1       R6_2.2.2         XML_3.98-1.9     rmarkdown_1.6   
## [25] sp_1.2-5         readr_1.1.1      pander_0.6.1     purrr_0.2.4     
## [29] selectr_0.3-1    magrittr_1.5     backports_1.1.1  scales_0.5.0    
## [33] htmltools_0.3.6  assertthat_0.2.0 colorspace_1.3-2 labeling_0.3    
## [37] stringi_1.1.5    lazyeval_0.2.0   munsell_0.4.3

PDF Inferno

Brendan Knapp

November 30, 2017

Notice to Mariners

Dependencies

Most Recently Published Issue

URLs

Setup for Data Extraction

Useful Regular Expressions

`list_to_text()` Modification

Empty `tibble` data frame

Data Wrangling

Extraction and Initial Cleaning

The Heavy Lifting

Results

The Resulting Table

Plotting Map Coordinates

Save

Time Elapsed

`sessionInfo()`

PDF Inferno

Brendan Knapp

November 30, 2017

Notice to Mariners

Dependencies

Most Recently Published Issue

URLs

Setup for Data Extraction

Useful Regular Expressions

list_to_text() Modification

Empty tibble data frame

Data Wrangling

Extraction and Initial Cleaning

The Heavy Lifting

Results

The Resulting Table

Plotting Map Coordinates

Save

Time Elapsed

sessionInfo()

`list_to_text()` Modification

Empty `tibble` data frame

`sessionInfo()`