Introduction

In one of my Facebook posts, my friend, Wilson Chua thought of scraping data from the Freedom of Information website in order to determine the best performing agencies in terms of the granting of access to information that is available in Philippine government databases. Specifically, Wilson wanted to know how the Department of Social Welfare and Development (DSWD) performed under the leadership of Sec. Judy Taguiwalo, whose confirmation was disapproved by the Committee on Appointments.

Scraping the data would be relatively easy if they are presented in tabular form. However, the FOI data are presented as separate summaries of requests and actions. Thankfully, there are patterns in the summaries that we can exploit in order to come up with tidy data. We can do this in R with the help of the rvest package.

Without further ado, let’s start scraping. We load the packages required for this exercise.

# Load the required packages
# for tidying data
library(tidyverse)
# for scraping text from web pages
library(rvest)
# for formatting dates
library(lubridate)

Reading all files in one go

As of this writing (2017-08-14 11:04 PHT), there are 1597 requests in the FOI web site. A page can only display up to 50 requests. I have manually copied the URL for each of the 32 pages in the file foi_request_list.txt.

urls <- read_lines("foi_request_list.txt")

We shall now use the URL’s in urls to create a list of texts found in div’s in each page.

foi_div_req <- urls %>% lapply(read_html) %>% 
  lapply(html_nodes, "div") %>% 
  lapply(html_text)

At this point, the data will not make much sense. We will do some cleaning with some help from the functions found in the stringr package, which is a part of the tidyverse package.

# remove unnecessary characters
foi_div_req2 <- lapply(foi_div_req, function(x) str_replace_all(x, pattern = "\n", replacement = " ") %>%
        str_replace_all(pattern = "[\\^]", replacement = " ") %>%
        str_replace_all(pattern = "\"", replacement = " ") %>%
        str_replace_all(pattern = "\\s+", replacement = " ") %>%
        str_trim(side = "both"))

By inspection, we can see that the data we are interested in are found from the 17th to the 64th lines.

# Choose the 17th to 64th lines from each element in the list
foi_div_req2 <- lapply(foi_div_req2, function(x) x[17:(length(x)-18)])

By inspection, we can see that we can tidy up the data by exploiting the regular patterns of expressions in each summary of request.

# split the texts in the lists into field entries
# form one data frame from the list
# and make sure that there are no duplicates in the data set
foi_div_req2 <- lapply(foi_div_req2, function(x) str_split(x, pattern = "Requested from |. Purpose: | Date of Coverage: | Tracking no: #", simplify = TRUE) %>% as.data.frame()) %>% 
  bind_rows() %>% 
  unique()

Notice that although we can extract data regarding requestee, requester, and request time above, we did not do this, since the words by, at, and on appears elsewehere in the summaries. Don’t worry since we will deal with this immediately with the help of the dplyr::separate verb.

foi_div_req4 <- foi_div_req3 %>% 
  separate(V2, into = c("requestee", "requester", "request_time", "request_date"), sep = " by | at | on")
head(foi_div_req4)

We can also separate the dates indicated in the dates of coverage from start to end, and separate the last column into tracking number (tracking_no) and action taken (action).

# separate the date of coverage into two dates
foi_div_req5 <- foi_div_req4 %>%
  separate(V4, c("start_date_req", "end_date_req"), sep = "-") %>%
  separate(V5, c("tracking_no", "action"), sep = " ")
# format the dates and rename the fields appropriately
# compute the number of days covered in the request

Finally, we can fix the formatting of the dates. We can also take the number of days covered in the covered dates of request, which might come handy in future analyses. And finally, some renaming of columns can be done in order for them to make more sense (the request, and the purpose).

foi_div_req6 <- foi_div_req5 %>% 
  mutate(
    request_time = format(strptime(request_time, "%I:%M %p"), format="%H:%M:%S"), 
    request_date = parse_date_time(request_date, "BdY"),
    start_date_req = parse_date_time(start_date_req, "mdY"), 
    end_date_req = parse_date_time(end_date_req, "mdY")) %>%
  mutate(
    request_date_time = as_datetime(paste(request_date, request_time)),
    request_date_range = difftime(end_date_req, start_date_req, units = "days")) %>%
  rename(request = V1, purpose = V3)

I have saved the data in a CSV file, which you can download here.

Some analysis

Now that we have a clean data at our disposal, we can perform some analyses. For instance, we are interested in how accessible information really is under the FOI Bill.

Distribution of actions taken on requests

# foi <- foi_div_req6
foi <- read_csv("foi_requests.csv")
foi %>% 
  group_by(action) %>% 
  summarise(total = n()) %>% 
  mutate(percent = round(total / sum(total) * 100, 2)) %>% 
  arrange(desc(total)) 
## # A tibble: 9 x 3
##       action total percent
##        <chr> <int>   <dbl>
## 1     DENIED   550   34.44
## 2 SUCCESSFUL   469   29.37
## 3    PENDING   206   12.90
## 4   ACCEPTED   201   12.59
## 5   EXTENDED    64    4.01
## 6  PARTIALLY    58    3.63
## 7     CLOSED    26    1.63
## 8 PROCESSING    16    1.00
## 9   AWAITING     7    0.44

From the output, we can see that denied requests lead other actions taken on requests through the FOI web page. Successful requests, or those requests that were granted place second in terms of type of action taken. Take note that there are also partially successful requests. But even if we combine successful and partially successful requests, these do not still exceed the number of successfully granted requests. Note also that about 1 of 8 requests are pending.

Using DT::datatable for interactive table generation

We can create an interactive table using the datatable() function from the DT.

foi_requestee_action <- foi %>% 
  group_by(requestee, action) %>% 
  summarise(total = n()) %>%
  mutate(percent = round(total / sum(total) * 100, 2)) %>%
  arrange(requestee, percent)
foi_requestee_action %>% DT::datatable()

We can use the search field in the interactive table for filtering information. For instance, you can try typing SUCCESSFUL in the text box and clicking on the down triangle button right beside the percent column header in order to see which agencies have 100% successful request rates.

By the way, as of this writing, DSWD can not be found among the agencies in the summaries provided in the FOI website.