Exposition

On 2/2/2021, a coup d’etats saw the arrest of the eleted leader of Myanmar, Aung San Suu Kyi. The national military body of the country, which had reserved for itself “25%” political power under the nominal democratic regime, took over many key industries and government departments over which it previously had no jurisdiction. The Tatmadaw, as they are known, express signature cruelty and stupidity as their primary weapons. Since the coup, 2100 have been arrested, and countless terrorist acts committed.

A couple of weeks ago, some happy hacker released a 354GB archive (compressed) of information pertaining to political figures, all registered companies, legal tenders, and a bit of extra information. It was called “Myanmar Financials”.

What’s in the leak?

In this project, we will have a look at only a small portion of the data:

  1. mvoter_app.json This seems to be a nested json file with information on every person who registered as a political candidate.

  2. bo-disclosure.csv This file contains information on companies in the mining and gems and basic construction materials sectors. Because the gem industry produces lots of money for the Shan army via jade and Mogok rubies, this will probably contain lots of useful information. The Burmese government puts a lot of effort into controlling this sector.

  3. myco_details This is a directory with other 125,000 txt files relating to another archive of pdf files. The pdfs are all named with a hash, which begins the filename of a corresponding txt file in myco_details. I believe these together provide information on all the companies (and people who registered them) in Myanmar.

library(tibble)
library(purrr)
library(jsonlite)

file1 <- "https://raw.githubusercontent.com/TheWerefriend/data607/master/project2/mvoter_app.json"
file2 <- "https://raw.githubusercontent.com/TheWerefriend/data607/master/project2/bo-disclosure.csv"

Dataset #1 mvoter_app.json

Read in the json data and check it out. At the top of the tree there is $data and some token we can throw out. Inside $data there are 5839 observations, each with lots of nested attributes. At the second level, each observation contains “id” “type” and “attributes”. We are only interested in “attributes” because there are other ID elements in “attributes”, and every “type” is set to “candidate”. “type” is also present in “attributes”.

mvoter <- fromJSON(file1, simplifyDataFrame = TRUE)
mvoter <- mvoter$data

I wondered, at first, which JSON library to use. We have rjson, RJSONIO, and jsonlite. After some research and tests, I find that RJSONIO has rewritten a couple of functions from rjson, and jsonlite has forked RJSONIO, making it the newest. RJSONIO has the most customizable interface, but setting the simplify options to FALSE in jsonlite provides the fastest import by nearly a factor of 2. In the end, jsonlite::fromJSON with simplifyDataFrame is best for this application.

nrow(mvoter)
## [1] 5839
length(mvoter$attributes)
## [1] 21

We can check if all the observations have a uniform structure:

mvoter %>%
  map(names) %>%
  unique() %>%
  length() == 1
## [1] FALSE
mvoter <- mvoter$attributes
names(mvoter)
##  [1] "age"                    "ballot_order"           "birthday"              
##  [4] "constituency"           "education"              "ethnicity"             
##  [7] "father"                 "gender"                 "image"                 
## [10] "individual_logo"        "is_elected"             "is_ethnic_candidate"   
## [13] "is_individual"          "mother"                 "name"                  
## [16] "party"                  "religion"               "representing_ethnicity"
## [19] "residential_address"    "sorting_name"           "work"

It seems that not all of these columns have the same shape… The nested lists have different numbers of values. This is okay.

Dataset #2 bo-disclosure.csv

Read in the csv and drop columns with mostly null values.

I know from checking the file in Excel and a text editor, that this is extremely messy data. There are hundreds of rows and hundreds of columns, lots of null values, and sometimes uneven numbers of columns for the observations. Let’s try an obscure package:

#PolyPatEx::fixCSV(file2)
bodisc <- read.csv("https://raw.githubusercontent.com/TheWerefriend/data607/master/project2/bo-disclosureFIXED.csv")

After examining the differences between bo-disclosure.csv and bo-disclosureFIXED.csv, it is apparent that some cells are being parsed very badly. fixCSV just gave it a haircut so the shape was uniform.

At first, I thought that row 26 was the problem, with unmatched quotes damagind the overall structure. Omitting that row with two reads and an rbind() call fixed nothing. We have so many issues, it’s hard to know what they all are… So, I tried taking just the first hundred columns…

These attempts also got me nowhere, so I opened the csv in LibreOffice Calc and deleted about 400 columns, relatively arbitrarily. For the sake of this assignment, this is an adequate solution. In the future, I will have to find a better fix.

bodisc <- read.csv("https://raw.githubusercontent.com/TheWerefriend/data607/master/project2/bo-disclosureMARRED.csv", header = TRUE, fill = TRUE)
bodisc[1:10, 1:10]
##     id                                                                    title
## 1  393                                                JUNE CEMENT INDUSTRY LTD.
## 2  392                           MOATTAMA GAS TRANSPORTATION CO., LTD. (BRANCH)
## 3  390                                SHWE HUA GEMS & JEWELLERY COMPANY LIMITED
## 4  389                                   SILVER SAND STONE GEMS COMPANY LIMITED
## 5  388                                      MYANMAR CNMC NICKEL COMPANY LIMITED
## 6  387                   LON HAI PRODUCTION TRADING OF GEMSTONE COMPANY LIMITED
## 7  386                                AUNG SHWE KABAR GEMS & JEWELLERY CO., LTD
## 8  384 DEVELOPERS ENTREPRENEURS LIAISON CONSTRUCTION ORGANIZERS (DELCO) LIMITED
## 9  382                                    AUNG THUKHA DANA GEMS COMPANY LIMITED
## 10 381                                    SHWE PAUK PAUK MINING COMPANY LIMITED
##    jurisdiction_registration unique_id
## 1                         MM 117418820
## 2                         MM 142356155
## 3                         MM 111532001
## 4                         MM 100900424
## 5                         MM 108341157
## 6                         MM 134823119
## 7                         MM 114585149
## 8                         MM 125445020
## 9                         MM 104728448
## 10                        MM 118049551
##                                                                                                      contact_address
## 1                                                        No.80, Sayarsan Lane,Bahan Township, Yangon Region, Myanmar
## 2                      No. 5 Sacred Tooth Relic Lake Avenue\n Punn Pin Gone Quarter No.5, Mayangone Township, YANGON
## 3                                                                 THARYAR SHWEPYI ROAD, NO. (D-2), TAMWELAY QUARTER,
## 4                                                No,62-63 , Ngu Shwe Wah Street , No (8/43), Chanmyatharzi Township.
## 5                            No.3/F, U Kyaw Hla Street, Seven Ward, Seven Miles\nMayangone Township, Yangon, Myanmar
## 6                             Mahar Aung Myay Township, between 61 & 62 streets, Theik Pan Street, No. 4/5, Mandalay
## 7                                          No.(63), 4th Floor, Baho Road, Sanchaung Township, Yangon Region, Myanmar
## 8                                                           No.150/B, University Avenue Road, Bahan Township, Yangon
## 9  No.(MA-9/33, Corner Of 63th Street & Padauk Street\nMyothit(1) Word, Chan Mya Thar Si Township, MANDALAY, Myanmar
## 10                  No.58, Sa Sa -2, Zabu Thiri StreetBetween 58 x 59 Street,Kan Thar Yar Quarter, Mandalay, Myanmar
##                                              contact_address_2    state
## 1                                                         null   Yangon
## 2                                                         null   Yangon
## 3                                                         null   Yangon
## 4                                                         null Mandalay
## 5  Maunggone Village, Tagaung Taung Nickel Project, Hteegyaint  Sagaing
## 6   Shwe Pone Nyet StreetL/6Kamaryut Township, Yangon, Myanmar Mandalay
## 7                                                         null   Yangon
## 8                                                         null   Yangon
## 9                                                         null Mandalay
## 10                                                        null Mandalay
##            city country  zip_code
## 1         Bahan      MM      null
## 2      Mayangon      MM      null
## 3         Tamwe      MM      null
## 4  Chanmyathazi      MM      null
## 5         Katha      MM      null
## 6   Mahaaungmye      MM 0503-1733
## 7     Sanchaung      MM      null
## 8         Bahan      MM      null
## 9  Chanmyathazi      MM      null
## 10 Chanmyathazi      MM      null

We are left with some disgustingly messy data, but it is finally loaded. At least, the file contains lots of data. Since we are just trying to correlate political names and company owners, there is probably something useful in this if we can employ regular expressions.

Dataset #3 myco_details

Here we have a directory of a large number of json files. I do not expect them to have a uniform shape. In fact, they have many similarities, but most are sparse or empty. The best thing to do would be to use these to indicate original documents (pdf documents named by the hash) relating to names in the other two datasets.

#files <- list.files(path = "~/Desktop/myco_details/")
#write.csv(files, "my_co_list.csv")
files <- read.csv("https://raw.githubusercontent.com/TheWerefriend/data607/master/project2/my_co_list.csv")
length(files)
## [1] 2

Each filename consists of a hash, an underscore, and a name in English with hyphens as spaces. We can make a dataframe with one column for the hash, and another column for the name.

files <- stringr::str_split_fixed(files$x, "_", 2)
colnames(files) <- c("hash", "name")
df <- as.data.frame(files)
df$name <- stringr::str_remove_all(df$name, ".json")