On 2/2/2021, a coup d’etats saw the arrest of the eleted leader of Myanmar, Aung San Suu Kyi. The national military body of the country, which had reserved for itself “25%” political power under the nominal democratic regime, took over many key industries and government departments over which it previously had no jurisdiction. The Tatmadaw, as they are known, express signature cruelty and stupidity as their primary weapons. Since the coup, 2100 have been arrested, and countless terrorist acts committed.
A couple of weeks ago, some happy hacker released a 354GB archive (compressed) of information pertaining to political figures, all registered companies, legal tenders, and a bit of extra information. It was called “Myanmar Financials”.
In this project, we will have a look at only a small portion of the data:
mvoter_app.json This seems to be a nested json file with information on every person who registered as a political candidate.
bo-disclosure.csv This file contains information on companies in the mining and gems and basic construction materials sectors. Because the gem industry produces lots of money for the Shan army via jade and Mogok rubies, this will probably contain lots of useful information. The Burmese government puts a lot of effort into controlling this sector.
myco_details This is a directory with other 125,000 txt files relating to another archive of pdf files. The pdfs are all named with a hash, which begins the filename of a corresponding txt file in myco_details. I believe these together provide information on all the companies (and people who registered them) in Myanmar.
library(tibble)
library(purrr)
library(jsonlite)
file1 <- "https://raw.githubusercontent.com/TheWerefriend/data607/master/project2/mvoter_app.json"
file2 <- "https://raw.githubusercontent.com/TheWerefriend/data607/master/project2/bo-disclosure.csv"
Read in the json data and check it out. At the top of the tree there is $data and some token we can throw out. Inside $data there are 5839 observations, each with lots of nested attributes. At the second level, each observation contains “id” “type” and “attributes”. We are only interested in “attributes” because there are other ID elements in “attributes”, and every “type” is set to “candidate”. “type” is also present in “attributes”.
mvoter <- fromJSON(file1, simplifyDataFrame = TRUE)
mvoter <- mvoter$data
I wondered, at first, which JSON library to use. We have rjson, RJSONIO, and jsonlite. After some research and tests, I find that RJSONIO has rewritten a couple of functions from rjson, and jsonlite has forked RJSONIO, making it the newest. RJSONIO has the most customizable interface, but setting the simplify options to FALSE in jsonlite provides the fastest import by nearly a factor of 2. In the end, jsonlite::fromJSON with simplifyDataFrame is best for this application.
nrow(mvoter)
## [1] 5839
length(mvoter$attributes)
## [1] 21
We can check if all the observations have a uniform structure:
mvoter %>%
map(names) %>%
unique() %>%
length() == 1
## [1] FALSE
mvoter <- mvoter$attributes
names(mvoter)
## [1] "age" "ballot_order" "birthday"
## [4] "constituency" "education" "ethnicity"
## [7] "father" "gender" "image"
## [10] "individual_logo" "is_elected" "is_ethnic_candidate"
## [13] "is_individual" "mother" "name"
## [16] "party" "religion" "representing_ethnicity"
## [19] "residential_address" "sorting_name" "work"
It seems that not all of these columns have the same shape… The nested lists have different numbers of values. This is okay.
Read in the csv and drop columns with mostly null values.
I know from checking the file in Excel and a text editor, that this is extremely messy data. There are hundreds of rows and hundreds of columns, lots of null values, and sometimes uneven numbers of columns for the observations. Let’s try an obscure package:
#PolyPatEx::fixCSV(file2)
bodisc <- read.csv("https://raw.githubusercontent.com/TheWerefriend/data607/master/project2/bo-disclosureFIXED.csv")
After examining the differences between bo-disclosure.csv and bo-disclosureFIXED.csv, it is apparent that some cells are being parsed very badly. fixCSV just gave it a haircut so the shape was uniform.
At first, I thought that row 26 was the problem, with unmatched quotes damagind the overall structure. Omitting that row with two reads and an rbind() call fixed nothing. We have so many issues, it’s hard to know what they all are… So, I tried taking just the first hundred columns…
These attempts also got me nowhere, so I opened the csv in LibreOffice Calc and deleted about 400 columns, relatively arbitrarily. For the sake of this assignment, this is an adequate solution. In the future, I will have to find a better fix.
bodisc <- read.csv("https://raw.githubusercontent.com/TheWerefriend/data607/master/project2/bo-disclosureMARRED.csv", header = TRUE, fill = TRUE)
bodisc[1:10, 1:10]
## id title
## 1 393 JUNE CEMENT INDUSTRY LTD.
## 2 392 MOATTAMA GAS TRANSPORTATION CO., LTD. (BRANCH)
## 3 390 SHWE HUA GEMS & JEWELLERY COMPANY LIMITED
## 4 389 SILVER SAND STONE GEMS COMPANY LIMITED
## 5 388 MYANMAR CNMC NICKEL COMPANY LIMITED
## 6 387 LON HAI PRODUCTION TRADING OF GEMSTONE COMPANY LIMITED
## 7 386 AUNG SHWE KABAR GEMS & JEWELLERY CO., LTD
## 8 384 DEVELOPERS ENTREPRENEURS LIAISON CONSTRUCTION ORGANIZERS (DELCO) LIMITED
## 9 382 AUNG THUKHA DANA GEMS COMPANY LIMITED
## 10 381 SHWE PAUK PAUK MINING COMPANY LIMITED
## jurisdiction_registration unique_id
## 1 MM 117418820
## 2 MM 142356155
## 3 MM 111532001
## 4 MM 100900424
## 5 MM 108341157
## 6 MM 134823119
## 7 MM 114585149
## 8 MM 125445020
## 9 MM 104728448
## 10 MM 118049551
## contact_address
## 1 No.80, Sayarsan Lane,Bahan Township, Yangon Region, Myanmar
## 2 No. 5 Sacred Tooth Relic Lake Avenue\n Punn Pin Gone Quarter No.5, Mayangone Township, YANGON
## 3 THARYAR SHWEPYI ROAD, NO. (D-2), TAMWELAY QUARTER,
## 4 No,62-63 , Ngu Shwe Wah Street , No (8/43), Chanmyatharzi Township.
## 5 No.3/F, U Kyaw Hla Street, Seven Ward, Seven Miles\nMayangone Township, Yangon, Myanmar
## 6 Mahar Aung Myay Township, between 61 & 62 streets, Theik Pan Street, No. 4/5, Mandalay
## 7 No.(63), 4th Floor, Baho Road, Sanchaung Township, Yangon Region, Myanmar
## 8 No.150/B, University Avenue Road, Bahan Township, Yangon
## 9 No.(MA-9/33, Corner Of 63th Street & Padauk Street\nMyothit(1) Word, Chan Mya Thar Si Township, MANDALAY, Myanmar
## 10 No.58, Sa Sa -2, Zabu Thiri StreetBetween 58 x 59 Street,Kan Thar Yar Quarter, Mandalay, Myanmar
## contact_address_2 state
## 1 null Yangon
## 2 null Yangon
## 3 null Yangon
## 4 null Mandalay
## 5 Maunggone Village, Tagaung Taung Nickel Project, Hteegyaint Sagaing
## 6 Shwe Pone Nyet StreetL/6Kamaryut Township, Yangon, Myanmar Mandalay
## 7 null Yangon
## 8 null Yangon
## 9 null Mandalay
## 10 null Mandalay
## city country zip_code
## 1 Bahan MM null
## 2 Mayangon MM null
## 3 Tamwe MM null
## 4 Chanmyathazi MM null
## 5 Katha MM null
## 6 Mahaaungmye MM 0503-1733
## 7 Sanchaung MM null
## 8 Bahan MM null
## 9 Chanmyathazi MM null
## 10 Chanmyathazi MM null
We are left with some disgustingly messy data, but it is finally loaded. At least, the file contains lots of data. Since we are just trying to correlate political names and company owners, there is probably something useful in this if we can employ regular expressions.
Here we have a directory of a large number of json files. I do not expect them to have a uniform shape. In fact, they have many similarities, but most are sparse or empty. The best thing to do would be to use these to indicate original documents (pdf documents named by the hash) relating to names in the other two datasets.
#files <- list.files(path = "~/Desktop/myco_details/")
#write.csv(files, "my_co_list.csv")
files <- read.csv("https://raw.githubusercontent.com/TheWerefriend/data607/master/project2/my_co_list.csv")
length(files)
## [1] 2
Each filename consists of a hash, an underscore, and a name in English with hyphens as spaces. We can make a dataframe with one column for the hash, and another column for the name.
files <- stringr::str_split_fixed(files$x, "_", 2)
colnames(files) <- c("hash", "name")
df <- as.data.frame(files)
df$name <- stringr::str_remove_all(df$name, ".json")
For each political candidate, we check to see if their name appears in the mining companies’ information. Then, we look for that company in the my_co files, and finally associate an original document. We can make a final dataframe with columns politician, company, and document by filename hash.
head(mvoter$name)
## [1] "ဦးကရိုးလရိန်" "ဦးခင်လဲဖုန်" "ဦးဂူဆာပါး"
## [4] "ဦးခင်မောင်မြင့်(ခ)ဦးဒိတ်" "ဦးဂျေယောဝူ" "ဦးငွါးဇာဒီး"
head(bodisc$highest_ranking_official)
## [1] null null null null null null
## Levels: Chanayethazan null
head(bodisc$legal_owners.0..full_name_of_shareholders)
## [1] DR. NU NU WIN MOATTAMA GAS TRANSPORTATION CO., LTD.
## [3] U EIKE TUN MM
## [5] CNMC NICKEL COMPANY LIMITED Ann Hai
## 267 Levels: AH SHONE Ah Woo Aheik Kwye AIKE CHWIN AIKE HTWE Ann Hai ... Yang Chin Kwan
Well. This is a bit of a non-starter. The politicians names are written here in Burmese…. And, all the other information is written in English.
I’ll have to revisit this with translations……….