library(tidyverse)
library(assertr)
library(googleLanguageR)
library(rjson)
library(R.utils)
library(tensorflow)
library(keras)

Abstract

This study aims at exploring the feasibility to identify country codes from open addresses utilizing convolutional neural network text classification approaches. The goal is to provide the banking market a robust tool to extract the country code from complete or incomplete address information effectively and efficiently in large volume of daily transactions and detect if the transactions are come from or sent to high-risk jurisdictions, which ultimately can reduce the manual work hours and lower the legal and compliance department cost in the financial industry.

Introduction

Team Member

Fan Xu
Sin Ying Wong

This study aims at exploring the feasibility to identify country codes from open addresses utilizing convolutional neural network text classification approaches.

The goal is to provide the banking market a robust tool to extract the country code from complete or incomplete address information effectively and efficiently in large volume of daily transactions and help detecting if the transactions are come from or sent to high-risk jurisdictions, which ultimately can reduce the manual work hours and lower the legal and compliance department cost in the financial industry.

Regulatory Enforcement

An effective BSA/AML compliance program is required by the US government to control the risk associated with money-laundering and terrorist financing, especially in the banking industry.

The Wolfsberg Group, an association of thirteen global banks which aims to develop frameworks and guidance for the management of financial crime risks, issued guidance to US financial institutions to include country risk factor in measuring money laundering risk.

However, the modern money transaction payment networks including but is not limited to SWIFT (Society for Worldwide Interbank Financial Telecommunications), Fedwire (a real-time gross settlement system of central bank money used by Federal Reserve), CHIPS (The Clearing House launched RTP®, a real-time payment system for all U.S. banks), does not instruct a mandatory country code entry for wire transactions.

And yet there is not an industrial wide standard for country code solutions.

Convolutional Neural Network

A Convolutional Neural Network (CNN) is very similar to an ordinary Neural Network (NN) that it is made up of neurons that have learnable weights and biases. However, CNN contains convolutional layers to extract key features and remove noise elements of data before feeding data into the ordinary NN.

Convolutional neural networks are distinguished from other neural networks by their superior performance with image, speech, or audio signal inputs. Recently it becomes a dominating algorithm in deep learning community.

CNN is also widely adopted in NLP projects such as Sentiment Analysis.

Literature Review

Challenge in mitigating Country risks

The Wolfsberg Group, which consists of leading international financial institutions including ABN MRO, Banco Santander, Bank of Tokyo-Mitsubishi-UFJ, Barclays, Citigroup, Credit Suisse, Deutsche Bank, Goldman Sachs, HSBC, JP Morgan Chase, Societe Generale and UBS, developed a guidance to assist financial institutions in managing money laundering risks and to prevent them from being leveraged for criminal purposes.

Country Risk is a common criterion to measure money laundering risks. Financial institutions are instructed to identify the criteria to measure potential money laundering risks. One of the most commonly criteria to monitor such risks from transactions is the Country risk. There are various credible sources evaluate the money laundering risk level of countries, such as the World Bank, the International Monetary Fund, the Organization for Economic Co-operation, and Development, as well as relevant national government bodies and non-governmental organizations.

Address Information in transaction data is unstructured and commonly incomplete. For banking transactions, there are exhaustive guidance and instructions from various the modern money transaction payment networks as well, including but is not limited to SWIFT (Society for Worldwide Interbank Financial Telecommunications), Fedwire (a real-time gross settlement system of central bank money used by Federal Reserve), CHIPS (The Clearing House launched RTP®, a real-time payment system for all U.S. banks). Challenge in mitigating Country risks The Wolfsberg Group, which consists of leading international financial institutions including ABN MRO, Banco Santander, Bank of Tokyo-Mitsubishi-UFJ, Barclays, Citigroup, Credit Suisse, Deutsche Bank, Goldman Sachs, HSBC, JP Morgan Chase, Societe Generale and UBS, developed a guidance to assist financial institutions in managing money laundering risks and to prevent them from being leveraged for criminal purposes. Country Risk is a common criterion to measure money laundering risks. Financial institutions are instructed to identify the criteria to measure potential money laundering risks. One of the most commonly criteria to monitor such risks from transactions is the Country risk. There are various credible sources evaluate the money laundering risk level of countries, such as the World Bank, the International Monetary Fund, the Organization for Economic Co-operation, and Development, as well as relevant national government bodies and non-governmental organizations. Address Information in transaction data is unstructured and commonly incomplete. For banking transactions, there are exhaustive guidance and instructions from various the modern money transaction payment networks as well, including but is not limited to SWIFT (Society for Worldwide Interbank Financial Telecommunications), Fedwire (a real-time gross settlement system of central bank money used by Federal Reserve), CHIPS (The Clearing House launched RTP®, a real-time payment system for all U.S. banks).

Classification Using Neural Networks

In Machine Learning, a Neural Networks (NN) is a series of algorithms that endeavors to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. McCulloch and Pitts (1943) proposed a logical calculus model which is constructed by a net of neurons. Within the neural net, each neuron connects to its neighbored neurons by synapses. Like real neural activities of human brains, when a neuron gets an input from its neighbors and if the input exceeds a pre-defined threshold, the neuron then creates an output and pass it to other neurons. Such a neural system is not just a randomly spanned ‘net’, but the neurons are organized in consecutive layers, and inputs are passed from one layer to another. With a proper design of the neural net, it would become a logical application that receive input data and return calculus results.

Perceptron. In year 1975, Rosenblatt created a simplest neural net model, so called the ‘Perceptron’, which contains only one input, one layer of neurons and one output. The perception is able to learn from input data and classify linearly separable data points.

Backpropagation. Research for NN is stalled because it needs tremendous computing power for plural layers as well as the exponentially increased number of weights followed by the increased number of neurons. Until 1970s, Backward Propagation of Errors, ‘Backpropagation’ for short, was invented as a general optimization method for performing automatic differentiation of complex nested functions and was introduced to machine learning community by Rumelhart, Hinton, and Williams in 1986. Given an NN and an error function, backpropagation minimize the error function using gradient decent with respect to the NN’s weights. Since 1986, Backpropagation became the must popular generalize optimization approach to deep NNs to image recognition and text related recognition.

Convolutional Neural Network (CNN). There are various variants of NN over the past 30 years, such as autoencoders, recurrent neural network (RNN), long short-term memory recurrent neural network (LSTM), Boltzmann machine, and the convolutional neural network (CNN). Recently CNN demonstrated exceptional performance compared to other NN algorithms.

LeGun (1989) developed the first CNN. It applied backpropagation to recognizing handwritten digits taken from the US mails. Instead of directly feeding the image data to the NN model, LeGun performed a linear transformation to the images to remove extraneous marks in them, which therefore reduced noises in the data, increased accuracy, and at the same time significantly reduced model train time. LeGun and his team perfected this algorithm and named it ‘LeNet-5’. The LeNet-5 was then widely utilized in handwritten or machine-written check auto reading applications in U.S banking industry.

CNN came the top NN solutions after Alex and his team won the year 2012 ILSVRC the NN algorithm AlexNet. AlexNet reduced the error rate from previous year’s 26% to 15.3%, while the second place in year 2012 was 26.2%. The architecture of AlexNet is similar to LeNet-5 but with more complex filters and convolutional strategies. This model was also trained in more advanced GPUs with supports much huger computing powers than LeNet-5. In the following years, CNN also dominated the ILSVRC, multiple popular CNN variants were introduced to the community, including ZFNet (2013), VGGNet (2014), GoogLeNet (2014), and ResNet (2015), ResNext (2016) and SENet (2017) with error rates 14.8%, 7.3%, 6.67%, 3.57%, 3.03% and 2.25%, where human beings can do around 5.1%.

CNN in Text Classification. In 2017, Wang combined CNN and explicit and implicit representations of short text for classification. The research team conceptualized a short text as a set of relevant concepts then obtain the embedding of short text by coalescing the words and relevant concepts on top of pre-trained word vectors. Wang showed that the model significantly outperforms other state-of-the art text classification approaches. Reviewing the literature leads back to aim of this study: identify country codes from complete/incomplete open address. The questions to be addressed are as below:

How to effectively vectorize large volume of short text data
How to build up a convolutional approach to extract key features of the short texts
How to construct an interactive application for the country code extractor.

Methodology

For this project, we will use OpenAddresses to collect international addresses raw data as our dataset. Basel AML Country Risk Rating will be used to determine the country risk rating of the addresses. Basel AML Country Risk Rating is provided by Basel Institute on Governance has a global ranking on the countries worldwide. We will build CNN models, such as word2vec, with deep learning method to perform text classification and country identification. Statistics and graphics will be provided to validate and show audience the model performances. Conclusion will then be drawn, and an interactive interface will be built for users to test their address entry and view the country risk score.

Evaluation

we will use industry acknowledged Convoluntional Neural Network to train a deep learning model. The evaluation approach will be based on academical machine learning model performance metrics.

Process

Intro to Data Set

The `OpenAddresses’ data set is used in this project.It contains 600 billion+ address entries from 60 countries in their native language.

Source: https://openaddresses.io/

The data set contains total 2878 files in Geojson format.

path <- 'D://DATA SCIENCE//DATA 698 FALL 2021//data'

file_path <- list.files(path, full.names = TRUE, recursive = TRUE) 
file_path <- as.data.frame(file_path) %>%
  rename(File_Path = file_path) %>%
  mutate(Country_Code = str_extract(File_Path, '(?<=\\/)[a-z]{2}(?=\\/)') %>% toupper()) 

file_path

Data Population Statistics

data_population <- file_path %>%
  mutate(Record_Cnt = map_dbl(File_Path, countLines))

data_population

Data Sample Statistics

As the model is trained on local CPU, to avoid over consumption of PC resource, approximately 2,500 address entries are randomly selected from each country.

Aggregated sample size: 150,000 +

#select approximately 2500 records per country code as sample for model training & testing
data_sample_cnt <- data_population %>%
  group_by(Country_Code) %>%
  mutate(Total_Cnt = sum(Record_Cnt),
         Sample_Cnt = round(Record_Cnt/sum(Record_Cnt)*2500)) %>%
  mutate(Sample_Cnt = if_else(Sample_Cnt == 0, Sample_Cnt + 1, Sample_Cnt)) %>%
  select(-Total_Cnt)

data_sample_cnt
#write.csv(data_sample_cnt, 'D://DATA SCIENCE//DATA 698 FALL 2021//Data_Sample_Cnt.csv')

Load Data Sample

Create a function to Read sample rows in each file one by one stead of read the whole file first then apply sampling to avoid unnecessary memory consumption.

#Create a function to Read sample rows in each file one by one 
#stead of read the whole file first then apply sampling 
#to avoid unnecessary memory consumption
read_samples <- function(f, n, s){
  con <- file(f, 'r')
  sample_rows <- seq(1, n) %>% sample(s)
  lines <- c()
  i <- 1
  while(TRUE) {
    line <- readLines(con, 1, encoding = 'UTF-8')
    if(length(line) == 0 | length(lines) == s){
      break
    }
    else if(i %in% sample_rows){
      #print(line)
      lines <- c(lines, line)
    }
    i = i + 1
  }
  return(lines)  
}

Read data sample. Note that Address data are recorded in their native language. As example, the first 5 rows of data are writen in Arabic.

# read data sample rows
set.seed(123)
#start_time <- Sys.time()

address_tbl <- data.frame()
n <- nrow(data_sample_cnt)
t <- 1
c <- 1
a <- 30000
for(i in 1:n){
  
  path <- data_sample_cnt$File_Path[i]
  cc <- data_sample_cnt$Country_Code[i]
  record_cnt <- data_sample_cnt$Record_Cnt[i]
  sample_cnt <- data_sample_cnt$Sample_Cnt[i]
  
  #print(str_c(i, '/', n, ' files is loading.', data_sample_cnt$File_Path[i],'. Total rows: ',t))
  #print(Sys.time() - start_time)
  
  address_data <- read_samples(path, record_cnt, sample_cnt)
  
  j <- 1
  for(record in address_data){
    
    if(j%%300 == 0 | j == 1 | j == sample_cnt){
      print(str_c(i, '/', n, ' files, ',  j, '/', sample_cnt, ' ', cc, ' entries'))
    }
    
    temp <- fromJSON(record)$properties %>% 
      as.data.frame() %>%
      select(-hash) %>%
      transmute(Address= col_concat(., sep = ' '),
                Country_Code =cc)
    
    address_tbl <- address_tbl %>%rbind(temp)
    
    j <- j + 1
    t <- t + 1
  }
  
  # Save temp process
  if(nrow(address_tbl) >= a){
    write.csv(address_tbl, str_c('D://DATA SCIENCE//DATA 698 FALL 2021//address_tbl/address_batch_file_', c,'.csv'))
    #print(str_c('Address batch ',c, ' saved completed'))
    #add_tbl <- data.frame()
    c <- c + 1
    a <- a + 30000
  }
}

#end_time <- Sys.time()
#end_time - start_time

address_tbl %>%
  .$Address %>%
  head()

## [1] "32938 99135   <U+0627><U+0644><U+0645><U+0645><U+0632><U+0631>"
## [2] "12740 66139   <U+062C><U+0628><U+0644> <U+0639><U+0644><U+064A> <U+0627><U+0644><U+0635><U+0646><U+0627><U+0639><U+064A><U+0629> <U+0627><U+0644><U+0623><U+0648><U+0644><U+0649>"
## [3] "07971 43170   <U+0627><U+0644><U+0644><U+064A><U+0627><U+0646> 1"
## [4] "25588 68822   <U+0627><U+0644><U+062D><U+0628><U+064A><U+0647> <U+0627><U+0644><U+062B><U+0627><U+0644><U+062B><U+0629>"
## [5] "44663 93749   <U+0627><U+0644><U+0645><U+0632><U+0647><U+0631> <U+0627><U+0644><U+062B><U+0627><U+0646><U+064A><U+0629>"
## [6] "52755 89298   <U+0627><U+0644><U+0639><U+064A><U+0627><U+0635>"

Translate data into English

Google Cloud Translation API

Transaction wire messages are all transmitted in English characters, therefore all address data need to be translated to English.

R package googleLanguageR with Google Cloud Translation API is used in this task.

Start translation process

address_tbl_en <- address_tbl %>%
  mutate(address_en = Address %>% as.character() %>% gl_translate(target = 'en') %>% .$translatedText)

#Save temp process
write.csv(address_tbl_en, 'D://DATA SCIENCE//DATA 698 FALL 2021//address_tbl/address_tbl_en.csv', row.names = FALSE)

Training CNN Model

Vectorization of address text strings

Text Vectorization is the process of converting text into numerical representation. There are various methods to convert text strings into numerical data. Such as:

Bag of words TFIDF Word2Vec, etc.,.

To reduce computation pressure, here we transform words of the address data into integer index. To be specific, words are indexed by overall frequency in the dataset, so that for instance the integer “3” encodes the 3rd most frequent word in the data. This is the vectorization method used in some of the datasets within the Tensorflow Keras module.

address_data_encode <- address_data_en %>%
  mutate(Address_En = Address_En %>%
           str_remove_all('[0-9]') %>% 
           str_replace_all('[:punct:]', ' ') %>%
           str_trim()) %>%
  mutate(Address_En_Encode = Address_En) %>%
  separate_rows(Address_En_Encode) %>%
  filter(Address_En_Encode != '') %>%
  mutate(Address_En_Encode = toupper(Address_En_Encode))
  
address_data_encode <- address_data_encode %>%
  left_join(address_data_encode %>%
              group_by(Address_En_Encode) %>%
              tally(n = 'Word_Cnt') %>%
  mutate(Encode_Rank = dense_rank(Word_Cnt))
)

address_data_encode <- address_data_encode %>%
  select(ID, Address_En, Country_Code, Encode_Rank) %>%
  group_by(ID) %>%
  mutate(Word_ID = row_number()) %>%
  ungroup() %>%
  spread(key = Word_ID, value = Encode_Rank) %>%
  mutate_all(~replace_na(.,0)) %>%
  ungroup() %>%
  mutate(Country_Code_Factor = dense_rank(Country_Code)) 

address_data_encode

Train-Test-Split

The vectorized data sample is splitted into training set and testing set with ratio 0.8.
Pad the vectorized address data (X) so all entries have the same number of tokens.
Transform the predicting variable (Y) to from numerical to categorical using One-Hot Encoding

set.seed(123)
# train-test-split by 0.8
## training set
maxlen <- 400

data_train <- address_data_encode %>%
  group_by(Country_Code) %>%
  sample_frac(0.8) %>%
  ungroup()
x_train <- data_train %>%
  select(-c(ID, Address_En, Country_Code, Country_Code_Factor)) %>%
  as.matrix() %>%
  pad_sequences(maxlen) 
y_train <- data_train$Country_Code_Factor %>% to_categorical(60)

## testing set  
data_test <- address_data_encode %>%
  filter(!ID %in% data_train$ID)
x_test <- data_test %>%
  select(-c(ID, Address_En, Country_Code, Country_Code_Factor)) %>%
  as.matrix() %>%
  pad_sequences(maxlen) 
y_test <- data_test$Country_Code_Factor %>% to_categorical(60)

Define a CNN Model

Set up One Dimensional Convolutional Neural Network Model (1D CNN) using tensorflow and keras packages. 1D CNN is suitable for one dimensional input data such as vectorized text strings.

# Defining Model ------------------------------------------------------
#tf$constant("Hellow Tensorflow")

max_features <- 5000
maxlen <- 400
batch_size <- 32
embedding_dims <- 50
filters <- 250
kernel_size <- 3
hidden_dims <- 250
epochs <- 10


#Initialize model
model <- keras_model_sequential()

model %>% 
  # Start off with an efficient embedding layer which maps
  # the vocab indices into embedding_dims dimensions
  layer_embedding(max_features, embedding_dims, input_length = maxlen) %>%
  layer_dropout(0.2) %>%

  # Add a Convolution1D, which will learn filters
    # Word group filters of size filter_length:
  layer_conv_1d(
    filters, kernel_size, 
    padding = "valid", activation = "relu", strides = 1
  ) %>%
  # Apply max pooling:
  layer_global_max_pooling_1d() %>%

  # Add a vanilla hidden layer:
  layer_dense(hidden_dims) %>%

  # Apply 20% layer dropout
  layer_dropout(0.2) %>%
  layer_activation("relu") %>%

  # Project onto a single unit output layer, and squash it with a sigmoid

  #layer_dense(1) %>%
  #layer_activation("relu")
  layer_dense(60) %>%
  layer_activation("softmax")

# Compile model
model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_rmsprop(),
  #loss = "binary_crossentropy",   
  #optimizer = "adam",             
  metrics = "accuracy"
)

summary(model)

## Model: "sequential"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  embedding (Embedding)              (None, 400, 50)                 250000      
##                                                                                 
##  dropout_1 (Dropout)                (None, 400, 50)                 0           
##                                                                                 
##  conv1d (Conv1D)                    (None, 398, 250)                37750       
##                                                                                 
##  global_max_pooling1d (GlobalMaxPoo  (None, 250)                    0           
##  ling1D)                                                                        
##                                                                                 
##  dense_1 (Dense)                    (None, 250)                     62750       
##                                                                                 
##  dropout (Dropout)                  (None, 250)                     0           
##                                                                                 
##  activation_1 (Activation)          (None, 250)                     0           
##                                                                                 
##  dense (Dense)                      (None, 60)                      15060       
##                                                                                 
##  activation (Activation)            (None, 60)                      0           
##                                                                                 
## ================================================================================
## Total params: 365,560
## Trainable params: 365,560
## Non-trainable params: 0
## ________________________________________________________________________________

Train Model

# Training ----------------------------------------------------------------
history <- model %>%
  fit(
    x_train, y_train,
    batch_size = batch_size,
    epochs = epochs,
    validation_data = list(x_test, y_test)
  )

Model Performance

plot(history)

At epoch 5, the model reaches validation Accuracy 81.28% & loss 0.65.

Note that arguments can be set to have the model training stopped when there is no more implement on accuracy. However, I kept the model trained till the end of 10 epochs to show the change of model performance.

After epoch 5, the Accuracy started to drop means that the model is getting overfit because of over training.

Neural Network models requires both large size of training sample and strong computation power. Enlarging the sample size and provided GPU computation will significantly improve model performance.

Bibliography

The Wolfsberg Group (2006). Wolfsberg Risk-Based Approach Guidance. https://www.wolfsberg-principles.com/sites/default/files/wb/pdfs/wolfsberg-standards/15.%20Wolfsberg_RBA_Guidance_%282006%29.pdf

Mcculloch, W. & Pitts, W. (1943). A Logical Calculus of The Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biology Vol. 52, No. 1/2, pp.99-115, 1990.

Tappert, C.C. (2019). Who Is the Father of Deep Learning. 2019 International Conference on Computational Science and Computational Intelligence (CSCI). doi: 10.1109/CSCI49370.2019.00067

Rumelhart, D., Hinton, G. & Williams, R. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel; Backpropagation Applied to Handwritten Zip Code Recognition. Neural Comput 1989; 1 (4): 541–551. doi: https://doi.org/10.1162/neco.1989.1.4.541

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (June 2017), 84–90. doi: https://doi.org/10.1145/3065386

Zeiler M.D., Fergus R. (2014) Visualizing and Understanding Convolutional Networks. In: Fleet D., Pajdla T., Schiele B., Tuytelaars T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8689. Springer, Cham. https://doi.org/10.1007/978-3-319-10590-1_53

C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , page 1-9. (June 2015)

Simonyan, K. & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, abs/1409.1556.

K. He, X. Zhang, S. Ren and J. Sun, Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.

S. Xie, R. Girshick, P. Dollar, Z. Tu, K. He. Aggregated Residual Transformations for Deep Neural Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), doi:10.1109/CVPR.2017.634

J. Hu, L. Shen and G. Sun, Squeeze-and-Excitation Networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132-7141, doi: 10.1109/CVPR.2018.00745.

J. Wang, Z. Wang, D. Zhang, and J. Yan. 2017. Combining knowledge with deep convolutional neural networks for short text classification. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). AAAI Press, 2915–2921

DATA 698 Final Project - Country Code Extractor For Wire Transactions

A Banking Compliance Anti-Money-Laundry Investigation Application Utilizing Convolutional Neural Network