Andre Costa

Prompt: You are looking for a pattern of money laundering between individuals. You’ll need to present your Director a short list of suspects and the analysis developed to identify them.

#Part 1 - Data Exploration

As Part 1 suggests, I started my analysis by exploring the different variables of the dataset to see if I could spot some patterns in the data.

I started by plotting the number of transactions by unique value within each variable (e.g., how many transactions were $500). I filtered for the top 25 values by transaction count for each variable to make the plots easier to digest.

data <- read_xlsx(path = "Andre Costa - AML exercise dataset.xlsx", 1) # Read in data

## New names:
## • `` -> `...1`

data <- data[,-1] # remove the first column which is row number

###' Prints the name of each variable and how many unique values it has
###' this helps get a feel for if certain variables are concentrated around
###' a certain amount of variables unexpectedly

par(mfrow = c(3, 3))
for (i in 1:length(names(data))){
  print(paste(names(data)[i], nrow(unique(data[,names(data)[i]]))))
}

## [1] "Send DateTime 11384"
## [1] "Pay DateTime 11244"
## [1] "Amount 892"
## [1] "Sender Name_ 3019"
## [1] "Sender Address_ 2615"
## [1] "Sender ID1 info 2"
## [1] "Sender City 465"
## [1] "Sender State 60"
## [1] "Payee Name_ 4658"
## [1] "Payee Address_ 5709"
## [1] "Payee ID1 info 2"
## [1] "Payee City 1410"
## [1] "Payee State 87"
## [1] "Send Agent Name_ 398"
## [1] "Send Agent City 296"
## [1] "Send Agent State 52"
## [1] "Send Country 22"
## [1] "Send Operator Name_ 489"
## [1] "Pay Agent Name_ 2519"
## [1] "Pay Agent City 995"
## [1] "Pay Agent State 72"
## [1] "Pay Country 65"
## [1] "Pay Operator Name_ 1042"
## [1] "Sender Age 71"
## [1] "Payee Age 79"

# Variables to loop through when plotting
var <- c("Amount", "Sender Name_", "Payee Name_", "Sender Address_", "Payee Address_", "Sender City", "Payee City", "Sender State", "Payee State", "Send Agent Name_", "Pay Agent Name_", "Send Agent City", "Pay Agent City", "Send Agent State", "Pay Agent State", "Send Country", "Pay Country", "Send Operator Name_", "Pay Operator Name_")

for (i in 1:length(var)){
  col_sym <- sym(var[i])
  
  sum <- data %>% 
    group_by(!!col_sym) %>% 
    summarise(txn_count = n(), .groups = "drop") %>% 
    arrange(desc(txn_count)) %>% 
    slice_head(n = 25)
  
  p <- ggplot(sum, aes(x = reorder(!!col_sym, txn_count), y = txn_count))+
    geom_col(fill = "black")+
    geom_text(aes(label = txn_count), hjust = 0, color = "red", size = 2.5) +
    labs(title = paste("Transactions by", var[i], "Top 20"), x = var[i], y = "Transaction Count") +
    coord_flip()
  
  print(p)
}

After creating some inital plots, there are some clear areas of suspicion

Though transactions are almost exclusively sent from the US, the top pay 5 agents which make up more than 20% of payee countries is Mexico, El Salvador, United States, Guatemala, and Tonga. Given involvement with the cartel is a concern, it will be interesting to dig into transactions going to these top countries outside of the US.
When comparing the sender addresses and the payment addresses, the distribution for the sender addresses is remarkably less skewed - though both have a long tail - payee address 0 is clearly used more than others and could be involved in money laundering as it is unlikely that so many different senders would be sending to the one centralized payer. This is likely a hub for money laundering.
Transactions are almost exclusively sent from send agents agent_11 (~90%) and agent_39 - though pay agents are more diversified, pay agent_11 is still at the top by a wide margin, suggesting there is circularity typical of money laundering.
Unsurprisingly, transactions are made in whole number amounts, making it easier to track money laundered. However, $1,000 and $500 being the top amounts may suggest that these are the amounts to watch out for when looking for suspcious activity.
Sender name_10 sticks out for having many more transactions than everyone else, and he/she would be worth digging further into

Given these patterns and areas of suspicion, I will continue my analysis looking out for the following: 1. Transactions originating in the US to countries outside of the US, most notably Mexico, El Salvador, and Guatemala are all in Central America and are thus more likely to be tied to involvement with Mexican cartels and money laundering. 2. Transactions involved with payee address 0 are more likely to be suspicious - this is a clear outlier that can be a hub for money laundering

While looking at these two traits specifically, some other aspects to keep in mind

Transactions involving whole numbers like $500, $1000 are more suspicious as this may be indicative of “structuring” form the money launderers. On a similar note, senders that make more transactions are more suspicious as it would be more complicated and unrealistic for money laundering to involve many people transacting once vs pockets of concentration of bad actors transacting multiple times. Small transactions that are repeated within a short time may also more suspicious as this is another sign of structuring, along with daily transfer amounts just shy of $10,000 as this is the limit before which transactions will be reported to the government
Transfers executed during non-business hours or weekends may be suspicious as these are irregular.
A short time between the wiring and receiving of a transaction may be suspicious because the quick reception of a transfer means that the receiver was informed that the transaction would be coming through (these transactions are being wired and received in person)
Transactions where the sender and/or the payer do not present their ID may be more suspicious as they don’t want the transaction to be traced back to them

Before moving forward, I wanted to manipulate the time data use in analysis later

# Convert send and received times to date_time objects to manipulate
data$`Send DateTime` <- mdy_hm(data$`Send DateTime`)
data$`Pay DateTime` <- mdy_hm(data$`Pay DateTime`)
data$Send_Pay_Time_Diff_Mins <- time_length(data$`Pay DateTime` - data$`Send DateTime`)/60
data$send_weekday <- weekdays(data$`Send DateTime`)
data$send_weekday <- factor(data$send_weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
data$hour_of_day <- hour(data$`Send DateTime`)
data$whole_number <- ifelse(data$Amount %% 1 == 0, 1, 0)

###'There is a clear cluster of transfers received in less than 6 hours that are likely related to money laundering
#data$Send_Pay_Time_Diff <- time_length(data$`Pay DateTime` - data$`Send DateTime`)/86400 #get transfer times in days
#plot(table(data$Send_Pay_Time_Diff[data$Send_Pay_Time_Diff < 5]), xaxt = "n")# plot send times by day up to 5 days
#axis(side = 1, at = seq(0, 5, by = .25))

#Part II Testing the Hypotheses

Dig in to see which senders and payers are those sending to places outside of the US, specifically in the top countries of suspicion

Note: Variables presented here were culled down from those analyzed to highlight those that were most important to continuing the analysis

###' 1. Dig in to see which senders and payers are those sending to places outside of the US, specifically in the top countries of suspicion

suspicious_countries <- c("mexico", "el salvador", "guatemala") #vector of suspcious countries

#filter dataset to only include data from suspicious countries
activity_sus_countries <- data %>%
  filter(`Pay Country` %in% toupper(suspicious_countries))

# nrow(activity_sus_countries) - out of 11,399 transactions, 8364 (~73%)

var <- c("Amount", "Sender Address_", "Payee Name_", "Payee Address_", "Send Agent Name_", "Pay Agent Name_",  "Pay Agent City", "Send Operator Name_", "Pay Operator Name_")

for (i in 1:length(var)){
  col_sym <- sym(var[i])
  
  sum <- activity_sus_countries  %>% 
    group_by(!!col_sym) %>% 
    summarise(txn_count = n(), .groups = "drop") %>% 
    arrange(desc(txn_count)) %>% 
    slice_head(n = 25)
  
  p <- ggplot(sum, aes(x = reorder(!!col_sym, txn_count), y = txn_count))+
    geom_col(fill = "black")+
    geom_text(aes(label = txn_count), hjust = 0, color = "red", size = 2.5) +
    labs(title = paste("Transactions by", var[i], "Top 25"), x = var[i], y = "Transaction Count") +
    coord_flip()
  
  print(p)
}

Couple notable things that happen when filtering to suspicious countries:

Send agent becomes exclusively agent 11, this is likely the problematic send agent, which we already had suspicion of
Pay agent address 0 still stands out among the rest, receiving many more transfers than everyone else and accounting for ~7% of transactions
Pay operator name also narrows down to Operator name 0, but that isn’t as concerning to me as this may be due to a certain operator like Western Union being more common in Central America. Meanwhile, multiple payees using the same address seems very suspicious.

I want to now zoom in and look specifically at the transactions that are paid to address 0 as this is the most suspicious. This will help to narrow down the data further as there is still a large tail of senders who and it’s not clear who the problematic parties are yet

#Filter the data to the suspsicious pay address
sus_countries_and_address <- activity_sus_countries %>% 
  filter(`Payee Address_` == "address_0")

#summarize by ssender activity 
sus_sender_activity <- sus_countries_and_address %>% 
  group_by(`Sender Name_`) %>%  
  summarise(txn_count = n(),
            unique_tsn_amt = n_distinct(Amount),
            avereage_tsn_amt = mean(Amount),
            median_tsn_amt = median(Amount),
            send_id_freq = mean(`Sender ID1 info`),
            payee_id_freq = mean(`Payee ID1 info`),
            avg_transfer_time = mean(Send_Pay_Time_Diff_Mins),
            med_transfer_time = median(Send_Pay_Time_Diff_Mins),
            weekend_proportion = mean(send_weekday %in% c("Saturday", "Sunday")),
            proportion_whole = mean(whole_number),
            off_hours_proportion = mean(hour_of_day > 18 | hour_of_day < 9),
            time_first_to_last_transaction_days = ( (max(`Send DateTime`) - min(`Send DateTime`))/1440 )) %>% 
  arrange(desc(txn_count)) %>% 
  filter(txn_count >=10) #narrow down to more than 10 transactions due to hypothesis above which mentioned higher frequency was more sus

print(sus_sender_activity)

## # A tibble: 12 × 13
##    `Sender Name_` txn_count unique_tsn_amt avereage_tsn_amt median_tsn_amt
##    <chr>              <int>          <int>            <dbl>          <dbl>
##  1 name_115              49              1             500            500 
##  2 name_813              30              2             517.           500 
##  3 name_1119             28             21            1070.           940 
##  4 name_525              21              1             500            500 
##  5 name_82               20             15            1294            995 
##  6 name_362              19             18            1224.          1061 
##  7 name_1007             14              5             756.           750 
##  8 name_2285             12             12            1397.          1248.
##  9 name_60               11              1             850            850 
## 10 name_190              10              1             500            500 
## 11 name_191              10             10             652.           638 
## 12 name_310              10              5             935.           990 
## # ℹ 8 more variables: send_id_freq <dbl>, payee_id_freq <dbl>,
## #   avg_transfer_time <dbl>, med_transfer_time <dbl>, weekend_proportion <dbl>,
## #   proportion_whole <dbl>, off_hours_proportion <dbl>,
## #   time_first_to_last_transaction_days <drtn>

After narrowing the data down further, I arrived at the list of senders above for those that are most suspcious in sending to the Central American area, given their relation to payer address 0. I will note that there are some names that are less likely to be as suspicious like name_1007 given that in transactions with this person, the sender and payeee often show ID, transfer times are high, and often has a unique transaction amounts (e.g., not always $500). Meanwhile, there are some clearly suspect senders like sender name_115 who has 49 transactions, neither party has almost ever shown an ID, and always transfers $500. Sender name_813 has a similar pattern with only two unique transaction amounts (vast majority seem to be $500), low ID frequencies, and nearly 50% of transactions made outside of normal business hours.

As one final step, I wanted to look into agent_11 a bit more as a pay agent. Above, we saw that it was the most common send agent, with all transactions coming from this send agent once we filtered to suspicious countries. So, I thought it would also be useful to check the circumstances of transfers coming to this pay agent site as Peregrine noted “Frequently, AML cases are initiated because of suspicious transactions being made out of specific stores (or send agents)”. In other words, this can help capture how the money is coming back into the US once it has been laundered outside of the US.

#filter dataset to only suspicious pay agent -  which is the suspicious send agent noted
activity_sus_pay_agent <- data %>%
  filter(data$`Pay Agent Name_` == "agent_11")

var <- c("Amount", "Sender Name_", "Payee Address_",  "Sender State",  "Send Agent Name_", "Send Agent City", "Send Agent State", "Send Country", "Send Operator Name_", "Pay Operator Name_")

for (i in 1:length(var)){
  col_sym <- sym(var[i])
  
  sum <- activity_sus_pay_agent  %>% 
    group_by(!!col_sym) %>% 
    summarise(txn_count = n(), .groups = "drop") %>% 
    arrange(desc(txn_count)) %>% 
    slice_head(n = 25)
  
  p <- ggplot(sum, aes(x = reorder(!!col_sym, txn_count), y = txn_count))+
    geom_col(fill = "black")+
    geom_text(aes(label = txn_count), hjust = 0, color = "red", size = 2.5) +
    labs(title = paste("Transactions by", var[i], "Top 25"), x = var[i], y = "Transaction Count") +
    coord_flip()
  
  print(p)
}

Interestingly, sender name name_10 rises to the top as the most common sender to this pay agent site. Earlier we noted that this sender could be suspicious as he/she had a large number of transactions. Now that those transactions are coming back to the pay agent site where problematic transactions sent to the outside of the US originated, sender name_10 becomes more suspcious… Let’s dig deeper.

sus_sender_activity_pay_agent <- activity_sus_pay_agent %>% 
  group_by(`Sender Name_`) %>%  
  summarise(txn_count = n(),
            unique_tsn_amt = n_distinct(Amount),
            send_id_freq = mean(`Sender ID1 info`),
            avereage_tsn_amt = mean(Amount),
            median_tsn_amt = median(Amount),
            payee_id_freq = mean(`Payee ID1 info`),
            avg_transfer_time = mean(Send_Pay_Time_Diff_Mins),
            med_transfer_time = median(Send_Pay_Time_Diff_Mins),
            weekend_proportion = mean(send_weekday %in% c("Saturday", "Sunday")),
            proportion_whole = mean(whole_number),
            off_hours_proportion = mean(hour_of_day > 18 | hour_of_day < 9),
            time_first_to_last_transaction_days = ( (max(`Send DateTime`) - min(`Send DateTime`))/1440 )) %>% 
  arrange(desc(txn_count)) %>% 
  filter(txn_count >=10)

print(sus_sender_activity_pay_agent)

## # A tibble: 4 × 13
##   `Sender Name_` txn_count unique_tsn_amt send_id_freq avereage_tsn_amt
##   <chr>              <int>          <int>        <dbl>            <dbl>
## 1 name_10              301            129        0                1231.
## 2 name_2748             22             10        0.909             673.
## 3 name_1445             21              6        0.714             955.
## 4 name_1968             15             15        1                 641.
## # ℹ 8 more variables: median_tsn_amt <dbl>, payee_id_freq <dbl>,
## #   avg_transfer_time <dbl>, med_transfer_time <dbl>, weekend_proportion <dbl>,
## #   proportion_whole <dbl>, off_hours_proportion <dbl>,
## #   time_first_to_last_transaction_days <drtn>

sus_sender_activity_pay_agent <- subset(sus_sender_activity_pay_agent, sus_sender_activity_pay_agent$`Sender Name_` == "name_10")

After digging in deeper for senders with more than 10 tranactions, only 4 sender names surface. Name_10 clearly stands out from the other senders for several reasons. Sender name_10 has ~300 transactions while the other senders had ~20. More importantly, sender name_10 has not once used his ID for a transaction, while the other senders have used their ID for the most part. Even more glaring is the fact that the median time between transfer start and stop is just 3 minutes for sender name_10, while it is over 10 hours for the others. For these reasons sender name_10 is clearly suspicious and is likely the method of getting laundered money back into the US once it has been sent through Central America by others

In summary, I would recommend starting the investigation by looking at sender Name_10 and those who have transfered money to him/her as this is the most concentrated source of suspicious transactions tied to the suspicious pay agent_11. This will help track down the tail fo the money laundering cycle. To track where it kicks off, I would investigate the senders I identified when looking at senders to suspicious countries above. A final list of suspects is below

print(c(sus_sender_activity_pay_agent$`Sender Name_`, sus_sender_activity$`Sender Name_`))

##  [1] "name_10"   "name_115"  "name_813"  "name_1119" "name_525"  "name_82"  
##  [7] "name_362"  "name_1007" "name_2285" "name_60"   "name_190"  "name_191" 
## [13] "name_310"

Below are links to ChatGPT usage:

Chat1 Chat2 Chat3

Andre Costa - Peregrine Case Study

2025-07-12