This appendix outlines the steps taken to construct the automated banking machine (ABM) data used in this paper. The construction process proceeded in three steps, outlined below.
Step 1: Identify and Drop Duplicates in Mastercard Data
In this step, we are identifying duplicates within each sponsor. We treat both the terminal ID and sponsor name as a combined key. We then add a column called duplicate_tag_1 that indicates whether each terminal ID appears once, twice, or more than twice within the same sponsor:
Tag 0: The terminal ID appears only once within a sponsor.
Tag 1: The terminal ID appears exactly twice within a sponsor.
Tag 2: The terminal ID appears more than twice within a sponsor.
This helps us manage duplicates at the sponsor level and ensures we accurately identify repeated terminal IDs within each sponsor’s dataset.
Code
tagged_2_2023 <- mc_raw_2023 %>%group_by(inst_assigned_term_id, sponsor_name) %>%mutate(duplicate_tag_1 =case_when(n() ==1~0,n() ==2~1,n() >2~2 )) %>%ungroup()tagged_2_2023 %>%count(sponsor_name, duplicate_tag_1) %>%pivot_wider(names_from = duplicate_tag_1, values_from = n, names_prefix ="tag_") %>%# Replace missing values (NA) with 0 in the reshaped datamutate(across(where(is.integer), ~replace_na(., 0))) %>%# Create a formatted table with 'kable' for better presentationkbl(caption ="Table 2: Duplicates Within Sponsors (2023)", col.names =c("Sponsor Name", "Tag 0", "Tag 1", "Tag 2")) %>%kable_classic()
Table 2: Duplicates Within Sponsors (2023)
Sponsor Name
Tag 0
Tag 1
Tag 2
BANK OF MONTREAL
37875
52
0
BANK OF NOVA SCOTIA
2434
48
18
CANADIAN IMPERIAL BANK OF COMMERCE
2968
0
0
DIRECTCASH BANK
7255
128
0
EVERLINK PAYMENT SERVICES INC.
1851
0
0
FEDERATION DES CAISSES DESJARDINS DU QUEBEC
1875
0
0
HSBC BANK CANADA
168
0
0
MONERIS SOLUTIONS CORPORATION
26397
0
0
NATIONAL BANK OF CANADA
927
0
0
ROYAL BANK OF CANADA
3955
0
0
TORONTO-DOMINION BANK, THE
2691
0
0
For Directcash Bank, some of the duplicates are due to the same record assigned once to its actual address, and once to a PO Box. Otherwise, the records look exactly the same.
Code
tagged_2_2023 %>%filter(duplicate_tag_1 ==1) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name) %>%filter(inst_assigned_term_id =="D1018430"| inst_assigned_term_id =="D1027205"| inst_assigned_term_id =="D1032094") %>%arrange(inst_assigned_term_id) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name")) %>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="blue") %>%row_spec(6, color ="blue")
Sponsor Name
Owner Name
Terminal ID
Address
City
Location Name
DIRECTCASH BANK
ALVENA CO-OPERATIVE ASSOCIATION LTD
D1018430
103 RAILWAY AVE
ALVENA
ALVENA COOP
DIRECTCASH BANK
ALVENA CO-OPERATIVE ASSOCIATION LTD
D1018430
PO BOX 150
ALVENA
ALVENA COOP
DIRECTCASH BANK
DC PAYMENTS CA
D1027205
215 VICTORIA AVE
MONARCH
MONARCH POST OFFICE
DIRECTCASH BANK
DC PAYMENTS CA
D1027205
PO BOX 168
MONARCH
MONARCH POST OFFICE
DIRECTCASH BANK
TIN ROOSTER HOLDINGS LTD
D1032094
BOULDER AVE
DEASE LAKE
DEASE LAKE SUPER A
DIRECTCASH BANK
TIN ROOSTER HOLDINGS LTD
D1032094
PO BOX 160
DEASE LAKE
DEASE LAKE SUPER A
Deduplication rule: When DirectCash Bank records share the same terminal ID but have different addresses, we drop the record where the address contains a PO Box and retain the record with a physical address.
Scotiabank duplicates have the same record with its address written differently, such as different city names with the same address, or one record ignores the street number while the other does have it. These are real duplicates as well.
Code
tagged_2_2023 %>%filter(duplicate_tag_1 !=0) %>%select(sponsor_name, owner_name, last_reported, inst_assigned_term_id, address1, city_name, latitude, longitude) %>%filter(inst_assigned_term_id =="AE25"| inst_assigned_term_id =="AH08") %>%arrange(inst_assigned_term_id) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Last Reported","Terminal ID", "Address", "City", "Latitude", "Longitude")) %>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="red") %>%row_spec(4, color ="green") %>%row_spec(5, color ="green")
Sponsor Name
Owner Name
Last Reported
Terminal ID
Address
City
Latitude
Longitude
BANK OF NOVA SCOTIA
SCOTIABANK
24-11-2023 14:20:10
AE25
18TH ST RICHMOND AVE
BRANDON
49.82750
-99.96201
BANK OF NOVA SCOTIA
SCOTIABANK
24-11-2023 14:20:10
AE25
1570 18TH ST
BRANDON
49.82204
-99.96358
BANK OF NOVA SCOTIA
SCOTIABANK
24-11-2023 14:20:10
AE25
18TH ST RICHMOND AVE
BRANDON
49.82750
-99.96201
BANK OF NOVA SCOTIA
SCOTIABANK
24-11-2023 14:20:10
AH08
3401 DUFFERIN ST
TORONTO
43.72598
-79.45241
BANK OF NOVA SCOTIA
SCOTIABANK
24-11-2023 14:20:10
AH08
3401 DUFFERIN ST
NORTH YORK
43.72611
-79.45227
BMO ATM duplicates seem to be the same ATM reported under ownership of both Cardtronics and Direct Cash. Since Cardtronics had already acquired Direct Cash in 2017, this seems to be real duplicates as well.
Code
tagged_2_2023 %>%filter(duplicate_tag_1 !=0) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name) %>%filter(inst_assigned_term_id =="D1018373"| inst_assigned_term_id =="D1057885"| inst_assigned_term_id =="D1066100") %>%arrange(inst_assigned_term_id) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City")) %>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="orange") %>%row_spec(6, color ="orange")
Sponsor Name
Owner Name
Terminal ID
Address
City
BANK OF MONTREAL
DIRECT CASH
D1018373
4650 SHEPPARD AVE E
SCARBOROUGH
BANK OF MONTREAL
CARDTRONICS
D1018373
1691 CT GAS SCARBOROUGH
SCARBOROUGH
BANK OF MONTREAL
DIRECT CASH
D1057885
165 MAIN ST
ANTIGONISH
BANK OF MONTREAL
CARDTRONICS
D1057885
476 CENTRE ST N
NAPANEE
BANK OF MONTREAL
DIRECT CASH
D1066100
6250 NO 3 RD
RICHMOND
BANK OF MONTREAL
CARDTRONICS
D1066100
RICHMOND BRIDGEHOUSE STATION 6250 NO 3 ROAD
RICHMOND
Deduplication rule: When records within the same sponsor (BMO) share the same terminal ID but have different owner names, specifically Cardtronics and Direct Cash, we drop the record associated with Cardtronics and retain the record associated with DirectCash..
The other BMO ATM duplicates have two owners as well: MBE and RapidCash.
Code
tagged_2_2023 %>%filter(duplicate_tag_1 !=0) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name) %>%filter(inst_assigned_term_id =="RCA00539"| inst_assigned_term_id =="RCA00540"| inst_assigned_term_id =="RCA00630") %>%arrange(desc(inst_assigned_term_id)) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City")) %>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="orange") %>%row_spec(6, color ="orange")
Sponsor Name
Owner Name
Terminal ID
Address
City
BANK OF MONTREAL
MBE
RCA00630
1ST ST
VERNON
BANK OF MONTREAL
RAPID CASH ATM LTD
RCA00630
12900 29 AVE
VERNON
BANK OF MONTREAL
MBE
RCA00540
TC HWY 1
CALGARY
BANK OF MONTREAL
RAPID CASH ATM LTD
RCA00540
1414 3 ST SW
CALGARY
BANK OF MONTREAL
MBE
RCA00539
TC HWY 1
CALGARY
BANK OF MONTREAL
RAPID CASH ATM LTD
RCA00539
1414 3 ST SW
CALGARY
Deduplication rule: When records within the same sponsor (BMO) share the same terminal ID but have different owner names, specifically MBE and Rapid Cash, we drop the record associated with MBE and retain the record associated with Rapid Cash.
Across Sponsors
We begin by treating the terminal ID as a unique identifier. To manage duplicates, we add a column called duplicate tag that indicates whether each terminal ID appears once, twice, or more than twice. This tagging helps highlight duplicates in the data:
Tag 0: The terminal ID appears only once (i.e., it’s unique).
Tag 1: The terminal ID appears exactly twice.
Tag 2: The terminal ID appears more than twice.
Finally, we count the number of occurrences for each combination of sponsor name and duplicate tag to better understand the distribution of duplicates.
Code
# Group the raw Mastercard 2023 data by 'inst_assigned_term_id'# inst_assigned_term_id is terminal ID in the 2023 mastercard datatagged_2023 <- mc_raw_2023 %>%group_by(inst_assigned_term_id) %>%# Create a new column 'duplicate_tag' to label the number of duplicates:# - Assign 0 if there's only one record for this 'inst_assigned_term_id'# - Assign 1 if there are exactly two records# - Assign 2 if there are more than two recordsmutate(duplicate_tag =case_when(n() ==1~0,n() ==2~1,n() >2~2 )) %>%# Remove the grouping structureungroup()# Count the occurrences of each 'sponsor_name' by 'duplicate_tag', # then reshape the data to have columns for each 'duplicate_tag' (0, 1, 2)tagged_2023 %>%count(sponsor_name, duplicate_tag) %>%pivot_wider(names_from = duplicate_tag, values_from = n, names_prefix ="tag_") %>%# Replace missing values (NA) with 0 in the reshaped datamutate(across(where(is.integer), ~replace_na(., 0))) %>%# Create a formatted table with 'kable' for better presentationkbl(caption ="Table 1: Duplicates Across Sponsors (2023)", col.names =c("Sponsor Name", "Tag 0", "Tag 1", "Tag 2")) %>%kable_classic() %>%kable_styling(full_width =FALSE)
Table 1: Duplicates Across Sponsors (2023)
Sponsor Name
Tag 0
Tag 1
Tag 2
BANK OF MONTREAL
9915
27845
167
BANK OF NOVA SCOTIA
2341
136
23
CANADIAN IMPERIAL BANK OF COMMERCE
2714
239
15
DIRECTCASH BANK
4493
2806
84
EVERLINK PAYMENT SERVICES INC.
1747
1
103
FEDERATION DES CAISSES DESJARDINS DU QUEBEC
1875
0
0
HSBC BANK CANADA
168
0
0
MONERIS SOLUTIONS CORPORATION
1460
24833
104
NATIONAL BANK OF CANADA
927
0
0
ROYAL BANK OF CANADA
3861
92
2
TORONTO-DOMINION BANK, THE
2691
0
0
It seems that the duplicates are concentrated in BMO (i.e. Moneris) and Directcash (which has been acquired by BMO according to MC).
BMO- Moneris Duplicates
BMO has 9915 unique terminal IDs, but 27845 Terminals appear more than once. And 167 Terminals appear at least twice. Moneris has 1460 unique terminal IDs, and 24833 are duplicates. And 104 terminals appear more than once. These are the primary drivers of the duplicates.
Deduplication rule: When BMO and Moneris share the same terminal ID, we use the Sponsor Name to determine which record to keep. We drop the record associated with Moneris and retain the record associated with BMO.
BMO- Direct Cash Duplicates
We also know that direct cash has been acquired by BMO. In these cases, we see the sponsor is Direct Cash in one entry, and the owner is direct cash in another entry.
Code
filter_dc_examples <-c("D4024406", "D4026911", "D4018524", "D4018462")tagged_2023 %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name, duplicate_tag) %>%filter(duplicate_tag ==1) %>%filter(inst_assigned_term_id %in% filter_dc_examples) %>%select(-duplicate_tag) %>%arrange(desc(inst_assigned_term_id)) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name")) %>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="blue") %>%row_spec(6, color ="blue") %>%row_spec(7, color ="orange") %>%row_spec(8, color ="orange")
Sponsor Name
Owner Name
Terminal ID
Address
City
Location Name
BANK OF MONTREAL
DIRECT CASH
D4026911
204 DAKOTA DUNES WAY
WHITECAP
SIGA DAKOTA DUNES 1
DIRECTCASH BANK
SASKATCHEWAN INDIAN GAMING AUTHORITY INC
D4026911
204 DAKOTA DUNES WAY
WHITECAP
SIGA DAKOTA DUNES 1
BANK OF MONTREAL
DIRECT CASH
D4024406
9208 JASPER AVE NW
EDMONTON
RIVERVIEW HUSKY 2
DIRECTCASH BANK
ROCKY STAR ENTERPRISE LTD
D4024406
9208 JASPER AVE NW
EDMONTON
RIVERVIEW HUSKY 2
BANK OF MONTREAL
DIRECT CASH
D4018524
377 RIVERSIDE DR E
WINDSOR
CAESARS WINDSOR EMPLOYEES LOUNGE
DIRECTCASH BANK
DC PAYMENTS CA
D4018524
377 RIVERSIDE DR E
WINDSOR
CAESARS WINDSOR EMPLOYEES LOUNGE
BANK OF MONTREAL
DIRECT CASH
D4018462
44 MARQUIS RD W
PRINCE ALBERT
SIGA NORTHERN LIGHTS 2
DIRECTCASH BANK
DC PAYMENTS CA
D4018462
44 MARQUIS RD W
PRINCE ALBERT
SIGA NORTHERN LIGHTS 2
Deduplication rule: When BMO and DirectCash Bank share the same terminal ID, we use the Sponsor Name to determine which record to keep. We drop the record associated with DirectCash Bank and retain the record associated with BMO.
Other
Most of the duplicates in the 2023 data originate from Moneris and Direct Cash. By applying the rules above, we can clean up most of the data, but a few other duplicates may still need attention.
For the few duplicate CIBC ATMs, it seems that they are actually owned by Cardtronics, but branded by CIBC.
Code
filter_cibc_examples <-c("MS909C9E", "MS909G9J", "MS909G9C", "MS909F9W")tagged_2023 %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name, duplicate_tag) %>%filter(duplicate_tag ==1) %>%filter(inst_assigned_term_id %in% filter_cibc_examples) %>%select(-duplicate_tag) %>%arrange(desc(inst_assigned_term_id)) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name")) %>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="blue") %>%row_spec(6, color ="blue") %>%row_spec(7, color ="orange") %>%row_spec(8, color ="orange")
Sponsor Name
Owner Name
Terminal ID
Address
City
Location Name
CANADIAN IMPERIAL BANK OF COMMERCE
CIBC
MS909G9J
1131 DUNDAS ST W
MISSISSAUGA
GIANT TIGER
BANK OF MONTREAL
CARDTRONICS
MS909G9J
1131 DUNDAS ST W UNIT M1
MISSISSAUGA
GIANT TIGER 234
CANADIAN IMPERIAL BANK OF COMMERCE
CIBC
MS909G9C
9225 HWY 93
MIDLAND
GIANT TIGER
BANK OF MONTREAL
CARDTRONICS
MS909G9C
9225 COUNTY RD 93 HURONIA MALL
MIDLAND
GIANT TIGER 128
CANADIAN IMPERIAL BANK OF COMMERCE
CIBC
MS909F9W
800 BOUL THIBEAU
TROISRIVIRES
GIANT TIGER
BANK OF MONTREAL
CARDTRONICS
MS909F9W
800 BOUL THIBEAU
TROISRIVIRES
GIANT TIGER 116
CANADIAN IMPERIAL BANK OF COMMERCE
CIBC
MS909C9E
1250 MAIN ST
OTTAWA
IDA PHARMACY
BANK OF MONTREAL
CARDTRONICS
MS909C9E
1250 MAIN ST
OTTAWA
IDA PHARMACY
Deduplication rule: Retain the record associated with CIBC
It seems that the RBC and Scotiabank’s “duplicated ATMs” are not duplicates, but different ATMs present in their respective branches. This could be because the terminal variable is only 4 digits in each case. For example, there are NOT duplicates.
Deduplication rule: These are not real duplicates. Nothing to do here.
Now, we explore the cases of duplicates with Tag 2 – i.e., when there are more than two instances of the same terminal ID.
There are two cases of double ATMs in Scotiabank branches that have the same 4-digit terminal ID as RBC ATMs. As mentioned in the previous paragraph, these are not real duplicates, so we do not drop them.
Deduplication rule: These are not true duplicates across sponsors, so we do not perform deduplication between different sponsors.
For some Direct Cash ATMs, there were two duplicate records already in 2019. Coupled with the BMO record in 2023, that makes three of the same.
Code
tagged_2023 %>%select(sponsor_name, owner_name, last_reported, inst_assigned_term_id, address1, location_name, duplicate_tag) %>%filter(inst_assigned_term_id =="D2069425"| inst_assigned_term_id =="D2069137") %>%select(-duplicate_tag) %>%arrange(desc(inst_assigned_term_id)) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Last Reported", "Terminal ID", "Address", "Location Name")) %>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="red") %>%row_spec(4, color ="blue") %>%row_spec(5, color ="blue") %>%row_spec(6, color ="blue")
Sponsor Name
Owner Name
Last Reported
Terminal ID
Address
Location Name
DIRECTCASH BANK
DC PAYMENTS CA
24-05-2019 09:40:38
D2069425
3456 HIGHWAY 138
MACEWEN RANGER CONVENIENCE
DIRECTCASH BANK
DC PAYMENTS CA
23-05-2019 16:59:37
D2069425
PO BOX 45
MACEWEN RANGER CONVENIENCE
BANK OF MONTREAL
CARDTRONICS
13-12-2023 15:55:39
D2069425
3456 HWY 138
ESSO MONKLAND 25005
BANK OF MONTREAL
DIRECT CASH
13-12-2023 15:59:40
D2069137
455 BROADWAY ST W
GALLAGHER CENTRE
DIRECTCASH BANK
GALLAGHER CENTRE- CITY OF YORKTON
24-05-2019 09:40:30
D2069137
455 BROADWAY AVE W BOX 908
GALLAGHER CENTRE
DIRECTCASH BANK
GALLAGHER CENTRE- CITY OF YORKTON
23-05-2019 16:58:50
D2069137
PO BOX 908 STN DELCENTRE
GALLAGHER CENTRE
Deduplication rule: Keep BMO.
Some CIBC-Branded ATMs were originally Direct Cash ATMs, acquired by Cardtronics and reported via BMO. For these ATMs there are three sponsors reporting them as well.
There are 88642 observations in the 2023 Mastercard data.
Eliminate duplicates within sponsor
Code
df_1_2023 <- mc_raw_2023 %>%group_by(sponsor_name, inst_assigned_term_id, state_prov) %>%mutate(tag1 =case_when(n() ==1~0,n() ==2~1,n() >2~2 )) %>%ungroup() %>%arrange(state_prov, tag1, inst_assigned_term_id) %>%mutate(todrop =0, todrop =if_else(owner_name =="MBE", todrop +1, todrop),todrop =if_else(grepl("PO BOX", address1), todrop +1, todrop),todrop =if_else(is.na(latitude), todrop +1, todrop),todrop =if_else(tag1 ==1& owner_name =="DIRECT CASH", todrop +1, todrop)) %>%arrange(desc(tag1), desc(inst_assigned_term_id), desc(todrop), desc(address1))df_2_2023 <- df_1_2023 %>%arrange(tag1, inst_assigned_term_id, todrop, address1) %>%group_by(tag1,inst_assigned_term_id) %>%mutate(dupeorder =row_number()) %>%ungroup() %>%filter(tag1 <1| dupeorder ==1) %>%select(-todrop, -tag1)# mc_raw_2023 %>%# group_by(inst_assigned_term_id) %>%# # Flag PO Box entries for removal if the same terminal ID has a real address# mutate(todrop = if_else(grepl("PO BOX", address1) & n() > 1, 1, 0)) %>%# # Remove the flagged PO Box entries# filter(todrop == 0) %>%# ungroup() %>%# select(-todrop)
Then eliminate duplicates with different sponsors
Code
df_3_2023 <- df_2_2023 %>%group_by(inst_assigned_term_id, state_prov) %>%mutate(tag =case_when(n() ==1~0,n() ==2~1,n() >2~2 )) %>%ungroup() %>%mutate(tokeep =0) %>%# when BMO and Moneris share a terminal ID, we are keeping BMOmutate(tokeep =if_else(tag >=1& sponsor_name =="BANK OF MONTREAL", tokeep +1, tokeep),# When directcash bank and some other sponsor (BMO) share an terminal ID, we remove# direct cash bank tokeep =if_else(tag >=1& sponsor_name =="DIRECTCASH BANK", tokeep -1, tokeep),# if cibc shares an ATM with another sponsor we retain CIBCtokeep =if_else(tag >=1& sponsor_name =="CANADIAN IMPERIAL BANK OF COMMERCE", tokeep +1, tokeep),### RBC AND SCOTIA BANK DUPLICATES ACROSS SPONSORS ARE NOT REAL DUPLICATEStokeep =if_else(tag >=1& sponsor_name =="ROYAL BANK OF CANADA", tokeep +100, tokeep),tokeep =if_else(tag >=1& sponsor_name =="BANK OF NOVA SCOTIA", tokeep +100, tokeep)) %>%arrange(inst_assigned_term_id, desc(tokeep)) %>%arrange(tokeep)df_4_2023 <- df_3_2023 %>%group_by(inst_assigned_term_id) %>%mutate(dupeorder =row_number()) %>%ungroup() %>%filter(tokeep >=10| dupeorder ==1) %>%select(-tokeep, -tag) %>%# I am dropping this here for now because I wanted the observations to match what Hong Yu had in his stata script.# when I update the document I will probably move this to a later stagefilter(owner_name !="FEDERATION DES CAISSES POPULAIRES ACADIENNES")
After deduplication, there are 60306 observations in the 2023 Mastercard data.
tagged_2_2023 <- mc_raw_2022 %>%group_by(inst_assigned_term_id, sponsor_name) %>%mutate(duplicate_tag_1 =case_when(n() ==1~0,n() ==2~1,n() >2~2 )) %>%ungroup()tagged_2_2023 %>%count(sponsor_name, duplicate_tag_1) %>%pivot_wider(names_from = duplicate_tag_1, values_from = n, names_prefix ="tag_") %>%# Replace missing values (NA) with 0 in the reshaped datamutate(across(where(is.integer), ~replace_na(., 0))) %>%# Create a formatted table with 'kable' for better presentationkbl(caption ="Table 4: Duplicates Within Sponsors (2022)", col.names =c("Sponsor Name", "Tag 0", "Tag 1")) %>%kable_classic()
Table 4: Duplicates Within Sponsors (2022)
Sponsor Name
Tag 0
Tag 1
BANK OF MONTREAL
2950
0
BANK OF NOVA SCOTIA
2402
80
CANADIAN IMPERIAL BANK OF COMMERCE
2999
0
DIRECTCASH BANK
7271
96
EVERLINK PAYMENT SERVICES INC.
2279
0
FEDERATION DES CAISSES DESJARDINS DU QUEBEC
1875
0
MONERIS SOLUTIONS CORPORATION
34540
0
NATIONAL BANK OF CANADA
929
0
ROYAL BANK OF CANADA
3989
0
TORONTO-DOMINION BANK THE
2749
0
For Directcash Bank, the duplicates are due to the same record assigned once to its actual address, and once to a PO Box. Otherwise the records look exactly the same.
Code
tagged_2_2023 %>%filter(duplicate_tag_1 !=0) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name) %>%filter(sponsor_name =="DIRECTCASH BANK") %>%arrange(inst_assigned_term_id) %>%filter(inst_assigned_term_id !="C4066615") %>%filter(inst_assigned_term_id !="C4066617") %>%head(6) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City"))%>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="orange") %>%row_spec(6, color ="orange")
Sponsor Name
Owner Name
Terminal ID
Address
City
DIRECTCASH BANK
DC PAYMENTS CA
C4066599
PO BOX 3725 STN CENTRAL
VICTORIA
DIRECTCASH BANK
DC PAYMENTS CA
C4066599
NELLES BLOCK CFB ESQUIMALT
VICTORIA
DIRECTCASH BANK
DC PAYMENTS CA
C4066616
PO BOX 250 BUILDING 100
SHEARWATER
DIRECTCASH BANK
DC PAYMENTS CA
C4066616
12 WING SHEARWATER BUILDING 100
SHEARWATER
DIRECTCASH BANK
CHARLIES LUNCH
D1016013
PO BOX 172
EASTEND
DIRECTCASH BANK
CHARLIES LUNCH
D1016013
327 RED COAT DR
EASTEND
Deduplication rule: When DirectCash Bank records share the same terminal ID but have different addresses, we drop the record where the address contains a PO Box and retain the record with a physical address.
Scotiabank duplicates have the same record with its address written differently, such as different city names with the same address, or one record ignores the street number while the other does have it. These are real duplicates as well.
Code
tagged_2_2023 %>%filter(duplicate_tag_1 !=0) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name) %>%filter(sponsor_name =="BANK OF NOVA SCOTIA") %>%arrange(inst_assigned_term_id) %>%head(6) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City"))%>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="orange") %>%row_spec(6, color ="orange")
Most duplicates (around 4000) have the sponsor being Directcash in one entry and the owner being Direct Cash in the other entry
Direct Cash Bank - Moneris
Code
tagged_2022 %>%filter(duplicate_tag ==1) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name) %>%filter(inst_assigned_term_id =="D1014793"| inst_assigned_term_id =="D1037192") %>%arrange(inst_assigned_term_id) %>%head(4) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name")) %>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green")
Sponsor Name
Owner Name
Terminal ID
Address
City
Location Name
DIRECTCASH BANK
AVONDALE STORES LIMITED
D1014793
242 KILLALY ST W
PORT COLBORNE
AVONDALE 033
MONERIS SOLUTIONS CORPORATION
DIRECT CASH
D1014793
242 KILLALY ST W
PORT COLBORNE
AVONDALE 033
MONERIS SOLUTIONS CORPORATION
DIRECT CASH
D1037192
252 THE ESPLANADE
TORONTO
VILLAGES MINI MART
DIRECTCASH BANK
DC PAYMENTS CA
D1037192
252 THE ESPLANADE
TORONTO
VILLAGES MINI MART
Deduplication rule: When DIRECTCASH BANK and MONERIS share the same terminal ID, we use the Sponsor Name to determine which record to keep. We drop the record associated with DirectCash Bank and retain the record associated with MONERIS.
Other
For the few duplicate CIBC ATMs, it seems that they are actually owned by Cardtronics, but branded by CIBC.
Code
tagged_2022 %>%filter(duplicate_tag ==1) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name) %>%group_by(inst_assigned_term_id) %>%filter(all(c("CANADIAN IMPERIAL BANK OF COMMERCE", "MONERIS SOLUTIONS CORPORATION") %in% sponsor_name)) %>%ungroup() %>%arrange(desc(inst_assigned_term_id)) %>%head(6) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name"))%>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="orange") %>%row_spec(6, color ="orange")
Sponsor Name
Owner Name
Terminal ID
Address
City
Location Name
CANADIAN IMPERIAL BANK OF COMMERCE
CIBC
PI909G9X
1315 COUNTY RD 28
FRASERVILLE
PIONEER GAS STATION
MONERIS SOLUTIONS CORPORATION
CARDTRONICS
PI909G9X
1315 COUNTY ROAD 28
FRASERVILLE
FRASERVILLE OTR 43251
MONERIS SOLUTIONS CORPORATION
CARDTRONICS
PI909G9W
453 DUNLOP ST W
BARRIE
PIONEER OTR42339
CANADIAN IMPERIAL BANK OF COMMERCE
CIBC
PI909G9W
453 DUNLOP ST W
BARRIE
PIONEER GAS STATION
CANADIAN IMPERIAL BANK OF COMMERCE
CIBC
PI909G9V
1089 TALBOT ST
ST THOMAS
PIONEER GAS STATION
MONERIS SOLUTIONS CORPORATION
CARDTRONICS
PI909G9V
1089 TALBOT ST
ST THOMAS
PIONEER OTR 43233
Deduplication rule: Retain the record associated with CIBC
It seems that the RBC and Scotiabank’s “duplicated ATMs” are actually not duplicates, but different ATMs present in their respective branches. This could to be because the terminal ID variable is only 4 digit in each case.
Code
tagged_2022 %>%filter(duplicate_tag ==1) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name) %>%group_by(inst_assigned_term_id) %>%filter(all(c("ROYAL BANK OF CANADA", "BANK OF NOVA SCOTIA") %in% sponsor_name)) %>%ungroup() %>%arrange(desc(inst_assigned_term_id)) %>%head(6) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name"))%>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="orange") %>%row_spec(6, color ="orange")
Sponsor Name
Owner Name
Terminal ID
Address
City
Location Name
ROYAL BANK OF CANADA
ROYAL BANK OF CANADA
WD21
85 MAIN ST
MELITA
MELITA BRWH28 AUD
BANK OF NOVA SCOTIA
SCOTIABANK
WD21
1195 ARTHUR ST W
THUNDER BAY
ARTHUR PARKDALE
ROYAL BANK OF CANADA
ROYAL BANK OF CANADA
WC23
220 PORTAGE AVE
WINNIPEG
WINNIPEG MNWE54AUD
BANK OF NOVA SCOTIA
SCOTIABANK
WC23
1091 ST MARYS RD
WINNIPEG
DAKOTA VILLAGE 3
ROYAL BANK OF CANADA
ROYAL BANK OF CANADA
WC21
220 PORTAGE AVE
WINNIPEG
WINNIPEG MAIN BR AUD
BANK OF NOVA SCOTIA
SCOTIABANK
WC21
318 1 ST
WINNIPEGOSIS
WINNIPEGOSIS BRANCH
Deduplication rule: These are not real duplicates. Nothing to do here.
It is also true that NBC and TD’s “duplicated ATMs” are actually not duplicates, but different ATMs present in their respective branches.
Code
tagged_2022 %>%filter(duplicate_tag ==1) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name) %>%group_by(inst_assigned_term_id) %>%filter(all(c("TORONTO-DOMINION BANK THE", "NATIONAL BANK OF CANADA") %in% sponsor_name)) %>%ungroup() %>%arrange(desc(inst_assigned_term_id)) %>%head(6) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name"))%>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="orange") %>%row_spec(6, color ="orange")
Sponsor Name
Owner Name
Terminal ID
Address
City
Location Name
TORONTO-DOMINION BANK THE
TORONTO-DOMINION BANK
7772
2025 PRINCE OF WALES DR
REGINA
VICTORIA SQUARE
NATIONAL BANK OF CANADA
NATIONAL BANK OF CANADA
7772
560 RUE SACR-COEUR O
ALMA
SUCC FERME
NATIONAL BANK OF CANADA
NATIONAL BANK OF CANADA
7771
3261 1IRE AV
RAWDON
SUCCURSALE
TORONTO-DOMINION BANK THE
TORONTO-DOMINION BANK
7771
2025 PRINCE OF WALES DR
REGINA
VICTORIA SQUARE
NATIONAL BANK OF CANADA
NATIONAL BANK OF CANADA
7770
585 AV SAINT-CHARLES
VAUDREUIL-DORION
SUCCURSALE
TORONTO-DOMINION BANK THE
TORONTO-DOMINION BANK
7770
2025 PRINCE OF WALES DR
REGINA
VICTORIA SQUARE
Deduplication rule: These are not real duplicates. Nothing to do here.
Some Scotiabank ATMs are double counted due to same address but different city names
Code
tagged_2022 %>%filter(duplicate_tag ==1) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name) %>%#group_by(inst_assigned_term_id) %>%filter(inst_assigned_term_id =="AA13"| inst_assigned_term_id =="AA21") %>%# filter(all(c("TORONTO-DOMINION BANK THE", "NATIONAL BANK OF CANADA") %in% sponsor_name)) %>%# ungroup() %>% arrange(desc(inst_assigned_term_id)) %>%head(4) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name"))%>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green")
Sponsor Name
Owner Name
Terminal ID
Address
City
Location Name
BANK OF NOVA SCOTIA
SCOTIABANK
AA21
305 PORT UNION RD
WEST HILL
PORT UNION FANFARE 2
BANK OF NOVA SCOTIA
SCOTIABANK
AA21
305 PORT UNION RD
SCARBOROUGH
PORT UNION FANFARE 2
BANK OF NOVA SCOTIA
SCOTIABANK
AA13
305 PORT UNION RD
WEST HILL
PORT UNION FANFARE 1
BANK OF NOVA SCOTIA
SCOTIABANK
AA13
305 PORT UNION RD
SCARBOROUGH
PORT UNION FANFARE 1
Deduplication rule: Keep one observation
Deduplicate (2022)
There are 62159 observations in the 2023 Mastercard data.
Eliminate duplicates within sponsor
Code
# Tag duplicates within the same sponsor and provincedf_1_2022 <- mc_raw_2022 %>%group_by(sponsor_name, inst_assigned_term_id, state_prov) %>%mutate(tag1 =case_when(n() ==1~0,n() ==2~1,n() >2~2 )) %>%ungroup() %>%arrange(state_prov, tag1, inst_assigned_term_id) %>%mutate(todrop =0,todrop =if_else(grepl("PO BOX", address1), todrop +1, todrop),todrop =if_else(is.na(latitude), todrop +1, todrop)) %>%arrange(desc(tag1), desc(inst_assigned_term_id), desc(todrop), desc(address1))# Drop the duplicates within the same sponsor and provincedf_2_2022 <- df_1_2022 %>%arrange(tag1, inst_assigned_term_id, todrop, address1) %>%group_by(tag1, inst_assigned_term_id) %>%mutate(dupeorder =row_number()) %>%ungroup() %>%filter(tag1 <1| dupeorder ==1) %>%select(-todrop, -tag1)
Then eliminate duplicates with different sponsors
Code
# Tag duplicates across different sponsorsdf_3_2022 <- df_2_2022 %>%group_by(inst_assigned_term_id, state_prov) %>%mutate(tag =case_when(n() ==1~0,n() ==2~1,n() >2~2 )) %>%ungroup() %>%mutate(tokeep =0) %>%mutate(tokeep =if_else(tag >=1& sponsor_name =="MONERIS SOLUTIONS CORPORATION", tokeep +1, tokeep),tokeep =if_else(tag >=1& sponsor_name =="DIRECTCASH BANK", tokeep -1, tokeep),tokeep =if_else(tag >=1& sponsor_name =="CANADIAN IMPERIAL BANK OF COMMERCE", tokeep +1, tokeep),tokeep =if_else(tag >=1& sponsor_name =="ROYAL BANK OF CANADA", tokeep +100, tokeep),tokeep =if_else(tag >=1& sponsor_name =="BANK OF NOVA SCOTIA", tokeep +100, tokeep),tokeep =if_else(tag >=1& sponsor_name =="NATIONAL BANK OF CANADA", tokeep +100, tokeep),tokeep =if_else(tag >=1& sponsor_name =="TORONTO-DOMINION BANK THE", tokeep +100, tokeep)) %>%arrange(inst_assigned_term_id, desc(tokeep)) %>%arrange(tokeep)# Final filtering and cleaningdf_4_2022 <- df_3_2022 %>%group_by(inst_assigned_term_id) %>%mutate(dupeorder =row_number()) %>%ungroup() %>%filter(tokeep >=10| dupeorder ==1) %>%select(-tokeep, -tag)
After deduplication, there are 57698 observations in the 2022 Mastercard data.
Step 2: Updating Desjardins ATM Data
Mastercard data is unreliable for Desjardins, as the 2019, 2022, and 2023 datasets show identical ATM counts, which is unlikely given Desjardins’ ongoing ATM closures (see Table 5 below). Therefore, we rely on external data sources for Desjardins.
Code
mc_raw_2019 <-read_csv("data/unprocessed/mastercard/BOC_All_ATMImage_(Nov 2019).txt", guess_max =50000) %>%clean_names()desjardins_19 <- mc_raw_2019 %>%count(owner_name) %>%filter(owner_name =="LA FEDERATION DES CAISSES DESJARDINS DU QUEBEC"| owner_name =="LA FEDERATION DES CAISSES POPULAIRES DE L'ONTARIO") %>%mutate(date ="Nov 2019")desjardins_22 <- df_4_2022 %>%count(owner_name) %>%filter(owner_name =="LA FEDERATION DES CAISSES DESJARDINS DU QUEBEC"| owner_name =="LA FEDERATION DES CAISSES POPULAIRES DE LONTARIO") %>%mutate(date ="Nov 2022")desjardins_23 <- df_4_2023 %>%count(owner_name) %>%filter(owner_name =="LA FEDERATION DES CAISSES DESJARDINS DU QUEBEC"| owner_name =="LA FEDERATION DES CAISSES POPULAIRES DE L'ONTARIO") %>%mutate(date ="Dec 2023")desjardins_19 %>%bind_rows(desjardins_22) %>%bind_rows(desjardins_23) %>%mutate(owner_name =case_when(str_detect(owner_name, "LA FEDERATION DES CAISSES DESJARDINS DU QUEBEC") ~"LA FEDERATION DES CAISSES DESJARDINS DU QUEBEC",str_detect(owner_name, "LA FEDERATION DES CAISSES POPULAIRES DE L'ONTARIO|LA FEDERATION DES CAISSES POPULAIRES DE LONTARIO") ~"LA FEDERATION DES CAISSES POPULAIRES DE L'ONTARIO",TRUE~ owner_name )) %>%pivot_wider(names_from = date, values_from = n) %>%kbl(caption ="Table 5: Desjardins ATM Counts Across Years (Mastercard Data)", col.names =c("Owner Name", "November 2019", "November 2022", "November 2023"))%>%kable_classic()
Table 5: Desjardins ATM Counts Across Years (Mastercard Data)
Owner Name
November 2019
November 2022
November 2023
LA FEDERATION DES CAISSES DESJARDINS DU QUEBEC
1706
1706
1706
LA FEDERATION DES CAISSES POPULAIRES DE L'ONTARIO
63
63
63
In the fall of 2022 and 2023, we webscraped Desjardins ATM data directly from their website. Unfortunately, we were unable to do this for 2019, so our improvements focus on the 2022 and 2023 datasets. The matching process for Desjardins ATMs involves the following steps:
Fuzzy Matching: We first fuzzy match the addresses from the Mastercard ATMs to the webscraped ATMs using the fuzzyjoin package. We use the Jaro-Winkler string distance with a threshold of 0.2 (lower values indicate better matches) and reject any matches exceeding this threshold. This results in a matched dataset and two sets of rejected matches (one from Mastercard, one from the webscraped data).
Manual Review: Next, we manually review the rejected matches to identify any false negatives. We carefully inspect and manually match addresses that were not captured by the fuzzy matching process. This results in four datasets: string-matched, manually-matched, rejected Mastercard matches, and rejected webscraped matches.
Consolidation: We then combine the string-matched and manually-matched datasets to create a consolidated set of matched ATMs.
Final Integration: Finally, since the webscraped data is more current and reliable, we add back the unmatched webscraped ATMs (those that didn’t have a corresponding match in the Mastercard data) to the matched dataset.
In summary, the final matched dataset is built through a process of initial fuzzy matching, manual review of unmatched data, and integrating unmatched webscraped ATMs back into the dataset.
library(downloadthis)library(gt)library(fuzzyjoin)# Perform fuzzy matchingmatched_data_23 <-stringdist_join(x = desjardins_2023_mc,y = desjardins_atm_2023_webscraped,by ="address_clean",method ="jw",max_dist =0.21,ignore_case =TRUE,mode ="inner",distance_col ="dist")# Filter to keep the best match for each mc_idbest_matches_23 <- matched_data_23 %>%group_by(mc_id) %>%slice_min(order_by = dist, n =1, with_ties =FALSE) %>%ungroup() %>%select(-latitude.y, -longitude.y, -name) %>%rename(longitude = longitude.x,latitude = latitude.x,address_mc = address_clean.x,address_webscraped = address_clean.y) %>%select(mc_id, webscraped_id, everything())# Identify unmatched records in dataset 1 (to discard)unmatched_mc <- desjardins_2023_mc %>%anti_join(best_matches_23, by ="mc_id")# Identify unmatched records in dataset 2 (to retain)unmatched_webscraped <- desjardins_atm_2023_webscraped %>%anti_join(best_matches_23, by ="webscraped_id")### After manual inspection there were many addresses that should have been matched but were missed in the string matching procedure### In this section I am adding them back into the mc data
We matched 1354 ATMs from the Mastercard data to the web scraped data after string matching. You can download the data below.
Code
library(downloadthis)best_matches_23 %>%download_this(output_name ="desjardin_matching_23_v1",output_extension =".xlsx",button_label ="Download data as xlsx",button_type ="default",has_icon =TRUE,icon ="fa fa-save" )
Next, we review the ATMs that were rejected from both the Mastercard and webscraped data, and performed manual matching where necessary. The table below gives us the ATMs from the webscraped data that did not meet the threshold for string matching
Using the two tables above, we manually match the false negatives based on their IDs. You can search for addresses or IDs using the search bar. The code chunk below creates a dataframe to capture these false negatives based on their IDs.
Next, we review the unmatched data from both the Mastercard and webscraped datasets—those records that remained unmatched after both string matching and manual matching. Below, you have the option to download these unmatched records for further analysis or review.
unmatched_webscraped_final %>%download_this(output_name ="unmatched_webscraped_23",output_extension =".xlsx",button_label ="Download data as xlsx",button_type ="default",has_icon =TRUE,icon ="fa fa-save" )
Consolidate (2023)
We consolidate the string-matched and manually matched data, then add the unmatched webscraped data to create the final dataset for Desjardins ATMs in Quebec. The Ontario matching will be handled in the next step.
Next, we will match the Ontario Desjardins ATMs. This matching procedure differs slightly from the one used for Quebec. In 2022 and 2023, we webscraped Ontario Desjardins ATMs, but discovered that this data only included off-site ATMs and excluded those located inside bank branches. After inspecting the Mastercard data for Ontario, we found that it included ATMs inside bank branches.
Therefore, our matching procedure is as follows: First, we fuzzy match the Ontario Desjardins ATMs from Mastercard to the webscraped Ontario Desjardins bank branches, removing any Mastercard ATMs that do not meet a specific address matching threshold. Next, we manually inspect any false negatives and add them back if necessary. Finally, we include the webscraped Ontario Desjardins ATMs in the dataset, as they represent relevant off-site ATMs that should be retained.
---title: "Construction of Automated Banking Machine Location Data"format: html: toc: true toc-depth: 5 code-fold: true code-tools: trueeditor: visual---This appendix outlines the steps taken to construct the automated banking machine (ABM) data used in this paper. The construction process proceeded in three steps, outlined below.## Step 1: Identify and Drop Duplicates in Mastercard Data### Identify Duplicates (2023)```{r}#| warning: false#| output : false# load packages library(tidyverse)library(janitor)library(knitr)library(kableExtra)library(gt)# import raw MasterCard datamc_raw_2023 <-read_csv("data/unprocessed/mastercard/BOC_All_ATMImage(December 2023).txt", guess_max =50000) %>%clean_names()```#### Within SponsorsIn this step, we are identifying duplicates within each sponsor. We treat both the terminal ID and sponsor name as a combined key. We then add a column called `duplicate_tag_1` that indicates whether each terminal ID appears once, twice, or more than twice within the same sponsor:- **Tag 0**: The terminal ID appears only once within a sponsor.- **Tag 1**: The terminal ID appears exactly twice within a sponsor.- **Tag 2**: The terminal ID appears more than twice within a sponsor.This helps us manage duplicates at the sponsor level and ensures we accurately identify repeated terminal IDs within each sponsor's dataset.```{r, cache=TRUE}tagged_2_2023 <- mc_raw_2023 %>%group_by(inst_assigned_term_id, sponsor_name) %>%mutate(duplicate_tag_1 =case_when(n() ==1~0,n() ==2~1,n() >2~2 )) %>%ungroup()tagged_2_2023 %>%count(sponsor_name, duplicate_tag_1) %>%pivot_wider(names_from = duplicate_tag_1, values_from = n, names_prefix ="tag_") %>%# Replace missing values (NA) with 0 in the reshaped datamutate(across(where(is.integer), ~replace_na(., 0))) %>%# Create a formatted table with 'kable' for better presentationkbl(caption ="Table 2: Duplicates Within Sponsors (2023)", col.names =c("Sponsor Name", "Tag 0", "Tag 1", "Tag 2")) %>%kable_classic()```For Directcash Bank, some of the duplicates are due to the same record assigned once to its actual address, and once to a PO Box. Otherwise, the records look exactly the same.```{r}tagged_2_2023 %>%filter(duplicate_tag_1 ==1) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name) %>%filter(inst_assigned_term_id =="D1018430"| inst_assigned_term_id =="D1027205"| inst_assigned_term_id =="D1032094") %>%arrange(inst_assigned_term_id) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name")) %>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="blue") %>%row_spec(6, color ="blue")```<u>Deduplication rule: When DirectCash Bank records share the same terminal ID but have different addresses, we drop the record where the address contains a PO Box and retain the record with a physical address.</u>Scotiabank duplicates have the same record with its address written differently, such as different city names with the same address, or one record ignores the street number while the other does have it. These are real duplicates as well.```{r}tagged_2_2023 %>%filter(duplicate_tag_1 !=0) %>%select(sponsor_name, owner_name, last_reported, inst_assigned_term_id, address1, city_name, latitude, longitude) %>%filter(inst_assigned_term_id =="AE25"| inst_assigned_term_id =="AH08") %>%arrange(inst_assigned_term_id) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Last Reported","Terminal ID", "Address", "City", "Latitude", "Longitude")) %>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="red") %>%row_spec(4, color ="green") %>%row_spec(5, color ="green")```BMO ATM duplicates seem to be the same ATM reported under ownership of both Cardtronics and Direct Cash. Since Cardtronics had already acquired Direct Cash in 2017, this seems to be real duplicates as well.```{r}tagged_2_2023 %>%filter(duplicate_tag_1 !=0) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name) %>%filter(inst_assigned_term_id =="D1018373"| inst_assigned_term_id =="D1057885"| inst_assigned_term_id =="D1066100") %>%arrange(inst_assigned_term_id) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City")) %>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="orange") %>%row_spec(6, color ="orange")```<u>Deduplication rule: When records within the same sponsor (BMO) share the same terminal ID but have different owner names, specifically Cardtronics and Direct Cash, we drop the record associated with Cardtronics and retain the record associated with DirectCash..</u>The other BMO ATM duplicates have two owners as well: MBE and RapidCash.```{r}tagged_2_2023 %>%filter(duplicate_tag_1 !=0) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name) %>%filter(inst_assigned_term_id =="RCA00539"| inst_assigned_term_id =="RCA00540"| inst_assigned_term_id =="RCA00630") %>%arrange(desc(inst_assigned_term_id)) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City")) %>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="orange") %>%row_spec(6, color ="orange")```<u>Deduplication rule: When records within the same sponsor (BMO) share the same terminal ID but have different owner names, specifically MBE and Rapid Cash, we drop the record associated with MBE and retain the record associated with Rapid Cash.</u>#### Across SponsorsWe begin by treating the terminal ID as a unique identifier. To manage duplicates, we add a column called `duplicate tag` that indicates whether each terminal ID appears once, twice, or more than twice. This tagging helps highlight duplicates in the data:- **Tag 0**: The terminal ID appears only once (i.e., it’s unique).- **Tag 1**: The terminal ID appears exactly twice.- **Tag 2**: The terminal ID appears more than twice.Finally, we count the number of occurrences for each combination of sponsor name and duplicate tag to better understand the distribution of duplicates.```{r, cache=TRUE}# Group the raw Mastercard 2023 data by 'inst_assigned_term_id'# inst_assigned_term_id is terminal ID in the 2023 mastercard datatagged_2023 <- mc_raw_2023 %>%group_by(inst_assigned_term_id) %>%# Create a new column 'duplicate_tag' to label the number of duplicates:# - Assign 0 if there's only one record for this 'inst_assigned_term_id'# - Assign 1 if there are exactly two records# - Assign 2 if there are more than two recordsmutate(duplicate_tag =case_when(n() ==1~0,n() ==2~1,n() >2~2 )) %>%# Remove the grouping structureungroup()# Count the occurrences of each 'sponsor_name' by 'duplicate_tag', # then reshape the data to have columns for each 'duplicate_tag' (0, 1, 2)tagged_2023 %>%count(sponsor_name, duplicate_tag) %>%pivot_wider(names_from = duplicate_tag, values_from = n, names_prefix ="tag_") %>%# Replace missing values (NA) with 0 in the reshaped datamutate(across(where(is.integer), ~replace_na(., 0))) %>%# Create a formatted table with 'kable' for better presentationkbl(caption ="Table 1: Duplicates Across Sponsors (2023)", col.names =c("Sponsor Name", "Tag 0", "Tag 1", "Tag 2")) %>%kable_classic() %>%kable_styling(full_width =FALSE)```It seems that the duplicates are concentrated in BMO (i.e. Moneris) and Directcash (which has been acquired by BMO according to MC).##### BMO- Moneris DuplicatesBMO has 9915 unique terminal IDs, but 27845 Terminals appear more than once. And 167 Terminals appear at least twice. Moneris has 1460 unique terminal IDs, and 24833 are duplicates. And 104 terminals appear more than once. These are the primary drivers of the duplicates.Here are examples of BMO- Moneris duplicates.```{r}filter_bmo_examples <-c("001BCLVS", "001NYKBA", "TOT00005")tagged_2023 %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name, duplicate_tag) %>%filter(duplicate_tag ==1) %>%arrange(inst_assigned_term_id, sponsor_name) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name) %>%filter(inst_assigned_term_id %in% filter_bmo_examples) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name")) %>%kable_classic() %>%#kbl(col.names = c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name")) %>% row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="blue") %>%row_spec(6, color ="blue") ```<u>Deduplication rule: When BMO and Moneris share the same terminal ID, we use the Sponsor Name to determine which record to keep. We drop the record associated with Moneris and retain the record associated with BMO.</u>##### BMO- Direct Cash DuplicatesWe also know that direct cash has been acquired by BMO. In these cases, we see the sponsor is Direct Cash in one entry, and the owner is direct cash in another entry.```{r}filter_dc_examples <-c("D4024406", "D4026911", "D4018524", "D4018462")tagged_2023 %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name, duplicate_tag) %>%filter(duplicate_tag ==1) %>%filter(inst_assigned_term_id %in% filter_dc_examples) %>%select(-duplicate_tag) %>%arrange(desc(inst_assigned_term_id)) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name")) %>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="blue") %>%row_spec(6, color ="blue") %>%row_spec(7, color ="orange") %>%row_spec(8, color ="orange")```<u>Deduplication rule: When BMO and DirectCash Bank share the same terminal ID, we use the Sponsor Name to determine which record to keep. We drop the record associated with DirectCash Bank and retain the record associated with BMO.</u>##### OtherMost of the duplicates in the 2023 data originate from Moneris and Direct Cash. By applying the rules above, we can clean up most of the data, but a few other duplicates may still need attention.For the few duplicate CIBC ATMs, it seems that they are actually owned by Cardtronics, but branded by CIBC.```{r}filter_cibc_examples <-c("MS909C9E", "MS909G9J", "MS909G9C", "MS909F9W")tagged_2023 %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name, duplicate_tag) %>%filter(duplicate_tag ==1) %>%filter(inst_assigned_term_id %in% filter_cibc_examples) %>%select(-duplicate_tag) %>%arrange(desc(inst_assigned_term_id)) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name")) %>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="blue") %>%row_spec(6, color ="blue") %>%row_spec(7, color ="orange") %>%row_spec(8, color ="orange")```<u>Deduplication rule: Retain the record associated with CIBC</u>It seems that the RBC and Scotiabank's "duplicated ATMs" are not duplicates, but different ATMs present in their respective branches. This could be because the terminal variable is only 4 digits in each case. For example, there are NOT duplicates.```{r}filter_rbc_examples <-c("WD21", "WC23", "VM20")tagged_2023 %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name, duplicate_tag) %>%filter(duplicate_tag ==1) %>%filter(inst_assigned_term_id %in% filter_rbc_examples) %>%select(-duplicate_tag) %>%arrange(inst_assigned_term_id) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name")) %>%kable_classic() ```<u>Deduplication rule: These are not real duplicates. Nothing to do here.</u>Now, we explore the cases of duplicates with **Tag 2** – i.e., when there are more than two instances of the same terminal ID.There are two cases of double ATMs in Scotiabank branches that have the same 4-digit terminal ID as RBC ATMs. As mentioned in the previous paragraph, these are not real duplicates, so we do not drop them.```{r}filter_trip_bmo_rbc_examples <-c("CM25", "CM23")tagged_2023 %>%arrange(inst_assigned_term_id) %>%filter(duplicate_tag ==2) %>%arrange(desc(inst_assigned_term_id)) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, location_name, duplicate_tag) %>%filter(inst_assigned_term_id %in% filter_trip_bmo_rbc_examples) %>%select(-duplicate_tag) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "Location Name")) %>%kable_classic()```<u>Deduplication rule: These are not true duplicates across sponsors, so we do not perform deduplication between different sponsors. </u>For some Direct Cash ATMs, there were two duplicate records already in 2019. Coupled with the BMO record in 2023, that makes three of the same.```{r}tagged_2023 %>%select(sponsor_name, owner_name, last_reported, inst_assigned_term_id, address1, location_name, duplicate_tag) %>%filter(inst_assigned_term_id =="D2069425"| inst_assigned_term_id =="D2069137") %>%select(-duplicate_tag) %>%arrange(desc(inst_assigned_term_id)) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Last Reported", "Terminal ID", "Address", "Location Name")) %>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="red") %>%row_spec(4, color ="blue") %>%row_spec(5, color ="blue") %>%row_spec(6, color ="blue")```<u>Deduplication rule: Keep BMO.</u>Some CIBC-Branded ATMs were originally Direct Cash ATMs, acquired by Cardtronics and reported via BMO. For these ATMs there are three sponsors reporting them as well.```{r}tagged_2023 %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name, duplicate_tag) %>%filter(duplicate_tag ==2) %>%filter(inst_assigned_term_id =="F4257559") %>%select(-duplicate_tag) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name")) %>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="red")```<u>Deduplication rule: .</u>### Deduplicate (2023)There are `r mc_raw_2023 %>% nrow` observations in the 2023 Mastercard data.Eliminate duplicates within sponsor```{r, cache=TRUE}df_1_2023 <- mc_raw_2023 %>%group_by(sponsor_name, inst_assigned_term_id, state_prov) %>%mutate(tag1 =case_when(n() ==1~0,n() ==2~1,n() >2~2 )) %>%ungroup() %>%arrange(state_prov, tag1, inst_assigned_term_id) %>%mutate(todrop =0, todrop =if_else(owner_name =="MBE", todrop +1, todrop),todrop =if_else(grepl("PO BOX", address1), todrop +1, todrop),todrop =if_else(is.na(latitude), todrop +1, todrop),todrop =if_else(tag1 ==1& owner_name =="DIRECT CASH", todrop +1, todrop)) %>%arrange(desc(tag1), desc(inst_assigned_term_id), desc(todrop), desc(address1))df_2_2023 <- df_1_2023 %>%arrange(tag1, inst_assigned_term_id, todrop, address1) %>%group_by(tag1,inst_assigned_term_id) %>%mutate(dupeorder =row_number()) %>%ungroup() %>%filter(tag1 <1| dupeorder ==1) %>%select(-todrop, -tag1)# mc_raw_2023 %>%# group_by(inst_assigned_term_id) %>%# # Flag PO Box entries for removal if the same terminal ID has a real address# mutate(todrop = if_else(grepl("PO BOX", address1) & n() > 1, 1, 0)) %>%# # Remove the flagged PO Box entries# filter(todrop == 0) %>%# ungroup() %>%# select(-todrop)```Then eliminate duplicates with different sponsors```{r, cache=TRUE}df_3_2023 <- df_2_2023 %>%group_by(inst_assigned_term_id, state_prov) %>%mutate(tag =case_when(n() ==1~0,n() ==2~1,n() >2~2 )) %>%ungroup() %>%mutate(tokeep =0) %>%# when BMO and Moneris share a terminal ID, we are keeping BMOmutate(tokeep =if_else(tag >=1& sponsor_name =="BANK OF MONTREAL", tokeep +1, tokeep),# When directcash bank and some other sponsor (BMO) share an terminal ID, we remove# direct cash bank tokeep =if_else(tag >=1& sponsor_name =="DIRECTCASH BANK", tokeep -1, tokeep),# if cibc shares an ATM with another sponsor we retain CIBCtokeep =if_else(tag >=1& sponsor_name =="CANADIAN IMPERIAL BANK OF COMMERCE", tokeep +1, tokeep),### RBC AND SCOTIA BANK DUPLICATES ACROSS SPONSORS ARE NOT REAL DUPLICATEStokeep =if_else(tag >=1& sponsor_name =="ROYAL BANK OF CANADA", tokeep +100, tokeep),tokeep =if_else(tag >=1& sponsor_name =="BANK OF NOVA SCOTIA", tokeep +100, tokeep)) %>%arrange(inst_assigned_term_id, desc(tokeep)) %>%arrange(tokeep)df_4_2023 <- df_3_2023 %>%group_by(inst_assigned_term_id) %>%mutate(dupeorder =row_number()) %>%ungroup() %>%filter(tokeep >=10| dupeorder ==1) %>%select(-tokeep, -tag) %>%# I am dropping this here for now because I wanted the observations to match what Hong Yu had in his stata script.# when I update the document I will probably move this to a later stagefilter(owner_name !="FEDERATION DES CAISSES POPULAIRES ACADIENNES")```After deduplication, there are `r df_4_2023 %>% nrow()` observations in the 2023 Mastercard data.### Identify Duplicates (2022)```{r}#| warning: false#| output: falsemc_raw_2022 <-read_csv("data/unprocessed/mastercard/BOC_All_ATMImage(November 2022).txt", guess_max =50000) %>%clean_names()```#### Within Sponsors```{r, cache=TRUE}tagged_2_2023 <- mc_raw_2022 %>%group_by(inst_assigned_term_id, sponsor_name) %>%mutate(duplicate_tag_1 =case_when(n() ==1~0,n() ==2~1,n() >2~2 )) %>%ungroup()tagged_2_2023 %>%count(sponsor_name, duplicate_tag_1) %>%pivot_wider(names_from = duplicate_tag_1, values_from = n, names_prefix ="tag_") %>%# Replace missing values (NA) with 0 in the reshaped datamutate(across(where(is.integer), ~replace_na(., 0))) %>%# Create a formatted table with 'kable' for better presentationkbl(caption ="Table 4: Duplicates Within Sponsors (2022)", col.names =c("Sponsor Name", "Tag 0", "Tag 1")) %>%kable_classic()```For Directcash Bank, the duplicates are due to the same record assigned once to its actual address, and once to a PO Box. Otherwise the records look exactly the same. ```{r}tagged_2_2023 %>%filter(duplicate_tag_1 !=0) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name) %>%filter(sponsor_name =="DIRECTCASH BANK") %>%arrange(inst_assigned_term_id) %>%filter(inst_assigned_term_id !="C4066615") %>%filter(inst_assigned_term_id !="C4066617") %>%head(6) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City"))%>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="orange") %>%row_spec(6, color ="orange")```<u>Deduplication rule: When DirectCash Bank records share the same terminal ID but have different addresses, we drop the record where the address contains a PO Box and retain the record with a physical address.</u>Scotiabank duplicates have the same record with its address written differently, such as different city names with the same address, or one record ignores the street number while the other does have it. These are real duplicates as well.```{r}tagged_2_2023 %>%filter(duplicate_tag_1 !=0) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name) %>%filter(sponsor_name =="BANK OF NOVA SCOTIA") %>%arrange(inst_assigned_term_id) %>%head(6) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City"))%>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="orange") %>%row_spec(6, color ="orange")```#### Across Sponsors```{r, cache=TRUE}tagged_2022 <- mc_raw_2022 %>%group_by(inst_assigned_term_id) %>%mutate(duplicate_tag =case_when(n() ==1~0,n() ==2~1,n() >2~2 )) %>%ungroup()tagged_2022 %>%count(sponsor_name, duplicate_tag) %>%pivot_wider(names_from = duplicate_tag, values_from = n, names_prefix ="tag_") %>%mutate(across(where(is.integer), ~replace_na(., 0))) %>%# Create a formatted table with 'kable' for better presentationkbl(caption ="Table 3: Duplicates Across Sponsors (2022)", col.names =c("Sponsor Name", "Tag 0", "Tag 1", "Tag 2")) %>%kable_classic() %>%kable_styling(full_width =FALSE)```Most duplicates (around 4000) have the sponsor being Directcash in one entry and the owner being Direct Cash in the other entry##### Direct Cash Bank - Moneris```{r, cache=TRUE}#| warning: falsetagged_2022 %>%filter(duplicate_tag ==1) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name) %>%filter(inst_assigned_term_id =="D1014793"| inst_assigned_term_id =="D1037192") %>%arrange(inst_assigned_term_id) %>%head(4) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name")) %>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green")```<u>Deduplication rule: When DIRECTCASH BANK and MONERIS share the same terminal ID, we use the Sponsor Name to determine which record to keep. We drop the record associated with DirectCash Bank and retain the record associated with MONERIS.</u>##### OtherFor the few duplicate CIBC ATMs, it seems that they are actually owned by Cardtronics, but branded by CIBC.```{r}tagged_2022 %>%filter(duplicate_tag ==1) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name) %>%group_by(inst_assigned_term_id) %>%filter(all(c("CANADIAN IMPERIAL BANK OF COMMERCE", "MONERIS SOLUTIONS CORPORATION") %in% sponsor_name)) %>%ungroup() %>%arrange(desc(inst_assigned_term_id)) %>%head(6) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name"))%>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="orange") %>%row_spec(6, color ="orange")```<u>Deduplication rule: Retain the record associated with CIBC</u>It seems that the RBC and Scotiabank's "duplicated ATMs" are actually not duplicates, but different ATMs present in their respective branches. This could to be because the terminal ID variable is only 4 digit in each case. ```{r}tagged_2022 %>%filter(duplicate_tag ==1) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name) %>%group_by(inst_assigned_term_id) %>%filter(all(c("ROYAL BANK OF CANADA", "BANK OF NOVA SCOTIA") %in% sponsor_name)) %>%ungroup() %>%arrange(desc(inst_assigned_term_id)) %>%head(6) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name"))%>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="orange") %>%row_spec(6, color ="orange")```<u>Deduplication rule: These are not real duplicates. Nothing to do here.</u>It is also true that NBC and TD's "duplicated ATMs" are actually not duplicates, but different ATMs present in their respective branches.```{r}tagged_2022 %>%filter(duplicate_tag ==1) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name) %>%group_by(inst_assigned_term_id) %>%filter(all(c("TORONTO-DOMINION BANK THE", "NATIONAL BANK OF CANADA") %in% sponsor_name)) %>%ungroup() %>%arrange(desc(inst_assigned_term_id)) %>%head(6) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name"))%>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") %>%row_spec(5, color ="orange") %>%row_spec(6, color ="orange")```<u>Deduplication rule: These are not real duplicates. Nothing to do here.</u>Some Scotiabank ATMs are double counted due to same address but different city names```{r}tagged_2022 %>%filter(duplicate_tag ==1) %>%select(sponsor_name, owner_name, inst_assigned_term_id, address1, city_name, location_name) %>%#group_by(inst_assigned_term_id) %>%filter(inst_assigned_term_id =="AA13"| inst_assigned_term_id =="AA21") %>%# filter(all(c("TORONTO-DOMINION BANK THE", "NATIONAL BANK OF CANADA") %in% sponsor_name)) %>%# ungroup() %>% arrange(desc(inst_assigned_term_id)) %>%head(4) %>%kbl(col.names =c("Sponsor Name", "Owner Name", "Terminal ID", "Address", "City", "Location Name"))%>%kable_classic() %>%row_spec(1, color ="red") %>%row_spec(2, color ="red") %>%row_spec(3, color ="green") %>%row_spec(4, color ="green") ```<u>Deduplication rule: Keep one observation </u>### Deduplicate (2022)There are `r mc_raw_2022 %>% nrow` observations in the 2023 Mastercard data.Eliminate duplicates within sponsor```{r, cache=TRUE}# Tag duplicates within the same sponsor and provincedf_1_2022 <- mc_raw_2022 %>%group_by(sponsor_name, inst_assigned_term_id, state_prov) %>%mutate(tag1 =case_when(n() ==1~0,n() ==2~1,n() >2~2 )) %>%ungroup() %>%arrange(state_prov, tag1, inst_assigned_term_id) %>%mutate(todrop =0,todrop =if_else(grepl("PO BOX", address1), todrop +1, todrop),todrop =if_else(is.na(latitude), todrop +1, todrop)) %>%arrange(desc(tag1), desc(inst_assigned_term_id), desc(todrop), desc(address1))# Drop the duplicates within the same sponsor and provincedf_2_2022 <- df_1_2022 %>%arrange(tag1, inst_assigned_term_id, todrop, address1) %>%group_by(tag1, inst_assigned_term_id) %>%mutate(dupeorder =row_number()) %>%ungroup() %>%filter(tag1 <1| dupeorder ==1) %>%select(-todrop, -tag1)```Then eliminate duplicates with different sponsors```{r, cache=TRUE}# Tag duplicates across different sponsorsdf_3_2022 <- df_2_2022 %>%group_by(inst_assigned_term_id, state_prov) %>%mutate(tag =case_when(n() ==1~0,n() ==2~1,n() >2~2 )) %>%ungroup() %>%mutate(tokeep =0) %>%mutate(tokeep =if_else(tag >=1& sponsor_name =="MONERIS SOLUTIONS CORPORATION", tokeep +1, tokeep),tokeep =if_else(tag >=1& sponsor_name =="DIRECTCASH BANK", tokeep -1, tokeep),tokeep =if_else(tag >=1& sponsor_name =="CANADIAN IMPERIAL BANK OF COMMERCE", tokeep +1, tokeep),tokeep =if_else(tag >=1& sponsor_name =="ROYAL BANK OF CANADA", tokeep +100, tokeep),tokeep =if_else(tag >=1& sponsor_name =="BANK OF NOVA SCOTIA", tokeep +100, tokeep),tokeep =if_else(tag >=1& sponsor_name =="NATIONAL BANK OF CANADA", tokeep +100, tokeep),tokeep =if_else(tag >=1& sponsor_name =="TORONTO-DOMINION BANK THE", tokeep +100, tokeep)) %>%arrange(inst_assigned_term_id, desc(tokeep)) %>%arrange(tokeep)# Final filtering and cleaningdf_4_2022 <- df_3_2022 %>%group_by(inst_assigned_term_id) %>%mutate(dupeorder =row_number()) %>%ungroup() %>%filter(tokeep >=10| dupeorder ==1) %>%select(-tokeep, -tag) ```After deduplication, there are `r df_4_2022 %>% nrow()` observations in the 2022 Mastercard data.## Step 2: Updating Desjardins ATM DataMastercard data is unreliable for Desjardins, as the 2019, 2022, and 2023 datasets show identical ATM counts, which is unlikely given Desjardins' ongoing ATM closures (see Table 5 below). Therefore, we rely on external data sources for Desjardins.```{r, cache=TRUE}#| warning: false#| message: falsemc_raw_2019 <-read_csv("data/unprocessed/mastercard/BOC_All_ATMImage_(Nov 2019).txt", guess_max =50000) %>%clean_names()desjardins_19 <- mc_raw_2019 %>%count(owner_name) %>%filter(owner_name =="LA FEDERATION DES CAISSES DESJARDINS DU QUEBEC"| owner_name =="LA FEDERATION DES CAISSES POPULAIRES DE L'ONTARIO") %>%mutate(date ="Nov 2019")desjardins_22 <- df_4_2022 %>%count(owner_name) %>%filter(owner_name =="LA FEDERATION DES CAISSES DESJARDINS DU QUEBEC"| owner_name =="LA FEDERATION DES CAISSES POPULAIRES DE LONTARIO") %>%mutate(date ="Nov 2022")desjardins_23 <- df_4_2023 %>%count(owner_name) %>%filter(owner_name =="LA FEDERATION DES CAISSES DESJARDINS DU QUEBEC"| owner_name =="LA FEDERATION DES CAISSES POPULAIRES DE L'ONTARIO") %>%mutate(date ="Dec 2023")desjardins_19 %>%bind_rows(desjardins_22) %>%bind_rows(desjardins_23) %>%mutate(owner_name =case_when(str_detect(owner_name, "LA FEDERATION DES CAISSES DESJARDINS DU QUEBEC") ~"LA FEDERATION DES CAISSES DESJARDINS DU QUEBEC",str_detect(owner_name, "LA FEDERATION DES CAISSES POPULAIRES DE L'ONTARIO|LA FEDERATION DES CAISSES POPULAIRES DE LONTARIO") ~"LA FEDERATION DES CAISSES POPULAIRES DE L'ONTARIO",TRUE~ owner_name )) %>%pivot_wider(names_from = date, values_from = n) %>%kbl(caption ="Table 5: Desjardins ATM Counts Across Years (Mastercard Data)", col.names =c("Owner Name", "November 2019", "November 2022", "November 2023"))%>%kable_classic()```In the fall of 2022 and 2023, we webscraped Desjardins ATM data directly from their website. Unfortunately, we were unable to do this for 2019, so our improvements focus on the 2022 and 2023 datasets. The matching process for Desjardins ATMs involves the following steps:1. **Fuzzy Matching**: We first fuzzy match the addresses from the Mastercard ATMs to the webscraped ATMs using the `fuzzyjoin` package. We use the Jaro-Winkler string distance with a threshold of 0.2 (lower values indicate better matches) and reject any matches exceeding this threshold. This results in a matched dataset and two sets of rejected matches (one from Mastercard, one from the webscraped data).2. **Manual Review**: Next, we manually review the rejected matches to identify any false negatives. We carefully inspect and manually match addresses that were not captured by the fuzzy matching process. This results in four datasets: string-matched, manually-matched, rejected Mastercard matches, and rejected webscraped matches.3. **Consolidation**: We then combine the string-matched and manually-matched datasets to create a consolidated set of matched ATMs.4. **Final Integration**: Finally, since the webscraped data is more current and reliable, we add back the unmatched webscraped ATMs (those that didn’t have a corresponding match in the Mastercard data) to the matched dataset.In summary, the final matched dataset is built through a process of initial fuzzy matching, manual review of unmatched data, and integrating unmatched webscraped ATMs back into the dataset.```{r, cache=TRUE}#| warning: false#| output: false#| message: false## READ IN OUR WEBSCRAPED DATA# read in webscraped Desjardins ATM data 2023desjardins_atm_2023_webscraped <-read_csv("data/unprocessed/webscraped/desjardins_2023_atm_geocoded.csv") %>%mutate(latitude = lat) %>%mutate(longitude = lng) %>%select(-lat, -lng) %>%rename(webscraped_id = id) %>%# clean addressesmutate(address_clean =str_replace_all(address_clean, ",", ""))# read in webscraped Desjardins ATM data 2022desjardins_atm_2022_webscraped <-read_csv("data/unprocessed/webscraped/desjardins_2022_atm_geocoded.csv") %>%mutate(latitude = lat) %>%mutate(longitude = lng) %>%select(-lat, -lng, -score, -caisse_atm_name) %>%rename(webscraped_id = id, name = caisse_caisse_name, address_clean = caisse_atm_address) %>%mutate(address_clean =str_replace_all(address_clean, ",", ""))# read in webscraped Desjardins 2022 ATMdesjardins_ontario_atm_2022 <-read_csv("data/unprocessed/webscraped/desjardins_ontario_atm_2022.csv") %>%mutate(address_clean =str_replace_all(string = atm_address, "\n", " ")) %>%select(-atm_address, -atm_name)# read in web scraped Ontario desjardins ATMsdesjardins_ontario_atm_2023 <-read_csv("data/unprocessed/webscraped/desjardins_ontario_atm_2023.csv") %>%mutate(address_clean =str_replace_all(string = atm_address, "\n", " ")) %>%select(-atm_address, -atm_name)# read in ontario web scraped branchesdesjardins_branch_ont_23 <-read_csv("data/unprocessed/webscraped/desjardins_branches_ontario_2023.csv") %>%select(branch_address)desjardins_branch_ont_22 <-read_csv("data/unprocessed/webscraped/desjardins_branches_ontario_2022.csv") %>%select(branch_address)## Subset deduplicated 2023 MasterCard data for quebec desjardin ATMsdesjardins_2023_mc <- df_4_2023 %>%filter(owner_name =="LA FEDERATION DES CAISSES DESJARDINS DU QUEBEC") %>%mutate(address_clean =paste0(address1, " ", city_name, " ", postal_code, " Quebec")) %>%mutate(address_clean =str_to_lower(address_clean)) %>%select(inst_assigned_term_id, owner_name, latitude, longitude, address_clean) %>%mutate(mc_id =row_number()) %>%mutate(address_clean =str_replace_all(address_clean, pattern ="quebec", "Québec")) %>%mutate(address_clean =str_replace_all(address_clean, "(?i)saint(\\S)", "saint \\1"))## Subset deduplicated 2023 MasterCard data for Ontario desjardin ATMsdesjardins_2023_mc_ontario <- df_4_2023 %>%filter(owner_name =="LA FEDERATION DES CAISSES POPULAIRES DE L'ONTARIO") %>%mutate(address_clean =paste0(address1, " ", city_name, " ", postal_code, " Ontario")) %>%mutate(address_clean =str_to_lower(address_clean)) %>%select(inst_assigned_term_id, owner_name, latitude, longitude, address_clean) %>%mutate(mc_id =row_number())## Subset deduplicated 2022 MasterCard data for quebec desjardin ATMsdesjardins_2022_mc <- df_4_2022 %>%filter(owner_name =="LA FEDERATION DES CAISSES DESJARDINS DU QUEBEC") %>%mutate(address_clean =paste0(address1, " ", city_name, " ", postal_code, " Quebec")) %>%mutate(address_clean =str_to_lower(address_clean)) %>%select(inst_assigned_term_id, owner_name, latitude, longitude, address_clean) %>%mutate(mc_id =row_number()) %>%mutate(address_clean =str_replace_all(address_clean, pattern ="quebec", "Québec")) %>%mutate(address_clean =str_replace_all(address_clean, "(?i)saint(\\S)", "saint \\1"))## Subset deduplicated 2023 MasterCard data for Ontario desjardin ATMsdesjardins_2022_mc_ontario <- df_4_2022 %>%filter(owner_name =="LA FEDERATION DES CAISSES POPULAIRES DE L'ONTARIO") %>%mutate(address_clean =paste0(address1, " ", city_name, " ", postal_code, " Ontario")) %>%mutate(address_clean =str_to_lower(address_clean)) %>%select(inst_assigned_term_id, owner_name, latitude, longitude, address_clean) %>%mutate(mc_id =row_number())```### String Match (2023)```{r}#| warning: falselibrary(downloadthis)library(gt)library(fuzzyjoin)# Perform fuzzy matchingmatched_data_23 <-stringdist_join(x = desjardins_2023_mc,y = desjardins_atm_2023_webscraped,by ="address_clean",method ="jw",max_dist =0.21,ignore_case =TRUE,mode ="inner",distance_col ="dist")# Filter to keep the best match for each mc_idbest_matches_23 <- matched_data_23 %>%group_by(mc_id) %>%slice_min(order_by = dist, n =1, with_ties =FALSE) %>%ungroup() %>%select(-latitude.y, -longitude.y, -name) %>%rename(longitude = longitude.x,latitude = latitude.x,address_mc = address_clean.x,address_webscraped = address_clean.y) %>%select(mc_id, webscraped_id, everything())# Identify unmatched records in dataset 1 (to discard)unmatched_mc <- desjardins_2023_mc %>%anti_join(best_matches_23, by ="mc_id")# Identify unmatched records in dataset 2 (to retain)unmatched_webscraped <- desjardins_atm_2023_webscraped %>%anti_join(best_matches_23, by ="webscraped_id")### After manual inspection there were many addresses that should have been matched but were missed in the string matching procedure### In this section I am adding them back into the mc data```We matched `r best_matches_23 %>% nrow()` ATMs from the Mastercard data to the web scraped data after string matching. You can download the data below.```{r}library(downloadthis)best_matches_23 %>%download_this(output_name ="desjardin_matching_23_v1",output_extension =".xlsx",button_label ="Download data as xlsx",button_type ="default",has_icon =TRUE,icon ="fa fa-save" )```Next, we review the ATMs that were rejected from both the Mastercard and webscraped data, and performed manual matching where necessary. The table below gives us the ATMs from the webscraped data that did not meet the threshold for string matching```{r}#| warning: falselibrary(DT)unmatched_webscraped %>%select(-name) %>%datatable() ```The table below gives us the ATMs from the Mastercard data that did not meet the threshold for string matching```{r}#| warning: falseunmatched_mc %>%select(mc_id, address_clean, latitude, longitude) %>%datatable() ```### Manual Review (2023)Using the two tables above, we manually match the false negatives based on their IDs. You can search for addresses or IDs using the search bar. The code chunk below creates a dataframe to capture these false negatives based on their IDs.```{r}# Define the manually matched IDs based on your inputmatched_mc_webscraped <- tibble::tribble(~mc_id, ~webscraped_id,1, 542,2, 542,114, 350,124, 356,171, 412,192, 363,193, 363,194, 362,197, 362,198, 362,199, 362,201, 362,222, 101,224, 101,321, 230,322, 230,323, 230,331, 120,332, 123,348, 371,351, 371,375, 144,384, 232,385, 232,395, 152,396, 152,399, 45,400, 45,401, 45,403, 45,409, 235,433, 903,434, 903,499, 282,504, 286,548, 9,639, 678,640, 678,645, 682,646, 682,670, 610,671, 610,672, 610,721, 841,722, 842,729, 687,730, 691,756, 844,779, 751,780, 751,811, 806,901, 703,902, 703,903, 703,907, 706,911, 704,912, 704,913, 704,957, 768,958, 768,959, 768,960, 777,961, 777,968, 778,979, 635,980, 635,981, 635,982, 635,984, 633,985, 633,986, 638,987, 638,990, 638,996, 714,997, 714,1008, 779,1009, 779,1038, 657,1049, 658,1053, 653,1054, 653,1055, 717,1056, 717,1079, 660,1080, 660,1106, 722,1107, 722,1108, 722,1148, 599,1149, 599,1150, 599,1244, 29,1245, 29,1306, 444,1307, 444,1308, 444,1330, 460,1387, 66,1401, 82,1425, 300,1426, 300,1427, 300,1437, 302,1448, 313,1463, 318,1485, 335,1605, 479,1611, 479,1612, 479,1615, 479,1620, 479,1634, 540,1635, 540,1645, 377,1650, 377,1675, 916,1676, 916,1686, 921,1695, 924)``````{r}#| warning: false#| output: false#| message: false# join Id's that we matched manuallymc_clean <- unmatched_mc %>%rename(address_clean_mc = address_clean) %>%rename(latitude_mc = latitude,longitude_mc = longitude)web_clean <- unmatched_webscraped %>%rename(address_web = address_clean) %>%rename(latitude_web = latitude,longitude_web = longitude) %>%select(-name)# join manually matched idmatched_mc_webscraped %>%inner_join(mc_clean) %>%inner_join(web_clean) %>%select(mc_id, webscraped_id, address_clean_mc, address_web)```Next, we review the unmatched data from both the Mastercard and webscraped datasets—those records that remained unmatched after both string matching and manual matching. Below, you have the option to download these unmatched records for further analysis or review.```{r}#| warning: false#| message: falseunmatched_mc_final <- unmatched_mc %>%filter(!mc_id %in% matched_mc_webscraped$mc_id)unmatched_webscraped_final <- unmatched_webscraped %>%filter(!webscraped_id %in% matched_mc_webscraped$webscraped_id) %>%select(-name) %>%rename(latitude_web = latitude,longitude_web = longitude,address_web = address_clean)unmatched_mc_final %>%download_this(output_name ="unmatched_mc_23",output_extension =".xlsx",button_label ="Download data as xlsx",button_type ="default",has_icon =TRUE,icon ="fa fa-save" )unmatched_webscraped_final %>%download_this(output_name ="unmatched_webscraped_23",output_extension =".xlsx",button_label ="Download data as xlsx",button_type ="default",has_icon =TRUE,icon ="fa fa-save" )```### Consolidate (2023)We consolidate the string-matched and manually matched data, then add the unmatched webscraped data to create the final dataset for Desjardins ATMs in Quebec. The Ontario matching will be handled in the next step.```{r}#| warning: false#| message: false# matched_final <- matched_mc_webscraped %>%inner_join(mc_clean) %>%inner_join(web_clean) %>%bind_rows(unmatched_webscraped_final) %>%mutate(address_clean = address_web) %>%mutate(address_clean =ifelse(webscraped_id ==350, "501 rue Beaudoin Saint-Léonard-d’Aston Québec J0C 1M0", address_clean)) %>%mutate(address_clean =ifelse(webscraped_id ==777, "170 boulevard Harwood Vaudreuil-Dorion Québec J7V 1Y2", address_clean)) %>%mutate(address_clean =ifelse(webscraped_id ==758, "745 rue Roy Est Montréal Québec H2L 1E1", address_clean)) %>%filter(webscraped_id !=488) %>%filter(webscraped_id !=778) %>%mutate(latitude = latitude_mc,longitude = longitude_mc) %>%mutate(latitude =ifelse(is.na(latitude), latitude_web, latitude)) %>%mutate(longitude =ifelse(is.na(longitude), latitude_web, longitude)) %>%mutate(owner_name =ifelse(is.na(owner_name), "Desjardins (webscraped)", owner_name)) %>%select(mc_id, webscraped_id, inst_assigned_term_id, owner_name, address_webscraped = address_web, address_mc = address_clean_mc, address_clean, latitude, longitude) %>%mutate(matched ="manual") %>%mutate(matched =ifelse(is.na(mc_id), "unmatched webscraped", matched))best_matches_23 %>%mutate(address_clean = address_webscraped) %>%mutate(matched ="string match") %>%bind_rows(matched_final) %>%rename(string_distance = dist) %>%select(mc_id, webscraped_id, address_mc, address_webscraped, string_distance, matched) %>%download_this(output_name ="matched_desjardins_final_23.",output_extension =".xlsx",button_label ="Download data as xlsx",button_type ="default",has_icon =TRUE,icon ="fa fa-save" )```### Ontario (2023)Next, we will match the Ontario Desjardins ATMs. This matching procedure differs slightly from the one used for Quebec. In 2022 and 2023, we webscraped Ontario Desjardins ATMs, but discovered that this data only included off-site ATMs and excluded those located inside bank branches. After inspecting the Mastercard data for Ontario, we found that it included ATMs inside bank branches.Therefore, our matching procedure is as follows: First, we fuzzy match the Ontario Desjardins ATMs from Mastercard to the webscraped Ontario Desjardins bank branches, removing any Mastercard ATMs that do not meet a specific address matching threshold. Next, we manually inspect any false negatives and add them back if necessary. Finally, we include the webscraped Ontario Desjardins ATMs in the dataset, as they represent relevant off-site ATMs that should be retained.```{r}desjardins_2023_mc_ontario %>%select(mc_id, address_clean) desjardins_branch_ont_23 <- desjardins_branch_ont_23 %>%mutate(branch_address =str_replace_all(branch_address, "\n", " ")) %>%mutate(branch_id =row_number()) %>%mutate(address_clean = branch_address) # Perform fuzzy matchingmatched_data_23_ont <-stringdist_join(x = desjardins_2023_mc_ontario,y = desjardins_branch_ont_23,by ="address_clean",method ="jw",max_dist =0.21,ignore_case =TRUE,mode ="inner",distance_col ="dist")matched_data_23_ont %>%View()best_matches_ontario_23 <- matched_data_23_ont %>%group_by(mc_id) %>%slice_min(order_by = dist, n =1, with_ties =FALSE) %>%ungroup() %>%rename(address_mc = address_clean.x,address_webscraped = address_clean.y) %>%select(mc_id, branch_id, address_mc, branch_address) # Identify unmatched records in dataset 1 (to discard)unmatched_mc_ontario_23 <- desjardins_2023_mc_ontario %>%anti_join(best_matches_ontario_23, by ="mc_id") %>%select(mc_id, address_clean)unmatched_mc_ontario_23 %>%View()# Identify unmatched records in dataset 2 (to retain)unmatched_branches_ontario_23 <- desjardins_branch_ont_23 %>%anti_join(best_matches_ontario_23, by ="branch_id") %>%select(-branch_address)```