===============================================
To have a better look on the structure and attributes before doing any data analysis
head(df, n = 5)
## accountNumber customerId creditLimit availableMoney transactionDateTime
## 1 737265056 737265056 5000 5000 2016-08-13T14:27:32
## 2 737265056 737265056 5000 5000 2016-10-11T05:05:54
## 3 737265056 737265056 5000 5000 2016-11-08T09:18:39
## 4 737265056 737265056 5000 5000 2016-12-10T02:14:50
## 5 830329091 830329091 5000 5000 2016-03-24T21:04:46
## transactionAmount merchantName acqCountry merchantCountryCode
## 1 98.55 Uber US US
## 2 74.51 AMC #191138 US US
## 3 7.47 Play Store US US
## 4 7.47 Play Store US US
## 5 71.18 Tim Hortons #947751 US US
## posEntryMode posConditionCode merchantCategoryCode currentExpDate
## 1 02 01 rideshare 06/2023
## 2 09 01 entertainment 02/2024
## 3 09 01 mobileapps 08/2025
## 4 09 01 mobileapps 08/2025
## 5 02 01 fastfood 10/2029
## accountOpenDate dateOfLastAddressChange cardCVV enteredCVV cardLast4Digits
## 1 2015-03-14 2015-03-14 414 414 1803
## 2 2015-03-14 2015-03-14 486 486 767
## 3 2015-03-14 2015-03-14 486 486 767
## 4 2015-03-14 2015-03-14 486 486 767
## 5 2015-08-06 2015-08-06 885 885 3143
## transactionType echoBuffer currentBalance merchantCity merchantState
## 1 PURCHASE 0
## 2 PURCHASE 0
## 3 PURCHASE 0
## 4 PURCHASE 0
## 5 PURCHASE 0
## merchantZip cardPresent posOnPremises recurringAuthInd
## 1 FALSE
## 2 TRUE
## 3 FALSE
## 4 FALSE
## 5 TRUE
## expirationDateKeyInMatch isFraud
## 1 FALSE FALSE
## 2 FALSE FALSE
## 3 FALSE FALSE
## 4 FALSE FALSE
## 5 FALSE FALSE
tail(df, n = 5)
## accountNumber customerId creditLimit availableMoney transactionDateTime
## 786359 732852505 732852505 50000 48904.96 2016-12-22T18:44:12
## 786360 732852505 732852505 50000 48785.04 2016-12-25T16:20:34
## 786361 732852505 732852505 50000 48766.15 2016-12-27T15:46:24
## 786362 732852505 732852505 50000 48716.72 2016-12-29T00:30:55
## 786363 732852505 732852505 50000 48666.83 2016-12-30T20:10:29
## transactionAmount merchantName acqCountry merchantCountryCode
## 786359 119.92 Lyft US US
## 786360 18.89 hulu.com US US
## 786361 49.43 Lyft US US
## 786362 49.89 walmart.com US US
## 786363 72.18 Uber US US
## posEntryMode posConditionCode merchantCategoryCode currentExpDate
## 786359 90 01 rideshare 12/2022
## 786360 09 01 online_subscriptions 08/2023
## 786361 02 01 rideshare 08/2025
## 786362 09 99 online_retail 07/2022
## 786363 05 01 rideshare 05/2024
## accountOpenDate dateOfLastAddressChange cardCVV enteredCVV
## 786359 2012-08-23 2012-08-23 936 936
## 786360 2012-08-23 2012-08-23 939 939
## 786361 2012-08-23 2012-08-23 936 936
## 786362 2012-08-23 2012-08-23 939 939
## 786363 2012-08-23 2012-08-23 939 939
## cardLast4Digits transactionType echoBuffer currentBalance merchantCity
## 786359 3783 PURCHASE 1095.04
## 786360 3388 PURCHASE 1214.96
## 786361 3783 PURCHASE 1233.85
## 786362 3388 PURCHASE 1283.28
## 786363 3388 PURCHASE 1333.17
## merchantState merchantZip cardPresent posOnPremises recurringAuthInd
## 786359 FALSE
## 786360 FALSE
## 786361 FALSE
## 786362 FALSE
## 786363 FALSE
## expirationDateKeyInMatch isFraud
## 786359 FALSE FALSE
## 786360 FALSE FALSE
## 786361 FALSE FALSE
## 786362 FALSE FALSE
## 786363 FALSE FALSE
summary(df)
## accountNumber customerId creditLimit availableMoney
## Length:786363 Length:786363 Min. : 250 Min. :-1006
## Class :character Class :character 1st Qu.: 5000 1st Qu.: 1077
## Mode :character Mode :character Median : 7500 Median : 3185
## Mean :10759 Mean : 6251
## 3rd Qu.:15000 3rd Qu.: 7500
## Max. :50000 Max. :50000
## transactionDateTime transactionAmount merchantName acqCountry
## Length:786363 Min. : 0.00 Length:786363 Length:786363
## Class :character 1st Qu.: 33.65 Class :character Class :character
## Mode :character Median : 87.90 Mode :character Mode :character
## Mean : 136.99
## 3rd Qu.: 191.48
## Max. :2011.54
## merchantCountryCode posEntryMode posConditionCode merchantCategoryCode
## Length:786363 Length:786363 Length:786363 Length:786363
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## currentExpDate accountOpenDate dateOfLastAddressChange
## Length:786363 Length:786363 Length:786363
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## cardCVV enteredCVV cardLast4Digits transactionType
## Length:786363 Length:786363 Length:786363 Length:786363
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## echoBuffer currentBalance merchantCity merchantState
## Length:786363 Min. : 0.0 Length:786363 Length:786363
## Class :character 1st Qu.: 689.9 Class :character Class :character
## Mode :character Median : 2451.8 Mode :character Mode :character
## Mean : 4508.7
## 3rd Qu.: 5291.1
## Max. :47498.8
## merchantZip cardPresent posOnPremises recurringAuthInd
## Length:786363 Mode :logical Length:786363 Length:786363
## Class :character FALSE:433495 Class :character Class :character
## Mode :character TRUE :352868 Mode :character Mode :character
##
##
##
## expirationDateKeyInMatch isFraud
## Mode :logical Mode :logical
## FALSE:785320 FALSE:773946
## TRUE :1043 TRUE :12417
##
##
##
colnames(df)
## [1] "accountNumber" "customerId"
## [3] "creditLimit" "availableMoney"
## [5] "transactionDateTime" "transactionAmount"
## [7] "merchantName" "acqCountry"
## [9] "merchantCountryCode" "posEntryMode"
## [11] "posConditionCode" "merchantCategoryCode"
## [13] "currentExpDate" "accountOpenDate"
## [15] "dateOfLastAddressChange" "cardCVV"
## [17] "enteredCVV" "cardLast4Digits"
## [19] "transactionType" "echoBuffer"
## [21] "currentBalance" "merchantCity"
## [23] "merchantState" "merchantZip"
## [25] "cardPresent" "posOnPremises"
## [27] "recurringAuthInd" "expirationDateKeyInMatch"
## [29] "isFraud"
nrow(df)
## [1] 786363
ncol(df)
## [1] 29
dim(df)
## [1] 786363 29
## accountNumber customerId creditLimit
## 0 0 0
## availableMoney transactionDateTime transactionAmount
## 0 0 0
## merchantName acqCountry merchantCountryCode
## 0 4562 724
## posEntryMode posConditionCode merchantCategoryCode
## 4054 409 0
## currentExpDate accountOpenDate dateOfLastAddressChange
## 0 0 0
## cardCVV enteredCVV cardLast4Digits
## 0 0 0
## transactionType echoBuffer currentBalance
## 698 786363 0
## merchantCity merchantState merchantZip
## 786363 786363 786363
## cardPresent posOnPremises recurringAuthInd
## 0 786363 786363
## expirationDateKeyInMatch isFraud
## 0 0
Unique Values in each attribute also helps explore the structure of the dataset When an attribute has less number of Unique Values it is always a good criteria to look at them keenly to know more about the structure as a whole and how the other attributes are behaving around these small buckets
## accountNumber customerId creditLimit
## 5000 5000 10
## availableMoney transactionDateTime transactionAmount
## 521916 776637 66038
## merchantName acqCountry merchantCountryCode
## 2490 5 5
## posEntryMode posConditionCode merchantCategoryCode
## 6 4 19
## currentExpDate accountOpenDate dateOfLastAddressChange
## 165 1820 2184
## cardCVV enteredCVV cardLast4Digits
## 899 976 5246
## transactionType echoBuffer currentBalance
## 4 1 487318
## merchantCity merchantState merchantZip
## 1 1 1
## cardPresent posOnPremises recurringAuthInd
## 2 1 1
## expirationDateKeyInMatch isFraud
## 2 2
The most important attribute of any given dataset, which can be a label in a classification problem or a continuous value in regression analysis. In clustering it depends on the use case that which attributes are important to look around with.
## isFraud n
## 1 FALSE 773946
## 2 TRUE 12417
df$isFraud <- as.integer(df$isFraud)
df$cardPresent <- as.integer(df$cardPresent)
df$expirationDateKeyInMatch <- as.integer(df$expirationDateKeyInMatch)
Most of the Credit Cards have 5000 as their limit, on second
number it is 15000. Interestingly, credit cards with 50,000 limit also
exists in the dataset
Most values lie inside Purchase Type. But it is necessary to explore the other two to analyze what is happening in those transactions.
## >>> Suggestions
## PieChart(transactionType, hole=0) # traditional pie chart
## PieChart(transactionType, values="%") # display %'s on the chart
## PieChart(transactionType) # bar chart
## Plot(transactionType) # bubble plot
## Plot(transactionType, values="count") # lollipop plot
##
## --- transactionType ---
##
## ADDRESS_VERIFICATION PURCHASE REVERSAL Total
## Frequencies: 20169 745193 20303 785665
## Proportions: 0.026 0.948 0.026 1.000
##
## Chi-squared test of null hypothesis of equal probabilities
## Chisq = 1337879.797, df = 2, p-value = 0.000
plays an important role and dataset is fairly distributed among the two values as True/False
## expirationDateKeyInMatch n
## 1 0 785320
## 2 1 1043
As described earlier is clearly an example of IMBALANCED DATSET with respect to this attribute. But It is expected to be like that in real world scenario as Frauds are always less than 5% with respect to whole dataset. Here it is 1.6%
To know more about the transaction data set it is important to know the Top Merchants where most of the Transactions happen. It helps in Optimizing the metrics and to prioritize the business and data strategies accordingly.
## merchantName n
## 1 Uber 25613
## 2 Lyft 25523
## 3 oldnavy.com 16992
## 4 staples.com 16980
## 5 alibaba.com 16959
## 6 apple.com 16898
## 7 walmart.com 16873
## 8 cheapfast.com 16858
## 9 ebay.com 16842
## 10 target.com 16813
## 11 amazon.com 16780
## 12 sears.com 16755
## 13 discount.com 16731
## 14 gap.com 16675
## creditLimit availableMoney transactionAmount currentBalance
## 1 5000 5000 98.55 0
## 2 5000 5000 74.51 0
## 3 5000 5000 7.47 0
## 4 5000 5000 7.47 0
## 5 5000 5000 71.18 0
Here we can see Available Money can also go Negative, which is -1005.63 as minimum value
## creditLimit availableMoney transactionAmount currentBalance
## Min. : 250 Min. :-1006 Min. : 0.00 Min. : 0.0
## 1st Qu.: 5000 1st Qu.: 1077 1st Qu.: 33.65 1st Qu.: 689.9
## Median : 7500 Median : 3185 Median : 87.90 Median : 2451.8
## Mean :10759 Mean : 6251 Mean : 136.99 Mean : 4508.7
## 3rd Qu.:15000 3rd Qu.: 7500 3rd Qu.: 191.48 3rd Qu.: 5291.1
## Max. :50000 Max. :50000 Max. :2011.54 Max. :47498.8
While data Modelling this information can be really
useful
## merchantCategoryCode
## 695 health
## 972 online_retail
## 1073 health
## 1114 online_retail
## 1322 online_retail
## 1346 rideshare
## 1347 rideshare
## 1368 online_retail
## 1417 online_retail
## 1624 online_retail
## merchantName n
## 1 Lyft 760
## 2 ebay.com 639
## 3 Fresh Flowers 553
## 4 Uber 512
## 5 walmart.com 446
## 6 cheapfast.com 442
## 7 sears.com 422
## 8 oldnavy.com 401
## 9 staples.com 399
## 10 alibaba.com 383
## 11 amazon.com 372
## 12 gap.com 369
## 13 target.com 362
## 14 apple.com 357
## 15 discount.com 346
## accountNumber n
## 1 380680241 783
## 2 782081187 307
## 3 246251253 278
## 4 700725639 272
## 5 472288969 266
## 6 208319653 211
## 7 419709514 207
## customerId n
## 1 380680241 783
## 2 782081187 307
## 3 246251253 278
## 4 700725639 272
## 5 472288969 266
## 6 208319653 211
## 7 419709514 207
tells important infomration about the given process.
Till know we didn’t remove any empty Columns. Now removing the empty columns
## accountNumber customerId creditLimit availableMoney transactionDateTime
## 1 737265056 737265056 5000 5000 2016-08-13T14:27:32
## 2 737265056 737265056 5000 5000 2016-10-11T05:05:54
## 3 737265056 737265056 5000 5000 2016-11-08T09:18:39
## 4 737265056 737265056 5000 5000 2016-12-10T02:14:50
## 5 830329091 830329091 5000 5000 2016-03-24T21:04:46
## transactionAmount merchantName acqCountry merchantCountryCode
## 1 98.55 Uber US US
## 2 74.51 AMC #191138 US US
## 3 7.47 Play Store US US
## 4 7.47 Play Store US US
## 5 71.18 Tim Hortons #947751 US US
## posEntryMode posConditionCode merchantCategoryCode currentExpDate
## 1 02 01 rideshare 06/2023
## 2 09 01 entertainment 02/2024
## 3 09 01 mobileapps 08/2025
## 4 09 01 mobileapps 08/2025
## 5 02 01 fastfood 10/2029
## accountOpenDate dateOfLastAddressChange cardCVV enteredCVV cardLast4Digits
## 1 2015-03-14 2015-03-14 414 414 1803
## 2 2015-03-14 2015-03-14 486 486 767
## 3 2015-03-14 2015-03-14 486 486 767
## 4 2015-03-14 2015-03-14 486 486 767
## 5 2015-08-06 2015-08-06 885 885 3143
## transactionType currentBalance cardPresent expirationDateKeyInMatch isFraud
## 1 PURCHASE 0 0 0 0
## 2 PURCHASE 0 1 0 0
## 3 PURCHASE 0 0 0 0
## 4 PURCHASE 0 0 0 0
## 5 PURCHASE 0 1 0 0
head(df_2, n = 5)
## accountNumber customerId creditLimit availableMoney transactionDateTime
## 1 737265056 737265056 5000 5000 2016-08-13T14:27:32
## 2 737265056 737265056 5000 5000 2016-10-11T05:05:54
## 3 737265056 737265056 5000 5000 2016-11-08T09:18:39
## 4 737265056 737265056 5000 5000 2016-12-10T02:14:50
## 5 830329091 830329091 5000 5000 2016-03-24T21:04:46
## transactionAmount merchantName acqCountry merchantCountryCode
## 1 98.55 Uber US US
## 2 74.51 AMC #191138 US US
## 3 7.47 Play Store US US
## 4 7.47 Play Store US US
## 5 71.18 Tim Hortons #947751 US US
## posEntryMode posConditionCode merchantCategoryCode currentExpDate
## 1 02 01 rideshare 06/2023
## 2 09 01 entertainment 02/2024
## 3 09 01 mobileapps 08/2025
## 4 09 01 mobileapps 08/2025
## 5 02 01 fastfood 10/2029
## accountOpenDate dateOfLastAddressChange cardCVV enteredCVV cardLast4Digits
## 1 2015-03-14 2015-03-14 414 414 1803
## 2 2015-03-14 2015-03-14 486 486 767
## 3 2015-03-14 2015-03-14 486 486 767
## 4 2015-03-14 2015-03-14 486 486 767
## 5 2015-08-06 2015-08-06 885 885 3143
## transactionType currentBalance cardPresent expirationDateKeyInMatch isFraud
## 1 PURCHASE 0 0 0 0
## 2 PURCHASE 0 1 0 0
## 3 PURCHASE 0 0 0 0
## 4 PURCHASE 0 0 0 0
## 5 PURCHASE 0 1 0 0
## transactionDate
## 1 2016-08-13
## 2 2016-10-11
## 3 2016-11-08
## 4 2016-12-10
## 5 2016-03-24
Here we are taking those attributes which can uniquely identify and help in indentifying a duplicate transaction at a given point:
## accountNumber customerId creditLimit availableMoney transactionDateTime
## 3 737265056 737265056 5000 5000.00 2016-11-08T09:18:39
## 4 737265056 737265056 5000 5000.00 2016-12-10T02:14:50
## 12 830329091 830329091 5000 5000.00 2016-12-07T16:34:04
## 13 830329091 830329091 5000 4959.25 2016-12-14T10:00:35
## 14 830329091 830329091 5000 4918.50 2016-12-20T18:38:23
## transactionAmount merchantName acqCountry merchantCountryCode posEntryMode
## 3 7.47 Play Store US US 09
## 4 7.47 Play Store US US 09
## 12 40.75 GreenCook US US 09
## 13 40.75 GreenCook US US 09
## 14 40.75 GreenCook US US 09
## posConditionCode merchantCategoryCode currentExpDate accountOpenDate
## 3 01 mobileapps 08/2025 2015-03-14
## 4 01 mobileapps 08/2025 2015-03-14
## 12 01 food_delivery 08/2024 2015-08-06
## 13 01 food_delivery 08/2024 2015-08-06
## 14 01 food_delivery 08/2024 2015-08-06
## dateOfLastAddressChange cardCVV enteredCVV cardLast4Digits transactionType
## 3 2015-03-14 486 486 767 PURCHASE
## 4 2015-03-14 486 486 767 PURCHASE
## 12 2015-08-06 885 885 3143 PURCHASE
## 13 2015-08-06 885 885 3143 PURCHASE
## 14 2015-08-06 885 885 3143 PURCHASE
## currentBalance cardPresent expirationDateKeyInMatch isFraud transactionDate
## 3 0.00 0 0 0 2016-11-08
## 4 0.00 0 0 0 2016-12-10
## 12 0.00 0 0 0 2016-12-07
## 13 40.75 0 0 0 2016-12-14
## 14 81.50 0 0 0 2016-12-20
dim(duplicate_trans)
## [1] 100716 24
## merchantName n
## 1 Fresh Flowers 69
## 2 Lyft 58
## 3 ebay.com 46
## 4 walmart.com 39
## 5 American Airlines 34
## 6 cheapfast.com 34
## 7 Uber 34
## 8 sears.com 32
## 9 Blue Mountain Online Services 31
## 10 oldnavy.com 30
nrow(subset(df_2, transactionType == 'REVERSAL'))
## [1] 20303
sum(subset(df_2, transactionType == 'REVERSAL')$transactionAmount)
## [1] 2821792
## merchantName n
## 1 Lyft 26
## 2 walmart.com 16
## 3 Fresh Flowers 15
## 4 gap.com 15
## 5 American Airlines 14
## 6 cheapfast.com 13
## 7 sears.com 12
## 8 apple.com 11
## 9 ebay.com 11
## 10 staples.com 11
## 11 oldnavy.com 10
nrow(subset(duplicate_trans, transactionType == 'REVERSAL'))
## [1] 17826
sum(subset(duplicate_trans, transactionType == 'REVERSAL')$transactionAmount)
## [1] 2669860
## merchantName n
## 1 Lyft 23
## 2 Fresh Flowers 14
## 3 gap.com 13
## 4 walmart.com 13
## 5 apple.com 11
## 6 cheapfast.com 11
## 7 ebay.com 11
## 8 staples.com 11
## 9 American Airlines 10
## 10 oldnavy.com 10
## accountNumber customerId creditLimit availableMoney transactionDateTime
## 1 737265056 737265056 5000 5000 2016-08-13T14:27:32
## 2 737265056 737265056 5000 5000 2016-10-11T05:05:54
## 3 737265056 737265056 5000 5000 2016-11-08T09:18:39
## 4 737265056 737265056 5000 5000 2016-12-10T02:14:50
## 5 830329091 830329091 5000 5000 2016-03-24T21:04:46
## transactionAmount merchantName acqCountry merchantCountryCode
## 1 98.55 Uber US US
## 2 74.51 AMC #191138 US US
## 3 7.47 Play Store US US
## 4 7.47 Play Store US US
## 5 71.18 Tim Hortons #947751 US US
## posEntryMode posConditionCode merchantCategoryCode currentExpDate
## 1 02 01 rideshare 06/2023
## 2 09 01 entertainment 02/2024
## 3 09 01 mobileapps 08/2025
## 4 09 01 mobileapps 08/2025
## 5 02 01 fastfood 10/2029
## accountOpenDate dateOfLastAddressChange cardCVV enteredCVV cardLast4Digits
## 1 2015-03-14 2015-03-14 414 414 1803
## 2 2015-03-14 2015-03-14 486 486 767
## 3 2015-03-14 2015-03-14 486 486 767
## 4 2015-03-14 2015-03-14 486 486 767
## 5 2015-08-06 2015-08-06 885 885 3143
## transactionType currentBalance cardPresent expirationDateKeyInMatch isFraud
## 1 PURCHASE 0 0 0 0
## 2 PURCHASE 0 1 0 0
## 3 PURCHASE 0 0 0 0
## 4 PURCHASE 0 0 0 0
## 5 PURCHASE 0 1 0 0
We can see which buckets are removed from the dataset, as for
Models to generalize the dataset well it is advised to remove outliers.
But in some use cases we avoid that as well
dim(df)
## [1] 617105 23
df$CVVMatched <- df$cardCVV == df$enteredCVV
df$CVVMatched <- as.integer(df$CVVMatched)
dim(df)
## [1] 617105 24
df_time <- data.frame(df)
head(df_time,n=5 )
## accountNumber customerId creditLimit availableMoney transactionDateTime
## 1 737265056 737265056 5000 5000 2016-08-13T14:27:32
## 2 737265056 737265056 5000 5000 2016-10-11T05:05:54
## 3 737265056 737265056 5000 5000 2016-11-08T09:18:39
## 4 737265056 737265056 5000 5000 2016-12-10T02:14:50
## 5 830329091 830329091 5000 5000 2016-03-24T21:04:46
## transactionAmount merchantName acqCountry merchantCountryCode
## 1 98.55 Uber US US
## 2 74.51 AMC #191138 US US
## 3 7.47 Play Store US US
## 4 7.47 Play Store US US
## 5 71.18 Tim Hortons #947751 US US
## posEntryMode posConditionCode merchantCategoryCode currentExpDate
## 1 02 01 rideshare 06/2023
## 2 09 01 entertainment 02/2024
## 3 09 01 mobileapps 08/2025
## 4 09 01 mobileapps 08/2025
## 5 02 01 fastfood 10/2029
## accountOpenDate dateOfLastAddressChange cardCVV enteredCVV cardLast4Digits
## 1 2015-03-14 2015-03-14 414 414 1803
## 2 2015-03-14 2015-03-14 486 486 767
## 3 2015-03-14 2015-03-14 486 486 767
## 4 2015-03-14 2015-03-14 486 486 767
## 5 2015-08-06 2015-08-06 885 885 3143
## transactionType currentBalance cardPresent expirationDateKeyInMatch isFraud
## 1 PURCHASE 0 0 0 0
## 2 PURCHASE 0 1 0 0
## 3 PURCHASE 0 0 0 0
## 4 PURCHASE 0 0 0 0
## 5 PURCHASE 0 1 0 0
## CVVMatched
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
df$transactionTimeCat = df_time$time_period
head(df, n=5)
## accountNumber customerId creditLimit availableMoney transactionDateTime
## 1 737265056 737265056 5000 5000 2016-08-13T14:27:32
## 2 737265056 737265056 5000 5000 2016-10-11T05:05:54
## 3 737265056 737265056 5000 5000 2016-11-08T09:18:39
## 4 737265056 737265056 5000 5000 2016-12-10T02:14:50
## 5 830329091 830329091 5000 5000 2016-03-24T21:04:46
## transactionAmount merchantName acqCountry merchantCountryCode
## 1 98.55 Uber US US
## 2 74.51 AMC #191138 US US
## 3 7.47 Play Store US US
## 4 7.47 Play Store US US
## 5 71.18 Tim Hortons #947751 US US
## posEntryMode posConditionCode merchantCategoryCode currentExpDate
## 1 02 01 rideshare 06/2023
## 2 09 01 entertainment 02/2024
## 3 09 01 mobileapps 08/2025
## 4 09 01 mobileapps 08/2025
## 5 02 01 fastfood 10/2029
## accountOpenDate dateOfLastAddressChange cardCVV enteredCVV cardLast4Digits
## 1 2015-03-14 2015-03-14 414 414 1803
## 2 2015-03-14 2015-03-14 486 486 767
## 3 2015-03-14 2015-03-14 486 486 767
## 4 2015-03-14 2015-03-14 486 486 767
## 5 2015-08-06 2015-08-06 885 885 3143
## transactionType currentBalance cardPresent expirationDateKeyInMatch isFraud
## 1 PURCHASE 0 0 0 0
## 2 PURCHASE 0 1 0 0
## 3 PURCHASE 0 0 0 0
## 4 PURCHASE 0 0 0 0
## 5 PURCHASE 0 1 0 0
## CVVMatched transactionTimeCat
## 1 1 1
## 2 1 3
## 3 1 0
## 4 1 3
## 5 1 2
count(df, transactionTimeCat)
## transactionTimeCat n
## 1 0 128254
## 2 1 154273
## 3 2 154956
## 4 3 179622
count(df, CVVMatched)
## CVVMatched n
## 1 0 5560
## 2 1 611545
Further Principal Component Analysis can also be tried to reduce the dimensions further but for the given case it is not required.
Variance of the dataset is still maintained to be 99%
## [1] "Final total columns"
## [1] 16
colSums(is.na(df))
## creditLimit availableMoney transactionAmount
## 0 0 0
## merchantName acqCountry merchantCountryCode
## 0 0 0
## posEntryMode posConditionCode merchantCategoryCode
## 0 0 0
## transactionType currentBalance cardPresent
## 0 0 0
## expirationDateKeyInMatch isFraud CVVMatched
## 0 0 0
## transactionTimeCat
## 0
head(df, n=5)
## creditLimit availableMoney transactionAmount merchantName acqCountry
## 1 5000 5000 98.55 0 0
## 2 5000 5000 74.51 1 0
## 3 5000 5000 7.47 2 0
## 4 5000 5000 7.47 2 0
## 5 5000 5000 71.18 3 0
## merchantCountryCode posEntryMode posConditionCode merchantCategoryCode
## 1 0 0 0 0
## 2 0 1 0 1
## 3 0 1 0 2
## 4 0 1 0 2
## 5 0 0 0 3
## transactionType currentBalance cardPresent expirationDateKeyInMatch isFraud
## 1 0 0 0 0 0
## 2 0 0 1 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 1 0 0
## CVVMatched transactionTimeCat
## 1 1 1
## 2 1 3
## 3 1 0
## 4 1 3
## 5 1 2
table(rose_train$isFraud)
##
## 0 1
## 309174 307931
table(df_under$isFraud)
##
## 0 1
## 8847 8847
##
## Call:
## glm(formula = isFraud ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0817 -1.0694 -0.5585 1.0963 2.0292
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.203188709 0.168595143 1.205 0.22813
## creditLimit 0.000011916 0.000006803 1.752 0.07984
## availableMoney -0.000012060 0.000008381 -1.439 0.15017
## transactionAmount 0.004758081 0.000169930 28.000 < 0.0000000000000002
## merchantName -0.000465575 0.000041221 -11.295 < 0.0000000000000002
## acqCountry 0.043301686 0.228171315 0.190 0.84948
## merchantCountryCode 0.084819309 0.233430307 0.363 0.71634
## posEntryMode -0.113553464 0.017284093 -6.570 0.0000000000504
## posConditionCode -0.075115314 0.039741254 -1.890 0.05874
## merchantCategoryCode 0.006528148 0.004391111 1.487 0.13710
## transactionType -0.027811227 0.050030075 -0.556 0.57829
## currentBalance NA NA NA NA
## cardPresent -0.420620035 0.041709599 -10.084 < 0.0000000000000002
## expirationDateKeyInMatch 0.795534257 0.514831351 1.545 0.12229
## CVVMatched -0.433319209 0.159294140 -2.720 0.00652
## transactionTimeCat -0.014773995 0.015988819 -0.924 0.35548
##
## (Intercept)
## creditLimit .
## availableMoney
## transactionAmount ***
## merchantName ***
## acqCountry
## merchantCountryCode
## posEntryMode ***
## posConditionCode .
## merchantCategoryCode
## transactionType
## currentBalance
## cardPresent ***
## expirationDateKeyInMatch
## CVVMatched **
## transactionTimeCat
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 19623 on 14154 degrees of freedom
## Residual deviance: 18139 on 14140 degrees of freedom
## AIC: 18169
##
## Number of Fisher Scoring iterations: 4
## [1] 14155 16
## [1] 3539 16
## predict_reg
## 0 1
## 0 1209 547
## 1 680 1103
## [1] "Accuracy = 0.65329189036451"
## Var1 Freq
## 1 0 8847
## 2 1 8847
## [1] 0.6535584
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1209 680
## 1 547 1103
##
## Accuracy : 0.6533
## 95% CI : (0.6373, 0.669)
## No Information Rate : 0.5038
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.3069
##
## Mcnemar's Test P-Value : 0.0001643
##
## Sensitivity : 0.6885
## Specificity : 0.6186
## Pos Pred Value : 0.6400
## Neg Pred Value : 0.6685
## Prevalence : 0.4962
## Detection Rate : 0.3416
## Detection Prevalence : 0.5338
## Balanced Accuracy : 0.6536
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1209 680
## 1 547 1103
##
## Accuracy : 0.6533
## 95% CI : (0.6373, 0.669)
## No Information Rate : 0.5038
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.3069
##
## Mcnemar's Test P-Value : 0.0001643
##
## Sensitivity : 0.6186
## Specificity : 0.6885
## Pos Pred Value : 0.6685
## Neg Pred Value : 0.6400
## Precision : 0.6685
## Recall : 0.6186
## F1 : 0.6426
## Prevalence : 0.5038
## Detection Rate : 0.3117
## Detection Prevalence : 0.4662
## Balanced Accuracy : 0.6536
##
## 'Positive' Class : 1
##
fitControl <- trainControl(method="cv",
number = 5,
preProcOptions = list(thresh = 0.99), # threshold for pca preprocess
classProbs = TRUE,
summaryFunction = twoClassSummary)
train$isFraud <- factor(train$isFraud)
model_knn <- train(make.names(isFraud)~.,
train,
method="knn",
metric="ROC",
preProcess = c('center', 'scale'),
tuneLength=10,
trControl=fitControl)
## [1] 0.6590938
raw.data = df_under
nrows <- nrow(raw.data)
set.seed(314)
indexT <- sample(1:nrow(raw.data), 0.8 * nrows)
#separate train and validation set
trainset = raw.data[indexT,]
verset = raw.data[-indexT,]
train = trainset
test = verset
library(rpart) # for regression trees
library(randomForest) # for random forests
# train a decision tree based on our dataset
tree.model <- rpart(isFraud ~ ., data = train)
# plot our regression tree
plot(tree.model, uniform=TRUE)
# add text labels & make them 60% as big as they are by default
text(tree.model, cex=.6)
raw.data = df_under
nrows <- nrow(raw.data)
set.seed(314)
indexT <- sample(1:nrow(raw.data), 0.8 * nrows)
#separate train and validation set
trainset = raw.data[indexT,]
verset = raw.data[-indexT,]
n <- names(trainset)
rf.form <- as.formula(paste("isFraud ~", paste(n[!n %in% "isFraud"], collapse = " + ")))
trainset.rf <- randomForest(rf.form,trainset,ntree=100,importance=T)
###XGBoost is Boot straped unsampled technique which uses variuos decision tree all merged together helps in providing a better accuracy.
## [1] test-auc:0.693403
## Will train until test_auc hasn't improved in 40 rounds.
##
## [21] test-auc:0.717561
## [41] test-auc:0.725942
## [61] test-auc:0.729602
## [81] test-auc:0.733082
## [101] test-auc:0.735312
## [121] test-auc:0.736935
## [141] test-auc:0.738546
## [161] test-auc:0.739718
## [181] test-auc:0.740832
## [201] test-auc:0.741978
## [221] test-auc:0.742289
## [241] test-auc:0.743184
## [261] test-auc:0.743543
## [281] test-auc:0.744220
## [301] test-auc:0.744372
## [321] test-auc:0.745138
## [341] test-auc:0.745831
## [361] test-auc:0.745897
## [381] test-auc:0.746428
## [401] test-auc:0.747213
## [421] test-auc:0.747397
## [441] test-auc:0.747453
## [461] test-auc:0.747624
## [481] test-auc:0.747817
## [500] test-auc:0.747818
## [1] test-auc:0.693403
## Will train until test_auc hasn't improved in 40 rounds.
##
## [21] test-auc:0.717561
## [41] test-auc:0.725942
## [61] test-auc:0.729602
## [81] test-auc:0.733082
## [101] test-auc:0.735312
## [121] test-auc:0.736935
## [141] test-auc:0.738546
## [161] test-auc:0.739718
## [181] test-auc:0.740832
## [201] test-auc:0.741978
## [221] test-auc:0.742289
## [241] test-auc:0.743184
## [261] test-auc:0.743543
## [281] test-auc:0.744220
## [301] test-auc:0.744372
## [321] test-auc:0.745138
## [341] test-auc:0.745831
## [361] test-auc:0.745897
## [381] test-auc:0.746428
## [401] test-auc:0.747213
## [421] test-auc:0.747397
## [441] test-auc:0.747453
## [461] test-auc:0.747624
## [481] test-auc:0.747817
## [500] test-auc:0.747818
## [1] test-auc:0.693403
## Will train until test_auc hasn't improved in 40 rounds.
##
## [21] test-auc:0.717561
## [41] test-auc:0.725942
## [61] test-auc:0.729602
## [81] test-auc:0.733082
## [101] test-auc:0.735312
## [121] test-auc:0.736935
## [141] test-auc:0.738546
## [161] test-auc:0.739718
## [181] test-auc:0.740832
## [201] test-auc:0.741978
## [221] test-auc:0.742289
## [241] test-auc:0.743184
## [261] test-auc:0.743543
## [281] test-auc:0.744220
## [301] test-auc:0.744372
## [321] test-auc:0.745138
## [341] test-auc:0.745831
## [361] test-auc:0.745897
## [381] test-auc:0.746428
## [401] test-auc:0.747213
## [421] test-auc:0.747397
## [441] test-auc:0.747453
## [461] test-auc:0.747624
## [481] test-auc:0.747817
## [500] test-auc:0.747818
## [1] test-auc:0.693403
## Will train until test_auc hasn't improved in 40 rounds.
##
## [21] test-auc:0.717561
## [41] test-auc:0.725942
## [61] test-auc:0.729602
## [81] test-auc:0.733082
## [101] test-auc:0.735312
## [121] test-auc:0.736935
## [141] test-auc:0.738546
## [161] test-auc:0.739718
## [181] test-auc:0.740832
## [201] test-auc:0.741978
## [221] test-auc:0.742289
## [241] test-auc:0.743184
## [261] test-auc:0.743543
## [281] test-auc:0.744220
## [301] test-auc:0.744372
## [321] test-auc:0.745138
## [341] test-auc:0.745831
## [361] test-auc:0.745897
## [381] test-auc:0.746428
## [401] test-auc:0.747213
## [421] test-auc:0.747397
## [441] test-auc:0.747453
## [461] test-auc:0.747624
## [481] test-auc:0.747817
## [500] test-auc:0.747818
## [1] test-auc:0.693403
## Will train until test_auc hasn't improved in 40 rounds.
##
## [21] test-auc:0.717561
## [41] test-auc:0.725942
## [61] test-auc:0.729602
## [81] test-auc:0.733082
## [101] test-auc:0.735312
## [121] test-auc:0.736935
## [141] test-auc:0.738546
## [161] test-auc:0.739718
## [181] test-auc:0.740832
## [201] test-auc:0.741978
## [221] test-auc:0.742289
## [241] test-auc:0.743184
## [261] test-auc:0.743543
## [281] test-auc:0.744220
## [301] test-auc:0.744372
## [321] test-auc:0.745138
## [341] test-auc:0.745831
## [361] test-auc:0.745897
## [381] test-auc:0.746428
## [401] test-auc:0.747213
## [421] test-auc:0.747397
## [441] test-auc:0.747453
## [461] test-auc:0.747624
## [481] test-auc:0.747817
## [500] test-auc:0.747818
## Unit: seconds
## expr
## xgb.model.speed <- xgb.train(data = xgb.data.train, params = list(objective = "binary:logistic", eta = 0.1, max.depth = 3, min_child_weight = 100, subsample = 1, colsample_bytree = 1, nthread = 3, eval_metric = "auc"), watchlist = list(test = xgb.data.test), nrounds = 500, early_stopping_rounds = 40, print_every_n = 20)
## min lq mean median uq max neval
## 2.649612 2.654635 2.674109 2.665423 2.668082 2.732791 5
## NULL
## [1] 0.7478539
First 4 PCA are explaining more than 55% of the variance in the dataset, where first two are really significant.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.4445 1.4019 1.2840 1.09533 1.05290 1.00436 1.00003
## Proportion of Variance 0.1391 0.1310 0.1099 0.07998 0.07391 0.06725 0.06667
## Cumulative Proportion 0.1391 0.2701 0.3800 0.46001 0.53392 0.60116 0.66783
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.99841 0.98256 0.97034 0.95195 0.84932 0.64800 0.17661
## Proportion of Variance 0.06646 0.06436 0.06277 0.06041 0.04809 0.02799 0.00208
## Cumulative Proportion 0.73429 0.79865 0.86142 0.92184 0.96993 0.99792 1.00000
## PC15
## Standard deviation 0.0000000000001122
## Proportion of Variance 0.0000000000000000
## Cumulative Proportion 1.0000000000000000
We carried out multiple modules of data science pipeline which involved: Data Visualizations, Descriptive Statistics, Prescriptive Statistics, Data wrangling, Feature engineering, Predictive Modelling, Interpretation of the model, comparing results and creating impactful solution for our use-case which can be used for decision making.
Classification is the best choice whenever we have a given class label to model around.
We checked on various Metrics such as Precision, Recall, Accuracy, F1 Scores and Balanced Accuracy. Our Best Model’s test performance is on XGBOOST Classifier which gives 75% accuracy, which is state-of-the-art for fraud systems to put the model in production.
Logistic regression and K-Nearest Neigbour are always the best baselines to compare-with. Here we got significant 65% accuracy with these models.
Here, Tree Based Classifiers are best in terms of interpretation of the end results. Because in real world scenarios, detecting fraud is just one part. But what actions to take to handle and optimize the processes, these tree structures help through the attributes hierarchy that which actions can be taken and some buckets can be made to start with.
Principal Component Analysis, an unsupervised technique, which is used for lower-dimensional visualization of the data and check for variance explainability.
Neural Networks are not used for simplicity and not getting into black boxed domain as interpretations of the results become harder. Else, a multi-layer perceptron could also be trained, would not recommend CNN or RNN for this use case. Though some people use RNNs for contexutal understanding in the fraud behavior.
Tree based classifiers have best interpretability and it can directly impact the business because it is the kind of models which can be used directly interpret and comprehend. So that can make decisions and build strategies further.
Clustering can also be done where labels are not required to get other meaningful insights out of data.
In the data set we had clearly seen that there were some Customer IDs which were highly fraudulent. So, a user-profiling based statistical model would significantly improve the results.
Nothing beats the quality of data. So running test-statistics could also help in analyzing the data further.
Important data acquisition ( either by client, business teams or an external API) and rigorous feature engineering.
Better way to fill the missing values in columns, based on the distribution of the data column
Better way to handle outliers. In case of large number of records people use Auto-encoders as well.
Exploring neural networks side for modeling and particularly Recrrent Neural Networks to see if there can be built context between next and previous fraudulent transactions.
Checking for data quality in terms of days like Cyber-Monday or Thanksgiving or any sales.
Time series based anomaly detection in the dataset and then modeling it appropriately
======================================================================================================================================