Team: TAYYAB BIN TAHIR, HARSHA SIDDAGANGAIAH

===============================================

1. Data Description

Every data has a story to tell and our data story is about Digital Transformation and Fraud Mitigation
The data set used is open source transaction data from capital one.
The data set contains 29 attributes and 786363 transactions.
To gain a useful insight, the first step was to assess the structure of the data. Since this is a supervised machine learning problem, the data set offers information on whether the transaction is fraudulent or not for each data point.
All empty values were designated as null, and null columns were later removed during data cleaning, because they had no effect on the final prediction.
As the data was imbalanced and more biased towards non fraudulent data points, the best idea to run a classification models efficiently was to use a Down or Upper Sampling technique to balance the dataset.
After all the data preprocessing we were left with fewer data points. Many techniques have been applied to credit card fraud detection, since this is a classification problem, XGBoost, Random Forest and logistic regression, Decision Trees, K Nearest neighbours classifiers are used to process the data set.

2. Questions

List of top merchants in the transaction dataset
Most transactions Types
Most fradulant Merchant Types
Merchant Names where most Fraud happens
Account IDs which get most frauds in dataset
Frauds in Duplicated Transaction can expected to be more
Most Frauds that happen in Reversal Transactions of Duplicated Subset

Displayed Both Top and Bottom rows of the data

To have a better look on the structure and attributes before doing any data analysis

head(df, n = 5)

##   accountNumber customerId creditLimit availableMoney transactionDateTime
## 1     737265056  737265056        5000           5000 2016-08-13T14:27:32
## 2     737265056  737265056        5000           5000 2016-10-11T05:05:54
## 3     737265056  737265056        5000           5000 2016-11-08T09:18:39
## 4     737265056  737265056        5000           5000 2016-12-10T02:14:50
## 5     830329091  830329091        5000           5000 2016-03-24T21:04:46
##   transactionAmount        merchantName acqCountry merchantCountryCode
## 1             98.55                Uber         US                  US
## 2             74.51         AMC #191138         US                  US
## 3              7.47          Play Store         US                  US
## 4              7.47          Play Store         US                  US
## 5             71.18 Tim Hortons #947751         US                  US
##   posEntryMode posConditionCode merchantCategoryCode currentExpDate
## 1           02               01            rideshare        06/2023
## 2           09               01        entertainment        02/2024
## 3           09               01           mobileapps        08/2025
## 4           09               01           mobileapps        08/2025
## 5           02               01             fastfood        10/2029
##   accountOpenDate dateOfLastAddressChange cardCVV enteredCVV cardLast4Digits
## 1      2015-03-14              2015-03-14     414        414            1803
## 2      2015-03-14              2015-03-14     486        486             767
## 3      2015-03-14              2015-03-14     486        486             767
## 4      2015-03-14              2015-03-14     486        486             767
## 5      2015-08-06              2015-08-06     885        885            3143
##   transactionType echoBuffer currentBalance merchantCity merchantState
## 1        PURCHASE                         0                           
## 2        PURCHASE                         0                           
## 3        PURCHASE                         0                           
## 4        PURCHASE                         0                           
## 5        PURCHASE                         0                           
##   merchantZip cardPresent posOnPremises recurringAuthInd
## 1                   FALSE                               
## 2                    TRUE                               
## 3                   FALSE                               
## 4                   FALSE                               
## 5                    TRUE                               
##   expirationDateKeyInMatch isFraud
## 1                    FALSE   FALSE
## 2                    FALSE   FALSE
## 3                    FALSE   FALSE
## 4                    FALSE   FALSE
## 5                    FALSE   FALSE

Tail of dataframe

tail(df, n = 5)

##        accountNumber customerId creditLimit availableMoney transactionDateTime
## 786359     732852505  732852505       50000       48904.96 2016-12-22T18:44:12
## 786360     732852505  732852505       50000       48785.04 2016-12-25T16:20:34
## 786361     732852505  732852505       50000       48766.15 2016-12-27T15:46:24
## 786362     732852505  732852505       50000       48716.72 2016-12-29T00:30:55
## 786363     732852505  732852505       50000       48666.83 2016-12-30T20:10:29
##        transactionAmount merchantName acqCountry merchantCountryCode
## 786359            119.92         Lyft         US                  US
## 786360             18.89     hulu.com         US                  US
## 786361             49.43         Lyft         US                  US
## 786362             49.89  walmart.com         US                  US
## 786363             72.18         Uber         US                  US
##        posEntryMode posConditionCode merchantCategoryCode currentExpDate
## 786359           90               01            rideshare        12/2022
## 786360           09               01 online_subscriptions        08/2023
## 786361           02               01            rideshare        08/2025
## 786362           09               99        online_retail        07/2022
## 786363           05               01            rideshare        05/2024
##        accountOpenDate dateOfLastAddressChange cardCVV enteredCVV
## 786359      2012-08-23              2012-08-23     936        936
## 786360      2012-08-23              2012-08-23     939        939
## 786361      2012-08-23              2012-08-23     936        936
## 786362      2012-08-23              2012-08-23     939        939
## 786363      2012-08-23              2012-08-23     939        939
##        cardLast4Digits transactionType echoBuffer currentBalance merchantCity
## 786359            3783        PURCHASE                   1095.04             
## 786360            3388        PURCHASE                   1214.96             
## 786361            3783        PURCHASE                   1233.85             
## 786362            3388        PURCHASE                   1283.28             
## 786363            3388        PURCHASE                   1333.17             
##        merchantState merchantZip cardPresent posOnPremises recurringAuthInd
## 786359                                 FALSE                               
## 786360                                 FALSE                               
## 786361                                 FALSE                               
## 786362                                 FALSE                               
## 786363                                 FALSE                               
##        expirationDateKeyInMatch isFraud
## 786359                    FALSE   FALSE
## 786360                    FALSE   FALSE
## 786361                    FALSE   FALSE
## 786362                    FALSE   FALSE
## 786363                    FALSE   FALSE

Summary helps in Analyzing the data at a glance

summary(df)

##  accountNumber       customerId         creditLimit    availableMoney 
##  Length:786363      Length:786363      Min.   :  250   Min.   :-1006  
##  Class :character   Class :character   1st Qu.: 5000   1st Qu.: 1077  
##  Mode  :character   Mode  :character   Median : 7500   Median : 3185  
##                                        Mean   :10759   Mean   : 6251  
##                                        3rd Qu.:15000   3rd Qu.: 7500  
##                                        Max.   :50000   Max.   :50000  
##  transactionDateTime transactionAmount merchantName        acqCountry       
##  Length:786363       Min.   :   0.00   Length:786363      Length:786363     
##  Class :character    1st Qu.:  33.65   Class :character   Class :character  
##  Mode  :character    Median :  87.90   Mode  :character   Mode  :character  
##                      Mean   : 136.99                                        
##                      3rd Qu.: 191.48                                        
##                      Max.   :2011.54                                        
##  merchantCountryCode posEntryMode       posConditionCode   merchantCategoryCode
##  Length:786363       Length:786363      Length:786363      Length:786363       
##  Class :character    Class :character   Class :character   Class :character    
##  Mode  :character    Mode  :character   Mode  :character   Mode  :character    
##                                                                                
##                                                                                
##                                                                                
##  currentExpDate     accountOpenDate    dateOfLastAddressChange
##  Length:786363      Length:786363      Length:786363          
##  Class :character   Class :character   Class :character       
##  Mode  :character   Mode  :character   Mode  :character       
##                                                               
##                                                               
##                                                               
##    cardCVV           enteredCVV        cardLast4Digits    transactionType   
##  Length:786363      Length:786363      Length:786363      Length:786363     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   echoBuffer        currentBalance    merchantCity       merchantState     
##  Length:786363      Min.   :    0.0   Length:786363      Length:786363     
##  Class :character   1st Qu.:  689.9   Class :character   Class :character  
##  Mode  :character   Median : 2451.8   Mode  :character   Mode  :character  
##                     Mean   : 4508.7                                        
##                     3rd Qu.: 5291.1                                        
##                     Max.   :47498.8                                        
##  merchantZip        cardPresent     posOnPremises      recurringAuthInd  
##  Length:786363      Mode :logical   Length:786363      Length:786363     
##  Class :character   FALSE:433495    Class :character   Class :character  
##  Mode  :character   TRUE :352868    Mode  :character   Mode  :character  
##                                                                          
##                                                                          
##                                                                          
##  expirationDateKeyInMatch  isFraud       
##  Mode :logical            Mode :logical  
##  FALSE:785320             FALSE:773946   
##  TRUE :1043               TRUE :12417    
##                                          
##                                          
##

Columns Names

colnames(df)

##  [1] "accountNumber"            "customerId"              
##  [3] "creditLimit"              "availableMoney"          
##  [5] "transactionDateTime"      "transactionAmount"       
##  [7] "merchantName"             "acqCountry"              
##  [9] "merchantCountryCode"      "posEntryMode"            
## [11] "posConditionCode"         "merchantCategoryCode"    
## [13] "currentExpDate"           "accountOpenDate"         
## [15] "dateOfLastAddressChange"  "cardCVV"                 
## [17] "enteredCVV"               "cardLast4Digits"         
## [19] "transactionType"          "echoBuffer"              
## [21] "currentBalance"           "merchantCity"            
## [23] "merchantState"            "merchantZip"             
## [25] "cardPresent"              "posOnPremises"           
## [27] "recurringAuthInd"         "expirationDateKeyInMatch"
## [29] "isFraud"

No of rows

nrow(df)

## [1] 786363

No of Columns

ncol(df)

## [1] 29

Number of Records = 786363

Number of Attributes/Columns = 29

dim(df)

## [1] 786363     29

Checking for Null values in data

Because either these are filled with an appropriate value with respect to each column or completely removed from the dataset, depending on a given use case

##            accountNumber               customerId              creditLimit 
##                        0                        0                        0 
##           availableMoney      transactionDateTime        transactionAmount 
##                        0                        0                        0 
##             merchantName               acqCountry      merchantCountryCode 
##                        0                     4562                      724 
##             posEntryMode         posConditionCode     merchantCategoryCode 
##                     4054                      409                        0 
##           currentExpDate          accountOpenDate  dateOfLastAddressChange 
##                        0                        0                        0 
##                  cardCVV               enteredCVV          cardLast4Digits 
##                        0                        0                        0 
##          transactionType               echoBuffer           currentBalance 
##                      698                   786363                        0 
##             merchantCity            merchantState              merchantZip 
##                   786363                   786363                   786363 
##              cardPresent            posOnPremises         recurringAuthInd 
##                        0                   786363                   786363 
## expirationDateKeyInMatch                  isFraud 
##                        0                        0

Checking for Unique values in data

Unique Values in each attribute also helps explore the structure of the dataset When an attribute has less number of Unique Values it is always a good criteria to look at them keenly to know more about the structure as a whole and how the other attributes are behaving around these small buckets

##            accountNumber               customerId              creditLimit 
##                     5000                     5000                       10 
##           availableMoney      transactionDateTime        transactionAmount 
##                   521916                   776637                    66038 
##             merchantName               acqCountry      merchantCountryCode 
##                     2490                        5                        5 
##             posEntryMode         posConditionCode     merchantCategoryCode 
##                        6                        4                       19 
##           currentExpDate          accountOpenDate  dateOfLastAddressChange 
##                      165                     1820                     2184 
##                  cardCVV               enteredCVV          cardLast4Digits 
##                      899                      976                     5246 
##          transactionType               echoBuffer           currentBalance 
##                        4                        1                   487318 
##             merchantCity            merchantState              merchantZip 
##                        1                        1                        1 
##              cardPresent            posOnPremises         recurringAuthInd 
##                        2                        1                        1 
## expirationDateKeyInMatch                  isFraud 
##                        2                        2

Here that most important attribute is “isFraud” which has 1.6% True Values

The most important attribute of any given dataset, which can be a label in a classification problem or a continuous value in regression analysis. In clustering it depends on the use case that which attributes are important to look around with.

##   isFraud      n
## 1   FALSE 773946
## 2    TRUE  12417

3. Visualization

Plots, Histograms, Box Plots, Correlation Matrix and Hypothesis about the Structure of Data

Converting True/False Booleans as 0s and 1s

df$isFraud <- as.integer(df$isFraud)
df$cardPresent <- as.integer(df$cardPresent)
df$expirationDateKeyInMatch <- as.integer(df$expirationDateKeyInMatch)

Buckets/Unique Values in Credit Card Limit are shown below:

Most of the Credit Cards have 5000 as their limit, on second number it is 15000. Interestingly, credit cards with 50,000 limit also exists in the dataset

Four Types of Merchant Country Code exists in the datset where US is a dominant entity as shown

5 POS Entery Mode exists where “05”, “09”, and “02” covers most of the values

3 POS Condition Mode exists where “01” is dominant among all

There are 3 transaction types:

1. Purchase

2. Reversal

3. Address Verification

Most values lie inside Purchase Type. But it is necessary to explore the other two to analyze what is happening in those transactions.

## >>> Suggestions
## PieChart(transactionType, hole=0)  # traditional pie chart
## PieChart(transactionType, values="%")  # display %'s on the chart
## PieChart(transactionType)  # bar chart
## Plot(transactionType)  # bubble plot
## Plot(transactionType, values="count")  # lollipop plot 
## 
## --- transactionType --- 
## 
##                ADDRESS_VERIFICATION  PURCHASE  REVERSAL        Total 
## Frequencies:                  20169    745193     20303       785665 
## Proportions:                  0.026     0.948     0.026        1.000 
## 
## Chi-squared test of null hypothesis of equal probabilities 
##   Chisq = 1337879.797, df = 2, p-value = 0.000

If card is Present during the transaction

plays an important role and dataset is fairly distributed among the two values as True/False

Expiration Date Key In Match is highly skewed to 0 value

Count of expirationDateKeyInMatch

##   expirationDateKeyInMatch      n
## 1                        0 785320
## 2                        1   1043

Plot showing frequency of Fraud

As described earlier is clearly an example of IMBALANCED DATSET with respect to this attribute. But It is expected to be like that in real world scenario as Frauds are always less than 5% with respect to whole dataset. Here it is 1.6%

4. Data Analysis

Deep Diving in Fraudulent Aspects to find correlation and causation

Question: List of top merchants

Here we can clearly see which are the Top Merchants:

To know more about the transaction data set it is important to know the Top Merchants where most of the Transactions happen. It helps in Optimizing the metrics and to prioritize the business and data strategies accordingly.

1. Uber

2. Lyft

3. oldnavy.com

4. staples.com

5. alibaba.com

6. apple.com

7. walmart.com

8. cheapfast.com

9. ebay.com

10. target.com

##     merchantName     n
## 1           Uber 25613
## 2           Lyft 25523
## 3    oldnavy.com 16992
## 4    staples.com 16980
## 5    alibaba.com 16959
## 6      apple.com 16898
## 7    walmart.com 16873
## 8  cheapfast.com 16858
## 9       ebay.com 16842
## 10    target.com 16813
## 11    amazon.com 16780
## 12     sears.com 16755
## 13  discount.com 16731
## 14       gap.com 16675

Question: Most transactions Types

Here we can clearly see that most transactions happen at:

1. Online Retail

2. Food

3. Entertainment

4. Online Gifts

6. Hotels

Let’s explore Numerical Valued Attributes in Credit Card Transactions these attributes hold high importance

##   creditLimit availableMoney transactionAmount currentBalance
## 1        5000           5000             98.55              0
## 2        5000           5000             74.51              0
## 3        5000           5000              7.47              0
## 4        5000           5000              7.47              0
## 5        5000           5000             71.18              0

Separatelty describing these numerical attributes to get the essence of data and its distribution.

Here we can see Available Money can also go Negative, which is -1005.63 as minimum value

##   creditLimit    availableMoney  transactionAmount currentBalance   
##  Min.   :  250   Min.   :-1006   Min.   :   0.00   Min.   :    0.0  
##  1st Qu.: 5000   1st Qu.: 1077   1st Qu.:  33.65   1st Qu.:  689.9  
##  Median : 7500   Median : 3185   Median :  87.90   Median : 2451.8  
##  Mean   :10759   Mean   : 6251   Mean   : 136.99   Mean   : 4508.7  
##  3rd Qu.:15000   3rd Qu.: 7500   3rd Qu.: 191.48   3rd Qu.: 5291.1  
##  Max.   :50000   Max.   :50000   Max.   :2011.54   Max.   :47498.8

Most of the Credit limits are under 10,000.

But some goes till 20,000 and 50,000 as well.

While data Modelling this information can be really useful

Available Money mostly lie under the buckets of less than 10,000 and is Right Skewed, which actually makes sense

Interestingly Transaction amount mostly falls under 500 and is Right Skewed

Current Balance

Box Plots help analyzing the Outliers in the datasets and the skewness

It shows how these Numerical Atrributes are Right Skewed

creditLimit

AvailableMoney

TransactionAmount

Current Balance

Question: Most fradulant Merchant Types

Merchant Type with highest fraud transaction

##      merchantCategoryCode
## 695                health
## 972         online_retail
## 1073               health
## 1114        online_retail
## 1322        online_retail
## 1346            rideshare
## 1347            rideshare
## 1368        online_retail
## 1417        online_retail
## 1624        online_retail

The graph below shows those Merchants where most of the fraudulent transactions happen.

Top 3 are:

1. Online Retail

2. Online Gifts

Question:

The Merchant Names where most Fraud happens:

Top 5 are:

1. Lyft

2. ebay.com

3. Fresh Flowers

4. Uber

5. walmart.com

##     merchantName   n
## 1           Lyft 760
## 2       ebay.com 639
## 3  Fresh Flowers 553
## 4           Uber 512
## 5    walmart.com 446
## 6  cheapfast.com 442
## 7      sears.com 422
## 8    oldnavy.com 401
## 9    staples.com 399
## 10   alibaba.com 383
## 11    amazon.com 372
## 12       gap.com 369
## 13    target.com 362
## 14     apple.com 357
## 15  discount.com 346

Question:

The account IDs which get most frauds can also be seen through dataset.

Top 3 Account IDs are:

1. 380680241

2. 782081187

3. 246251253

##   accountNumber   n
## 1     380680241 783
## 2     782081187 307
## 3     246251253 278
## 4     700725639 272
## 5     472288969 266
## 6     208319653 211
## 7     419709514 207

Account IDs and Customer IDs are the same.

##   customerId   n
## 1  380680241 783
## 2  782081187 307
## 3  246251253 278
## 4  700725639 272
## 5  472288969 266
## 6  208319653 211
## 7  419709514 207

Correlation Matrix between Numerical Attributes

tells important infomration about the given process.

5. Data Wrangling

Till know we didn’t remove any empty Columns. Now removing the empty columns

Data frame after dropping 6 columns

##   accountNumber customerId creditLimit availableMoney transactionDateTime
## 1     737265056  737265056        5000           5000 2016-08-13T14:27:32
## 2     737265056  737265056        5000           5000 2016-10-11T05:05:54
## 3     737265056  737265056        5000           5000 2016-11-08T09:18:39
## 4     737265056  737265056        5000           5000 2016-12-10T02:14:50
## 5     830329091  830329091        5000           5000 2016-03-24T21:04:46
##   transactionAmount        merchantName acqCountry merchantCountryCode
## 1             98.55                Uber         US                  US
## 2             74.51         AMC #191138         US                  US
## 3              7.47          Play Store         US                  US
## 4              7.47          Play Store         US                  US
## 5             71.18 Tim Hortons #947751         US                  US
##   posEntryMode posConditionCode merchantCategoryCode currentExpDate
## 1           02               01            rideshare        06/2023
## 2           09               01        entertainment        02/2024
## 3           09               01           mobileapps        08/2025
## 4           09               01           mobileapps        08/2025
## 5           02               01             fastfood        10/2029
##   accountOpenDate dateOfLastAddressChange cardCVV enteredCVV cardLast4Digits
## 1      2015-03-14              2015-03-14     414        414            1803
## 2      2015-03-14              2015-03-14     486        486             767
## 3      2015-03-14              2015-03-14     486        486             767
## 4      2015-03-14              2015-03-14     486        486             767
## 5      2015-08-06              2015-08-06     885        885            3143
##   transactionType currentBalance cardPresent expirationDateKeyInMatch isFraud
## 1        PURCHASE              0           0                        0       0
## 2        PURCHASE              0           1                        0       0
## 3        PURCHASE              0           0                        0       0
## 4        PURCHASE              0           0                        0       0
## 5        PURCHASE              0           1                        0       0

The assumption is Duplicate Transactions are any record other than Address Verification because it has Transaction Amount = 0

Adding Date Column in the Data Frame to computer Results independent of just Date-Time

head(df_2, n = 5)

##   accountNumber customerId creditLimit availableMoney transactionDateTime
## 1     737265056  737265056        5000           5000 2016-08-13T14:27:32
## 2     737265056  737265056        5000           5000 2016-10-11T05:05:54
## 3     737265056  737265056        5000           5000 2016-11-08T09:18:39
## 4     737265056  737265056        5000           5000 2016-12-10T02:14:50
## 5     830329091  830329091        5000           5000 2016-03-24T21:04:46
##   transactionAmount        merchantName acqCountry merchantCountryCode
## 1             98.55                Uber         US                  US
## 2             74.51         AMC #191138         US                  US
## 3              7.47          Play Store         US                  US
## 4              7.47          Play Store         US                  US
## 5             71.18 Tim Hortons #947751         US                  US
##   posEntryMode posConditionCode merchantCategoryCode currentExpDate
## 1           02               01            rideshare        06/2023
## 2           09               01        entertainment        02/2024
## 3           09               01           mobileapps        08/2025
## 4           09               01           mobileapps        08/2025
## 5           02               01             fastfood        10/2029
##   accountOpenDate dateOfLastAddressChange cardCVV enteredCVV cardLast4Digits
## 1      2015-03-14              2015-03-14     414        414            1803
## 2      2015-03-14              2015-03-14     486        486             767
## 3      2015-03-14              2015-03-14     486        486             767
## 4      2015-03-14              2015-03-14     486        486             767
## 5      2015-08-06              2015-08-06     885        885            3143
##   transactionType currentBalance cardPresent expirationDateKeyInMatch isFraud
## 1        PURCHASE              0           0                        0       0
## 2        PURCHASE              0           1                        0       0
## 3        PURCHASE              0           0                        0       0
## 4        PURCHASE              0           0                        0       0
## 5        PURCHASE              0           1                        0       0
##   transactionDate
## 1      2016-08-13
## 2      2016-10-11
## 3      2016-11-08
## 4      2016-12-10
## 5      2016-03-24

Duplicate transaction

Here we are taking those attributes which can uniquely identify and help in indentifying a duplicate transaction at a given point:

The attributes used are:

accountNumber (It will be unique for that customer)
transactionAmount (the amount needs to be check for duplicated occurances)
merchantName (where the transaction is actually taking place) 4 .acqCountry (the country where it can take place)
accountOpenDate (account open date will be same)
merchantCategoryCode (Sometimes, data do not come as we expect it to be, where merchantName can come as an empty index, then this merchantCategoryCode will be most useful in that scenario)
cardLast4Digits (always help in analyzing duplicate tansactions)

Duplicated Trnsactions = 100716

##    accountNumber customerId creditLimit availableMoney transactionDateTime
## 3      737265056  737265056        5000        5000.00 2016-11-08T09:18:39
## 4      737265056  737265056        5000        5000.00 2016-12-10T02:14:50
## 12     830329091  830329091        5000        5000.00 2016-12-07T16:34:04
## 13     830329091  830329091        5000        4959.25 2016-12-14T10:00:35
## 14     830329091  830329091        5000        4918.50 2016-12-20T18:38:23
##    transactionAmount merchantName acqCountry merchantCountryCode posEntryMode
## 3               7.47   Play Store         US                  US           09
## 4               7.47   Play Store         US                  US           09
## 12             40.75    GreenCook         US                  US           09
## 13             40.75    GreenCook         US                  US           09
## 14             40.75    GreenCook         US                  US           09
##    posConditionCode merchantCategoryCode currentExpDate accountOpenDate
## 3                01           mobileapps        08/2025      2015-03-14
## 4                01           mobileapps        08/2025      2015-03-14
## 12               01        food_delivery        08/2024      2015-08-06
## 13               01        food_delivery        08/2024      2015-08-06
## 14               01        food_delivery        08/2024      2015-08-06
##    dateOfLastAddressChange cardCVV enteredCVV cardLast4Digits transactionType
## 3               2015-03-14     486        486             767        PURCHASE
## 4               2015-03-14     486        486             767        PURCHASE
## 12              2015-08-06     885        885            3143        PURCHASE
## 13              2015-08-06     885        885            3143        PURCHASE
## 14              2015-08-06     885        885            3143        PURCHASE
##    currentBalance cardPresent expirationDateKeyInMatch isFraud transactionDate
## 3            0.00           0                        0       0      2016-11-08
## 4            0.00           0                        0       0      2016-12-10
## 12           0.00           0                        0       0      2016-12-07
## 13          40.75           0                        0       0      2016-12-14
## 14          81.50           0                        0       0      2016-12-20

dim(duplicate_trans)

## [1] 100716     24

Question:

Frauds in Duplicated Transaction can expected to be more.

The Top 3 Merhcants where duplicated transactions happened and Frauds took place are:

1. Fresh Flowers

2. Lyft

3. ebay.com

##                     merchantName  n
## 1                  Fresh Flowers 69
## 2                           Lyft 58
## 3                       ebay.com 46
## 4                    walmart.com 39
## 5              American Airlines 34
## 6                  cheapfast.com 34
## 7                           Uber 34
## 8                      sears.com 32
## 9  Blue Mountain Online Services 31
## 10                   oldnavy.com 30

Reversal Transactions can be found directly either by using the attribute Transaction Type.

That comes out to be Reversal = 20303

Transaction Amount of Reversal Transaction = 2821792

nrow(subset(df_2, transactionType == 'REVERSAL'))

## [1] 20303

 sum(subset(df_2, transactionType == 'REVERSAL')$transactionAmount)

## [1] 2821792

Merchnats where Frauds took place according to the given attribute transaction type = Reversal in the original dataset:

Most Frauds Happening within Original data and reversal transaction type are at these places:

1. Lyft

2. walmart.com

3. gap.com

##         merchantName  n
## 1               Lyft 26
## 2        walmart.com 16
## 3      Fresh Flowers 15
## 4            gap.com 15
## 5  American Airlines 14
## 6      cheapfast.com 13
## 7          sears.com 12
## 8          apple.com 11
## 9           ebay.com 11
## 10       staples.com 11
## 11       oldnavy.com 10

Another method to find Reversal Transaction is to look directly into the duplicated transactions:

Reversal Transactions in Duplicated records = 17826

Transaction Amount for Reversal Transactions in Duplicated records = 2669860

nrow(subset(duplicate_trans, transactionType == 'REVERSAL'))

## [1] 17826

 sum(subset(duplicate_trans, transactionType == 'REVERSAL')$transactionAmount)

## [1] 2669860

Question:

Most Frauds that happen in Reversal Transactions of Duplicated Subset are for these Merchants:

1. Lyft

2. Fresh Flowers

3. Walmart.com

##         merchantName  n
## 1               Lyft 23
## 2      Fresh Flowers 14
## 3            gap.com 13
## 4        walmart.com 13
## 5          apple.com 11
## 6      cheapfast.com 11
## 7           ebay.com 11
## 8        staples.com 11
## 9  American Airlines 10
## 10       oldnavy.com 10

Removing Outliers

Interquartile Ranges are generally used for removing the outliers in a numerical attribute

##   accountNumber customerId creditLimit availableMoney transactionDateTime
## 1     737265056  737265056        5000           5000 2016-08-13T14:27:32
## 2     737265056  737265056        5000           5000 2016-10-11T05:05:54
## 3     737265056  737265056        5000           5000 2016-11-08T09:18:39
## 4     737265056  737265056        5000           5000 2016-12-10T02:14:50
## 5     830329091  830329091        5000           5000 2016-03-24T21:04:46
##   transactionAmount        merchantName acqCountry merchantCountryCode
## 1             98.55                Uber         US                  US
## 2             74.51         AMC #191138         US                  US
## 3              7.47          Play Store         US                  US
## 4              7.47          Play Store         US                  US
## 5             71.18 Tim Hortons #947751         US                  US
##   posEntryMode posConditionCode merchantCategoryCode currentExpDate
## 1           02               01            rideshare        06/2023
## 2           09               01        entertainment        02/2024
## 3           09               01           mobileapps        08/2025
## 4           09               01           mobileapps        08/2025
## 5           02               01             fastfood        10/2029
##   accountOpenDate dateOfLastAddressChange cardCVV enteredCVV cardLast4Digits
## 1      2015-03-14              2015-03-14     414        414            1803
## 2      2015-03-14              2015-03-14     486        486             767
## 3      2015-03-14              2015-03-14     486        486             767
## 4      2015-03-14              2015-03-14     486        486             767
## 5      2015-08-06              2015-08-06     885        885            3143
##   transactionType currentBalance cardPresent expirationDateKeyInMatch isFraud
## 1        PURCHASE              0           0                        0       0
## 2        PURCHASE              0           1                        0       0
## 3        PURCHASE              0           0                        0       0
## 4        PURCHASE              0           0                        0       0
## 5        PURCHASE              0           1                        0       0

We can see which buckets are removed from the dataset, as for Models to generalize the dataset well it is advised to remove outliers. But in some use cases we avoid that as well

Histogram after removing outliers

Credit limit

Available Money

Transaction Amount

Current Balance

Box Plots are again drawn to help visualize how outliers were impacting the overall skewness

Credit Limit

Available Money

Transaction Amount

Current Balance

Dimension of Data frame after removing Outliers

dim(df)

## [1] 617105     23

5. Feature Engineering

Making an additional Column derived from the equality of CardCVV and EnteredCVV as shown

df$CVVMatched <-  df$cardCVV == df$enteredCVV

Transormation from Categorical and Booleans to Numerical

df$CVVMatched <- as.integer(df$CVVMatched)

dim(df)

## [1] 617105     24

Four times that represent morning, afternoon/evening, night and late/mid night are also added as feature

df_time <- data.frame(df)

head(df_time,n=5 )

##   accountNumber customerId creditLimit availableMoney transactionDateTime
## 1     737265056  737265056        5000           5000 2016-08-13T14:27:32
## 2     737265056  737265056        5000           5000 2016-10-11T05:05:54
## 3     737265056  737265056        5000           5000 2016-11-08T09:18:39
## 4     737265056  737265056        5000           5000 2016-12-10T02:14:50
## 5     830329091  830329091        5000           5000 2016-03-24T21:04:46
##   transactionAmount        merchantName acqCountry merchantCountryCode
## 1             98.55                Uber         US                  US
## 2             74.51         AMC #191138         US                  US
## 3              7.47          Play Store         US                  US
## 4              7.47          Play Store         US                  US
## 5             71.18 Tim Hortons #947751         US                  US
##   posEntryMode posConditionCode merchantCategoryCode currentExpDate
## 1           02               01            rideshare        06/2023
## 2           09               01        entertainment        02/2024
## 3           09               01           mobileapps        08/2025
## 4           09               01           mobileapps        08/2025
## 5           02               01             fastfood        10/2029
##   accountOpenDate dateOfLastAddressChange cardCVV enteredCVV cardLast4Digits
## 1      2015-03-14              2015-03-14     414        414            1803
## 2      2015-03-14              2015-03-14     486        486             767
## 3      2015-03-14              2015-03-14     486        486             767
## 4      2015-03-14              2015-03-14     486        486             767
## 5      2015-08-06              2015-08-06     885        885            3143
##   transactionType currentBalance cardPresent expirationDateKeyInMatch isFraud
## 1        PURCHASE              0           0                        0       0
## 2        PURCHASE              0           1                        0       0
## 3        PURCHASE              0           0                        0       0
## 4        PURCHASE              0           0                        0       0
## 5        PURCHASE              0           1                        0       0
##   CVVMatched
## 1          1
## 2          1
## 3          1
## 4          1
## 5          1

df$transactionTimeCat = df_time$time_period

head(df, n=5)

##   accountNumber customerId creditLimit availableMoney transactionDateTime
## 1     737265056  737265056        5000           5000 2016-08-13T14:27:32
## 2     737265056  737265056        5000           5000 2016-10-11T05:05:54
## 3     737265056  737265056        5000           5000 2016-11-08T09:18:39
## 4     737265056  737265056        5000           5000 2016-12-10T02:14:50
## 5     830329091  830329091        5000           5000 2016-03-24T21:04:46
##   transactionAmount        merchantName acqCountry merchantCountryCode
## 1             98.55                Uber         US                  US
## 2             74.51         AMC #191138         US                  US
## 3              7.47          Play Store         US                  US
## 4              7.47          Play Store         US                  US
## 5             71.18 Tim Hortons #947751         US                  US
##   posEntryMode posConditionCode merchantCategoryCode currentExpDate
## 1           02               01            rideshare        06/2023
## 2           09               01        entertainment        02/2024
## 3           09               01           mobileapps        08/2025
## 4           09               01           mobileapps        08/2025
## 5           02               01             fastfood        10/2029
##   accountOpenDate dateOfLastAddressChange cardCVV enteredCVV cardLast4Digits
## 1      2015-03-14              2015-03-14     414        414            1803
## 2      2015-03-14              2015-03-14     486        486             767
## 3      2015-03-14              2015-03-14     486        486             767
## 4      2015-03-14              2015-03-14     486        486             767
## 5      2015-08-06              2015-08-06     885        885            3143
##   transactionType currentBalance cardPresent expirationDateKeyInMatch isFraud
## 1        PURCHASE              0           0                        0       0
## 2        PURCHASE              0           1                        0       0
## 3        PURCHASE              0           0                        0       0
## 4        PURCHASE              0           0                        0       0
## 5        PURCHASE              0           1                        0       0
##   CVVMatched transactionTimeCat
## 1          1                  1
## 2          1                  3
## 3          1                  0
## 4          1                  3
## 5          1                  2

No of Transaction based on Four timing range

count(df, transactionTimeCat)

##   transactionTimeCat      n
## 1                  0 128254
## 2                  1 154273
## 3                  2 154956
## 4                  3 179622

Count of CVV Matched and not mached

count(df, CVVMatched)

##   CVVMatched      n
## 1          0   5560
## 2          1 611545

The Final Numbers of Columns Selected are 16, we have removed the unique identifiers and empty columns.

Further Principal Component Analysis can also be tried to reduce the dimensions further but for the given case it is not required.

Variance of the dataset is still maintained to be 99%

## [1] "Final total columns"

## [1] 16

Checking if null values exists in columns

colSums(is.na(df))

##              creditLimit           availableMoney        transactionAmount 
##                        0                        0                        0 
##             merchantName               acqCountry      merchantCountryCode 
##                        0                        0                        0 
##             posEntryMode         posConditionCode     merchantCategoryCode 
##                        0                        0                        0 
##          transactionType           currentBalance              cardPresent 
##                        0                        0                        0 
## expirationDateKeyInMatch                  isFraud               CVVMatched 
##                        0                        0                        0 
##       transactionTimeCat 
##                        0

head(df, n=5)

##   creditLimit availableMoney transactionAmount merchantName acqCountry
## 1        5000           5000             98.55            0          0
## 2        5000           5000             74.51            1          0
## 3        5000           5000              7.47            2          0
## 4        5000           5000              7.47            2          0
## 5        5000           5000             71.18            3          0
##   merchantCountryCode posEntryMode posConditionCode merchantCategoryCode
## 1                   0            0                0                    0
## 2                   0            1                0                    1
## 3                   0            1                0                    2
## 4                   0            1                0                    2
## 5                   0            0                0                    3
##   transactionType currentBalance cardPresent expirationDateKeyInMatch isFraud
## 1               0              0           0                        0       0
## 2               0              0           1                        0       0
## 3               0              0           0                        0       0
## 4               0              0           0                        0       0
## 5               0              0           1                        0       0
##   CVVMatched transactionTimeCat
## 1          1                  1
## 2          1                  3
## 3          1                  0
## 4          1                  3
## 5          1                  2

7. Modeling

As the Data is Imbalanced, so to run classification models efficiently it is advised to use a Down or Upper Sampling technique to balance the dataset for better results

Using OverSampling Technique

table(rose_train$isFraud)

## 
##      0      1 
## 309174 307931

Using UnderSampling Technique

table(df_under$isFraud)

## 
##    0    1 
## 8847 8847

For Model Evaluation Always Train the model on Training dataset and Verify on test data sets.

Logistic Regression

Logistic regression model used for the probability threshold to determine if we classify that element in one of two groups. This model is used as a base line to compare-with for this work.

## 
## Call:
## glm(formula = isFraud ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0817  -1.0694  -0.5585   1.0963   2.0292  
## 
## Coefficients: (1 not defined because of singularities)
##                              Estimate   Std. Error z value             Pr(>|z|)
## (Intercept)               0.203188709  0.168595143   1.205              0.22813
## creditLimit               0.000011916  0.000006803   1.752              0.07984
## availableMoney           -0.000012060  0.000008381  -1.439              0.15017
## transactionAmount         0.004758081  0.000169930  28.000 < 0.0000000000000002
## merchantName             -0.000465575  0.000041221 -11.295 < 0.0000000000000002
## acqCountry                0.043301686  0.228171315   0.190              0.84948
## merchantCountryCode       0.084819309  0.233430307   0.363              0.71634
## posEntryMode             -0.113553464  0.017284093  -6.570      0.0000000000504
## posConditionCode         -0.075115314  0.039741254  -1.890              0.05874
## merchantCategoryCode      0.006528148  0.004391111   1.487              0.13710
## transactionType          -0.027811227  0.050030075  -0.556              0.57829
## currentBalance                     NA           NA      NA                   NA
## cardPresent              -0.420620035  0.041709599 -10.084 < 0.0000000000000002
## expirationDateKeyInMatch  0.795534257  0.514831351   1.545              0.12229
## CVVMatched               -0.433319209  0.159294140  -2.720              0.00652
## transactionTimeCat       -0.014773995  0.015988819  -0.924              0.35548
##                             
## (Intercept)                 
## creditLimit              .  
## availableMoney              
## transactionAmount        ***
## merchantName             ***
## acqCountry                  
## merchantCountryCode         
## posEntryMode             ***
## posConditionCode         .  
## merchantCategoryCode        
## transactionType             
## currentBalance              
## cardPresent              ***
## expirationDateKeyInMatch    
## CVVMatched               ** 
## transactionTimeCat          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19623  on 14154  degrees of freedom
## Residual deviance: 18139  on 14140  degrees of freedom
## AIC: 18169
## 
## Number of Fisher Scoring iterations: 4

Dimentions of Train data

## [1] 14155    16

Dimentions of Test data

## [1] 3539   16

##    predict_reg
##        0    1
##   0 1209  547
##   1  680 1103

## [1] "Accuracy = 0.65329189036451"

##   Var1 Freq
## 1    0 8847
## 2    1 8847

## [1] 0.6535584

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1209  680
##          1  547 1103
##                                                
##                Accuracy : 0.6533               
##                  95% CI : (0.6373, 0.669)      
##     No Information Rate : 0.5038               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3069               
##                                                
##  Mcnemar's Test P-Value : 0.0001643            
##                                                
##             Sensitivity : 0.6885               
##             Specificity : 0.6186               
##          Pos Pred Value : 0.6400               
##          Neg Pred Value : 0.6685               
##              Prevalence : 0.4962               
##          Detection Rate : 0.3416               
##    Detection Prevalence : 0.5338               
##       Balanced Accuracy : 0.6536               
##                                                
##        'Positive' Class : 0                    
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1209  680
##          1  547 1103
##                                                
##                Accuracy : 0.6533               
##                  95% CI : (0.6373, 0.669)      
##     No Information Rate : 0.5038               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3069               
##                                                
##  Mcnemar's Test P-Value : 0.0001643            
##                                                
##             Sensitivity : 0.6186               
##             Specificity : 0.6885               
##          Pos Pred Value : 0.6685               
##          Neg Pred Value : 0.6400               
##               Precision : 0.6685               
##                  Recall : 0.6186               
##                      F1 : 0.6426               
##              Prevalence : 0.5038               
##          Detection Rate : 0.3117               
##    Detection Prevalence : 0.4662               
##       Balanced Accuracy : 0.6536               
##                                                
##        'Positive' Class : 1                    
##

The Results of Logisitc Regression Classifier:

Accuracy = 0.6535 = 65.35%

KNN

This model help is determining how the nearest neighbour help in determining of fradulant transaction.

fitControl <- trainControl(method="cv",
                            number = 5,
                            preProcOptions = list(thresh = 0.99), # threshold for pca preprocess
                            classProbs = TRUE,
                            summaryFunction = twoClassSummary)

train$isFraud <- factor(train$isFraud)

model_knn <- train(make.names(isFraud)~.,
                   train,
                   method="knn",
                   metric="ROC",
                   preProcess = c('center', 'scale'),
                   tuneLength=10,
                   trControl=fitControl)

## [1] 0.6590938

The Results of K nearest neighbour:

Accuracy = 0.6590 = 65.90%

Decision Tree

Decision tree provides us with the interpertability regarding the decision made over feature set which might lead to a good accuracy and its shows a clear heirarchy.

raw.data = df_under


nrows <- nrow(raw.data)
set.seed(314)
indexT <- sample(1:nrow(raw.data), 0.8 * nrows)
#separate train and validation set
trainset = raw.data[indexT,]
verset =   raw.data[-indexT,]

train = trainset
test = verset

library(rpart) # for regression trees
library(randomForest) # for random forests


# train a decision tree based on our dataset 
tree.model <- rpart(isFraud ~ ., data = train)

Decison Tree developed over dataset

# plot our regression tree 
plot(tree.model, uniform=TRUE)
# add text labels & make them 60% as big as they are by default
text(tree.model, cex=.6)

Random Forest

It provides the feature importance with the help of gini index, which helps in determining which feature influence more on the fradulent transactions.

raw.data = df_under


nrows <- nrow(raw.data)
set.seed(314)
indexT <- sample(1:nrow(raw.data), 0.8 * nrows)
#separate train and validation set
trainset = raw.data[indexT,]
verset =   raw.data[-indexT,]
n <- names(trainset)
rf.form <- as.formula(paste("isFraud ~", paste(n[!n %in% "isFraud"], collapse = " + ")))
trainset.rf <- randomForest(rf.form,trainset,ntree=100,importance=T)

Plots showing feature importance

XGboost

###XGBoost is Boot straped unsampled technique which uses variuos decision tree all merged together helps in providing a better accuracy.

## [1]  test-auc:0.693403 
## Will train until test_auc hasn't improved in 40 rounds.
## 
## [21] test-auc:0.717561 
## [41] test-auc:0.725942 
## [61] test-auc:0.729602 
## [81] test-auc:0.733082 
## [101]    test-auc:0.735312 
## [121]    test-auc:0.736935 
## [141]    test-auc:0.738546 
## [161]    test-auc:0.739718 
## [181]    test-auc:0.740832 
## [201]    test-auc:0.741978 
## [221]    test-auc:0.742289 
## [241]    test-auc:0.743184 
## [261]    test-auc:0.743543 
## [281]    test-auc:0.744220 
## [301]    test-auc:0.744372 
## [321]    test-auc:0.745138 
## [341]    test-auc:0.745831 
## [361]    test-auc:0.745897 
## [381]    test-auc:0.746428 
## [401]    test-auc:0.747213 
## [421]    test-auc:0.747397 
## [441]    test-auc:0.747453 
## [461]    test-auc:0.747624 
## [481]    test-auc:0.747817 
## [500]    test-auc:0.747818 
## [1]  test-auc:0.693403 
## Will train until test_auc hasn't improved in 40 rounds.
## 
## [21] test-auc:0.717561 
## [41] test-auc:0.725942 
## [61] test-auc:0.729602 
## [81] test-auc:0.733082 
## [101]    test-auc:0.735312 
## [121]    test-auc:0.736935 
## [141]    test-auc:0.738546 
## [161]    test-auc:0.739718 
## [181]    test-auc:0.740832 
## [201]    test-auc:0.741978 
## [221]    test-auc:0.742289 
## [241]    test-auc:0.743184 
## [261]    test-auc:0.743543 
## [281]    test-auc:0.744220 
## [301]    test-auc:0.744372 
## [321]    test-auc:0.745138 
## [341]    test-auc:0.745831 
## [361]    test-auc:0.745897 
## [381]    test-auc:0.746428 
## [401]    test-auc:0.747213 
## [421]    test-auc:0.747397 
## [441]    test-auc:0.747453 
## [461]    test-auc:0.747624 
## [481]    test-auc:0.747817 
## [500]    test-auc:0.747818 
## [1]  test-auc:0.693403 
## Will train until test_auc hasn't improved in 40 rounds.
## 
## [21] test-auc:0.717561 
## [41] test-auc:0.725942 
## [61] test-auc:0.729602 
## [81] test-auc:0.733082 
## [101]    test-auc:0.735312 
## [121]    test-auc:0.736935 
## [141]    test-auc:0.738546 
## [161]    test-auc:0.739718 
## [181]    test-auc:0.740832 
## [201]    test-auc:0.741978 
## [221]    test-auc:0.742289 
## [241]    test-auc:0.743184 
## [261]    test-auc:0.743543 
## [281]    test-auc:0.744220 
## [301]    test-auc:0.744372 
## [321]    test-auc:0.745138 
## [341]    test-auc:0.745831 
## [361]    test-auc:0.745897 
## [381]    test-auc:0.746428 
## [401]    test-auc:0.747213 
## [421]    test-auc:0.747397 
## [441]    test-auc:0.747453 
## [461]    test-auc:0.747624 
## [481]    test-auc:0.747817 
## [500]    test-auc:0.747818 
## [1]  test-auc:0.693403 
## Will train until test_auc hasn't improved in 40 rounds.
## 
## [21] test-auc:0.717561 
## [41] test-auc:0.725942 
## [61] test-auc:0.729602 
## [81] test-auc:0.733082 
## [101]    test-auc:0.735312 
## [121]    test-auc:0.736935 
## [141]    test-auc:0.738546 
## [161]    test-auc:0.739718 
## [181]    test-auc:0.740832 
## [201]    test-auc:0.741978 
## [221]    test-auc:0.742289 
## [241]    test-auc:0.743184 
## [261]    test-auc:0.743543 
## [281]    test-auc:0.744220 
## [301]    test-auc:0.744372 
## [321]    test-auc:0.745138 
## [341]    test-auc:0.745831 
## [361]    test-auc:0.745897 
## [381]    test-auc:0.746428 
## [401]    test-auc:0.747213 
## [421]    test-auc:0.747397 
## [441]    test-auc:0.747453 
## [461]    test-auc:0.747624 
## [481]    test-auc:0.747817 
## [500]    test-auc:0.747818 
## [1]  test-auc:0.693403 
## Will train until test_auc hasn't improved in 40 rounds.
## 
## [21] test-auc:0.717561 
## [41] test-auc:0.725942 
## [61] test-auc:0.729602 
## [81] test-auc:0.733082 
## [101]    test-auc:0.735312 
## [121]    test-auc:0.736935 
## [141]    test-auc:0.738546 
## [161]    test-auc:0.739718 
## [181]    test-auc:0.740832 
## [201]    test-auc:0.741978 
## [221]    test-auc:0.742289 
## [241]    test-auc:0.743184 
## [261]    test-auc:0.743543 
## [281]    test-auc:0.744220 
## [301]    test-auc:0.744372 
## [321]    test-auc:0.745138 
## [341]    test-auc:0.745831 
## [361]    test-auc:0.745897 
## [381]    test-auc:0.746428 
## [401]    test-auc:0.747213 
## [421]    test-auc:0.747397 
## [441]    test-auc:0.747453 
## [461]    test-auc:0.747624 
## [481]    test-auc:0.747817 
## [500]    test-auc:0.747818

## Unit: seconds
##                                                                                                                                                                                                                                                                                                                                                   expr
##  xgb.model.speed <- xgb.train(data = xgb.data.train, params = list(objective = "binary:logistic",      eta = 0.1, max.depth = 3, min_child_weight = 100, subsample = 1,      colsample_bytree = 1, nthread = 3, eval_metric = "auc"),      watchlist = list(test = xgb.data.test), nrounds = 500, early_stopping_rounds = 40,      print_every_n = 20)
##       min       lq     mean   median       uq      max neval
##  2.649612 2.654635 2.674109 2.665423 2.668082 2.732791     5

## NULL

## [1] 0.7478539

We can see that Grid XGBoost Classifier gave better results than the last one:

Accuracy = 0.7479 ~ 75%

It can clearly be seen that XGBoost Trees outperformed all the other classifiers

Principle Component Analysis

First 4 PCA are explaining more than 55% of the variance in the dataset, where first two are really significant.

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.4445 1.4019 1.2840 1.09533 1.05290 1.00436 1.00003
## Proportion of Variance 0.1391 0.1310 0.1099 0.07998 0.07391 0.06725 0.06667
## Cumulative Proportion  0.1391 0.2701 0.3800 0.46001 0.53392 0.60116 0.66783
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.99841 0.98256 0.97034 0.95195 0.84932 0.64800 0.17661
## Proportion of Variance 0.06646 0.06436 0.06277 0.06041 0.04809 0.02799 0.00208
## Cumulative Proportion  0.73429 0.79865 0.86142 0.92184 0.96993 0.99792 1.00000
##                                      PC15
## Standard deviation     0.0000000000001122
## Proportion of Variance 0.0000000000000000
## Cumulative Proportion  1.0000000000000000

8. Results and Impact

We carried out multiple modules of data science pipeline which involved: Data Visualizations, Descriptive Statistics, Prescriptive Statistics, Data wrangling, Feature engineering, Predictive Modelling, Interpretation of the model, comparing results and creating impactful solution for our use-case which can be used for decision making.
Classification is the best choice whenever we have a given class label to model around.
We checked on various Metrics such as Precision, Recall, Accuracy, F1 Scores and Balanced Accuracy. Our Best Model’s test performance is on XGBOOST Classifier which gives 75% accuracy, which is state-of-the-art for fraud systems to put the model in production.
Logistic regression and K-Nearest Neigbour are always the best baselines to compare-with. Here we got significant 65% accuracy with these models.
Here, Tree Based Classifiers are best in terms of interpretation of the end results. Because in real world scenarios, detecting fraud is just one part. But what actions to take to handle and optimize the processes, these tree structures help through the attributes hierarchy that which actions can be taken and some buckets can be made to start with.
Principal Component Analysis, an unsupervised technique, which is used for lower-dimensional visualization of the data and check for variance explainability.
Neural Networks are not used for simplicity and not getting into black boxed domain as interpretations of the results become harder. Else, a multi-layer perceptron could also be trained, would not recommend CNN or RNN for this use case. Though some people use RNNs for contexutal understanding in the fraud behavior.
Tree based classifiers have best interpretability and it can directly impact the business because it is the kind of models which can be used directly interpret and comprehend. So that can make decisions and build strategies further.

9. Future Deep Diving:

Clustering can also be done where labels are not required to get other meaningful insights out of data.
In the data set we had clearly seen that there were some Customer IDs which were highly fraudulent. So, a user-profiling based statistical model would significantly improve the results.
Nothing beats the quality of data. So running test-statistics could also help in analyzing the data further.
Important data acquisition ( either by client, business teams or an external API) and rigorous feature engineering.
Better way to fill the missing values in columns, based on the distribution of the data column
Better way to handle outliers. In case of large number of records people use Auto-encoders as well.
Exploring neural networks side for modeling and particularly Recrrent Neural Networks to see if there can be built context between next and previous fraudulent transactions.
Checking for data quality in terms of days like Cyber-Monday or Thanksgiving or any sales.
Time series based anomaly detection in the dataset and then modeling it appropriately

======================================================================================================================================

Data Analysis of Digital Payments and Fraud Detection

Team: TAYYAB BIN TAHIR, HARSHA SIDDAGANGAIAH

1. Data Description

2. Questions

Displayed Both Top and Bottom rows of the data

Tail of dataframe

Summary helps in Analyzing the data at a glance

Columns Names

No of rows

No of Columns

Number of Records = 786363

Number of Attributes/Columns = 29

Checking for Null values in data

Checking for Unique values in data

Here that most important attribute is “isFraud” which has 1.6% True Values

3. Visualization

Plots, Histograms, Box Plots, Correlation Matrix and Hypothesis about the Structure of Data

Converting True/False Booleans as 0s and 1s

Buckets/Unique Values in Credit Card Limit are shown below:

Four Types of Merchant Country Code exists in the datset where US is a dominant entity as shown

5 POS Entery Mode exists where “05”, “09”, and “02” covers most of the values

3 POS Condition Mode exists where “01” is dominant among all

There are 3 transaction types:

1. Purchase

2. Reversal

3. Address Verification

If card is Present during the transaction

Expiration Date Key In Match is highly skewed to 0 value

Count of expirationDateKeyInMatch

Plot showing frequency of Fraud

4. Data Analysis

Deep Diving in Fraudulent Aspects to find correlation and causation

Question: List of top merchants

Here we can clearly see which are the Top Merchants:

1. Uber

2. Lyft

3. oldnavy.com

4. staples.com

5. alibaba.com

6. apple.com

7. walmart.com

8. cheapfast.com

9. ebay.com

10. target.com

Question: Most transactions Types

Here we can clearly see that most transactions happen at:

1. Online Retail

2. Food

3. Entertainment

4. Online Gifts

5. Ride Share

6. Hotels

Let’s explore Numerical Valued Attributes in Credit Card Transactions these attributes hold high importance

Separatelty describing these numerical attributes to get the essence of data and its distribution.

Most of the Credit limits are under 10,000.

But some goes till 20,000 and 50,000 as well.

Available Money mostly lie under the buckets of less than 10,000 and is Right Skewed, which actually makes sense

Interestingly Transaction amount mostly falls under 500 and is Right Skewed

Current Balance

Box Plots help analyzing the Outliers in the datasets and the skewness

It shows how these Numerical Atrributes are Right Skewed

creditLimit

AvailableMoney

TransactionAmount

Current Balance

Question: Most fradulant Merchant Types

Merchant Type with highest fraud transaction

The graph below shows those Merchants where most of the fraudulent transactions happen.

Top 3 are:

1. Online Retail

2. Online Gifts

3. Ride Share

Question:

The Merchant Names where most Fraud happens:

Top 5 are:

1. Lyft

2. ebay.com

3. Fresh Flowers

4. Uber

5. walmart.com