Credit Card Fraud: A Statistical Analysis on Detection Systems and its Effect on Prevention

Project Introduction

Credit card fraud occurs when there is an unauthorized use of a credit card to withdraw and transfer money or make purchases without the owner’s permission. This can occur by stealing someone’s physical credit card, phishing, or using stolen card data to make online purchases. Credit card fraud has been an ongoing problem for the past couple of decades with scammers developing newer tactics and all types of fraud becoming more common. When there are unresolved credit card fraud cases, the individuals’ credit scores could be affected which could cause long term damage such as unpaid debts or long-term debts opened in your name. Fortunately, in the past few years, technology has changed with more advanced security measures, two-factor authentication, and fraud alerts to prevent credit card fraud.

Problem Statement

New forms of digital payments are always being introduced; however, it makes it easier for scammers to access sensitive information leading to an increase in credit card fraud. Theft of credit card information can be very stressful and a hassle to remedy to both the consumers and financial institutions. With the large number of card transactions that happen per day and new fraud tactics it is a struggle for current fraud detection systems to keep up. Better credit card fraud detection will ensure the protection of consumers’ sensitive financial information and reduce any loss for financial institutions.

Project Objective

The main objective of this project is to analyze the effect of detection systems for fraudulent transactions. We will apply statistical methods to accurately detect fraud prevention. We aim to check credit card fraud systems’ effectiveness and see whether we need new and improved systems.

Method: Logistic Regression

Describe the variables:

Dependent Variable: Fraud
Independent Variable: distance_from_home, distance_from_last_transaction, ratio_to_median_purchase_price, repeat_retailer, used_chip, used_pin_number, online_order.

Install and Load the Libraries:

#if(!require(pROC)) install.packages("pROC")
#install.packages("Hmisc")
#install.packages("pscl")
#install.packages(pR2) 
library(pscl)
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(readxl) 

Import Data:

data <- read_excel("~/Desktop/194A Group-Project/card_trans.data.xlsx")

Descriptive Statistics and Visual Analysis

Data Summary:

summary(data)
##  distance_from_home  distance_from_last_transaction
##  Min.   :    0.005   Min.   :    0.000             
##  1st Qu.:    3.878   1st Qu.:    0.297             
##  Median :    9.968   Median :    0.999             
##  Mean   :   26.629   Mean   :    5.037             
##  3rd Qu.:   25.744   3rd Qu.:    3.356             
##  Max.   :10632.724   Max.   :11851.105             
##  ratio_to_median_purchase_price repeat_retailer    used_chip     
##  Min.   :  0.0044               Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:  0.4757               1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :  0.9977               Median :1.0000   Median :0.0000  
##  Mean   :  1.8242               Mean   :0.8815   Mean   :0.3504  
##  3rd Qu.:  2.0964               3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :267.8029               Max.   :1.0000   Max.   :1.0000  
##  used_pin_number   online_order        fraud       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.0000   Median :0.0000  
##  Mean   :0.1006   Mean   :0.6506   Mean   :0.0874  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000
Interpretation: The average for distance_from_home is 10. The average for distance_from_last_transaction and ratio_to_median_purchase_price is 0.99. The average for used_chip, used_pin_number, and fraud are 0. And the average is 1 for repeat_retailed and online_order.

Correlation Analysis:

corr <- rcorr(as.matrix(data))
corr
##                                distance_from_home
## distance_from_home                           1.00
## distance_from_last_transaction               0.00
## ratio_to_median_purchase_price               0.00
## repeat_retailer                              0.14
## used_chip                                    0.00
## used_pin_number                              0.00
## online_order                                 0.00
## fraud                                        0.19
##                                distance_from_last_transaction
## distance_from_home                                       0.00
## distance_from_last_transaction                           1.00
## ratio_to_median_purchase_price                           0.00
## repeat_retailer                                          0.00
## used_chip                                                0.00
## used_pin_number                                          0.00
## online_order                                             0.00
## fraud                                                    0.09
##                                ratio_to_median_purchase_price repeat_retailer
## distance_from_home                                       0.00            0.14
## distance_from_last_transaction                           0.00            0.00
## ratio_to_median_purchase_price                           1.00            0.00
## repeat_retailer                                          0.00            1.00
## used_chip                                                0.00            0.00
## used_pin_number                                          0.00            0.00
## online_order                                             0.00            0.00
## fraud                                                    0.46            0.00
##                                used_chip used_pin_number online_order fraud
## distance_from_home                  0.00             0.0         0.00  0.19
## distance_from_last_transaction      0.00             0.0         0.00  0.09
## ratio_to_median_purchase_price      0.00             0.0         0.00  0.46
## repeat_retailer                     0.00             0.0         0.00  0.00
## used_chip                           1.00             0.0         0.00 -0.06
## used_pin_number                     0.00             1.0         0.00 -0.10
## online_order                        0.00             0.0         1.00  0.19
## fraud                              -0.06            -0.1         0.19  1.00
## 
## n= 1000000 
## 
## 
## P
##                                distance_from_home
## distance_from_home                               
## distance_from_last_transaction 0.8471            
## ratio_to_median_purchase_price 0.1694            
## repeat_retailer                0.0000            
## used_chip                      0.4858            
## used_pin_number                0.1048            
## online_order                   0.1932            
## fraud                          0.0000            
##                                distance_from_last_transaction
## distance_from_home             0.8471                        
## distance_from_last_transaction                               
## ratio_to_median_purchase_price 0.3113                        
## repeat_retailer                0.3533                        
## used_chip                      0.0399                        
## used_pin_number                0.3688                        
## online_order                   0.8880                        
## fraud                          0.0000                        
##                                ratio_to_median_purchase_price repeat_retailer
## distance_from_home             0.1694                         0.0000         
## distance_from_last_transaction 0.3113                         0.3533         
## ratio_to_median_purchase_price                                0.1695         
## repeat_retailer                0.1695                                        
## used_chip                      0.5575                         0.1787         
## used_pin_number                0.3461                         0.6764         
## online_order                   0.7415                         0.5946         
## fraud                          0.0000                         0.1746         
##                                used_chip used_pin_number online_order fraud 
## distance_from_home             0.4858    0.1048          0.1932       0.0000
## distance_from_last_transaction 0.0399    0.3688          0.8880       0.0000
## ratio_to_median_purchase_price 0.5575    0.3461          0.7415       0.0000
## repeat_retailer                0.1787    0.6764          0.5946       0.1746
## used_chip                                0.1636          0.8268       0.0000
## used_pin_number                0.1636                    0.7711       0.0000
## online_order                   0.8268    0.7711                       0.0000
## fraud                          0.0000    0.0000          0.0000
Interpretations: 
1) Ratio to Median Purchase Price has the highest correlation with fraud suggesting that the possibility of fraud increases.
2) Online Order and Distance from Home have a positive correlation between online orders and distance from home variables with fraud. This means that they are more likely to be fraudulent.
3) Used Chip has a negative correlation between used-chip and fraud which indicates that transactions when chip is used have a lower possibility of fraud.
4) Linear Relationship as predictors have a straightforward relationship with the odds of fraud. Each transaction is independent of others. No Multicollinearity.
Binary Outcome Fraud has two possible values (fraud or no fraud).

Scatter Plot Matrix:

pairs( ~ fraud + distance_from_home + distance_from_last_transaction
       + ratio_to_median_purchase_price + repeat_retailer + used_chip + 
         used_pin_number + online_order, data = data)

Interpretation: Using a scatter-plot matrix we are able to compare multiple variables at the same time with visualization.

Data Analysis

Estimate Regression Model:

null_model <- glm(fraud ~ ratio_to_median_purchase_price + online_order
                  + distance_from_home, data = data, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
null_model
## 
## Call:  glm(formula = fraud ~ ratio_to_median_purchase_price + online_order + 
##     distance_from_home, family = binomial, data = data)
## 
## Coefficients:
##                    (Intercept)  ratio_to_median_purchase_price  
##                       -9.10866                         0.66722  
##                   online_order              distance_from_home  
##                        5.14165                         0.01166  
## 
## Degrees of Freedom: 999999 Total (i.e. Null);  999996 Residual
## Null Deviance:       593000 
## Residual Deviance: 332200    AIC: 332200
Interpretation:
b0 = −9.11On average the risk of fraud decreases by 9.11 transactions by detection systems.
b1 = 0.67 The ratio to median purchase price does increases the risk of fraud by 0.67 transactions.
b2 = 5.14 Online orders does increases the risk of fraud by 5.14 transactions.
b3 = 0.01 Distance from home does increases the risk of fraud by 0.01.

Odds Ratio:

coefficients <- summary(null_model)$coefficients
odds_ratio1 <- exp(coefficients["ratio_to_median_purchase_price", "Estimate"])
odds_ratio1
## [1] 1.948811
odds_ratio2 <- exp(coefficients["online_order", "Estimate"])
odds_ratio2
## [1] 170.9969
odds_ratio3 <- exp(coefficients["distance_from_home", "Estimate"])
odds_ratio3
## [1] 1.011726
Interpretation: odds ratios are greater than 1 indicate an increased possibility of fraudulent transactions. a 4% increase is associated with the ratio to the median purchase price, and a 12% increase is linked to online orders.

Pseudo-R-squared:

pR2(null_model)
## fitting null model for pseudo-r2
##           llh       llhNull            G2      McFadden          r2ML 
## -1.661204e+05 -2.964878e+05  2.607348e+05  4.397059e-01  2.295148e-01 
##          r2CU 
##  5.130889e-01
Interpretation: A McFadden value of 0.4397 shows that the model has a strong fit to the data, meaning it does a good job of explaining the variance and predicting the outcomes.

Conclusion and Recommendation

Conclusion:

Based on our results, we came to the conclusion the detection systems that were put in place were not effective for combatting credit card fraud. 
•   As Our average was 9.11
   Independent factors / variables still increased the risk of fraud. 
•   Our three variables: (β1, β2, β3) were not statistically significant as they are >0.05.

Recommendations:

•   Two factor-authentication when it comes to online orders 
•   Implementing Encryption Processes into different websites 
•   AI fraud detection (can monitor suspicious activity in real time & cross checking) 
•   Include security questions for higher purchases ($100+)