Statistical learning models were applied to credit card transactions of European cardholders dataset to recognize fraudulent credit card transactions based on their clinical and demographic information. A variety of learning techniques were explored and validated. GBM has good performance but further work of data collection and model adjustment is required.
Credit card 1 transactions has been taking a larger share of the US payment system. customers and banks are suffered from high rate of stolen account numbers and subsequent losses. Therefore, improved fraud 2 detection has become essential to maintain the viability of the US payment system. 3 .
In this analysis,statistics and machine learning techniques were applied to recognize fraudulent credit card transactions. The datasets contains transactions made by credit cards in September 2013 by European cardholders. Unfortunately, due to confidentiality issues, the original features and more background information about the data cannot be provided.
Therefore, classification methods, including logistic regression, k-nearest neighbors, GBM models and neural networks models are applied. The results indicate that among all the models are considered, GBM models can probably be made with a reasonable ROC and sensitivity. However, further investigation about the data was still needed through practical and statistical methods.
The datasets contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. For the purposes of this analysis, we only use a subset of 50000 instances. The original dataset, containing 2828003 observations, can be accessed via Credit Card Fraud Detection 4 .
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, the original features and more background information about the data cannot be provided. Features V1, V2, … V28 are the principal components 5 obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’.
In order to determine whether a certain credit card transactionsis fraudulent or not, four modeling techniques were explored: logistics regression, k-nearest neighbors models, neural networks models, and gradient boosting machine (GBM) models. All available except time were used. To evaluate the ability of different models to detect fraudulent credit card transactions, the data was split into training and testing sets.
17% of transactions in the training dataset are fraudulent. Given the class imbalance ratio, ROC and sensitivity ere used as the accuracy for unbalanced classification. In addition, resampling method ROSE was used due to the imbalanced ratio of class.
The results below show somewhat similar performance across the GBM models and KNN models. Ultimately, the GBM model is chosen as its higher ROC. Additional intermediate tuning results can be found in the appendix.
| Model Name | ROC | Sensitivity |
|---|---|---|
| Logistic Regression | 0.763 | 0.533 |
| KNN Model | 0.892 | 0.800 |
| GBM Model | 0.982 | 0.800 |
| Neutral Network Model | 0.066 | 0.700 |
According to results of assessing model performance based on the testing dataset, It can be interpreted that the GBM model played a significant role in recognizing fraudulent credit card transactions.
The maximum loss of all transactions is 1062.935 dollar, and he average loss is 0.1156212 dollar. Given that billions of transactions will be made through credit card every day. In this dataset, this seems to suggest a model acceptable but not performing very well at the prediction task. There is no solid evidence that the model should be considered to put into practice.
Statistically, the application of this chosen model is to some extent limited due to the nature of the data. The observations used to train the model was only a subset of 50000 out of 280000,which suggest that the sample size is relatively small. To generalize this model to a greater population, more data would need to be included.
V1 - V28 - 28 principal components based on an unknown set of input features that contain information about each transaction.Time: contains the seconds elapsed between each transaction and the first transaction in the dataset.Amount: the transaction Amount, this feature can be used for example-dependent cost-sensitive learning.Class: the response variable and it takes value 1 in case of fraud and 0 otherwise.For additional background on the data, see the data source on Credit Card Fraud Detection (Kaggle).
| Class | Count | 10th Percentile | Median | 90th Percentile |
|---|---|---|---|---|
| fraud | 11 | 1 | 311.91 | 723.210 |
| genuine | 4989 | 1 | 22.95 | 214.132 |
Marginal: 17% of transactions in the training dataset are fraudulent. 22 is the median transaction amount in the training dataset. 203.176 is the 90th percentile of transaction amount in the training dataset.
Conditional: Given that a transaction in the training dataset is fraudulent. 8.42 is the median transaction amount. 325.404 is the 90th percentile of transaction amount. Given that a transaction in the training dataset is genuine. 22 is the median transaction amount. 203 is the 90th percentile of transaction amount.
Guo, Tao, and Gui-Yang Li. “Neural Data Mining for Credit Card Fraud Detection.” 2008 International Conference on Machine Learning and Cybernetics, 2008. https://doi.org/10.1109/icmlc.2008.4621035.↩