Classification of the fraud transactions

Introduction

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. Looking at the importance of this issue, the current study tries to perform classification modeling technique to predict and categorize all the transactions into valid and fraudulent transactions. As the dataset used for the analysis is severely imbalanced, we recommend measuring the accuracy using the Area Under the Receiver Operating Characteristic curve (AUROC).

Importing the required packages and libraries

#install.packages(c('ROCR','ggplot2','corrplot','caTools','class',
#                   'randomForest','pROC','imbalance'))
library(ROCR)
library(ggplot2)
library(corrplot)
library(caTools)
library(class)
library(randomForest)
library(pROC)
library(imbalance)
library(rpart)

Importing the dataset (CSV format)

transaction <- read.csv('creditcard.csv')

About the Dataset

The dataset includes 284807 observations and 31 variables. Out of these 31 variables, one variable (Class) is the variable of interest. The dataset contains transactions made by credit cards in September 2013 by European cardholders over a span of two days. Due to confidentiality issues, dataset contains only numerical input variables, which are the result of a PCA transformation. Only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount.

The dataset is made available on the Kaggle platform by a research unit of the Université Libre de Bruxelles, Machine Learning Group - ULB. The dataset can be found here- https://www.kaggle.com/mlg-ulb/creditcardfraud

Basic Exploratory Data Analysis

Viewing some initial observations of the dataset

head(transaction)

##   Time         V1          V2        V3         V4          V5          V6
## 1    0 -1.3598071 -0.07278117 2.5363467  1.3781552 -0.33832077  0.46238778
## 2    0  1.1918571  0.26615071 0.1664801  0.4481541  0.06001765 -0.08236081
## 3    1 -1.3583541 -1.34016307 1.7732093  0.3797796 -0.50319813  1.80049938
## 4    1 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888  1.24720317
## 5    2 -1.1582331  0.87773675 1.5487178  0.4030339 -0.40719338  0.09592146
## 6    2 -0.4259659  0.96052304 1.1411093 -0.1682521  0.42098688 -0.02972755
##            V7          V8         V9         V10        V11         V12
## 1  0.23959855  0.09869790  0.3637870  0.09079417 -0.5515995 -0.61780086
## 2 -0.07880298  0.08510165 -0.2554251 -0.16697441  1.6127267  1.06523531
## 3  0.79146096  0.24767579 -1.5146543  0.20764287  0.6245015  0.06608369
## 4  0.23760894  0.37743587 -1.3870241 -0.05495192 -0.2264873  0.17822823
## 5  0.59294075 -0.27053268  0.8177393  0.75307443 -0.8228429  0.53819555
## 6  0.47620095  0.26031433 -0.5686714 -0.37140720  1.3412620  0.35989384
##          V13        V14        V15        V16         V17         V18
## 1 -0.9913898 -0.3111694  1.4681770 -0.4704005  0.20797124  0.02579058
## 2  0.4890950 -0.1437723  0.6355581  0.4639170 -0.11480466 -0.18336127
## 3  0.7172927 -0.1659459  2.3458649 -2.8900832  1.10996938 -0.12135931
## 4  0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279  1.96577500
## 5  1.3458516 -1.1196698  0.1751211 -0.4514492 -0.23703324 -0.03819479
## 6 -0.3580907 -0.1371337  0.5176168  0.4017259 -0.05813282  0.06865315
##           V19         V20          V21          V22         V23         V24
## 1  0.40399296  0.25141210 -0.018306778  0.277837576 -0.11047391  0.06692807
## 2 -0.14578304 -0.06908314 -0.225775248 -0.638671953  0.10128802 -0.33984648
## 3 -2.26185710  0.52497973  0.247998153  0.771679402  0.90941226 -0.68928096
## 4 -1.23262197 -0.20803778 -0.108300452  0.005273597 -0.19032052 -1.17557533
## 5  0.80348692  0.40854236 -0.009430697  0.798278495 -0.13745808  0.14126698
## 6 -0.03319379  0.08496767 -0.208253515 -0.559824796 -0.02639767 -0.37142658
##          V25        V26          V27         V28 Amount Class
## 1  0.1285394 -0.1891148  0.133558377 -0.02105305 149.62     0
## 2  0.1671704  0.1258945 -0.008983099  0.01472417   2.69     0
## 3 -0.3276418 -0.1390966 -0.055352794 -0.05975184 378.66     0
## 4  0.6473760 -0.2219288  0.062722849  0.06145763 123.50     0
## 5 -0.2060096  0.5022922  0.219422230  0.21515315  69.99     0
## 6 -0.2327938  0.1059148  0.253844225  0.08108026   3.67     0

Summary of the dataset

summary(transaction)

##       Time              V1                  V2                  V3          
##  Min.   :     0   Min.   :-56.40751   Min.   :-72.71573   Min.   :-48.3256  
##  1st Qu.: 54202   1st Qu.: -0.92037   1st Qu.: -0.59855   1st Qu.: -0.8904  
##  Median : 84692   Median :  0.01811   Median :  0.06549   Median :  0.1799  
##  Mean   : 94814   Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.0000  
##  3rd Qu.:139321   3rd Qu.:  1.31564   3rd Qu.:  0.80372   3rd Qu.:  1.0272  
##  Max.   :172792   Max.   :  2.45493   Max.   : 22.05773   Max.   :  9.3826  
##        V4                 V5                   V6                 V7          
##  Min.   :-5.68317   Min.   :-113.74331   Min.   :-26.1605   Min.   :-43.5572  
##  1st Qu.:-0.84864   1st Qu.:  -0.69160   1st Qu.: -0.7683   1st Qu.: -0.5541  
##  Median :-0.01985   Median :  -0.05434   Median : -0.2742   Median :  0.0401  
##  Mean   : 0.00000   Mean   :   0.00000   Mean   :  0.0000   Mean   :  0.0000  
##  3rd Qu.: 0.74334   3rd Qu.:   0.61193   3rd Qu.:  0.3986   3rd Qu.:  0.5704  
##  Max.   :16.87534   Max.   :  34.80167   Max.   : 73.3016   Max.   :120.5895  
##        V8                  V9                 V10                 V11          
##  Min.   :-73.21672   Min.   :-13.43407   Min.   :-24.58826   Min.   :-4.79747  
##  1st Qu.: -0.20863   1st Qu.: -0.64310   1st Qu.: -0.53543   1st Qu.:-0.76249  
##  Median :  0.02236   Median : -0.05143   Median : -0.09292   Median :-0.03276  
##  Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.:  0.32735   3rd Qu.:  0.59714   3rd Qu.:  0.45392   3rd Qu.: 0.73959  
##  Max.   : 20.00721   Max.   : 15.59500   Max.   : 23.74514   Max.   :12.01891  
##       V12                V13                V14                V15          
##  Min.   :-18.6837   Min.   :-5.79188   Min.   :-19.2143   Min.   :-4.49894  
##  1st Qu.: -0.4056   1st Qu.:-0.64854   1st Qu.: -0.4256   1st Qu.:-0.58288  
##  Median :  0.1400   Median :-0.01357   Median :  0.0506   Median : 0.04807  
##  Mean   :  0.0000   Mean   : 0.00000   Mean   :  0.0000   Mean   : 0.00000  
##  3rd Qu.:  0.6182   3rd Qu.: 0.66251   3rd Qu.:  0.4931   3rd Qu.: 0.64882  
##  Max.   :  7.8484   Max.   : 7.12688   Max.   : 10.5268   Max.   : 8.87774  
##       V16                 V17                 V18           
##  Min.   :-14.12985   Min.   :-25.16280   Min.   :-9.498746  
##  1st Qu.: -0.46804   1st Qu.: -0.48375   1st Qu.:-0.498850  
##  Median :  0.06641   Median : -0.06568   Median :-0.003636  
##  Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.000000  
##  3rd Qu.:  0.52330   3rd Qu.:  0.39968   3rd Qu.: 0.500807  
##  Max.   : 17.31511   Max.   :  9.25353   Max.   : 5.041069  
##       V19                 V20                 V21           
##  Min.   :-7.213527   Min.   :-54.49772   Min.   :-34.83038  
##  1st Qu.:-0.456299   1st Qu.: -0.21172   1st Qu.: -0.22839  
##  Median : 0.003735   Median : -0.06248   Median : -0.02945  
##  Mean   : 0.000000   Mean   :  0.00000   Mean   :  0.00000  
##  3rd Qu.: 0.458949   3rd Qu.:  0.13304   3rd Qu.:  0.18638  
##  Max.   : 5.591971   Max.   : 39.42090   Max.   : 27.20284  
##       V22                  V23                 V24          
##  Min.   :-10.933144   Min.   :-44.80774   Min.   :-2.83663  
##  1st Qu.: -0.542350   1st Qu.: -0.16185   1st Qu.:-0.35459  
##  Median :  0.006782   Median : -0.01119   Median : 0.04098  
##  Mean   :  0.000000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.:  0.528554   3rd Qu.:  0.14764   3rd Qu.: 0.43953  
##  Max.   : 10.503090   Max.   : 22.52841   Max.   : 4.58455  
##       V25                 V26                V27            
##  Min.   :-10.29540   Min.   :-2.60455   Min.   :-22.565679  
##  1st Qu.: -0.31715   1st Qu.:-0.32698   1st Qu.: -0.070840  
##  Median :  0.01659   Median :-0.05214   Median :  0.001342  
##  Mean   :  0.00000   Mean   : 0.00000   Mean   :  0.000000  
##  3rd Qu.:  0.35072   3rd Qu.: 0.24095   3rd Qu.:  0.091045  
##  Max.   :  7.51959   Max.   : 3.51735   Max.   : 31.612198  
##       V28                Amount             Class         
##  Min.   :-15.43008   Min.   :    0.00   Min.   :0.000000  
##  1st Qu.: -0.05296   1st Qu.:    5.60   1st Qu.:0.000000  
##  Median :  0.01124   Median :   22.00   Median :0.000000  
##  Mean   :  0.00000   Mean   :   88.35   Mean   :0.001728  
##  3rd Qu.:  0.07828   3rd Qu.:   77.17   3rd Qu.:0.000000  
##  Max.   : 33.84781   Max.   :25691.16   Max.   :1.000000

Viewing Structure of the dataset

str(transaction)

## 'data.frame':    284807 obs. of  31 variables:
##  $ Time  : num  0 0 1 1 2 2 4 7 7 9 ...
##  $ V1    : num  -1.36 1.192 -1.358 -0.966 -1.158 ...
##  $ V2    : num  -0.0728 0.2662 -1.3402 -0.1852 0.8777 ...
##  $ V3    : num  2.536 0.166 1.773 1.793 1.549 ...
##  $ V4    : num  1.378 0.448 0.38 -0.863 0.403 ...
##  $ V5    : num  -0.3383 0.06 -0.5032 -0.0103 -0.4072 ...
##  $ V6    : num  0.4624 -0.0824 1.8005 1.2472 0.0959 ...
##  $ V7    : num  0.2396 -0.0788 0.7915 0.2376 0.5929 ...
##  $ V8    : num  0.0987 0.0851 0.2477 0.3774 -0.2705 ...
##  $ V9    : num  0.364 -0.255 -1.515 -1.387 0.818 ...
##  $ V10   : num  0.0908 -0.167 0.2076 -0.055 0.7531 ...
##  $ V11   : num  -0.552 1.613 0.625 -0.226 -0.823 ...
##  $ V12   : num  -0.6178 1.0652 0.0661 0.1782 0.5382 ...
##  $ V13   : num  -0.991 0.489 0.717 0.508 1.346 ...
##  $ V14   : num  -0.311 -0.144 -0.166 -0.288 -1.12 ...
##  $ V15   : num  1.468 0.636 2.346 -0.631 0.175 ...
##  $ V16   : num  -0.47 0.464 -2.89 -1.06 -0.451 ...
##  $ V17   : num  0.208 -0.115 1.11 -0.684 -0.237 ...
##  $ V18   : num  0.0258 -0.1834 -0.1214 1.9658 -0.0382 ...
##  $ V19   : num  0.404 -0.146 -2.262 -1.233 0.803 ...
##  $ V20   : num  0.2514 -0.0691 0.525 -0.208 0.4085 ...
##  $ V21   : num  -0.01831 -0.22578 0.248 -0.1083 -0.00943 ...
##  $ V22   : num  0.27784 -0.63867 0.77168 0.00527 0.79828 ...
##  $ V23   : num  -0.11 0.101 0.909 -0.19 -0.137 ...
##  $ V24   : num  0.0669 -0.3398 -0.6893 -1.1756 0.1413 ...
##  $ V25   : num  0.129 0.167 -0.328 0.647 -0.206 ...
##  $ V26   : num  -0.189 0.126 -0.139 -0.222 0.502 ...
##  $ V27   : num  0.13356 -0.00898 -0.05535 0.06272 0.21942 ...
##  $ V28   : num  -0.0211 0.0147 -0.0598 0.0615 0.2152 ...
##  $ Amount: num  149.62 2.69 378.66 123.5 69.99 ...
##  $ Class : int  0 0 0 0 0 0 0 0 0 0 ...

sapply(transaction, FUN=class)

##      Time        V1        V2        V3        V4        V5        V6        V7 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##        V8        V9       V10       V11       V12       V13       V14       V15 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##       V16       V17       V18       V19       V20       V21       V22       V23 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##       V24       V25       V26       V27       V28    Amount     Class 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "integer"

After the basic data exploration we found that all of the variables including the target variable are of numeric data type, due to the PCA.

Viewing the Frequency distribution of “Class” Variable

table(transaction$Class)

## 
##      0      1 
## 284315    492

prop.table(table(transaction$Class))

## 
##           0           1 
## 0.998272514 0.001727486

ggplot(data = transaction,aes(x=Class))+
  geom_bar(col="slateblue3")

The dataset is highly imbalanced, the positive class (frauds) account for mere 0.172% of all transactions. This is really good for the credit card company but not for this analysis.

Plotting the Amount - Fraud chart

ggplot(data = transaction,aes(y=Amount,x=Class))+
  geom_point(col="slateblue3") +
  facet_grid(~Class)

This plot tells us that all the fraudulent transactions were made for less than around 2500, which is considerably less than that of true transactions.

Exploring Time Column with Regards to the Class variable

ggplot(data = transaction,aes(x=Time),fill=factor(Class))+
  geom_histogram(col='red',bins=48) + 
  facet_grid(Class ~ ., scales = 'free_y')

As the unit of measure for Time here is seconds, it makes it hard to understand and interpret anything.

Converting Time(Seconds) into Hours for better understanding

transaction$hours <- round(transaction$Time/3600)
unique(transaction$hours)

##  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
## [26] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Plotting Frauds with time (in hours)

ggplot(data = transaction,
       aes(x=hours,fill=factor(Class)))+
  geom_histogram(col='red',bins=48) +  
  facet_grid(Class ~ ., scales = 'free_y')

Replacing the Time column with hours column to keep it simple for analysis

transaction$Time <- transaction$hours
transaction <- subset(transaction,select=-hours)

Checking correlation between the variables

transaction$Class <- as.numeric(transaction$Class)
corr <- cor(transaction[],method="pearson")
corrplot(corr,method = "circle", type = "lower")

This shows that there is not much correlation among the variables. “V2” has some negative correlation with “Amount”. However, we can ignore it.

Converting “Class” to factor type for Modeling purposes

transaction$Class <- as.factor(transaction$Class)

Splitting the dataset into the Training dataset and Test dataset

Splitting the data into 75-25 ratio.

# For reproducability of the random selection
set.seed(123)

We are using sample.split and subset functions from caTools Library to split the data randomly.

split = sample.split(transaction$Class, SplitRatio = 0.75)
training_set = subset(transaction, split == TRUE)
test_set = subset(transaction, split == FALSE)

Verify if the proportion of Class is still the same

table(training_set$Class)

## 
##      0      1 
## 213236    369

prop.table(table(training_set$Class))

## 
##           0           1 
## 0.998272512 0.001727488

table(test_set$Class)

## 
##     0     1 
## 71079   123

prop.table(table(test_set$Class))

## 
##          0          1 
## 0.99827252 0.00172748

To balance out the minority(fraud) and majority class in the dataset, we are using the SMOTE technique. We are setting ratio at 0.8 which means we are oversampling our minority class to come upto 80% of the majority class. It will maintain the ratio and the model will now be able to easily understand the minority class also.

smoted_training_set<-oversample(dataset= training_set, 
                                method = 'SMOTE',ratio = 0.8)
table(smoted_training_set$Class)

## 
##      0      1 
## 213236 170589

prop.table(table(smoted_training_set$Class))

## 
##         0         1 
## 0.5555553 0.4444447

colnames(smoted_training_set)

##  [1] "Time"   "V1"     "V2"     "V3"     "V4"     "V5"     "V6"     "V7"    
##  [9] "V8"     "V9"     "V10"    "V11"    "V12"    "V13"    "V14"    "V15"   
## [17] "V16"    "V17"    "V18"    "V19"    "V20"    "V21"    "V22"    "V23"   
## [25] "V24"    "V25"    "V26"    "V27"    "V28"    "Amount" "Class"

Excluding the Class variable to calculate the scaled values for each column

smoted_training_set[-31] <- scale(smoted_training_set[-31])
test_set[-31] <- scale(test_set[-31])

Fitting Logistic Regression to the Training set

classifier <- glm(formula = Class ~ .,
                  family = binomial,
                  data = smoted_training_set)

summary(classifier)

## 
## Call:
## glm(formula = Class ~ ., family = binomial, data = smoted_training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.1981  -0.0515   0.0000   3.1189  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 13.21357    0.20456  64.595   <2e-16 ***
## Time        -0.48566    0.01520 -31.960   <2e-16 ***
## V1           4.16642    0.10716  38.882   <2e-16 ***
## V2           2.01935    0.12771  15.812   <2e-16 ***
## V3           1.69492    0.10179  16.651   <2e-16 ***
## V4           2.65980    0.03521  75.548   <2e-16 ***
## V5           3.53083    0.10236  34.493   <2e-16 ***
## V6          -0.84735    0.02771 -30.578   <2e-16 ***
## V7          -3.19131    0.17999 -17.730   <2e-16 ***
## V8          -2.03610    0.05362 -37.973   <2e-16 ***
## V9          -1.16819    0.03426 -34.100   <2e-16 ***
## V10         -3.56501    0.08646 -41.235   <2e-16 ***
## V11          1.92890    0.03436  56.137   <2e-16 ***
## V12         -5.41377    0.07636 -70.902   <2e-16 ***
## V13         -0.56755    0.01001 -56.699   <2e-16 ***
## V14         -6.79996    0.08555 -79.487   <2e-16 ***
## V15         -0.32356    0.01028 -31.479   <2e-16 ***
## V16         -2.29176    0.06361 -36.028   <2e-16 ***
## V17         -5.63337    0.13085 -43.051   <2e-16 ***
## V18         -1.12256    0.03900 -28.786   <2e-16 ***
## V19          0.37196    0.01656  22.463   <2e-16 ***
## V20         -0.74733    0.03340 -22.376   <2e-16 ***
## V21          0.38731    0.03909   9.908   <2e-16 ***
## V22          0.97774    0.02117  46.188   <2e-16 ***
## V23          0.49043    0.03245  15.114   <2e-16 ***
## V24          0.01944    0.01063   1.828   0.0675 .  
## V25         -0.03240    0.01397  -2.319   0.0204 *  
## V26          0.09861    0.01000   9.861   <2e-16 ***
## V27          0.08428    0.03446   2.445   0.0145 *  
## V28          0.22122    0.01843  12.001   <2e-16 ***
## Amount       1.86849    0.08204  22.774   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 527346  on 383824  degrees of freedom
## Residual deviance:  78623  on 383794  degrees of freedom
## AIC: 78685
## 
## Number of Fisher Scoring iterations: 13

Summary of this model clearly tells that almost all of the variables are statistically significant to the model. Only V24 is there which is not significant. We can ignore this and keep it with us as the value is slightly higher than the 5% level of significance.

Predicting the Test set results

prob_pred <- predict(classifier, type = 'response', newdata = test_set[-31])

Setting a condition that even if there are more than 40% chances(>0.4) of getting a Fraud Transaction, model should predict it as Fraud.

y_pred <- ifelse(prob_pred > 0.4, 1, 0)

Calculating and Plotting the AUROC curve

AUROC (Area under the receiver operating characteristic curve) curve tells that how accurately our model is predicting together with its overall Performance.

The thumb-rule is that Higher the AUC Better our model is at predicting.

calc_auc <- function (prob_pred,test_y){
  predict_log <- prediction(prob_pred,test_y) # Prediction Probability
  
  #Calculating ROC Curve...
  table(test_y, prob_pred>0.4)
  roc_curve1<- performance(predict_log,"tpr","fpr")  
  # plot(roc_curve)
  plot(roc_curve1, colorize=T)
  
  #Calculating AUC and printing....
  auc<- performance(predict_log,"auc")
  paste(auc@y.values[[1]])
  
}

Making confusion matrix

cm <- table(test_set[, 31], y_pred > 0.4)
cm

##    
##     FALSE  TRUE
##   0  3263 67816
##   1     0   123

Calculating the AUROC value

print(calc_auc(prob_pred,test_set$Class))

## [1] "0.974189831376961"

The value of AUC is 0.97 which is really high. However, we cannot just decide if it is the best model for our dataset or not.

Creating Random Forest Model

ntree = 100

set.seed(123)
forest.classifier = randomForest(x = smoted_training_set[-31],
                                 y = smoted_training_set$Class,
                                 ntree = 100)

Predicting Classes using the model

forest.y_pred = predict(forest.classifier, newdata = test_set[-31])

Predicting Probabilities

forest.y_prob = predict(forest.classifier, newdata = test_set[-31], type='prob')

Predicting Log for AUROC generation

forest.predict_log <- prediction(as.numeric(y_pred),as.numeric(test_set$Class))

Calculating AUC value

forest.auc<- performance(forest.predict_log,"auc")
print( (forest.auc@y.values[[1]]))

## [1] 0.5229533

The AUC value in this case is 0.52, which means the logistic regression is far better than the Random forest model for this dataset.

Making the confusion matrix

cm <- table(test_set[, 31], forest.y_pred)
cm

##    forest.y_pred
##         0     1
##   0 64062  7017
##   1    25    98

Creating AUROC Curve

roc_curve2<- performance(forest.predict_log,"tpr","fpr")  
plot(roc_curve2, colorize=T)

Decision Tree Classifier

tree.model <- rpart(Class~.,smoted_training_set,method='class')
tree.predict <- predict(tree.model, newdata = test_set[-31],type = 'class')
tree.predict_log <- prediction(as.numeric(tree.predict),
                               as.numeric(test_set$Class))

Calculating AUC value

tree.auc<- performance(tree.predict_log,"auc")
print( (tree.auc@y.values[[1]]))

## [1] 0.6818205

The AUC value comes out to be 0.68 which is significantly lower than that of logistic regression.

Plotting the AUROC curve

roc_curve3<- performance(tree.predict_log,"tpr","fpr")
plot(roc_curve3, colorize=T)

Conclusions:

Logistic Regression works the best for this dataset.
We are getting 0.97 of AUC for Logistic Regression which is a significant performance of the model.
There can be seen a rise in the Fraud transactions during the morning or early morning time period.
Frauds were not of high values in comparison to the normal transactions.
Maximum amount of fraud transaction was 2500-3000 in the native currency.

Classification of the fraud transactions

Elbrus Gasimov, Gunneet Singh

May 30, 2021

Introduction

Importing the required packages and libraries

Importing the dataset (CSV format)

About the Dataset

Basic Exploratory Data Analysis

Viewing some initial observations of the dataset

Summary of the dataset

Viewing Structure of the dataset

Viewing the Frequency distribution of “Class” Variable

Plotting the Amount - Fraud chart

Exploring Time Column with Regards to the Class variable

Converting Time(Seconds) into Hours for better understanding

Plotting Frauds with time (in hours)

Replacing the Time column with hours column to keep it simple for analysis

Checking correlation between the variables

Converting “Class” to factor type for Modeling purposes

Splitting the dataset into the Training dataset and Test dataset

Splitting the data into 75-25 ratio.

We are using sample.split and subset functions from caTools Library to split the data randomly.

Verify if the proportion of Class is still the same

Excluding the Class variable to calculate the scaled values for each column

Fitting Logistic Regression to the Training set

Predicting the Test set results

Setting a condition that even if there are more than 40% chances(>0.4) of getting a Fraud Transaction, model should predict it as Fraud.

Calculating and Plotting the AUROC curve

The thumb-rule is that Higher the AUC Better our model is at predicting.

Making confusion matrix

Calculating the AUROC value

Creating Random Forest Model

ntree = 100

Predicting Classes using the model

Predicting Probabilities

Predicting Log for AUROC generation

Calculating AUC value

Making the confusion matrix

Creating AUROC Curve

Decision Tree Classifier

Calculating AUC value

Plotting the AUROC curve

Conclusions: