It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. Looking at the importance of this issue, the current study tries to perform classification modeling technique to predict and categorize all the transactions into valid and fraudulent transactions. As the dataset used for the analysis is severely imbalanced, we recommend measuring the accuracy using the Area Under the Receiver Operating Characteristic curve (AUROC).
#install.packages(c('ROCR','ggplot2','corrplot','caTools','class',
# 'randomForest','pROC','imbalance'))
library(ROCR)
library(ggplot2)
library(corrplot)
library(caTools)
library(class)
library(randomForest)
library(pROC)
library(imbalance)
library(rpart)
transaction <- read.csv('creditcard.csv')
The dataset includes 284807 observations and 31 variables. Out of these 31 variables, one variable (Class) is the variable of interest. The dataset contains transactions made by credit cards in September 2013 by European cardholders over a span of two days. Due to confidentiality issues, dataset contains only numerical input variables, which are the result of a PCA transformation. Only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount.
The dataset is made available on the Kaggle platform by a research unit of the Université Libre de Bruxelles, Machine Learning Group - ULB. The dataset can be found here- https://www.kaggle.com/mlg-ulb/creditcardfraud
head(transaction)
## Time V1 V2 V3 V4 V5 V6
## 1 0 -1.3598071 -0.07278117 2.5363467 1.3781552 -0.33832077 0.46238778
## 2 0 1.1918571 0.26615071 0.1664801 0.4481541 0.06001765 -0.08236081
## 3 1 -1.3583541 -1.34016307 1.7732093 0.3797796 -0.50319813 1.80049938
## 4 1 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888 1.24720317
## 5 2 -1.1582331 0.87773675 1.5487178 0.4030339 -0.40719338 0.09592146
## 6 2 -0.4259659 0.96052304 1.1411093 -0.1682521 0.42098688 -0.02972755
## V7 V8 V9 V10 V11 V12
## 1 0.23959855 0.09869790 0.3637870 0.09079417 -0.5515995 -0.61780086
## 2 -0.07880298 0.08510165 -0.2554251 -0.16697441 1.6127267 1.06523531
## 3 0.79146096 0.24767579 -1.5146543 0.20764287 0.6245015 0.06608369
## 4 0.23760894 0.37743587 -1.3870241 -0.05495192 -0.2264873 0.17822823
## 5 0.59294075 -0.27053268 0.8177393 0.75307443 -0.8228429 0.53819555
## 6 0.47620095 0.26031433 -0.5686714 -0.37140720 1.3412620 0.35989384
## V13 V14 V15 V16 V17 V18
## 1 -0.9913898 -0.3111694 1.4681770 -0.4704005 0.20797124 0.02579058
## 2 0.4890950 -0.1437723 0.6355581 0.4639170 -0.11480466 -0.18336127
## 3 0.7172927 -0.1659459 2.3458649 -2.8900832 1.10996938 -0.12135931
## 4 0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279 1.96577500
## 5 1.3458516 -1.1196698 0.1751211 -0.4514492 -0.23703324 -0.03819479
## 6 -0.3580907 -0.1371337 0.5176168 0.4017259 -0.05813282 0.06865315
## V19 V20 V21 V22 V23 V24
## 1 0.40399296 0.25141210 -0.018306778 0.277837576 -0.11047391 0.06692807
## 2 -0.14578304 -0.06908314 -0.225775248 -0.638671953 0.10128802 -0.33984648
## 3 -2.26185710 0.52497973 0.247998153 0.771679402 0.90941226 -0.68928096
## 4 -1.23262197 -0.20803778 -0.108300452 0.005273597 -0.19032052 -1.17557533
## 5 0.80348692 0.40854236 -0.009430697 0.798278495 -0.13745808 0.14126698
## 6 -0.03319379 0.08496767 -0.208253515 -0.559824796 -0.02639767 -0.37142658
## V25 V26 V27 V28 Amount Class
## 1 0.1285394 -0.1891148 0.133558377 -0.02105305 149.62 0
## 2 0.1671704 0.1258945 -0.008983099 0.01472417 2.69 0
## 3 -0.3276418 -0.1390966 -0.055352794 -0.05975184 378.66 0
## 4 0.6473760 -0.2219288 0.062722849 0.06145763 123.50 0
## 5 -0.2060096 0.5022922 0.219422230 0.21515315 69.99 0
## 6 -0.2327938 0.1059148 0.253844225 0.08108026 3.67 0
summary(transaction)
## Time V1 V2 V3
## Min. : 0 Min. :-56.40751 Min. :-72.71573 Min. :-48.3256
## 1st Qu.: 54202 1st Qu.: -0.92037 1st Qu.: -0.59855 1st Qu.: -0.8904
## Median : 84692 Median : 0.01811 Median : 0.06549 Median : 0.1799
## Mean : 94814 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.:139321 3rd Qu.: 1.31564 3rd Qu.: 0.80372 3rd Qu.: 1.0272
## Max. :172792 Max. : 2.45493 Max. : 22.05773 Max. : 9.3826
## V4 V5 V6 V7
## Min. :-5.68317 Min. :-113.74331 Min. :-26.1605 Min. :-43.5572
## 1st Qu.:-0.84864 1st Qu.: -0.69160 1st Qu.: -0.7683 1st Qu.: -0.5541
## Median :-0.01985 Median : -0.05434 Median : -0.2742 Median : 0.0401
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.74334 3rd Qu.: 0.61193 3rd Qu.: 0.3986 3rd Qu.: 0.5704
## Max. :16.87534 Max. : 34.80167 Max. : 73.3016 Max. :120.5895
## V8 V9 V10 V11
## Min. :-73.21672 Min. :-13.43407 Min. :-24.58826 Min. :-4.79747
## 1st Qu.: -0.20863 1st Qu.: -0.64310 1st Qu.: -0.53543 1st Qu.:-0.76249
## Median : 0.02236 Median : -0.05143 Median : -0.09292 Median :-0.03276
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.32735 3rd Qu.: 0.59714 3rd Qu.: 0.45392 3rd Qu.: 0.73959
## Max. : 20.00721 Max. : 15.59500 Max. : 23.74514 Max. :12.01891
## V12 V13 V14 V15
## Min. :-18.6837 Min. :-5.79188 Min. :-19.2143 Min. :-4.49894
## 1st Qu.: -0.4056 1st Qu.:-0.64854 1st Qu.: -0.4256 1st Qu.:-0.58288
## Median : 0.1400 Median :-0.01357 Median : 0.0506 Median : 0.04807
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6182 3rd Qu.: 0.66251 3rd Qu.: 0.4931 3rd Qu.: 0.64882
## Max. : 7.8484 Max. : 7.12688 Max. : 10.5268 Max. : 8.87774
## V16 V17 V18
## Min. :-14.12985 Min. :-25.16280 Min. :-9.498746
## 1st Qu.: -0.46804 1st Qu.: -0.48375 1st Qu.:-0.498850
## Median : 0.06641 Median : -0.06568 Median :-0.003636
## Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.52330 3rd Qu.: 0.39968 3rd Qu.: 0.500807
## Max. : 17.31511 Max. : 9.25353 Max. : 5.041069
## V19 V20 V21
## Min. :-7.213527 Min. :-54.49772 Min. :-34.83038
## 1st Qu.:-0.456299 1st Qu.: -0.21172 1st Qu.: -0.22839
## Median : 0.003735 Median : -0.06248 Median : -0.02945
## Mean : 0.000000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.458949 3rd Qu.: 0.13304 3rd Qu.: 0.18638
## Max. : 5.591971 Max. : 39.42090 Max. : 27.20284
## V22 V23 V24
## Min. :-10.933144 Min. :-44.80774 Min. :-2.83663
## 1st Qu.: -0.542350 1st Qu.: -0.16185 1st Qu.:-0.35459
## Median : 0.006782 Median : -0.01119 Median : 0.04098
## Mean : 0.000000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.528554 3rd Qu.: 0.14764 3rd Qu.: 0.43953
## Max. : 10.503090 Max. : 22.52841 Max. : 4.58455
## V25 V26 V27
## Min. :-10.29540 Min. :-2.60455 Min. :-22.565679
## 1st Qu.: -0.31715 1st Qu.:-0.32698 1st Qu.: -0.070840
## Median : 0.01659 Median :-0.05214 Median : 0.001342
## Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.35072 3rd Qu.: 0.24095 3rd Qu.: 0.091045
## Max. : 7.51959 Max. : 3.51735 Max. : 31.612198
## V28 Amount Class
## Min. :-15.43008 Min. : 0.00 Min. :0.000000
## 1st Qu.: -0.05296 1st Qu.: 5.60 1st Qu.:0.000000
## Median : 0.01124 Median : 22.00 Median :0.000000
## Mean : 0.00000 Mean : 88.35 Mean :0.001728
## 3rd Qu.: 0.07828 3rd Qu.: 77.17 3rd Qu.:0.000000
## Max. : 33.84781 Max. :25691.16 Max. :1.000000
str(transaction)
## 'data.frame': 284807 obs. of 31 variables:
## $ Time : num 0 0 1 1 2 2 4 7 7 9 ...
## $ V1 : num -1.36 1.192 -1.358 -0.966 -1.158 ...
## $ V2 : num -0.0728 0.2662 -1.3402 -0.1852 0.8777 ...
## $ V3 : num 2.536 0.166 1.773 1.793 1.549 ...
## $ V4 : num 1.378 0.448 0.38 -0.863 0.403 ...
## $ V5 : num -0.3383 0.06 -0.5032 -0.0103 -0.4072 ...
## $ V6 : num 0.4624 -0.0824 1.8005 1.2472 0.0959 ...
## $ V7 : num 0.2396 -0.0788 0.7915 0.2376 0.5929 ...
## $ V8 : num 0.0987 0.0851 0.2477 0.3774 -0.2705 ...
## $ V9 : num 0.364 -0.255 -1.515 -1.387 0.818 ...
## $ V10 : num 0.0908 -0.167 0.2076 -0.055 0.7531 ...
## $ V11 : num -0.552 1.613 0.625 -0.226 -0.823 ...
## $ V12 : num -0.6178 1.0652 0.0661 0.1782 0.5382 ...
## $ V13 : num -0.991 0.489 0.717 0.508 1.346 ...
## $ V14 : num -0.311 -0.144 -0.166 -0.288 -1.12 ...
## $ V15 : num 1.468 0.636 2.346 -0.631 0.175 ...
## $ V16 : num -0.47 0.464 -2.89 -1.06 -0.451 ...
## $ V17 : num 0.208 -0.115 1.11 -0.684 -0.237 ...
## $ V18 : num 0.0258 -0.1834 -0.1214 1.9658 -0.0382 ...
## $ V19 : num 0.404 -0.146 -2.262 -1.233 0.803 ...
## $ V20 : num 0.2514 -0.0691 0.525 -0.208 0.4085 ...
## $ V21 : num -0.01831 -0.22578 0.248 -0.1083 -0.00943 ...
## $ V22 : num 0.27784 -0.63867 0.77168 0.00527 0.79828 ...
## $ V23 : num -0.11 0.101 0.909 -0.19 -0.137 ...
## $ V24 : num 0.0669 -0.3398 -0.6893 -1.1756 0.1413 ...
## $ V25 : num 0.129 0.167 -0.328 0.647 -0.206 ...
## $ V26 : num -0.189 0.126 -0.139 -0.222 0.502 ...
## $ V27 : num 0.13356 -0.00898 -0.05535 0.06272 0.21942 ...
## $ V28 : num -0.0211 0.0147 -0.0598 0.0615 0.2152 ...
## $ Amount: num 149.62 2.69 378.66 123.5 69.99 ...
## $ Class : int 0 0 0 0 0 0 0 0 0 0 ...
sapply(transaction, FUN=class)
## Time V1 V2 V3 V4 V5 V6 V7
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## V8 V9 V10 V11 V12 V13 V14 V15
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## V16 V17 V18 V19 V20 V21 V22 V23
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## V24 V25 V26 V27 V28 Amount Class
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "integer"
After the basic data exploration we found that all of the variables including the target variable are of numeric data type, due to the PCA.
table(transaction$Class)
##
## 0 1
## 284315 492
prop.table(table(transaction$Class))
##
## 0 1
## 0.998272514 0.001727486
ggplot(data = transaction,aes(x=Class))+
geom_bar(col="slateblue3")
The dataset is highly imbalanced, the positive class (frauds) account for mere 0.172% of all transactions. This is really good for the credit card company but not for this analysis.
ggplot(data = transaction,aes(y=Amount,x=Class))+
geom_point(col="slateblue3") +
facet_grid(~Class)
This plot tells us that all the fraudulent transactions were made for less than around 2500, which is considerably less than that of true transactions.
ggplot(data = transaction,aes(x=Time),fill=factor(Class))+
geom_histogram(col='red',bins=48) +
facet_grid(Class ~ ., scales = 'free_y')
As the unit of measure for Time here is seconds, it makes it hard to understand and interpret anything.
transaction$hours <- round(transaction$Time/3600)
unique(transaction$hours)
## [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
## [26] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
ggplot(data = transaction,
aes(x=hours,fill=factor(Class)))+
geom_histogram(col='red',bins=48) +
facet_grid(Class ~ ., scales = 'free_y')
transaction$Time <- transaction$hours
transaction <- subset(transaction,select=-hours)
transaction$Class <- as.numeric(transaction$Class)
corr <- cor(transaction[],method="pearson")
corrplot(corr,method = "circle", type = "lower")
This shows that there is not much correlation among the variables. “V2” has some negative correlation with “Amount”. However, we can ignore it.
transaction$Class <- as.factor(transaction$Class)
# For reproducability of the random selection
set.seed(123)
split = sample.split(transaction$Class, SplitRatio = 0.75)
training_set = subset(transaction, split == TRUE)
test_set = subset(transaction, split == FALSE)
table(training_set$Class)
##
## 0 1
## 213236 369
prop.table(table(training_set$Class))
##
## 0 1
## 0.998272512 0.001727488
table(test_set$Class)
##
## 0 1
## 71079 123
prop.table(table(test_set$Class))
##
## 0 1
## 0.99827252 0.00172748
To balance out the minority(fraud) and majority class in the dataset, we are using the SMOTE technique. We are setting ratio at 0.8 which means we are oversampling our minority class to come upto 80% of the majority class. It will maintain the ratio and the model will now be able to easily understand the minority class also.
smoted_training_set<-oversample(dataset= training_set,
method = 'SMOTE',ratio = 0.8)
table(smoted_training_set$Class)
##
## 0 1
## 213236 170589
prop.table(table(smoted_training_set$Class))
##
## 0 1
## 0.5555553 0.4444447
colnames(smoted_training_set)
## [1] "Time" "V1" "V2" "V3" "V4" "V5" "V6" "V7"
## [9] "V8" "V9" "V10" "V11" "V12" "V13" "V14" "V15"
## [17] "V16" "V17" "V18" "V19" "V20" "V21" "V22" "V23"
## [25] "V24" "V25" "V26" "V27" "V28" "Amount" "Class"
smoted_training_set[-31] <- scale(smoted_training_set[-31])
test_set[-31] <- scale(test_set[-31])
classifier <- glm(formula = Class ~ .,
family = binomial,
data = smoted_training_set)
summary(classifier)
##
## Call:
## glm(formula = Class ~ ., family = binomial, data = smoted_training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.4904 -0.1981 -0.0515 0.0000 3.1189
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 13.21357 0.20456 64.595 <2e-16 ***
## Time -0.48566 0.01520 -31.960 <2e-16 ***
## V1 4.16642 0.10716 38.882 <2e-16 ***
## V2 2.01935 0.12771 15.812 <2e-16 ***
## V3 1.69492 0.10179 16.651 <2e-16 ***
## V4 2.65980 0.03521 75.548 <2e-16 ***
## V5 3.53083 0.10236 34.493 <2e-16 ***
## V6 -0.84735 0.02771 -30.578 <2e-16 ***
## V7 -3.19131 0.17999 -17.730 <2e-16 ***
## V8 -2.03610 0.05362 -37.973 <2e-16 ***
## V9 -1.16819 0.03426 -34.100 <2e-16 ***
## V10 -3.56501 0.08646 -41.235 <2e-16 ***
## V11 1.92890 0.03436 56.137 <2e-16 ***
## V12 -5.41377 0.07636 -70.902 <2e-16 ***
## V13 -0.56755 0.01001 -56.699 <2e-16 ***
## V14 -6.79996 0.08555 -79.487 <2e-16 ***
## V15 -0.32356 0.01028 -31.479 <2e-16 ***
## V16 -2.29176 0.06361 -36.028 <2e-16 ***
## V17 -5.63337 0.13085 -43.051 <2e-16 ***
## V18 -1.12256 0.03900 -28.786 <2e-16 ***
## V19 0.37196 0.01656 22.463 <2e-16 ***
## V20 -0.74733 0.03340 -22.376 <2e-16 ***
## V21 0.38731 0.03909 9.908 <2e-16 ***
## V22 0.97774 0.02117 46.188 <2e-16 ***
## V23 0.49043 0.03245 15.114 <2e-16 ***
## V24 0.01944 0.01063 1.828 0.0675 .
## V25 -0.03240 0.01397 -2.319 0.0204 *
## V26 0.09861 0.01000 9.861 <2e-16 ***
## V27 0.08428 0.03446 2.445 0.0145 *
## V28 0.22122 0.01843 12.001 <2e-16 ***
## Amount 1.86849 0.08204 22.774 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 527346 on 383824 degrees of freedom
## Residual deviance: 78623 on 383794 degrees of freedom
## AIC: 78685
##
## Number of Fisher Scoring iterations: 13
Summary of this model clearly tells that almost all of the variables are statistically significant to the model. Only V24 is there which is not significant. We can ignore this and keep it with us as the value is slightly higher than the 5% level of significance.
prob_pred <- predict(classifier, type = 'response', newdata = test_set[-31])
y_pred <- ifelse(prob_pred > 0.4, 1, 0)
AUROC (Area under the receiver operating characteristic curve) curve tells that how accurately our model is predicting together with its overall Performance.
calc_auc <- function (prob_pred,test_y){
predict_log <- prediction(prob_pred,test_y) # Prediction Probability
#Calculating ROC Curve...
table(test_y, prob_pred>0.4)
roc_curve1<- performance(predict_log,"tpr","fpr")
# plot(roc_curve)
plot(roc_curve1, colorize=T)
#Calculating AUC and printing....
auc<- performance(predict_log,"auc")
paste(auc@y.values[[1]])
}
cm <- table(test_set[, 31], y_pred > 0.4)
cm
##
## FALSE TRUE
## 0 3263 67816
## 1 0 123
print(calc_auc(prob_pred,test_set$Class))
## [1] "0.974189831376961"
The value of AUC is 0.97 which is really high. However, we cannot just decide if it is the best model for our dataset or not.
set.seed(123)
forest.classifier = randomForest(x = smoted_training_set[-31],
y = smoted_training_set$Class,
ntree = 100)
forest.y_pred = predict(forest.classifier, newdata = test_set[-31])
forest.y_prob = predict(forest.classifier, newdata = test_set[-31], type='prob')
forest.predict_log <- prediction(as.numeric(y_pred),as.numeric(test_set$Class))
forest.auc<- performance(forest.predict_log,"auc")
print( (forest.auc@y.values[[1]]))
## [1] 0.5229533
The AUC value in this case is 0.52, which means the logistic regression is far better than the Random forest model for this dataset.
cm <- table(test_set[, 31], forest.y_pred)
cm
## forest.y_pred
## 0 1
## 0 64062 7017
## 1 25 98
roc_curve2<- performance(forest.predict_log,"tpr","fpr")
plot(roc_curve2, colorize=T)
tree.model <- rpart(Class~.,smoted_training_set,method='class')
tree.predict <- predict(tree.model, newdata = test_set[-31],type = 'class')
tree.predict_log <- prediction(as.numeric(tree.predict),
as.numeric(test_set$Class))
tree.auc<- performance(tree.predict_log,"auc")
print( (tree.auc@y.values[[1]]))
## [1] 0.6818205
The AUC value comes out to be 0.68 which is significantly lower than that of logistic regression.
roc_curve3<- performance(tree.predict_log,"tpr","fpr")
plot(roc_curve3, colorize=T)