Introduction

Over sampling and under sampling are techniques used in data mining and data analytics to modify unequal data classes to create balanced data sets. Over sampling and under sampling are also known as resampling.

When one class of data is the underrepresented minority class in the data sample, over sampling techniques maybe used to duplicate these results for a more balanced amount of positive results in training. Over sampling is used when the amount of data collected is insufficient. A popular over sampling technique is SMOTE (Synthetic Minority Over-sampling Technique), which creates synthetic samples by randomly sampling the characteristics from occurrences in the minority class.

In this study, I will be creating three new synthetically balanced datasets from the one imbalanced training dataset. I am going to use the “smotefamily” R package to create the following techniques: SMOTE, ADASYN, and DB-SMOTE.

SMOTE (Synthetic Minority Oversampling Technique): A subset of data is taken from the minority class as an example. New synthetic similar examples are generated from the “feature space” rather than the “data space.”

ADASYN (Adaptive Synthetic Sampling): A weighted distribution is used depending on each minority class according to their degree of learning difficulty. More synthetic observations are generated for some minority class instances that are more difficult to learn as compared to others

DB-SMOTE (Density Based SMOTE): This over-samples the minority class at the decision boundary and over-examines the region to maintain the majority class detection rate. These are more likely to be misclassified than those far from the border.

Data:

The payment fraud data set (Dal Pozzolo et al. 2015) was downloaded from Kaggle. This has features and labels for thousands of credit card transactions, each of which is labeled as fraudulent or valid. The data set has 284807 observations for each 31 variables.

Time: Number of seconds between the transaction to first transaction
V1-V28: Maybe result of PCA dimensionality deduction to protect users information
Amount: Transaction amount
Class: Classes for fradualent (1) and valid (0)

Objective:

The goal of this study is to identify the transactions as either fradualent or valid implementing Machine learning teachniques using resampling(SMOTE) methods.



Loading Libraries

library(plyr)
library(smotefamily) ## Loading DMwr to balance the unbalanced class
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
#library(yardstick)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
#PREPARING WORK SPAcE
# Clear the workspace: 
rm(list = ls())

Data Preparation/Exploration

# Load data
# Load data
df <- read.csv("Creditfraud.csv", header = TRUE)

names(df)
##  [1] "Time"   "V1"     "V2"     "V3"     "V4"     "V5"     "V6"     "V7"    
##  [9] "V8"     "V9"     "V10"    "V11"    "V12"    "V13"    "V14"    "V15"   
## [17] "V16"    "V17"    "V18"    "V19"    "V20"    "V21"    "V22"    "V23"   
## [25] "V24"    "V25"    "V26"    "V27"    "V28"    "Amount" "Class"
dim(df)
## [1] 284807     31
str(df)
## 'data.frame':    284807 obs. of  31 variables:
##  $ Time  : num  0 0 1 1 2 2 4 7 7 9 ...
##  $ V1    : num  -1.36 1.192 -1.358 -0.966 -1.158 ...
##  $ V2    : num  -0.0728 0.2662 -1.3402 -0.1852 0.8777 ...
##  $ V3    : num  2.536 0.166 1.773 1.793 1.549 ...
##  $ V4    : num  1.378 0.448 0.38 -0.863 0.403 ...
##  $ V5    : num  -0.3383 0.06 -0.5032 -0.0103 -0.4072 ...
##  $ V6    : num  0.4624 -0.0824 1.8005 1.2472 0.0959 ...
##  $ V7    : num  0.2396 -0.0788 0.7915 0.2376 0.5929 ...
##  $ V8    : num  0.0987 0.0851 0.2477 0.3774 -0.2705 ...
##  $ V9    : num  0.364 -0.255 -1.515 -1.387 0.818 ...
##  $ V10   : num  0.0908 -0.167 0.2076 -0.055 0.7531 ...
##  $ V11   : num  -0.552 1.613 0.625 -0.226 -0.823 ...
##  $ V12   : num  -0.6178 1.0652 0.0661 0.1782 0.5382 ...
##  $ V13   : num  -0.991 0.489 0.717 0.508 1.346 ...
##  $ V14   : num  -0.311 -0.144 -0.166 -0.288 -1.12 ...
##  $ V15   : num  1.468 0.636 2.346 -0.631 0.175 ...
##  $ V16   : num  -0.47 0.464 -2.89 -1.06 -0.451 ...
##  $ V17   : num  0.208 -0.115 1.11 -0.684 -0.237 ...
##  $ V18   : num  0.0258 -0.1834 -0.1214 1.9658 -0.0382 ...
##  $ V19   : num  0.404 -0.146 -2.262 -1.233 0.803 ...
##  $ V20   : num  0.2514 -0.0691 0.525 -0.208 0.4085 ...
##  $ V21   : num  -0.01831 -0.22578 0.248 -0.1083 -0.00943 ...
##  $ V22   : num  0.27784 -0.63867 0.77168 0.00527 0.79828 ...
##  $ V23   : num  -0.11 0.101 0.909 -0.19 -0.137 ...
##  $ V24   : num  0.0669 -0.3398 -0.6893 -1.1756 0.1413 ...
##  $ V25   : num  0.129 0.167 -0.328 0.647 -0.206 ...
##  $ V26   : num  -0.189 0.126 -0.139 -0.222 0.502 ...
##  $ V27   : num  0.13356 -0.00898 -0.05535 0.06272 0.21942 ...
##  $ V28   : num  -0.0211 0.0147 -0.0598 0.0615 0.2152 ...
##  $ Amount: num  149.62 2.69 378.66 123.5 69.99 ...
##  $ Class : int  0 0 0 0 0 0 0 0 0 0 ...


## Remove rows that do not have target variable values
newdf <- df[!(is.na(df$Class)),]

newdf$class <- factor(newdf$Class)

newdf <- select(newdf, -c(Class))

head(newdf)
##   Time         V1          V2        V3         V4          V5          V6
## 1    0 -1.3598071 -0.07278117 2.5363467  1.3781552 -0.33832077  0.46238778
## 2    0  1.1918571  0.26615071 0.1664801  0.4481541  0.06001765 -0.08236081
## 3    1 -1.3583541 -1.34016307 1.7732093  0.3797796 -0.50319813  1.80049938
## 4    1 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888  1.24720317
## 5    2 -1.1582331  0.87773676 1.5487178  0.4030339 -0.40719338  0.09592146
## 6    2 -0.4259659  0.96052304 1.1411093 -0.1682521  0.42098688 -0.02972755
##            V7          V8         V9         V10        V11         V12
## 1  0.23959855  0.09869790  0.3637870  0.09079417 -0.5515995 -0.61780086
## 2 -0.07880298  0.08510165 -0.2554251 -0.16697441  1.6127267  1.06523531
## 3  0.79146096  0.24767579 -1.5146543  0.20764287  0.6245015  0.06608369
## 4  0.23760894  0.37743587 -1.3870241 -0.05495192 -0.2264873  0.17822823
## 5  0.59294075 -0.27053268  0.8177393  0.75307443 -0.8228429  0.53819555
## 6  0.47620095  0.26031433 -0.5686714 -0.37140720  1.3412620  0.35989384
##          V13        V14        V15        V16         V17         V18
## 1 -0.9913898 -0.3111694  1.4681770 -0.4704005  0.20797124  0.02579058
## 2  0.4890950 -0.1437723  0.6355581  0.4639170 -0.11480466 -0.18336127
## 3  0.7172927 -0.1659459  2.3458649 -2.8900832  1.10996938 -0.12135931
## 4  0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279  1.96577500
## 5  1.3458516 -1.1196698  0.1751211 -0.4514492 -0.23703324 -0.03819479
## 6 -0.3580907 -0.1371337  0.5176168  0.4017259 -0.05813282  0.06865315
##           V19         V20          V21          V22         V23         V24
## 1  0.40399296  0.25141210 -0.018306778  0.277837576 -0.11047391  0.06692808
## 2 -0.14578304 -0.06908314 -0.225775248 -0.638671953  0.10128802 -0.33984648
## 3 -2.26185709  0.52497973  0.247998153  0.771679402  0.90941226 -0.68928096
## 4 -1.23262197 -0.20803778 -0.108300452  0.005273597 -0.19032052 -1.17557533
## 5  0.80348692  0.40854236 -0.009430697  0.798278495 -0.13745808  0.14126698
## 6 -0.03319379  0.08496767 -0.208253515 -0.559824796 -0.02639767 -0.37142658
##          V25        V26          V27         V28 Amount class
## 1  0.1285394 -0.1891148  0.133558377 -0.02105305 149.62     0
## 2  0.1671704  0.1258945 -0.008983099  0.01472417   2.69     0
## 3 -0.3276418 -0.1390966 -0.055352794 -0.05975184 378.66     0
## 4  0.6473760 -0.2219288  0.062722849  0.06145763 123.50     0
## 5 -0.2060096  0.5022922  0.219422230  0.21515315  69.99     0
## 6 -0.2327938  0.1059148  0.253844225  0.08108026   3.67     0
# Get the count for each classes
count(newdf, 'class')
##   "class"      n
## 1   class 284807

Data Partition

set.seed(111)
#Data Partition
ind <-  sample(2,nrow(newdf), replace=T, prob=c(0.8, 0.2))

train <- newdf[ind==1,]
test <-  newdf[ind==2,]

# Get the count and proportion of each classes
head(train)
##   Time         V1          V2        V3         V4          V5          V6
## 1    0 -1.3598071 -0.07278117 2.5363467  1.3781552 -0.33832077  0.46238778
## 2    0  1.1918571  0.26615071 0.1664801  0.4481541  0.06001765 -0.08236081
## 3    1 -1.3583541 -1.34016307 1.7732093  0.3797796 -0.50319813  1.80049938
## 4    1 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888  1.24720317
## 5    2 -1.1582331  0.87773676 1.5487178  0.4030339 -0.40719338  0.09592146
## 6    2 -0.4259659  0.96052304 1.1411093 -0.1682521  0.42098688 -0.02972755
##            V7          V8         V9         V10        V11         V12
## 1  0.23959855  0.09869790  0.3637870  0.09079417 -0.5515995 -0.61780086
## 2 -0.07880298  0.08510165 -0.2554251 -0.16697441  1.6127267  1.06523531
## 3  0.79146096  0.24767579 -1.5146543  0.20764287  0.6245015  0.06608369
## 4  0.23760894  0.37743587 -1.3870241 -0.05495192 -0.2264873  0.17822823
## 5  0.59294075 -0.27053268  0.8177393  0.75307443 -0.8228429  0.53819555
## 6  0.47620095  0.26031433 -0.5686714 -0.37140720  1.3412620  0.35989384
##          V13        V14        V15        V16         V17         V18
## 1 -0.9913898 -0.3111694  1.4681770 -0.4704005  0.20797124  0.02579058
## 2  0.4890950 -0.1437723  0.6355581  0.4639170 -0.11480466 -0.18336127
## 3  0.7172927 -0.1659459  2.3458649 -2.8900832  1.10996938 -0.12135931
## 4  0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279  1.96577500
## 5  1.3458516 -1.1196698  0.1751211 -0.4514492 -0.23703324 -0.03819479
## 6 -0.3580907 -0.1371337  0.5176168  0.4017259 -0.05813282  0.06865315
##           V19         V20          V21          V22         V23         V24
## 1  0.40399296  0.25141210 -0.018306778  0.277837576 -0.11047391  0.06692808
## 2 -0.14578304 -0.06908314 -0.225775248 -0.638671953  0.10128802 -0.33984648
## 3 -2.26185709  0.52497973  0.247998153  0.771679402  0.90941226 -0.68928096
## 4 -1.23262197 -0.20803778 -0.108300452  0.005273597 -0.19032052 -1.17557533
## 5  0.80348692  0.40854236 -0.009430697  0.798278495 -0.13745808  0.14126698
## 6 -0.03319379  0.08496767 -0.208253515 -0.559824796 -0.02639767 -0.37142658
##          V25        V26          V27         V28 Amount class
## 1  0.1285394 -0.1891148  0.133558377 -0.02105305 149.62     0
## 2  0.1671704  0.1258945 -0.008983099  0.01472417   2.69     0
## 3 -0.3276418 -0.1390966 -0.055352794 -0.05975184 378.66     0
## 4  0.6473760 -0.2219288  0.062722849  0.06145763 123.50     0
## 5 -0.2060096  0.5022922  0.219422230  0.21515315  69.99     0
## 6 -0.2327938  0.1059148  0.253844225  0.08108026   3.67     0
count(train, 'class')
##   "class"      n
## 1   class 227734
prop.table(table(train$class))
## 
##           0           1 
## 0.998296258 0.001703742
prop.table(table(test$class))
## 
##           0           1 
## 0.998177772 0.001822228

SMOTE implementation

## Smote : Synthetic Minority Oversampling Technique To Handle Class Imbalance In Binary Classification
SMOTed <- SMOTE(train[,-31],train$class, K=5)
SMOTed <- SMOTed$data # extract only the balanced dataset
SMOTed$class <- as.factor(SMOTed$class)
head(SMOTed)
##     Time         V1          V2          V3        V4         V5         V6
## 1 141565  0.1149650  0.76676155  -0.4941322  0.116772  0.8681685 -0.4779820
## 2  93834 -3.7656801  5.89073524 -10.2022676 10.259036 -5.6114484 -3.2353756
## 3  93853 -5.8391916  7.15153235 -12.8167601  7.031115 -9.6512722 -2.9384273
## 4  67571 -0.7584687 -0.04541028  -0.1684383 -1.313275 -1.9017625  0.7394334
## 5 160895 -0.8482902  2.71988212  -6.1990702  3.044437 -3.3019096 -1.9921168
## 6  29785  0.9237644  0.34404807  -2.8800037  1.721680 -3.0195648 -0.6397361
##            V7          V8         V9         V10       V11           V12
## 1   0.4384957  0.06307334 -0.1862071  -0.1593251  1.200304   0.281744447
## 2 -10.6326835  3.27271633 -5.2689052 -11.1821254  8.879476 -18.431131030
## 3 -11.5432072  4.84362653 -3.4942757 -13.3207889  8.460244 -17.003289450
## 4   3.0718920 -0.48342224  0.6182028  -1.7690597 -0.651414  -0.005423291
## 5  -3.7349018  1.52007897 -2.5487882  -4.5335154  2.288022  -5.267204670
## 6  -3.8013245  1.29909589  0.8640655  -2.8952517  3.028162  -2.549177308
##          V13         V14        V15          V16         V17        V18
## 1 -0.6238440  -0.6582463 -0.1558884   0.05622735   0.6536621  0.3346553
## 2 -0.2328220 -15.0216573  0.1411863 -12.18636250 -20.1655674 -7.0516514
## 3  0.1015566 -14.0944517  0.7470308 -12.66169572 -18.9124938 -6.6269748
## 4 -0.5171943   0.2174698  0.8835588  -1.17397765   0.2433471 -0.3423007
## 5  0.3948000  -4.2879958  1.3152795  -6.46918689  -8.7139201 -3.7050697
## 6 -1.5604318  -2.9713168  1.0788954  -4.70201182  -4.9080985 -1.5088733
##         V19         V20         V21        V22        V23        V24
## 1 1.0289268  0.06219857 -0.28441290 -0.7068655 0.13140486  0.6007421
## 2 2.5008272  1.19413732  2.24560593  0.5463207 0.38185343  0.3820245
## 3 4.0089207  0.05568388  2.46205591  1.0548652 0.53048060  0.4726698
## 4 0.6870562 -0.03249990  0.04261946  0.3972238 0.07222887 -0.2422760
## 5 3.5310029  0.31957627  1.12522926  0.8052579 0.19911919  0.0352062
## 6 3.0016851  0.17087177  0.89993118  1.4812710 0.72526555  0.1769596
##           V25        V26        V27         V28 Amount class
## 1 -0.60426428  0.2629379 0.09914463  0.01080972   4.49     1
## 2 -0.82103649  0.3943551 1.41296099  0.78240705   0.01     1
## 3 -0.27599797  0.2824350 0.10488602  0.25441710 316.06     1
## 4  0.56091647 -0.5409546 0.15060645 -0.11714012 549.06     1
## 5  0.01215876  0.6016578 0.13746768 -0.17139698 127.14     1
## 6 -1.81563798 -0.5365171 0.48903502 -0.04972858  30.30     1
# Get the count and proportion of each classes
prop.table(table(SMOTed$class))
## 
##         0         1 
## 0.5004028 0.4995972
(table(SMOTed$class))
## 
##      0      1 
## 227346 226980

ADASYN implementation

#ADASYN Balanced
ADASed <- ADAS(train[,-31],train$class,K = 5)
ADASed <- ADASed$data  # extract only the balanced dataset
ADASed$class <- as.factor(ADASed$class)

# Get the count and proportion of each classes
prop.table(table(ADASed$class))
## 
##         0         1 
## 0.4998549 0.5001451
(table(ADASed$class))
## 
##      0      1 
## 227346 227478

DENSITY BASED SMOTE IMPLEMENTATION

#Density based SMOTE
DBSMOTed <- DBSMOTE(train[,-31],train$class)
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 3
## [1] 4
## [1] 3
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 4
## [1] 2
## [1] 2
## [1] 2
## [1] 4
## [1] 3
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 4
## [1] 4
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 8
## [1] 10
## [1] 10
## [1] 8
## [1] 4
## [1] 6
## [1] 6
## [1] 9
## [1] 7
## [1] 4
## [1] 2
## [1] 7
## [1] 7
## [1] 7
## [1] 2
## [1] 9
## [1] 9
## [1] 4
## [1] 7
## [1] 11
## [1] 7
## [1] 4
## [1] 9
## [1] 6
## [1] 5
## [1] 5
## [1] 10
## [1] 8
## [1] 4
## [1] 5
## [1] 5
## [1] 7
## [1] 7
## [1] 2
## [1] 7
## [1] 6
## [1] 5
## [1] 6
## [1] 7
## [1] 4
## [1] 10
## [1] 5
## [1] 9
## [1] 4
## [1] 2
## [1] 9
## [1] 7
## [1] 3
## [1] 6
## [1] 11
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 3
## [1] 3
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 6
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 4
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 5
## [1] 2
## [1] 2
## [1] 3
## [1] 3
## [1] 2
## [1] 5
## [1] 2
## [1] 5
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 6
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 3
## [1] 2
## [1] 3
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 3
## [1] 3
## [1] 4
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 3
## [1] 3
## [1] 3
## [1] 3
## [1] 4
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 3
## [1] 2
## [1] 2
## [1] 3
## [1] 3
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 3
## [1] 4
## [1] 3
## [1] 4
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 4
## [1] 4
## [1] 3
## [1] 4
## [1] 2
## [1] 2
## [1] 2
## [1] 5
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 4
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] "DBSMOTE is Done"
DBSMOTed <- DBSMOTed$data # extract only the balanced dataset
DBSMOTed$class <- as.factor(DBSMOTed$class)

head(DBSMOTed)
##     Time          V1        V2          V3       V4          V5        V6
## 1   8757  -1.8637556  3.442644  -4.4682597 2.805336  -2.1184125 -2.332285
## 2  84694  -4.8681084  1.264420  -5.1678854 3.193648  -3.0456214 -2.096166
## 3  87883  -1.3602926 -0.458069  -0.7004039 2.737229  -1.0051060  2.891399
## 4  79540  -0.1143607  1.036129   1.9844053 3.128243  -0.7403436  1.548619
## 5  25254 -17.2751912 10.819665 -20.3638860 6.046612 -13.4650334 -4.166647
## 6 110547  -1.5328104  2.232752  -5.9231001 3.386708  -0.1534433 -1.419748
##           V7        V8        V9        V10       V11        V12        V13
## 1  -4.261237  1.701682 -1.439396 -6.9999066 6.3162097 -8.6708180  0.3160240
## 2  -6.445610  2.422536 -3.214055 -8.7459726 5.4160419 -8.1641251 -0.1650106
## 3   5.802537 -1.933197 -1.017717  1.9878619 0.5041161 -0.8634309 -0.1844501
## 4  -1.701284 -2.203842 -1.242265  0.2695618 1.2934183  0.9332158 -0.1353260
## 5 -14.409448 11.580797 -4.073856 -9.1533680 6.2108830 -8.7785720 -0.0613675
## 6  -3.878576  1.444656 -1.465542 -5.2083346 4.5463011 -7.7611940  1.1595403
##           V14         V15         V16         V17        V18        V19
## 1  -7.4177121 -0.43653747 -3.65280196  -6.2931453 -1.2432483  0.3648105
## 2 -10.1935304 -1.89521031 -7.36047461 -14.6687711 -4.8771190  1.3856095
## 3  -1.0169155 -1.55941020  1.15431269  -2.0438577 -0.1516988 -0.9435135
## 4   0.5214837  0.38688419  0.05986895   0.3063394  0.2650520  0.2237183
## 5  -9.5746623  0.04928854 -7.41848728 -14.1027719 -5.0164233  1.3903144
## 6  -5.2316115 -0.17164212 -4.71982950  -4.8479920 -1.1343287  3.5277383
##          V20        V21         V22         V23        V24        V25
## 1  0.3609240  0.6679266 -0.51624236 -0.01221781  0.0706137 0.05850447
## 2  0.6673098  1.2692054  0.05765725  0.62930740 -0.1684318 0.44374390
## 3 -1.4934014 -0.9369895 -0.05381159  0.58010589  0.2169273 0.15164293
## 4  0.7328525 -1.0329347  1.19642831 -0.11285672  0.2547190 0.69666789
## 5  1.5449705  1.7298041 -1.20809608 -0.72683922  0.1125397 1.11919347
## 6  0.5208399  0.6325047 -0.07083795 -0.49029132 -0.3599831 0.05067753
##          V26        V27        V28 Amount class
## 1  0.3048828  0.4180125  0.2088583   1.00     1
## 2  0.2765395  1.4412740 -0.1279437  12.31     1
## 3 -0.3321146 -0.4698003 -1.4950059 829.41     1
## 4  0.4823704  0.1299693  0.2239243   0.20     1
## 5 -0.2331890  1.6840630  0.5037397  99.99     1
## 6  1.0956713  0.4717414 -0.1066671   0.76     1
# Get the count and proportion of each classes
prop.table(table(DBSMOTed$class))
## 
##         0         1 
## 0.5187551 0.4812449
(table(DBSMOTed$class))
## 
##      0      1 
## 227346 210907

Logistic regression Model Implementation

model_base_glm <-  glm(class~., data=train, family=binomial)
model_SMOTed_glm <- glm(class~., data=SMOTed, family=binomial)
model_ADASed_glm <- glm(class~., data=ADASed, family=binomial)
model_DBSMOTed_glm <- glm(class~., data=DBSMOTed, family=binomial)

Predictions

#For Logistic regression
predict_base      <- predict(model_base_glm,     test, type='response')
predict_SMOTed    <- predict(model_SMOTed_glm,   test, type='response')
predict_ADASed    <- predict(model_ADASed_glm,   test, type='response')
predict_DBSMOTed  <- predict(model_DBSMOTed_glm, test, type='response')

Confusion Matrix

#Logistic Regression Model
#Base Model
(base_t<-table(as.factor(test$class), predict_base>0.5))
##    
##     FALSE  TRUE
##   0 56961     8
##   1    42    62
#SMOTed Model
(SMOTed_t<-table(test$class, predict_SMOTed>0.5))
##    
##     FALSE  TRUE
##   0 56554   415
##   1    16    88
#ADASed Model
(ADASed_t <-table(test$class, predict_ADASed>0.5))
##    
##     FALSE  TRUE
##   0 56529   440
##   1    15    89
#DBSMOTed Model
(DBSMOTed_t <- table(test$class, predict_DBSMOTed>0.5))
##    
##     FALSE  TRUE
##   0 56630   339
##   1    18    86

SCORE Comparisons

Score_base <- sum(diag(base_t))/sum(base_t)
Score_SMOTed <- sum(diag(SMOTed_t))/sum(SMOTed_t)
Score_ADASed <- sum(diag(ADASed_t))/sum(ADASed_t)
Score_DBSMOTed <- sum(diag(DBSMOTed_t))/sum(DBSMOTed_t)
#Logistic Regression
(precisicon_base <- base_t[2,2]/(base_t[2,2]+base_t[1,2]))
## [1] 0.8857143
precisicon_SMOTed <- SMOTed_t[2,2]/(SMOTed_t[2,2]+SMOTed_t[1,2])
precisicon_ADASed <- ADASed_t[2,2]/(ADASed_t[2,2]+ADASed_t[1,2])
(precisicon_DBSMOTed <- base_t[2,2]/(base_t[2,2]+base_t[1,2]))
## [1] 0.8857143
(recall_base <- base_t[2,2] / (base_t[2,2]+base_t[2,1]))
## [1] 0.5961538
recall_SMOTed <- SMOTed_t[2,2] / (SMOTed_t[2,2]+SMOTed_t[2,1])
recall_ADASed <- ADASed_t[2,2] / (ADASed_t[2,2]+ADASed_t[2,1])
(recall_DBSMOTed <- DBSMOTed_t[2,2] / (DBSMOTed_t[2,2]+DBSMOTed_t[2,1]))
## [1] 0.8269231

Dummy Classifier

(table_df <- table(test$class))
## 
##     0     1 
## 56969   104
table_df[1]/(table_df[1]+table_df[2])
##         0 
## 0.9981778
(F1Score_base  <- (2*precisicon_base*recall_base)/ (precisicon_base+recall_base))
## [1] 0.7126437
(F1score_SMOTed   <- (2*precisicon_SMOTed*recall_SMOTed)/ (precisicon_SMOTed+recall_SMOTed))
## [1] 0.2899506
(F1Score_ADASed   <- (2*precisicon_ADASed*recall_ADASed)/ (precisicon_ADASed+recall_ADASed))
## [1] 0.2812006
(F1Score_DBSMOTed <- (2*precisicon_DBSMOTed*recall_DBSMOTed)/ (precisicon_DBSMOTed+recall_DBSMOTed))
## [1] 0.8553096



#Compare the F1 of the models: 2*((Precision*Recall) / (Precision + Recall)) 

model_compare_f1 <- data.frame(Model = c('F1Score_base_GLM ',
                                      'F1score_SMOTed_GLM',
                                      'F1Score_ADASed_GLM',
                                      'F1Score_DBSMOTed_GLM'),
                              F1 = c(F1Score_base,
                                     F1score_SMOTed,
                                     F1Score_ADASed,
                                     F1Score_DBSMOTed))

ggplot(aes(x=reorder(Model,-F1),y = F1),data = model_compare_f1) +
  geom_bar(stat = 'identity',fill = 'light blue') +
  ggtitle('F1 Score Comparison for SMOTE Approach') +
  xlab('Models')  +
  ylab('F1 Measure')+
  geom_text(aes(label = round(F1,2))) + theme_bw() + 
  theme(axis.text.x = element_text(angle = 40))


We can notice that Density based SMOTE model returns the best result.

References: 1. Colorado state University-MAS Program Data Mining Course Notes
2. https://whatis.techtarget.com/definition/over-sampling-and-under-sampling
3. https://www.kaggle.com/residentmario/undersampling-and-oversampling-imbalanced-data


************************