Over sampling and under sampling are techniques used in data mining and data analytics to modify unequal data classes to create balanced data sets. Over sampling and under sampling are also known as resampling.
When one class of data is the underrepresented minority class in the data sample, over sampling techniques maybe used to duplicate these results for a more balanced amount of positive results in training. Over sampling is used when the amount of data collected is insufficient. A popular over sampling technique is SMOTE (Synthetic Minority Over-sampling Technique), which creates synthetic samples by randomly sampling the characteristics from occurrences in the minority class.
In this study, I will be creating three new synthetically balanced datasets from the one imbalanced training dataset. I am going to use the “smotefamily” R package to create the following techniques: SMOTE, ADASYN, and DB-SMOTE.
SMOTE (Synthetic Minority Oversampling Technique): A subset of data is taken from the minority class as an example. New synthetic similar examples are generated from the “feature space” rather than the “data space.”
ADASYN (Adaptive Synthetic Sampling): A weighted distribution is used depending on each minority class according to their degree of learning difficulty. More synthetic observations are generated for some minority class instances that are more difficult to learn as compared to others
DB-SMOTE (Density Based SMOTE): This over-samples the minority class at the decision boundary and over-examines the region to maintain the majority class detection rate. These are more likely to be misclassified than those far from the border.
Data:
The payment fraud data set (Dal Pozzolo et al. 2015) was downloaded from Kaggle. This has features and labels for thousands of credit card transactions, each of which is labeled as fraudulent or valid. The data set has 284807 observations for each 31 variables.
Time: Number of seconds between the transaction to first transaction V1-V28: Maybe result of PCA dimensionality deduction to protect users information Amount: Transaction amount Class: Classes for fradualent (1) and valid (0)
Objective:
The goal of this study is to identify the transactions as either fradualent or valid implementing Machine learning teachniques using resampling(SMOTE) methods.
Loading Libraries
library(plyr)
library(smotefamily) ## Loading DMwr to balance the unbalanced class
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
#library(yardstick)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
#PREPARING WORK SPAcE
# Clear the workspace:
rm(list = ls())
Data Preparation/Exploration
# Load data
# Load data
df <- read.csv("Creditfraud.csv", header = TRUE)
names(df)
## [1] "Time" "V1" "V2" "V3" "V4" "V5" "V6" "V7"
## [9] "V8" "V9" "V10" "V11" "V12" "V13" "V14" "V15"
## [17] "V16" "V17" "V18" "V19" "V20" "V21" "V22" "V23"
## [25] "V24" "V25" "V26" "V27" "V28" "Amount" "Class"
dim(df)
## [1] 284807 31
str(df)
## 'data.frame': 284807 obs. of 31 variables:
## $ Time : num 0 0 1 1 2 2 4 7 7 9 ...
## $ V1 : num -1.36 1.192 -1.358 -0.966 -1.158 ...
## $ V2 : num -0.0728 0.2662 -1.3402 -0.1852 0.8777 ...
## $ V3 : num 2.536 0.166 1.773 1.793 1.549 ...
## $ V4 : num 1.378 0.448 0.38 -0.863 0.403 ...
## $ V5 : num -0.3383 0.06 -0.5032 -0.0103 -0.4072 ...
## $ V6 : num 0.4624 -0.0824 1.8005 1.2472 0.0959 ...
## $ V7 : num 0.2396 -0.0788 0.7915 0.2376 0.5929 ...
## $ V8 : num 0.0987 0.0851 0.2477 0.3774 -0.2705 ...
## $ V9 : num 0.364 -0.255 -1.515 -1.387 0.818 ...
## $ V10 : num 0.0908 -0.167 0.2076 -0.055 0.7531 ...
## $ V11 : num -0.552 1.613 0.625 -0.226 -0.823 ...
## $ V12 : num -0.6178 1.0652 0.0661 0.1782 0.5382 ...
## $ V13 : num -0.991 0.489 0.717 0.508 1.346 ...
## $ V14 : num -0.311 -0.144 -0.166 -0.288 -1.12 ...
## $ V15 : num 1.468 0.636 2.346 -0.631 0.175 ...
## $ V16 : num -0.47 0.464 -2.89 -1.06 -0.451 ...
## $ V17 : num 0.208 -0.115 1.11 -0.684 -0.237 ...
## $ V18 : num 0.0258 -0.1834 -0.1214 1.9658 -0.0382 ...
## $ V19 : num 0.404 -0.146 -2.262 -1.233 0.803 ...
## $ V20 : num 0.2514 -0.0691 0.525 -0.208 0.4085 ...
## $ V21 : num -0.01831 -0.22578 0.248 -0.1083 -0.00943 ...
## $ V22 : num 0.27784 -0.63867 0.77168 0.00527 0.79828 ...
## $ V23 : num -0.11 0.101 0.909 -0.19 -0.137 ...
## $ V24 : num 0.0669 -0.3398 -0.6893 -1.1756 0.1413 ...
## $ V25 : num 0.129 0.167 -0.328 0.647 -0.206 ...
## $ V26 : num -0.189 0.126 -0.139 -0.222 0.502 ...
## $ V27 : num 0.13356 -0.00898 -0.05535 0.06272 0.21942 ...
## $ V28 : num -0.0211 0.0147 -0.0598 0.0615 0.2152 ...
## $ Amount: num 149.62 2.69 378.66 123.5 69.99 ...
## $ Class : int 0 0 0 0 0 0 0 0 0 0 ...
## Remove rows that do not have target variable values
newdf <- df[!(is.na(df$Class)),]
newdf$class <- factor(newdf$Class)
newdf <- select(newdf, -c(Class))
head(newdf)
## Time V1 V2 V3 V4 V5 V6
## 1 0 -1.3598071 -0.07278117 2.5363467 1.3781552 -0.33832077 0.46238778
## 2 0 1.1918571 0.26615071 0.1664801 0.4481541 0.06001765 -0.08236081
## 3 1 -1.3583541 -1.34016307 1.7732093 0.3797796 -0.50319813 1.80049938
## 4 1 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888 1.24720317
## 5 2 -1.1582331 0.87773676 1.5487178 0.4030339 -0.40719338 0.09592146
## 6 2 -0.4259659 0.96052304 1.1411093 -0.1682521 0.42098688 -0.02972755
## V7 V8 V9 V10 V11 V12
## 1 0.23959855 0.09869790 0.3637870 0.09079417 -0.5515995 -0.61780086
## 2 -0.07880298 0.08510165 -0.2554251 -0.16697441 1.6127267 1.06523531
## 3 0.79146096 0.24767579 -1.5146543 0.20764287 0.6245015 0.06608369
## 4 0.23760894 0.37743587 -1.3870241 -0.05495192 -0.2264873 0.17822823
## 5 0.59294075 -0.27053268 0.8177393 0.75307443 -0.8228429 0.53819555
## 6 0.47620095 0.26031433 -0.5686714 -0.37140720 1.3412620 0.35989384
## V13 V14 V15 V16 V17 V18
## 1 -0.9913898 -0.3111694 1.4681770 -0.4704005 0.20797124 0.02579058
## 2 0.4890950 -0.1437723 0.6355581 0.4639170 -0.11480466 -0.18336127
## 3 0.7172927 -0.1659459 2.3458649 -2.8900832 1.10996938 -0.12135931
## 4 0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279 1.96577500
## 5 1.3458516 -1.1196698 0.1751211 -0.4514492 -0.23703324 -0.03819479
## 6 -0.3580907 -0.1371337 0.5176168 0.4017259 -0.05813282 0.06865315
## V19 V20 V21 V22 V23 V24
## 1 0.40399296 0.25141210 -0.018306778 0.277837576 -0.11047391 0.06692808
## 2 -0.14578304 -0.06908314 -0.225775248 -0.638671953 0.10128802 -0.33984648
## 3 -2.26185709 0.52497973 0.247998153 0.771679402 0.90941226 -0.68928096
## 4 -1.23262197 -0.20803778 -0.108300452 0.005273597 -0.19032052 -1.17557533
## 5 0.80348692 0.40854236 -0.009430697 0.798278495 -0.13745808 0.14126698
## 6 -0.03319379 0.08496767 -0.208253515 -0.559824796 -0.02639767 -0.37142658
## V25 V26 V27 V28 Amount class
## 1 0.1285394 -0.1891148 0.133558377 -0.02105305 149.62 0
## 2 0.1671704 0.1258945 -0.008983099 0.01472417 2.69 0
## 3 -0.3276418 -0.1390966 -0.055352794 -0.05975184 378.66 0
## 4 0.6473760 -0.2219288 0.062722849 0.06145763 123.50 0
## 5 -0.2060096 0.5022922 0.219422230 0.21515315 69.99 0
## 6 -0.2327938 0.1059148 0.253844225 0.08108026 3.67 0
# Get the count for each classes
count(newdf, 'class')
## "class" n
## 1 class 284807
Data Partition
set.seed(111)
#Data Partition
ind <- sample(2,nrow(newdf), replace=T, prob=c(0.8, 0.2))
train <- newdf[ind==1,]
test <- newdf[ind==2,]
# Get the count and proportion of each classes
head(train)
## Time V1 V2 V3 V4 V5 V6
## 1 0 -1.3598071 -0.07278117 2.5363467 1.3781552 -0.33832077 0.46238778
## 2 0 1.1918571 0.26615071 0.1664801 0.4481541 0.06001765 -0.08236081
## 3 1 -1.3583541 -1.34016307 1.7732093 0.3797796 -0.50319813 1.80049938
## 4 1 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888 1.24720317
## 5 2 -1.1582331 0.87773676 1.5487178 0.4030339 -0.40719338 0.09592146
## 6 2 -0.4259659 0.96052304 1.1411093 -0.1682521 0.42098688 -0.02972755
## V7 V8 V9 V10 V11 V12
## 1 0.23959855 0.09869790 0.3637870 0.09079417 -0.5515995 -0.61780086
## 2 -0.07880298 0.08510165 -0.2554251 -0.16697441 1.6127267 1.06523531
## 3 0.79146096 0.24767579 -1.5146543 0.20764287 0.6245015 0.06608369
## 4 0.23760894 0.37743587 -1.3870241 -0.05495192 -0.2264873 0.17822823
## 5 0.59294075 -0.27053268 0.8177393 0.75307443 -0.8228429 0.53819555
## 6 0.47620095 0.26031433 -0.5686714 -0.37140720 1.3412620 0.35989384
## V13 V14 V15 V16 V17 V18
## 1 -0.9913898 -0.3111694 1.4681770 -0.4704005 0.20797124 0.02579058
## 2 0.4890950 -0.1437723 0.6355581 0.4639170 -0.11480466 -0.18336127
## 3 0.7172927 -0.1659459 2.3458649 -2.8900832 1.10996938 -0.12135931
## 4 0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279 1.96577500
## 5 1.3458516 -1.1196698 0.1751211 -0.4514492 -0.23703324 -0.03819479
## 6 -0.3580907 -0.1371337 0.5176168 0.4017259 -0.05813282 0.06865315
## V19 V20 V21 V22 V23 V24
## 1 0.40399296 0.25141210 -0.018306778 0.277837576 -0.11047391 0.06692808
## 2 -0.14578304 -0.06908314 -0.225775248 -0.638671953 0.10128802 -0.33984648
## 3 -2.26185709 0.52497973 0.247998153 0.771679402 0.90941226 -0.68928096
## 4 -1.23262197 -0.20803778 -0.108300452 0.005273597 -0.19032052 -1.17557533
## 5 0.80348692 0.40854236 -0.009430697 0.798278495 -0.13745808 0.14126698
## 6 -0.03319379 0.08496767 -0.208253515 -0.559824796 -0.02639767 -0.37142658
## V25 V26 V27 V28 Amount class
## 1 0.1285394 -0.1891148 0.133558377 -0.02105305 149.62 0
## 2 0.1671704 0.1258945 -0.008983099 0.01472417 2.69 0
## 3 -0.3276418 -0.1390966 -0.055352794 -0.05975184 378.66 0
## 4 0.6473760 -0.2219288 0.062722849 0.06145763 123.50 0
## 5 -0.2060096 0.5022922 0.219422230 0.21515315 69.99 0
## 6 -0.2327938 0.1059148 0.253844225 0.08108026 3.67 0
count(train, 'class')
## "class" n
## 1 class 227734
prop.table(table(train$class))
##
## 0 1
## 0.998296258 0.001703742
prop.table(table(test$class))
##
## 0 1
## 0.998177772 0.001822228
SMOTE implementation
## Smote : Synthetic Minority Oversampling Technique To Handle Class Imbalance In Binary Classification
SMOTed <- SMOTE(train[,-31],train$class, K=5)
SMOTed <- SMOTed$data # extract only the balanced dataset
SMOTed$class <- as.factor(SMOTed$class)
head(SMOTed)
## Time V1 V2 V3 V4 V5 V6
## 1 141565 0.1149650 0.76676155 -0.4941322 0.116772 0.8681685 -0.4779820
## 2 93834 -3.7656801 5.89073524 -10.2022676 10.259036 -5.6114484 -3.2353756
## 3 93853 -5.8391916 7.15153235 -12.8167601 7.031115 -9.6512722 -2.9384273
## 4 67571 -0.7584687 -0.04541028 -0.1684383 -1.313275 -1.9017625 0.7394334
## 5 160895 -0.8482902 2.71988212 -6.1990702 3.044437 -3.3019096 -1.9921168
## 6 29785 0.9237644 0.34404807 -2.8800037 1.721680 -3.0195648 -0.6397361
## V7 V8 V9 V10 V11 V12
## 1 0.4384957 0.06307334 -0.1862071 -0.1593251 1.200304 0.281744447
## 2 -10.6326835 3.27271633 -5.2689052 -11.1821254 8.879476 -18.431131030
## 3 -11.5432072 4.84362653 -3.4942757 -13.3207889 8.460244 -17.003289450
## 4 3.0718920 -0.48342224 0.6182028 -1.7690597 -0.651414 -0.005423291
## 5 -3.7349018 1.52007897 -2.5487882 -4.5335154 2.288022 -5.267204670
## 6 -3.8013245 1.29909589 0.8640655 -2.8952517 3.028162 -2.549177308
## V13 V14 V15 V16 V17 V18
## 1 -0.6238440 -0.6582463 -0.1558884 0.05622735 0.6536621 0.3346553
## 2 -0.2328220 -15.0216573 0.1411863 -12.18636250 -20.1655674 -7.0516514
## 3 0.1015566 -14.0944517 0.7470308 -12.66169572 -18.9124938 -6.6269748
## 4 -0.5171943 0.2174698 0.8835588 -1.17397765 0.2433471 -0.3423007
## 5 0.3948000 -4.2879958 1.3152795 -6.46918689 -8.7139201 -3.7050697
## 6 -1.5604318 -2.9713168 1.0788954 -4.70201182 -4.9080985 -1.5088733
## V19 V20 V21 V22 V23 V24
## 1 1.0289268 0.06219857 -0.28441290 -0.7068655 0.13140486 0.6007421
## 2 2.5008272 1.19413732 2.24560593 0.5463207 0.38185343 0.3820245
## 3 4.0089207 0.05568388 2.46205591 1.0548652 0.53048060 0.4726698
## 4 0.6870562 -0.03249990 0.04261946 0.3972238 0.07222887 -0.2422760
## 5 3.5310029 0.31957627 1.12522926 0.8052579 0.19911919 0.0352062
## 6 3.0016851 0.17087177 0.89993118 1.4812710 0.72526555 0.1769596
## V25 V26 V27 V28 Amount class
## 1 -0.60426428 0.2629379 0.09914463 0.01080972 4.49 1
## 2 -0.82103649 0.3943551 1.41296099 0.78240705 0.01 1
## 3 -0.27599797 0.2824350 0.10488602 0.25441710 316.06 1
## 4 0.56091647 -0.5409546 0.15060645 -0.11714012 549.06 1
## 5 0.01215876 0.6016578 0.13746768 -0.17139698 127.14 1
## 6 -1.81563798 -0.5365171 0.48903502 -0.04972858 30.30 1
# Get the count and proportion of each classes
prop.table(table(SMOTed$class))
##
## 0 1
## 0.5004028 0.4995972
(table(SMOTed$class))
##
## 0 1
## 227346 226980
ADASYN implementation
#ADASYN Balanced
ADASed <- ADAS(train[,-31],train$class,K = 5)
ADASed <- ADASed$data # extract only the balanced dataset
ADASed$class <- as.factor(ADASed$class)
# Get the count and proportion of each classes
prop.table(table(ADASed$class))
##
## 0 1
## 0.4998549 0.5001451
(table(ADASed$class))
##
## 0 1
## 227346 227478
DENSITY BASED SMOTE IMPLEMENTATION
#Density based SMOTE
DBSMOTed <- DBSMOTE(train[,-31],train$class)
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 3
## [1] 4
## [1] 3
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 4
## [1] 2
## [1] 2
## [1] 2
## [1] 4
## [1] 3
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 4
## [1] 4
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 8
## [1] 10
## [1] 10
## [1] 8
## [1] 4
## [1] 6
## [1] 6
## [1] 9
## [1] 7
## [1] 4
## [1] 2
## [1] 7
## [1] 7
## [1] 7
## [1] 2
## [1] 9
## [1] 9
## [1] 4
## [1] 7
## [1] 11
## [1] 7
## [1] 4
## [1] 9
## [1] 6
## [1] 5
## [1] 5
## [1] 10
## [1] 8
## [1] 4
## [1] 5
## [1] 5
## [1] 7
## [1] 7
## [1] 2
## [1] 7
## [1] 6
## [1] 5
## [1] 6
## [1] 7
## [1] 4
## [1] 10
## [1] 5
## [1] 9
## [1] 4
## [1] 2
## [1] 9
## [1] 7
## [1] 3
## [1] 6
## [1] 11
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 3
## [1] 3
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 6
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 4
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 5
## [1] 2
## [1] 2
## [1] 3
## [1] 3
## [1] 2
## [1] 5
## [1] 2
## [1] 5
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 6
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 3
## [1] 2
## [1] 3
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 3
## [1] 3
## [1] 4
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 3
## [1] 3
## [1] 3
## [1] 3
## [1] 4
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 3
## [1] 2
## [1] 2
## [1] 3
## [1] 3
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 3
## [1] 4
## [1] 3
## [1] 4
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 4
## [1] 4
## [1] 3
## [1] 4
## [1] 2
## [1] 2
## [1] 2
## [1] 5
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 4
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 3
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] 2
## [1] "DBSMOTE is Done"
DBSMOTed <- DBSMOTed$data # extract only the balanced dataset
DBSMOTed$class <- as.factor(DBSMOTed$class)
head(DBSMOTed)
## Time V1 V2 V3 V4 V5 V6
## 1 8757 -1.8637556 3.442644 -4.4682597 2.805336 -2.1184125 -2.332285
## 2 84694 -4.8681084 1.264420 -5.1678854 3.193648 -3.0456214 -2.096166
## 3 87883 -1.3602926 -0.458069 -0.7004039 2.737229 -1.0051060 2.891399
## 4 79540 -0.1143607 1.036129 1.9844053 3.128243 -0.7403436 1.548619
## 5 25254 -17.2751912 10.819665 -20.3638860 6.046612 -13.4650334 -4.166647
## 6 110547 -1.5328104 2.232752 -5.9231001 3.386708 -0.1534433 -1.419748
## V7 V8 V9 V10 V11 V12 V13
## 1 -4.261237 1.701682 -1.439396 -6.9999066 6.3162097 -8.6708180 0.3160240
## 2 -6.445610 2.422536 -3.214055 -8.7459726 5.4160419 -8.1641251 -0.1650106
## 3 5.802537 -1.933197 -1.017717 1.9878619 0.5041161 -0.8634309 -0.1844501
## 4 -1.701284 -2.203842 -1.242265 0.2695618 1.2934183 0.9332158 -0.1353260
## 5 -14.409448 11.580797 -4.073856 -9.1533680 6.2108830 -8.7785720 -0.0613675
## 6 -3.878576 1.444656 -1.465542 -5.2083346 4.5463011 -7.7611940 1.1595403
## V14 V15 V16 V17 V18 V19
## 1 -7.4177121 -0.43653747 -3.65280196 -6.2931453 -1.2432483 0.3648105
## 2 -10.1935304 -1.89521031 -7.36047461 -14.6687711 -4.8771190 1.3856095
## 3 -1.0169155 -1.55941020 1.15431269 -2.0438577 -0.1516988 -0.9435135
## 4 0.5214837 0.38688419 0.05986895 0.3063394 0.2650520 0.2237183
## 5 -9.5746623 0.04928854 -7.41848728 -14.1027719 -5.0164233 1.3903144
## 6 -5.2316115 -0.17164212 -4.71982950 -4.8479920 -1.1343287 3.5277383
## V20 V21 V22 V23 V24 V25
## 1 0.3609240 0.6679266 -0.51624236 -0.01221781 0.0706137 0.05850447
## 2 0.6673098 1.2692054 0.05765725 0.62930740 -0.1684318 0.44374390
## 3 -1.4934014 -0.9369895 -0.05381159 0.58010589 0.2169273 0.15164293
## 4 0.7328525 -1.0329347 1.19642831 -0.11285672 0.2547190 0.69666789
## 5 1.5449705 1.7298041 -1.20809608 -0.72683922 0.1125397 1.11919347
## 6 0.5208399 0.6325047 -0.07083795 -0.49029132 -0.3599831 0.05067753
## V26 V27 V28 Amount class
## 1 0.3048828 0.4180125 0.2088583 1.00 1
## 2 0.2765395 1.4412740 -0.1279437 12.31 1
## 3 -0.3321146 -0.4698003 -1.4950059 829.41 1
## 4 0.4823704 0.1299693 0.2239243 0.20 1
## 5 -0.2331890 1.6840630 0.5037397 99.99 1
## 6 1.0956713 0.4717414 -0.1066671 0.76 1
# Get the count and proportion of each classes
prop.table(table(DBSMOTed$class))
##
## 0 1
## 0.5187551 0.4812449
(table(DBSMOTed$class))
##
## 0 1
## 227346 210907
Logistic regression Model Implementation
model_base_glm <- glm(class~., data=train, family=binomial)
model_SMOTed_glm <- glm(class~., data=SMOTed, family=binomial)
model_ADASed_glm <- glm(class~., data=ADASed, family=binomial)
model_DBSMOTed_glm <- glm(class~., data=DBSMOTed, family=binomial)
Predictions
#For Logistic regression
predict_base <- predict(model_base_glm, test, type='response')
predict_SMOTed <- predict(model_SMOTed_glm, test, type='response')
predict_ADASed <- predict(model_ADASed_glm, test, type='response')
predict_DBSMOTed <- predict(model_DBSMOTed_glm, test, type='response')
Confusion Matrix
#Logistic Regression Model
#Base Model
(base_t<-table(as.factor(test$class), predict_base>0.5))
##
## FALSE TRUE
## 0 56961 8
## 1 42 62
#SMOTed Model
(SMOTed_t<-table(test$class, predict_SMOTed>0.5))
##
## FALSE TRUE
## 0 56554 415
## 1 16 88
#ADASed Model
(ADASed_t <-table(test$class, predict_ADASed>0.5))
##
## FALSE TRUE
## 0 56529 440
## 1 15 89
#DBSMOTed Model
(DBSMOTed_t <- table(test$class, predict_DBSMOTed>0.5))
##
## FALSE TRUE
## 0 56630 339
## 1 18 86
SCORE Comparisons
Score_base <- sum(diag(base_t))/sum(base_t)
Score_SMOTed <- sum(diag(SMOTed_t))/sum(SMOTed_t)
Score_ADASed <- sum(diag(ADASed_t))/sum(ADASed_t)
Score_DBSMOTed <- sum(diag(DBSMOTed_t))/sum(DBSMOTed_t)
#Logistic Regression
(precisicon_base <- base_t[2,2]/(base_t[2,2]+base_t[1,2]))
## [1] 0.8857143
precisicon_SMOTed <- SMOTed_t[2,2]/(SMOTed_t[2,2]+SMOTed_t[1,2])
precisicon_ADASed <- ADASed_t[2,2]/(ADASed_t[2,2]+ADASed_t[1,2])
(precisicon_DBSMOTed <- base_t[2,2]/(base_t[2,2]+base_t[1,2]))
## [1] 0.8857143
(recall_base <- base_t[2,2] / (base_t[2,2]+base_t[2,1]))
## [1] 0.5961538
recall_SMOTed <- SMOTed_t[2,2] / (SMOTed_t[2,2]+SMOTed_t[2,1])
recall_ADASed <- ADASed_t[2,2] / (ADASed_t[2,2]+ADASed_t[2,1])
(recall_DBSMOTed <- DBSMOTed_t[2,2] / (DBSMOTed_t[2,2]+DBSMOTed_t[2,1]))
## [1] 0.8269231
Dummy Classifier
(table_df <- table(test$class))
##
## 0 1
## 56969 104
table_df[1]/(table_df[1]+table_df[2])
## 0
## 0.9981778
(F1Score_base <- (2*precisicon_base*recall_base)/ (precisicon_base+recall_base))
## [1] 0.7126437
(F1score_SMOTed <- (2*precisicon_SMOTed*recall_SMOTed)/ (precisicon_SMOTed+recall_SMOTed))
## [1] 0.2899506
(F1Score_ADASed <- (2*precisicon_ADASed*recall_ADASed)/ (precisicon_ADASed+recall_ADASed))
## [1] 0.2812006
(F1Score_DBSMOTed <- (2*precisicon_DBSMOTed*recall_DBSMOTed)/ (precisicon_DBSMOTed+recall_DBSMOTed))
## [1] 0.8553096
#Compare the F1 of the models: 2*((Precision*Recall) / (Precision + Recall))
model_compare_f1 <- data.frame(Model = c('F1Score_base_GLM ',
'F1score_SMOTed_GLM',
'F1Score_ADASed_GLM',
'F1Score_DBSMOTed_GLM'),
F1 = c(F1Score_base,
F1score_SMOTed,
F1Score_ADASed,
F1Score_DBSMOTed))
ggplot(aes(x=reorder(Model,-F1),y = F1),data = model_compare_f1) +
geom_bar(stat = 'identity',fill = 'light blue') +
ggtitle('F1 Score Comparison for SMOTE Approach') +
xlab('Models') +
ylab('F1 Measure')+
geom_text(aes(label = round(F1,2))) + theme_bw() +
theme(axis.text.x = element_text(angle = 40))
We can notice that Density based SMOTE model returns the best result. References: 1. Colorado state University-MAS Program Data Mining Course Notes 2. https://whatis.techtarget.com/definition/over-sampling-and-under-sampling 3. https://www.kaggle.com/residentmario/undersampling-and-oversampling-imbalanced-data