CREDIT CARD FRAUD
A-INTRODUCCION
El conjunto de datos sometido a estudio contiene transacciones realizadas con tarjetas de credito en septiembre de 2013 por titulares de tarjetas europeos.
Contiene solo variables de entrada numericas que son el resultado de una transformacion PCA.
Por motivos de confidencialidad no se ha podido exponer las caracteristicas y datos originales, siendo variables de esta transformacion desde V1 hasta la V31.
Las unicas caracteristicas que no se han transformado son tiempo (time) y cantidad o importe de la transaccion (amount),la caracteristica tiempo contiene los segundos transcurridos entre cada transaccion y la primera transaccion en el conjunto de datos.
La caracteristica ‘Clase’ es la variable de respuesta y toma el valor 1 en caso de fraude y 0 en caso contrario.
B-OBJETO DEL ESTUDIO.
Buscamos un modelo de entre los que vamos a seleccionar a continuacion que nos prediga si las transaciones que se efectuan son fraudulentas o no.
0.CARGA DE DATOS Y ESTRUCTURA
Cargamos las librerias y nuestros datos.
library(data.table)
library(tidymodels)
library(tidyverse)
library(ggplot2)
library(scales)
library(ROCR)
library(themis)
library(doParallel)
library(kableExtra)
options(scipen=999)#Desactiva la notacion cientifica
dat <- fread('creditcard.csv')Visualizamos la estructura de nuestros datos.
str(dat)## Classes 'data.table' and 'data.frame': 284807 obs. of 31 variables:
## $ Time : num 0 0 1 1 2 2 4 7 7 9 ...
## $ V1 : num -1.36 1.192 -1.358 -0.966 -1.158 ...
## $ V2 : num -0.0728 0.2662 -1.3402 -0.1852 0.8777 ...
## $ V3 : num 2.536 0.166 1.773 1.793 1.549 ...
## $ V4 : num 1.378 0.448 0.38 -0.863 0.403 ...
## $ V5 : num -0.3383 0.06 -0.5032 -0.0103 -0.4072 ...
## $ V6 : num 0.4624 -0.0824 1.8005 1.2472 0.0959 ...
## $ V7 : num 0.2396 -0.0788 0.7915 0.2376 0.5929 ...
## $ V8 : num 0.0987 0.0851 0.2477 0.3774 -0.2705 ...
## $ V9 : num 0.364 -0.255 -1.515 -1.387 0.818 ...
## $ V10 : num 0.0908 -0.167 0.2076 -0.055 0.7531 ...
## $ V11 : num -0.552 1.613 0.625 -0.226 -0.823 ...
## $ V12 : num -0.6178 1.0652 0.0661 0.1782 0.5382 ...
## $ V13 : num -0.991 0.489 0.717 0.508 1.346 ...
## $ V14 : num -0.311 -0.144 -0.166 -0.288 -1.12 ...
## $ V15 : num 1.468 0.636 2.346 -0.631 0.175 ...
## $ V16 : num -0.47 0.464 -2.89 -1.06 -0.451 ...
## $ V17 : num 0.208 -0.115 1.11 -0.684 -0.237 ...
## $ V18 : num 0.0258 -0.1834 -0.1214 1.9658 -0.0382 ...
## $ V19 : num 0.404 -0.146 -2.262 -1.233 0.803 ...
## $ V20 : num 0.2514 -0.0691 0.525 -0.208 0.4085 ...
## $ V21 : num -0.01831 -0.22578 0.248 -0.1083 -0.00943 ...
## $ V22 : num 0.27784 -0.63867 0.77168 0.00527 0.79828 ...
## $ V23 : num -0.11 0.101 0.909 -0.19 -0.137 ...
## $ V24 : num 0.0669 -0.3398 -0.6893 -1.1756 0.1413 ...
## $ V25 : num 0.129 0.167 -0.328 0.647 -0.206 ...
## $ V26 : num -0.189 0.126 -0.139 -0.222 0.502 ...
## $ V27 : num 0.13356 -0.00898 -0.05535 0.06272 0.21942 ...
## $ V28 : num -0.0211 0.0147 -0.0598 0.0615 0.2152 ...
## $ Amount: num 149.62 2.69 378.66 123.5 69.99 ...
## $ Class : int 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(dat)## Time V1 V2 V3
## Min. : 0 Min. :-56.40751 Min. :-72.71573 Min. :-48.3256
## 1st Qu.: 54202 1st Qu.: -0.92037 1st Qu.: -0.59855 1st Qu.: -0.8904
## Median : 84692 Median : 0.01811 Median : 0.06549 Median : 0.1799
## Mean : 94814 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.:139321 3rd Qu.: 1.31564 3rd Qu.: 0.80372 3rd Qu.: 1.0272
## Max. :172792 Max. : 2.45493 Max. : 22.05773 Max. : 9.3826
## V4 V5 V6 V7
## Min. :-5.68317 Min. :-113.74331 Min. :-26.1605 Min. :-43.5572
## 1st Qu.:-0.84864 1st Qu.: -0.69160 1st Qu.: -0.7683 1st Qu.: -0.5541
## Median :-0.01985 Median : -0.05434 Median : -0.2742 Median : 0.0401
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.74334 3rd Qu.: 0.61193 3rd Qu.: 0.3986 3rd Qu.: 0.5704
## Max. :16.87534 Max. : 34.80167 Max. : 73.3016 Max. :120.5895
## V8 V9 V10 V11
## Min. :-73.21672 Min. :-13.43407 Min. :-24.58826 Min. :-4.79747
## 1st Qu.: -0.20863 1st Qu.: -0.64310 1st Qu.: -0.53543 1st Qu.:-0.76249
## Median : 0.02236 Median : -0.05143 Median : -0.09292 Median :-0.03276
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.32735 3rd Qu.: 0.59714 3rd Qu.: 0.45392 3rd Qu.: 0.73959
## Max. : 20.00721 Max. : 15.59500 Max. : 23.74514 Max. :12.01891
## V12 V13 V14 V15
## Min. :-18.6837 Min. :-5.79188 Min. :-19.2143 Min. :-4.49894
## 1st Qu.: -0.4056 1st Qu.:-0.64854 1st Qu.: -0.4256 1st Qu.:-0.58288
## Median : 0.1400 Median :-0.01357 Median : 0.0506 Median : 0.04807
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6182 3rd Qu.: 0.66251 3rd Qu.: 0.4931 3rd Qu.: 0.64882
## Max. : 7.8484 Max. : 7.12688 Max. : 10.5268 Max. : 8.87774
## V16 V17 V18
## Min. :-14.12985 Min. :-25.16280 Min. :-9.498746
## 1st Qu.: -0.46804 1st Qu.: -0.48375 1st Qu.:-0.498850
## Median : 0.06641 Median : -0.06568 Median :-0.003636
## Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.52330 3rd Qu.: 0.39968 3rd Qu.: 0.500807
## Max. : 17.31511 Max. : 9.25353 Max. : 5.041069
## V19 V20 V21
## Min. :-7.213527 Min. :-54.49772 Min. :-34.83038
## 1st Qu.:-0.456299 1st Qu.: -0.21172 1st Qu.: -0.22839
## Median : 0.003735 Median : -0.06248 Median : -0.02945
## Mean : 0.000000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.458949 3rd Qu.: 0.13304 3rd Qu.: 0.18638
## Max. : 5.591971 Max. : 39.42090 Max. : 27.20284
## V22 V23 V24
## Min. :-10.933144 Min. :-44.80774 Min. :-2.83663
## 1st Qu.: -0.542350 1st Qu.: -0.16185 1st Qu.:-0.35459
## Median : 0.006782 Median : -0.01119 Median : 0.04098
## Mean : 0.000000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.528554 3rd Qu.: 0.14764 3rd Qu.: 0.43953
## Max. : 10.503090 Max. : 22.52841 Max. : 4.58455
## V25 V26 V27
## Min. :-10.29540 Min. :-2.60455 Min. :-22.565679
## 1st Qu.: -0.31715 1st Qu.:-0.32698 1st Qu.: -0.070840
## Median : 0.01659 Median :-0.05214 Median : 0.001342
## Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.35072 3rd Qu.: 0.24095 3rd Qu.: 0.091045
## Max. : 7.51959 Max. : 3.51735 Max. : 31.612198
## V28 Amount Class
## Min. :-15.43008 Min. : 0.00 Min. :0.000000
## 1st Qu.: -0.05296 1st Qu.: 5.60 1st Qu.:0.000000
## Median : 0.01124 Median : 22.00 Median :0.000000
## Mean : 0.00000 Mean : 88.35 Mean :0.001728
## 3rd Qu.: 0.07828 3rd Qu.: 77.17 3rd Qu.:0.000000
## Max. : 33.84781 Max. :25691.16 Max. :1.000000
glimpse(dat)## Rows: 284,807
## Columns: 31
## $ Time <dbl> 0, 0, 1, 1, 2, 2, 4, 7, 7, 9, 10, 10, 10, 11, 12, 12, 12, 13, 1~
## $ V1 <dbl> -1.3598071, 1.1918571, -1.3583541, -0.9662717, -1.1582331, -0.4~
## $ V2 <dbl> -0.07278117, 0.26615071, -1.34016307, -0.18522601, 0.87773675, ~
## $ V3 <dbl> 2.53634674, 0.16648011, 1.77320934, 1.79299334, 1.54871785, 1.1~
## $ V4 <dbl> 1.37815522, 0.44815408, 0.37977959, -0.86329128, 0.40303393, -0~
## $ V5 <dbl> -0.33832077, 0.06001765, -0.50319813, -0.01030888, -0.40719338,~
## $ V6 <dbl> 0.46238778, -0.08236081, 1.80049938, 1.24720317, 0.09592146, -0~
## $ V7 <dbl> 0.239598554, -0.078802983, 0.791460956, 0.237608940, 0.59294074~
## $ V8 <dbl> 0.098697901, 0.085101655, 0.247675787, 0.377435875, -0.27053267~
## $ V9 <dbl> 0.3637870, -0.2554251, -1.5146543, -1.3870241, 0.8177393, -0.56~
## $ V10 <dbl> 0.09079417, -0.16697441, 0.20764287, -0.05495192, 0.75307443, -~
## $ V11 <dbl> -0.55159953, 1.61272666, 0.62450146, -0.22648726, -0.82284288, ~
## $ V12 <dbl> -0.61780086, 1.06523531, 0.06608369, 0.17822823, 0.53819555, 0.~
## $ V13 <dbl> -0.99138985, 0.48909502, 0.71729273, 0.50775687, 1.34585159, -0~
## $ V14 <dbl> -0.31116935, -0.14377230, -0.16594592, -0.28792375, -1.11966983~
## $ V15 <dbl> 1.468176972, 0.635558093, 2.345864949, -0.631418118, 0.17512113~
## $ V16 <dbl> -0.47040053, 0.46391704, -2.89008319, -1.05964725, -0.45144918,~
## $ V17 <dbl> 0.207971242, -0.114804663, 1.109969379, -0.684092786, -0.237033~
## $ V18 <dbl> 0.02579058, -0.18336127, -0.12135931, 1.96577500, -0.03819479, ~
## $ V19 <dbl> 0.40399296, -0.14578304, -2.26185710, -1.23262197, 0.80348692, ~
## $ V20 <dbl> 0.25141210, -0.06908314, 0.52497973, -0.20803778, 0.40854236, 0~
## $ V21 <dbl> -0.018306778, -0.225775248, 0.247998153, -0.108300452, -0.00943~
## $ V22 <dbl> 0.277837576, -0.638671953, 0.771679402, 0.005273597, 0.79827849~
## $ V23 <dbl> -0.110473910, 0.101288021, 0.909412262, -0.190320519, -0.137458~
## $ V24 <dbl> 0.06692807, -0.33984648, -0.68928096, -1.17557533, 0.14126698, ~
## $ V25 <dbl> 0.12853936, 0.16717040, -0.32764183, 0.64737603, -0.20600959, -~
## $ V26 <dbl> -0.18911484, 0.12589453, -0.13909657, -0.22192884, 0.50229222, ~
## $ V27 <dbl> 0.133558377, -0.008983099, -0.055352794, 0.062722849, 0.2194222~
## $ V28 <dbl> -0.021053053, 0.014724169, -0.059751841, 0.061457629, 0.2151531~
## $ Amount <dbl> 149.62, 2.69, 378.66, 123.50, 69.99, 3.67, 4.99, 40.80, 93.20, ~
## $ Class <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
1. CALIDAD DE DATOS
Comprobamos si hay valores NA en las columnas.
colSums(is.na(dat))## Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 0 0 0 0 0 0 0 0 0 0 0
## V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 0 0 0 0 0 0 0 0 0 0 0
## V22 V23 V24 V25 V26 V27 V28 Amount Class
## 0 0 0 0 0 0 0 0 0
y vemos que no tenemos valores NA en ninguna de ellas.
Comprobamos los estadisticos de nuestras variables para ver si vemos algo irregular o variables con diferencias de escala a resaltar para luego escalarlas si se da el caso.
lapply(dat, summary)## $Time
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 54202 84692 94814 139321 172792
##
## $V1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -56.40751 -0.92037 0.01811 0.00000 1.31564 2.45493
##
## $V2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -72.71573 -0.59855 0.06549 0.00000 0.80372 22.05773
##
## $V3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -48.3256 -0.8904 0.1799 0.0000 1.0272 9.3826
##
## $V4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.68317 -0.84864 -0.01985 0.00000 0.74334 16.87534
##
## $V5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -113.74331 -0.69160 -0.05434 0.00000 0.61193 34.80167
##
## $V6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -26.1605 -0.7683 -0.2742 0.0000 0.3986 73.3016
##
## $V7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -43.5572 -0.5541 0.0401 0.0000 0.5704 120.5895
##
## $V8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -73.21672 -0.20863 0.02236 0.00000 0.32735 20.00721
##
## $V9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -13.43407 -0.64310 -0.05143 0.00000 0.59714 15.59500
##
## $V10
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -24.58826 -0.53543 -0.09292 0.00000 0.45392 23.74514
##
## $V11
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.79747 -0.76249 -0.03276 0.00000 0.73959 12.01891
##
## $V12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -18.6837 -0.4056 0.1400 0.0000 0.6182 7.8484
##
## $V13
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.79188 -0.64854 -0.01357 0.00000 0.66251 7.12688
##
## $V14
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -19.2143 -0.4256 0.0506 0.0000 0.4931 10.5268
##
## $V15
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.49894 -0.58288 0.04807 0.00000 0.64882 8.87774
##
## $V16
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -14.12985 -0.46804 0.06641 0.00000 0.52330 17.31511
##
## $V17
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -25.16280 -0.48375 -0.06568 0.00000 0.39968 9.25353
##
## $V18
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9.498746 -0.498850 -0.003636 0.000000 0.500807 5.041069
##
## $V19
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -7.213527 -0.456299 0.003735 0.000000 0.458949 5.591971
##
## $V20
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -54.49772 -0.21172 -0.06248 0.00000 0.13304 39.42090
##
## $V21
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -34.83038 -0.22839 -0.02945 0.00000 0.18638 27.20284
##
## $V22
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -10.933144 -0.542350 0.006782 0.000000 0.528554 10.503090
##
## $V23
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -44.80774 -0.16185 -0.01119 0.00000 0.14764 22.52841
##
## $V24
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.83663 -0.35459 0.04098 0.00000 0.43953 4.58455
##
## $V25
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -10.29540 -0.31715 0.01659 0.00000 0.35072 7.51959
##
## $V26
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.60455 -0.32698 -0.05214 0.00000 0.24095 3.51735
##
## $V27
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -22.565679 -0.070840 0.001342 0.000000 0.091045 31.612198
##
## $V28
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -15.43008 -0.05296 0.01124 0.00000 0.07828 33.84781
##
## $Amount
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 5.60 22.00 88.35 77.17 25691.16
##
## $Class
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.001728 0.000000 1.000000
Pasamos a factor nuestra variable objetivo y miramos la tabla de contingencia referente a la misma.
dat$Class<-as.factor(dat$Class)
table(dat$Class)##
## 0 1
## 284315 492
Como vemos la diferencia es muy alta entre las clases.
Miramos ahora la proporcion de una con respecto a la otra.
prop.table(table(dat$Class))##
## 0 1
## 0.998272514 0.001727486
En esta grafica mostramos la diferencia de clases entre las transacciones no fraudulentas y las fraudulentas en porcentaje, viendo que la diferencia es claramente ostensible y que los datos no estan balanceados.
ggplot(data = dat, aes(x = factor(Class),
y = prop.table(stat(count)), fill = factor(Class),
label = scales::percent_format(accuracy=0.01)(prop.table(stat(count))))) +
geom_bar(position = "dodge") +
geom_text(stat = 'count',
position = position_dodge(0.9),
vjust = -0.5,
size = 3.5) + coord_flip()+
scale_y_continuous(labels = scales::percent_format(accuracy=0.01))+labs(
x ='Clases',y ='Porcentaje de Transacciones')+ggtitle('DISTRIBUCION DE CLASES')Como vemos la clase positiva que cometio fraude es la clase minoritaria siendo la de mayor interes para la determinacion de nuestros resultados.
Del mismo modo vemos en funcion de la diferencia de datos de una clase y otra que el conjunto de datos no esta balanceado pues hay una diferencia sustancial entre las 2 clases, pues una de las clases(mayoritaria) esta sensiblemente mas representada que el resto de clases.
Esta problematica puede conducir, en terminos de clasificacion a datos sesga- dos en perjuicio de la clase minoritaria, que usualmente contiene los casos de mayor interes para nuestro estudio.
Para resolver el problema descrito se pueden utilizar tecnicas para solventar el problema descrito como pueden ser los metodos de remuestreo para tratar de equilibrar las frecuencias de las clases como Upsampling o Downsampling y tambien metodos como el de SMOTE y el de ROSE.
Nosotros para este caso en concreto vamos a utilizar el metodo de Remuestreo Downsampling.
Para el metodo Downsampling se mantienen todos los casos de la clase minorita ria y se elige aleatoriamente una muestra con el mismo numero de casos en la clase mayoritaria.
2.DISTRIBUCION DE LAS VARIABLES CONTINUAS
2.1 Distribucion de los datos por hora y clase.
HISTOGRAMA
ggplot(dat, aes(Time, fill=factor(Class))) +
geom_histogram(bins=100)+labs(
x = 'Tiempo en segungos transcurridos entre transaccion',
y = 'Numero de Transacciones') +ggtitle('DISTRIBUCION DE LAS TRANSACCIONES POR TIEMPO Y CLASE')+
facet_wrap(~Class, scales = 'free_y', nrow = 2)Se muestra para las transacciones no fraudulentas una estructura de los datos bimodal y para las fraudulentas con algunos picos que que destacan sobre el resto de la distribucion..
2.2 Distribucion de los datos por cantidad y clase.
BOXPLOT
ggplot(dat, aes(factor(Class),Amount)) +
geom_boxplot()+labs(
x = 'Clase',
y = 'Cantidad')+ ggtitle('DISTRIBUCION DE LOS DATOS POR CANTIDAD y CLASE')En la grafica se puede ver una mayor variabilidad en la transaccion de valores para las transacciones no fraudulentas.
3. MODELIZACION DE DATOS.
FUNCIONES QUE VAMOS A APLICAR A LOS MODELOS
Funcion para crear una matriz de confusion
confusion<-function(real,scoring,umbral){
conf<-table(real,scoring>=umbral)
if(ncol(conf)==2) return(conf) else return(NULL)
}Funcion para calcular las metricas de los modelos: acierto, precision, cobertura y F1
metricas<-function(matriz_conf){
acierto <- (matriz_conf[1,1] + matriz_conf[2,2]) / sum(matriz_conf) *100
precision <- matriz_conf[2,2] / (matriz_conf[2,2] + matriz_conf[1,2]) *100
cobertura <- matriz_conf[2,2] / (matriz_conf[2,2] + matriz_conf[2,1]) *100
F1 <- 2*precision*cobertura/(precision+cobertura)
salida<-c(acierto,precision,cobertura,F1)
return(salida)
}Funcion para probar distintos umbrales y ver el efecto sobre precision y cobertura
umbrales<-function(real,scoring){
umbrales<-data.frame(umbral=rep(0,times=19),acierto=rep(0,times=19),precision=rep(0,times=19),cobertura=rep(0,times=19),F1=rep(0,times=19))
cont <- 1
for (cada in seq(0.05,0.95,by = 0.05)){
datos<-metricas(confusion(real,scoring,cada))
registro<-c(cada,datos)
umbrales[cont,]<-registro
cont <- cont + 1
}
return(umbrales)
}Funciones que calculan la curva ROC y el AUC
roc<-function(prediction){
r<-performance(prediction,'tpr','fpr')
plot(r)
}
auc<-function(prediction){
a<-performance(prediction,'auc')
return(a@y.values[[1]])
}DIVISION DE LOS DATOS ENTRENAMIENTO Y PRUEBA
Creamos las particiones de entrenamiento y prueba, 70% y 30% respectivamente.
Vamos a aplicar un cross-validation a todos los modelos de vfold=10.
3.1 Regresion Logistica
set.seed(45)
split_inicial <- initial_split(
data = dat,
prop = 0.7,
strata = Class
)
train <- training(split_inicial)
test <- testing(split_inicial)
# Balanceamos datos y preprocesamos:
# Hacemos preprocesamiento de datos
mod_rcp <- recipe(Class~.,
data = train)%>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_downsample(Class)
# sacamos el subconjunto de datos ahora balanceados y preprocesados
downsp <- mod_rcp%>%
prep()%>%
juice()
table(downsp$Class)##
## 0 1
## 365 365
# hacemos ahora el modelo con los datos balanceados
mod_rl <- logistic_reg(mode = 'classification') %>%
set_engine(engine = "glm",family=binomial(link='logit'))
mod_dwn <- recipe(Class~.,
data = downsp)
set.seed(45)
cv_folds <- vfold_cv(
data = downsp,
v = 10,
strata = Class
)
wf_m <- workflow()%>%
add_recipe(mod_dwn)%>%
add_model(mod_rl)
my_metrics <- metric_set(accuracy,roc_auc,kap)set.seed(45)
rlog <- wf_m%>%
fit_resamples(
resamples = cv_folds,
metrics = my_metrics,
control = control_resamples(save_pred = TRUE)
)collect_metrics(rlog)## # A tibble: 3 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.936 10 0.00625 Preprocessor1_Model1
## 2 kap binary 0.871 10 0.0125 Preprocessor1_Model1
## 3 roc_auc binary 0.971 10 0.00642 Preprocessor1_Model1
show_best(rlog, metric = "roc_auc")## # A tibble: 1 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.971 10 0.00642 Preprocessor1_Model1
best_rlog2 <- rlog%>%
select_best("roc_auc")
wf_final <-
wf_m%>%
finalize_workflow(best_rlog2)
set.seed(45)
final_model_rl <- fit(object = wf_final, data = train)
model_extr1 <- final_model_rl%>%extract_fit_engine()
summary(model_extr1)##
## Call:
## stats::glm(formula = ..y ~ ., family = ~binomial(link = "logit"),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.9346 -0.0295 -0.0188 -0.0117 4.6224
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.319231413 0.291577338 -28.532 < 0.0000000000000002 ***
## Time -0.000004848 0.000002602 -1.863 0.062460 .
## V1 0.107654903 0.050909722 2.115 0.034462 *
## V2 -0.033633533 0.069822411 -0.482 0.630018
## V3 -0.037947213 0.063979283 -0.593 0.553103
## V4 0.692706140 0.084390123 8.208 0.000000000000000224 ***
## V5 0.109726031 0.080660343 1.360 0.173720
## V6 -0.150614456 0.088786867 -1.696 0.089818 .
## V7 -0.034023457 0.082385089 -0.413 0.679621
## V8 -0.227243848 0.036927965 -6.154 0.000000000756928775 ***
## V9 -0.338204498 0.129851997 -2.605 0.009200 **
## V10 -0.785183945 0.115353532 -6.807 0.000000000009982042 ***
## V11 -0.136922109 0.096175533 -1.424 0.154542
## V12 0.132575869 0.106214146 1.248 0.211960
## V13 -0.448729872 0.096353914 -4.657 0.000003206937752662 ***
## V14 -0.672508391 0.077686302 -8.657 < 0.0000000000000002 ***
## V15 -0.119418918 0.100663673 -1.186 0.235498
## V16 -0.229062650 0.145298089 -1.577 0.114910
## V17 -0.038005784 0.083032375 -0.458 0.647152
## V18 0.008196272 0.150900559 0.054 0.956684
## V19 0.223503656 0.114126424 1.958 0.050185 .
## V20 -0.440758512 0.100280739 -4.395 0.000011064741078914 ***
## V21 0.407535314 0.067648082 6.024 0.000000001697969830 ***
## V22 0.644599718 0.151453601 4.256 0.000020803568207664 ***
## V23 -0.050734804 0.074327573 -0.683 0.494870
## V24 0.316967217 0.175035093 1.811 0.070160 .
## V25 -0.219317203 0.153222376 -1.431 0.152326
## V26 -0.188142940 0.222860995 -0.844 0.398548
## V27 -0.916394316 0.145986409 -6.277 0.000000000344596593 ***
## V28 -0.411353180 0.112069490 -3.671 0.000242 ***
## Amount 0.000657254 0.000514437 1.278 0.201384
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5330.5 on 199363 degrees of freedom
## Residual deviance: 1629.6 on 199333 degrees of freedom
## AIC: 1691.6
##
## Number of Fisher Scoring iterations: 12
Vamos a seleccionar las variables altamente signifcativas(***), para volver a pasar el modelo solo con estas, asi pues:
df <- dat%>%select(V4,V8,V10,V13,V14,V20,V21,V22,V27,V28,Class)
set.seed(45)
split_inicial <- initial_split(
data = df,
prop = 0.7,
strata = Class
)
train <- training(split_inicial)
test <- testing(split_inicial)
# Balanceamos datos y preprocesamos:
# Hacemos preprocesamiento de datos
mod_rcp <- recipe(Class~.,
data = train)%>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes())%>%
step_downsample(Class)
# sacamos el subconjunto de datos ahora balanceados y preprocesados
downsp <- mod_rcp%>%
prep()%>%
juice()
table(downsp$Class)##
## 0 1
## 365 365
# hacemos ahora el modelo con los datos balanceados
mod_rl <- logistic_reg(mode = 'classification') %>%
set_engine(engine = "glm",family=binomial(link='logit'))
mod_dwn <- recipe(Class~.,
data = downsp)
set.seed(45)
cv_folds <- vfold_cv(
data = downsp,
v = 10,
strata = Class
)
wf_m <- workflow()%>%
add_recipe(mod_dwn)%>%
add_model(mod_rl)
my_metrics <- metric_set(accuracy,roc_auc,kap)set.seed(45)
rlog <- wf_m%>%
fit_resamples(
resamples = cv_folds,
metrics = my_metrics,
control = control_resamples(save_pred = TRUE)
)collect_metrics(rlog)## # A tibble: 3 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.940 10 0.00723 Preprocessor1_Model1
## 2 kap binary 0.879 10 0.0145 Preprocessor1_Model1
## 3 roc_auc binary 0.977 10 0.00521 Preprocessor1_Model1
show_best(rlog, metric = "roc_auc")## # A tibble: 1 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.977 10 0.00521 Preprocessor1_Model1
best_rlog3 <- rlog%>%
select_best("roc_auc")
wf_final <-
wf_m%>%
finalize_workflow(best_rlog3)
set.seed(45)
final_model_rl <- fit(object = wf_final, data = train)
model_extr2 <- final_model_rl%>%extract_fit_engine()
summary(model_extr2)##
## Call:
## stats::glm(formula = ..y ~ ., family = ~binomial(link = "logit"),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.5349 -0.0296 -0.0192 -0.0121 4.5134
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.63193 0.14901 -57.930 < 0.0000000000000002 ***
## V4 0.62479 0.05058 12.352 < 0.0000000000000002 ***
## V8 -0.16315 0.02004 -8.141 0.000000000000000392 ***
## V10 -0.70409 0.06629 -10.622 < 0.0000000000000002 ***
## V13 -0.44463 0.08645 -5.143 0.000000270405507326 ***
## V14 -0.68226 0.04797 -14.224 < 0.0000000000000002 ***
## V20 -0.19123 0.04400 -4.346 0.000013850931663338 ***
## V21 0.45066 0.04989 9.033 < 0.0000000000000002 ***
## V22 0.77452 0.12653 6.121 0.000000000928133987 ***
## V27 -0.74164 0.10397 -7.133 0.000000000000980873 ***
## V28 -0.29612 0.09210 -3.215 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5330.5 on 199363 degrees of freedom
## Residual deviance: 1697.2 on 199353 degrees of freedom
## AIC: 1719.2
##
## Number of Fisher Scoring iterations: 11
y vemos que ahora las variables practicamente cuentan con 3 estrellas(***), salvando un caso que de todas formas es bastante significativo.
La Prediccion del scoring del modelo (conjunto test) sera:
rl_predict <- predict(model_extr2,test,type = "response")
head(rl_predict,20)## 2 3 10 17 22
## 0.00012991159 0.00029768491 0.00007357205 0.00050964868 0.00043325804
## 23 25 27 29 30
## 0.00046599407 0.00001341684 0.00022701711 0.00055445835 0.00055038904
## 37 42 47 48 49
## 0.00039528039 0.00016444255 0.00012446489 0.00047243773 0.00003924375
## 50 54 55 56 58
## 0.00028193918 0.00094029386 0.00014850395 0.00051287401 0.00008386964
Veamos graficamente la pinta que tiene:
plot(rl_predict~test$Class)Calculamos ahora el pseudo R cuadrado, visualizando con anterioridad los coeficientes y significancia de acuerdo con lo que hemos decidido escoger para nuestro modelo:
psdo_R <- 1-(model_extr1$deviance/model_extr1$null.deviance)
psdo_R## [1] 0.6942926
El valor que nos da es muy bueno, este valor nos indica que con las variables que tenemos somos capaces de explicar el 69% de los motivos por los que una transaccion es fraudulenta, y hay solo aprox. un 30% que no somos capaces de explicar.
Con la funcion umbrales probamos diferentes cortes y para este caso concreto de descubrimiento de fraude vamos a intentar aumentar la cobertura, pero sin que disminuya mucho la precision, puesto que para un caso de fraude asumimos que pueda haber algo mas de errores de deteccion en pos de que haya mas cobertura para asegurar que se corte la transaccion ante la sospecha para proteger el dinero del cliente.
umb_rl<-umbrales(test$Class,rl_predict)
umb_rl## umbral acierto precision cobertura F1
## 1 0.05 99.89935 61.98830 83.46457 71.14094
## 2 0.10 99.91924 70.71429 77.95276 74.15730
## 3 0.15 99.92276 73.28244 75.59055 74.41860
## 4 0.20 99.93095 78.33333 74.01575 76.11336
## 5 0.25 99.92861 78.94737 70.86614 74.68880
## 6 0.30 99.92627 79.09091 68.50394 73.41772
## 7 0.35 99.92393 79.24528 66.14173 72.10300
## 8 0.40 99.92276 79.61165 64.56693 71.30435
## 9 0.45 99.92276 80.19802 63.77953 71.05263
## 10 0.50 99.92744 84.21053 62.99213 72.07207
## 11 0.55 99.92627 84.78261 61.41732 71.23288
## 12 0.60 99.92627 85.55556 60.62992 70.96774
## 13 0.65 99.92159 84.88372 57.48031 68.54460
## 14 0.70 99.92159 85.71429 56.69291 68.24645
## 15 0.75 99.91807 85.18519 54.33071 66.34615
## 16 0.80 99.91690 85.00000 53.54331 65.70048
## 17 0.85 99.91339 84.41558 51.18110 63.72549
## 18 0.90 99.90754 84.28571 46.45669 59.89848
## 19 0.95 99.89935 82.53968 40.94488 54.73684
Seleccionamos un umbral que amplifique la cobertura sin disminuir demasiado la precision.
umbral_final_rl<-0.10
umbral_final_rl## [1] 0.1
Hallamos la matriz de confusion del umbral optimizado.
confusion(test$Class,rl_predict,0.10)##
## real FALSE TRUE
## 0 85275 41
## 1 28 99
rl_metricas<-filter(umb_rl,umbral==umbral_final_rl)
rl_metricas## umbral acierto precision cobertura F1
## 1 0.1 99.91924 70.71429 77.95276 74.1573
Evaluamos la ROC
#creamos el objeto prediction
rl_prediction<-prediction(rl_predict,test$Class)
#visualizamos la ROC
roc(rl_prediction)Sacamos las metricas definitivas incluyendo el AUC
rl_metricas<-cbind(rl_metricas,AUC=round(auc(rl_prediction),2)*100)
print(t(rl_metricas))## [,1]
## umbral 0.10000
## acierto 99.91924
## precision 70.71429
## cobertura 77.95276
## F1 74.15730
## AUC 97.00000
3.2 Random Forest
set.seed(45)
split_inicial <- initial_split(
data = dat,
prop = 0.7,
strata = Class
)
train <- training(split_inicial)
test <- testing(split_inicial)
# Balanceamos datos y preprocesamos:
# Hacemos preprocesamiento de datos
mod_rcp <- recipe(Class~.,
data = train)%>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes())%>%
step_downsample(Class)
# sacamos el subconjunto de datos ahora balanceados y preprocesados
downsp <- mod_rcp%>%
prep()%>%
juice()
table(downsp$Class)##
## 0 1
## 365 365
mod_rf <- rand_forest(
mode = "classification",
mtry = tune(),
trees = 100,
min_n = tune()
) %>%
set_engine(engine = "randomForest",importance=T)
mod_dwn <- recipe(Class~.,
data = downsp)
set.seed(45)
cv_folds <- vfold_cv(
data = downsp,
v = 10,
strata = Class)
wf_rf <- workflow()%>%
add_recipe(mod_dwn)%>%
add_model(mod_rf)
rf_grid <- grid_regular(
mtry(range = c(10, 30)),
min_n(range = c(2, 8)),
levels = 5
)registerDoParallel(cores = parallel::detectCores() - 1)
set.seed(45)
rf_fit <- wf_rf%>%tune_grid(
resamples = cv_folds,
metrics = metric_set(roc_auc),
control = control_resamples(save_pred = TRUE),
grid=rf_grid)
stopImplicitCluster()show_best(rf_fit, metric = "roc_auc")## # A tibble: 5 x 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 25 8 roc_auc binary 0.977 10 0.00539 Preprocessor1_Model24
## 2 15 8 roc_auc binary 0.977 10 0.00472 Preprocessor1_Model22
## 3 10 6 roc_auc binary 0.976 10 0.00508 Preprocessor1_Model16
## 4 15 3 roc_auc binary 0.976 10 0.00515 Preprocessor1_Model07
## 5 30 5 roc_auc binary 0.976 10 0.00529 Preprocessor1_Model15
mejor_hiper1 <- select_best(rf_fit, metric="roc_auc")
mejor_hiper1## # A tibble: 1 x 3
## mtry min_n .config
## <int> <int> <chr>
## 1 25 8 Preprocessor1_Model24
wf_finalrf <-
wf_rf%>%
finalize_workflow(mejor_hiper1)
set.seed(45)
final_model_rf <- fit(object = wf_finalrf, data = train)
model_extr3 <- final_model_rf%>%
extract_fit_engine()Calculo de prediccion del modelo (scorings).
rf_predict <- predict(model_extr3, test, type = 'prob')[,2]
head(rf_predict,20)## 2 3 10 17 22 23 25 27 29 30 37 42 47 48 49 50
## 0.00 0.03 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00
## 54 55 56 58
## 0.00 0.00 0.00 0.00
Veamos graficamente como se ve:
plot(rf_predict~test$Class)Probamos cortes con la funcion umbrales y visualizamos la tabla respectiva. elegimos un umbral que maximice la cobertura con respecto a a la precision sin que disminuya en exceso la misma.
umb_rf<-umbrales(test$Class,rf_predict)
umb_rf## umbral acierto precision cobertura F1
## 1 0.05 99.79870 41.69742 88.97638 56.78392
## 2 0.10 99.89935 61.20219 88.18898 72.25806
## 3 0.15 99.94148 76.19048 88.18898 81.75182
## 4 0.20 99.94382 78.01418 86.61417 82.08955
## 5 0.25 99.95319 82.70677 86.61417 84.61538
## 6 0.30 99.95787 85.27132 86.61417 85.93750
## 7 0.35 99.96021 87.20000 85.82677 86.50794
## 8 0.40 99.96489 90.08264 85.82677 87.90323
## 9 0.45 99.96606 90.83333 85.82677 88.25911
## 10 0.50 99.96489 90.75630 85.03937 87.80488
## 11 0.55 99.96723 92.30769 85.03937 88.52459
## 12 0.60 99.96723 93.80531 83.46457 88.33333
## 13 0.65 99.96255 94.39252 79.52756 86.32479
## 14 0.70 99.95904 94.23077 77.16535 84.84848
## 15 0.75 99.95436 94.89796 73.22835 82.66667
## 16 0.80 99.95084 94.73684 70.86614 81.08108
## 17 0.85 99.94616 96.55172 66.14173 78.50467
## 18 0.90 99.93446 100.00000 55.90551 71.71717
## 19 0.95 99.92041 100.00000 46.45669 63.44086
Lo hallamos directamente con la siguiente sintaxis.
umbral_final_rf<-0.2
umbral_final_rf## [1] 0.2
Hallamos la matriz de confusion del umbral optimizado.
confusion(test$Class,rf_predict,0.2)##
## real FALSE TRUE
## 0 85285 31
## 1 17 110
rf_metricas<-filter(umb_rf,umbral==umbral_final_rf)
rf_metricas## umbral acierto precision cobertura F1
## 1 0.2 99.94382 78.01418 86.61417 82.08955
Evaluamos la ROC
#creamos el objeto prediction
rf_prediction<-prediction(rf_predict,test$Class)
#visualizamos la ROC
roc(rf_prediction)Sacamos las metricas definitivas incluyendo el AUC
rf_metricas<-cbind(rf_metricas,AUC=round(auc(rf_prediction),2)*100)
print(t(rf_metricas))## [,1]
## umbral 0.20000
## acierto 99.94382
## precision 78.01418
## cobertura 86.61417
## F1 82.08955
## AUC 96.00000
3.3 XGBOOST (extreme gradient boosting)
set.seed(45)
split_inicial <- initial_split(
data = dat,
prop = 0.7,
strata = Class
)
train <- training(split_inicial)
test <- testing(split_inicial)
# Balanceamos datos y preprocesamos:
# Hacemos preprocesamiento de datos
mod_rcp <- recipe(Class~.,
data = train)%>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes())%>%
step_downsample(Class)
# sacamos el subconjunto de datos ahora balanceados y preprocesados
downsp <- mod_rcp%>%
prep()%>%
juice()
table(downsp$Class)##
## 0 1
## 365 365
mod_xg <- boost_tree(mtry = tune(),
min_n = tune(),
tree_depth = tune(),
trees = 100,
learn_rate = 0.1,
sample_size = 0.8) %>%
set_engine("xgboost") %>%
set_mode("classification")
mod_dwn <- recipe(Class~.,
data = downsp)
set.seed(3001)
cv_folds <- vfold_cv(
data = downsp,
v = 10,
strata = Class)
wf_xg<- workflow()%>%
add_recipe(mod_dwn)%>%
add_model(mod_xg)
# Parametros para tuning
params_xgb <- parameters(
finalize(mtry(), x = train[, -1]),
min_n(range = c(2L, 50L)),
tree_depth(range = c(3L, 8L))
)
# Grid
set.seed(2000)
grid_xgb <- params_xgb %>%
grid_max_entropy(size = 10)registerDoParallel(cores = parallel::detectCores() - 1)
set.seed(2631)
xg_fit <- wf_xg%>%tune_grid(
resamples = cv_folds,
metrics = metric_set(roc_auc),
grid = grid_xgb,
control = control_grid(save_pred = TRUE))
stopImplicitCluster()show_best(xg_fit, metric = "roc_auc")## # A tibble: 5 x 9
## mtry min_n tree_depth .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 27 6 3 roc_auc binary 0.976 10 0.00555 Preprocessor1_M~
## 2 29 9 7 roc_auc binary 0.975 10 0.00591 Preprocessor1_M~
## 3 3 7 8 roc_auc binary 0.975 10 0.00572 Preprocessor1_M~
## 4 2 8 3 roc_auc binary 0.973 10 0.00610 Preprocessor1_M~
## 5 5 36 4 roc_auc binary 0.970 10 0.00647 Preprocessor1_M~
mejor_hiper5 <- select_best(xg_fit, metric="roc_auc")
mejor_hiper5## # A tibble: 1 x 4
## mtry min_n tree_depth .config
## <int> <int> <int> <chr>
## 1 27 6 3 Preprocessor1_Model01
wf_finalxg <-
wf_rf%>%
finalize_workflow(mejor_hiper5)
set.seed(3300)
final_model_xg <- fit(object = wf_finalxg, data = train)
model_extr4 <- final_model_xg%>%
extract_fit_engine()Calculo de prediccion del modelo (scorings)
xg_predict <- predict(model_extr4, test, type = 'prob')[,2]
head(xg_predict,20)## 2 3 10 17 22 23 25 27 29 30 37 42 47 48 49 50 54 55 56 58
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Veamos graficamente como se ve:
plot(xg_predict~test$Class)Probamos cortes con la funcion umbrales, visualizamos la tabla respectiva.
umb_xg<-umbrales(test$Class,xg_predict)
umb_xg## umbral acierto precision cobertura F1
## 1 0.05 99.78231 39.57597 88.18898 54.63415
## 2 0.10 99.89233 59.25926 88.18898 70.88608
## 3 0.15 99.93329 73.33333 86.61417 79.42238
## 4 0.20 99.94265 77.46479 86.61417 81.78439
## 5 0.25 99.95436 83.33333 86.61417 84.94208
## 6 0.30 99.95436 83.84615 85.82677 84.82490
## 7 0.35 99.95904 86.50794 85.82677 86.16601
## 8 0.40 99.96021 87.20000 85.82677 86.50794
## 9 0.45 99.96372 90.00000 85.03937 87.44939
## 10 0.50 99.96606 91.52542 85.03937 88.16327
## 11 0.55 99.96606 92.24138 84.25197 88.06584
## 12 0.60 99.96489 92.17391 83.46457 87.60331
## 13 0.65 99.96021 92.66055 79.52756 85.59322
## 14 0.70 99.95787 94.17476 76.37795 84.34783
## 15 0.75 99.95436 94.00000 74.01575 82.81938
## 16 0.80 99.95436 94.89796 73.22835 82.66667
## 17 0.85 99.94382 96.47059 64.56693 77.35849
## 18 0.90 99.93563 97.36842 58.26772 72.90640
## 19 0.95 99.91807 98.30508 45.66929 62.36559
Buscamos un umbral que maximice la coberturra sin disminuir demasiado la precision.
Lo hallamos directamente con la siguiente sintaxis.
umbral_final_xg<-0.20
umbral_final_xg## [1] 0.2
Hallamos la matriz de confusion del umbral optimizado.
confusion(test$Class,xg_predict,0.20)##
## real FALSE TRUE
## 0 85284 32
## 1 17 110
xg_metricas<-filter(umb_xg,umbral==umbral_final_xg)
xg_metricas## umbral acierto precision cobertura F1
## 1 0.2 99.94265 77.46479 86.61417 81.78439
Evaluamos la ROC
#creamos el objeto prediction
xg_prediction<-prediction(xg_predict,test$Class)
#visualizamos la ROC
roc(xg_prediction)Sacamos las metricas definitivas incluyendo el AUC
xg_metricas<-cbind(xg_metricas,AUC=round(auc(xg_prediction),2)*100)
print(t(xg_metricas))## [,1]
## umbral 0.20000
## acierto 99.94265
## precision 77.46479
## cobertura 86.61417
## F1 81.78439
## AUC 96.00000
3.4 Comparativa de metricas y curvas ROC.
3.4.1 Comparativa de metricas
comparativa <- rbind(rl_metricas,rf_metricas,xg_metricas)
rownames(comparativa) <- c('Regresion Logistica', 'Random Forest', 'Extreme Gradient Boosting')
t(comparativa)## Regresion Logistica Random Forest Extreme Gradient Boosting
## umbral 0.10000 0.20000 0.20000
## acierto 99.91924 99.94382 99.94265
## precision 70.71429 78.01418 77.46479
## cobertura 77.95276 86.61417 86.61417
## F1 74.15730 82.08955 81.78439
## AUC 97.00000 96.00000 96.00000
3.4.2 Comparativas de curvas ROC de los modelos.
rl_auc <- rlog%>%collect_predictions(summarize=TRUE,parameters=best_rlog2)%>%
roc_curve(Class,.pred_0)%>%
mutate(model = "Regresion Logistica")
rf_auc <- rf_fit%>%collect_predictions(summarize=TRUE,parameters=mejor_hiper1)%>%
roc_curve(Class,.pred_0)%>%
mutate(model = "Random Forest")
xg_auc <- xg_fit%>%collect_predictions(summarize=TRUE,parameters=mejor_hiper5)%>%
roc_curve(Class,.pred_0)%>%
mutate(model = "Xgboost")
bind_rows(rl_auc,rf_auc,xg_auc)%>% # Dibuja las 4 curvas AUC juntas
ggplot(aes(x=1-specificity, y = sensitivity, col = model))+ # Especifica el eje x e y dibuja la columna para usar el nombre de la metrica
geom_path(lwd = 1.5, alpha = 0.8) + # Conecta a las 4 AUC, lwd = anchura de linea, alpha = Color transparencia del valor.
geom_abline(lty = 3) + # abline del dibujo, lty= linea tipo
coord_equal() + # Asegura que los rangos de los ejes sean iguales
scale_color_viridis_d(option = "plasma", end = .6)3.5 CONCLUSION Y GUARDADO DEL MODELO PREDICTIVO ELEGIDO
Vemos que todos los modelos predicen muy bien aunque por simplicidad y algo levemente mejor elegimos el modelo de regresion logistica como modelo final.
dat$SCORING_CLASS <- predict(model_extr1,dat,type='response')
# Visualizamos la tabla con el Scoring añadido y guardamos el modelo final
kable(head(dat,6), booktabs = T) %>%
kable_styling(font_size=8)| Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | SCORING_CLASS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.3598071 | -0.0727812 | 2.5363467 | 1.3781552 | -0.3383208 | 0.4623878 | 0.2395986 | 0.0986979 | 0.3637870 | 0.0907942 | -0.5515995 | -0.6178009 | -0.9913898 | -0.3111694 | 1.4681770 | -0.4704005 | 0.2079712 | 0.0257906 | 0.4039930 | 0.2514121 | -0.0183068 | 0.2778376 | -0.1104739 | 0.0669281 | 0.1285394 | -0.1891148 | 0.1335584 | -0.0210531 | 149.62 | 0 | 0.0007505 |
| 0 | 1.1918571 | 0.2661507 | 0.1664801 | 0.4481541 | 0.0600176 | -0.0823608 | -0.0788030 | 0.0851017 | -0.2554251 | -0.1669744 | 1.6127267 | 1.0652353 | 0.4890950 | -0.1437723 | 0.6355581 | 0.4639170 | -0.1148047 | -0.1833613 | -0.1457830 | -0.0690831 | -0.2257752 | -0.6386720 | 0.1012880 | -0.3398465 | 0.1671704 | 0.1258945 | -0.0089831 | 0.0147242 | 2.69 | 0 | 0.0001612 |
| 1 | -1.3583541 | -1.3401631 | 1.7732093 | 0.3797796 | -0.5031981 | 1.8004994 | 0.7914610 | 0.2476758 | -1.5146543 | 0.2076429 | 0.6245015 | 0.0660837 | 0.7172927 | -0.1659459 | 2.3458649 | -2.8900832 | 1.1099694 | -0.1213593 | -2.2618571 | 0.5249797 | 0.2479982 | 0.7716794 | 0.9094123 | -0.6892810 | -0.3276418 | -0.1390966 | -0.0553528 | -0.0597518 | 378.66 | 0 | 0.0002707 |
| 1 | -0.9662717 | -0.1852260 | 1.7929933 | -0.8632913 | -0.0103089 | 1.2472032 | 0.2376089 | 0.3774359 | -1.3870241 | -0.0549519 | -0.2264873 | 0.1782282 | 0.5077569 | -0.2879237 | -0.6314181 | -1.0596472 | -0.6840928 | 1.9657750 | -1.2326220 | -0.2080378 | -0.1083005 | 0.0052736 | -0.1903205 | -1.1755753 | 0.6473760 | -0.2219288 | 0.0627228 | 0.0614576 | 123.50 | 0 | 0.0001049 |
| 2 | -1.1582331 | 0.8777368 | 1.5487178 | 0.4030339 | -0.4071934 | 0.0959215 | 0.5929407 | -0.2705327 | 0.8177393 | 0.7530744 | -0.8228429 | 0.5381956 | 1.3458516 | -1.1196698 | 0.1751211 | -0.4514492 | -0.2370332 | -0.0381948 | 0.8034869 | 0.4085424 | -0.0094307 | 0.7982785 | -0.1374581 | 0.1412670 | -0.2060096 | 0.5022922 | 0.2194222 | 0.2151531 | 69.99 | 0 | 0.0002148 |
| 2 | -0.4259659 | 0.9605230 | 1.1411093 | -0.1682521 | 0.4209869 | -0.0297276 | 0.4762009 | 0.2603143 | -0.5686714 | -0.3714072 | 1.3412620 | 0.3598938 | -0.3580907 | -0.1371337 | 0.5176168 | 0.4017259 | -0.0581328 | 0.0686531 | -0.0331938 | 0.0849677 | -0.2082535 | -0.5598248 | -0.0263977 | -0.3714266 | -0.2327938 | 0.1059148 | 0.2538442 | 0.0810803 | 3.67 | 0 | 0.0001270 |
saveRDS(model_extr1,'modelo_final.rds')