WORKSHOP PYSPARK

Configuraciones iniciales e importación de librerías

Configuración de sesión de Spark

In [272]:
import findspark
findspark.init()
In [273]:
from pyspark.sql import SparkSession
In [274]:
spark = SparkSession.builder.getOrCreate()

Librerías a usar

In [275]:
#Pyspark libraries/modules
from pyspark.sql import *
from pyspark.sql import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql.types import StringType, DoubleType, IntegerType, ArrayType, DateType

#Python libraries/modules
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
from scipy import stats
from scipy.stats import kstest

#Pyspark Machine Learning libraries/modules
from pyspark.mllib.stat import Statistics

from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, MinMaxScaler
from pyspark.ml.classification import RandomForestClassifier, GBTClassifier, LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.stat import Statistics
import re

Caso a tratar

Cargamos un dataset público con información de una campaña de marketing portuguesa para la venta de un préstamo bancario. El objetivo es, una vez limpiado el dataset, estudiarlo para extraer el máximo número de insights de los datos.

Información sobre las variables del dataset

  1. age
  2. job: tipo de trabajo
  3. marital: estado civil (nota: "divorced" implica tanto divorciado como viudo)
  4. education
  5. default: ¿no ha pagado alguna cuota de un crédito?
  6. balance: saldo medio anual, en euros
  7. housing: ¿tiene una hipoteca?
  8. loan: ¿tiene un crédito?
  9. contact: tipo de comunicación del último contacto
  10. day: día de la fecha de último contacto
  11. month: mes de la fecha de último contacto
  12. duration: duración del último contacto, en segundos
  13. campaign: número de contactos llevados a cabo durante esta campaña y para este cliente (incluye el último contacto)
  14. pdays: número de dias que han pasado desde que el cliente fue contactado por última vez durante la pasada campaña (-1 significa que el cliente no fue contactado)
  15. previous: número de contactos que se han llevado a cabo antes de esta campaña y para este cliente
  16. poutcome: resultado de la última campaña de marketing
  17. Target: ¿tiene el cliente contratado un depósito a plazo fijo?

Chequeos y Comprobaciones Iniciales

Importación del dataset

In [276]:
df = spark.read.csv('bank-full.csv',sep=",",header = True,inferSchema = True)
In [277]:
df.toPandas().head()
Out[277]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome Target
0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no

Dimensiones del dataset

In [278]:
print("El dataset tiene",df.count(),"filas y ",len(df.columns),"columnas")
El dataset tiene 45211 filas y  17 columnas

Balanceo de la variable Target

In [279]:
Frecuencias = (df.select('Target').groupBy('Target').count()
.withColumn('Porcentaje',F.round(F.col('count')*100/df.count(),3)))
In [280]:
Frecuencias.show()
+------+-----+----------+
|Target|count|Porcentaje|
+------+-----+----------+
|    no|39922|    88.302|
|   yes| 5289|    11.698|
+------+-----+----------+

In [281]:
df_frec = Frecuencias.toPandas()
plt.figure(figsize=(10,4))
plt.title("Frecuencia porcentual de variable Target")
sns.barplot(x=df_frec['Target'],y= df_frec['Porcentaje'])
plt.show()

Por medio de la tabla de frecuencias y el gráfico vemos que la variable Target se encuentra desbalanceada. la categoría no representa un 88% de las filas y la categoría yes sólo un 11,69% (redondeado 12%).

Data Wrangling

Formateo de las variables (strings y numéricas)

In [282]:
df.printSchema()
root
 |-- age: integer (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: integer (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- day: integer (nullable = true)
 |-- month: string (nullable = true)
 |-- duration: integer (nullable = true)
 |-- campaign: integer (nullable = true)
 |-- pdays: integer (nullable = true)
 |-- previous: integer (nullable = true)
 |-- poutcome: string (nullable = true)
 |-- Target: string (nullable = true)

De forma preliminar vimos que variables como age, balance y duration tenían formato string pero agregando el argumento inferSchema en la función read.csv() detecta automaticamente los data types de estas columnas, sin embargo vemos que hay algunos casos que no fueron bien inferidos tales como month.

In [283]:
df.select('month').distinct().show()
+-----+
|month|
+-----+
|  jun|
|  aug|
|  may|
|  feb|
|  sep|
|  mar|
|  oct|
|  jul|
|  nov|
|  apr|
|  dec|
|  jan|
+-----+

Como podemos ver, month tiene un formato tipo string, por lo tanto por medio de un condicional vamos a convertir las etiquetas a número de meses, de esta forma la variable pasa de string a integer.

In [284]:
df = df.withColumn('month',F.when(F.col('month') == "jan", 1)
              .when(F.col('month') == "feb", 2)
              .when(F.col('month') == "mar", 3)
              .when(F.col('month') == "apr", 4)
              .when(F.col('month') == "may", 5)
              .when(F.col('month') == "jun", 6)
              .when(F.col('month') == "jul", 7)
              .when(F.col('month') == "aug", 8)
              .when(F.col('month') == "sep", 9)
              .when(F.col('month') == "oct", 10)
              .when(F.col('month') == "nov", 11)
              .otherwise(12))

Finalmente, volvemos a ejecutar el printSchema() para verificar que todas las variables tengan su correspondiente data type.

In [285]:
df.printSchema()
root
 |-- age: integer (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: integer (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- day: integer (nullable = true)
 |-- month: integer (nullable = false)
 |-- duration: integer (nullable = true)
 |-- campaign: integer (nullable = true)
 |-- pdays: integer (nullable = true)
 |-- previous: integer (nullable = true)
 |-- poutcome: string (nullable = true)
 |-- Target: string (nullable = true)

In [286]:
data_types = list(df.dtypes)
data_types_df = spark.createDataFrame(data_types, ['variable', 'data_type']).select('data_type').groupBy('data_type').count().toPandas()
In [287]:
plt.figure(figsize=(10,4))
plt.title( "Cantidad de variables por Data type")
sns.barplot(x = data_types_df['data_type'],y = data_types_df['count'])
plt.show()

Hasta el momento tenemos mayot cantidad de variables string que integer, luego cuando se creen las variables dummies esta proporción va a cambiar.

Comprobación y eliminación de registros repetidos

In [288]:
antes = df.count()
In [289]:
df = df.dropDuplicates()
In [290]:
despues = df.count()
In [291]:
print(' Filas antes de chequear duplicados:',antes,'\n','Filas después de chequear duplicados:',despues,'\n','Total filas duplicadas:',antes-despues)
 Filas antes de chequear duplicados: 45211 
 Filas después de chequear duplicados: 45211 
 Total filas duplicadas: 0

Por medio de dropDuplicates() vemos que no hay filas completas duplicadas en el dataset. Sin embargo tampoco contamos con un ID único ni nombre que permita realizar un chequeo a fondo.

Comprobación de columnas que no aportan información

1 - Comprobación para variables Cualitativas

Primero creamos un dataframe con variables cualitativas

In [292]:
cualitativas = [c for c,t in df.dtypes if t in ['string']]
cualitativas = df.select(cualitativas)
In [293]:
cualitativas.show(1)
+------+-------+---------+-------+-------+----+-------+--------+------+
|   job|marital|education|default|housing|loan|contact|poutcome|Target|
+------+-------+---------+-------+-------+----+-------+--------+------+
|admin.|married|  unknown|     no|    yes|  no|unknown| unknown|    no|
+------+-------+---------+-------+-------+----+-------+--------+------+
only showing top 1 row

Loop para generar tablas de frecuencias de las variables cualitativas

In [294]:
for col in cualitativas.columns:
    cualitativas.select(col).groupBy(col).count()\
    .withColumn('Porcentaje',F.round(F.col('count')*100/cualitativas.count(),3))\
    .orderBy(F.col('Porcentaje').desc())\
    .drop('count')\
    .show()
+-------------+----------+
|          job|Porcentaje|
+-------------+----------+
|  blue-collar|    21.526|
|   management|     20.92|
|   technician|    16.803|
|       admin.|    11.437|
|     services|     9.188|
|      retired|     5.008|
|self-employed|     3.493|
| entrepreneur|     3.289|
|   unemployed|     2.882|
|    housemaid|     2.743|
|      student|     2.075|
|      unknown|     0.637|
+-------------+----------+

+--------+----------+
| marital|Porcentaje|
+--------+----------+
| married|    60.193|
|  single|     28.29|
|divorced|    11.517|
+--------+----------+

+---------+----------+
|education|Porcentaje|
+---------+----------+
|secondary|    51.319|
| tertiary|     29.42|
|  primary|    15.153|
|  unknown|     4.107|
+---------+----------+

+-------+----------+
|default|Porcentaje|
+-------+----------+
|     no|    98.197|
|    yes|     1.803|
+-------+----------+

+-------+----------+
|housing|Porcentaje|
+-------+----------+
|    yes|    55.584|
|     no|    44.416|
+-------+----------+

+----+----------+
|loan|Porcentaje|
+----+----------+
|  no|    83.977|
| yes|    16.023|
+----+----------+

+---------+----------+
|  contact|Porcentaje|
+---------+----------+
| cellular|    64.774|
|  unknown|    28.798|
|telephone|     6.428|
+---------+----------+

+--------+----------+
|poutcome|Porcentaje|
+--------+----------+
| unknown|    81.748|
| failure|     10.84|
|   other|      4.07|
| success|     3.342|
+--------+----------+

+------+----------+
|Target|Porcentaje|
+------+----------+
|    no|    88.302|
|   yes|    11.698|
+------+----------+

2 - Comprobación para variables Cuantitativas

Primero creamos un dataframe con variables cuantitativas

In [295]:
numericas = [c for c,t in df.dtypes if t in ['int']]
numericas = df.select(numericas)
In [296]:
numericas.show(1)
+---+-------+---+-----+--------+--------+-----+--------+
|age|balance|day|month|duration|campaign|pdays|previous|
+---+-------+---+-----+--------+--------+-----+--------+
| 34|      3|  5|    5|     120|       3|   -1|       0|
+---+-------+---+-----+--------+--------+-----+--------+
only showing top 1 row

Loop para generar tablas de frecuencias de las variables cualitativas

In [297]:
for col in numericas.columns:
    numericas.select(col).groupBy(col).count()\
    .withColumn('Porcentaje',F.round(F.col('count')*100/numericas.count(),3))\
    .orderBy(F.col('Porcentaje').desc())\
    .drop('count')\
    .show(5)
+---+----------+
|age|Porcentaje|
+---+----------+
| 32|     4.612|
| 31|     4.415|
| 33|     4.362|
| 34|     4.269|
| 35|     4.189|
+---+----------+
only showing top 5 rows

+-------+----------+
|balance|Porcentaje|
+-------+----------+
|      0|     7.772|
|      1|     0.431|
|      2|     0.345|
|      4|     0.307|
|      3|     0.296|
+-------+----------+
only showing top 5 rows

+---+----------+
|day|Porcentaje|
+---+----------+
| 20|     6.087|
| 18|     5.105|
| 21|     4.481|
| 17|     4.289|
|  6|     4.273|
+---+----------+
only showing top 5 rows

+-----+----------+
|month|Porcentaje|
+-----+----------+
|    5|    30.448|
|    7|    15.251|
|    8|    13.817|
|    6|    11.813|
|   11|     8.781|
+-----+----------+
only showing top 5 rows

+--------+----------+
|duration|Porcentaje|
+--------+----------+
|     124|     0.416|
|      90|     0.407|
|      89|     0.391|
|     114|     0.387|
|     122|     0.387|
+--------+----------+
only showing top 5 rows

+--------+----------+
|campaign|Porcentaje|
+--------+----------+
|       1|    38.805|
|       2|    27.659|
|       3|    12.212|
|       4|      7.79|
|       5|     3.902|
+--------+----------+
only showing top 5 rows

+-----+----------+
|pdays|Porcentaje|
+-----+----------+
|   -1|    81.737|
|  182|     0.369|
|   92|     0.325|
|  183|     0.279|
|   91|     0.279|
+-----+----------+
only showing top 5 rows

+--------+----------+
|previous|Porcentaje|
+--------+----------+
|       0|    81.737|
|       1|     6.131|
|       2|     4.658|
|       3|     2.526|
|       4|     1.579|
+--------+----------+
only showing top 5 rows

Resumen de variables que no aportan mucha información:

Estableciendo un umbral de 80%, si una de sus categorías concentra este porcentaje decimos que la variable a priori no aporta mucha información ya que esta categoría se repite mayoritariamente a lo largo de toda la columna.

  • Variables cualitativas: default, loan, poutcome
  • Variables numéricas: picked_up_call, extra_info

Estudio de la varianza para las variables numéricas

Método 1

In [298]:
numericas_pd = numericas.toPandas()
In [299]:
avgs = [F.avg(col).alias('avg_' + col) for col in numericas_pd]
maxs = [F.max(col).alias('max_' + col) for col in numericas_pd]
mins = [F.min(col).alias('min_' + col) for col in numericas_pd]
stds = [F.stddev(col).alias('std_' + col) for col in numericas_pd]

operations = avgs + stds + maxs + mins 
In [300]:
resultados = df.select(operations).first()

for col in numericas_pd:
    
    avg = resultados['avg_' + col]
    std = resultados['std_' + col]
    maxi = resultados['max_' + col]
    mini = resultados['min_' + col]
    
    print('{}: avg={}, std={}, min={}, max={}'\
          .format(col, round(avg, 2), round(std, 2), mini, maxi))
age: avg=40.94, std=10.62, min=18, max=95
balance: avg=1362.27, std=3044.77, min=-8019, max=102127
day: avg=15.81, std=8.32, min=1, max=31
month: avg=6.14, std=2.41, min=1, max=12
duration: avg=258.16, std=257.53, min=0, max=4918
campaign: avg=2.76, std=3.1, min=1, max=63
pdays: avg=40.2, std=100.13, min=-1, max=871
previous: avg=0.58, std=2.3, min=0, max=275

Método 2: más automatizado

In [301]:
numericas_pd.describe()
Out[301]:
age balance day month duration campaign pdays previous
count 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000
mean 40.936210 1362.272058 15.806419 6.144655 258.163080 2.763841 40.197828 0.580323
std 10.618762 3044.765829 8.322476 2.408034 257.527812 3.098021 100.128746 2.303441
min 18.000000 -8019.000000 1.000000 1.000000 0.000000 1.000000 -1.000000 0.000000
25% 33.000000 72.000000 8.000000 5.000000 103.000000 1.000000 -1.000000 0.000000
50% 39.000000 448.000000 16.000000 6.000000 180.000000 2.000000 -1.000000 0.000000
75% 48.000000 1428.000000 21.000000 8.000000 319.000000 3.000000 -1.000000 0.000000
max 95.000000 102127.000000 31.000000 12.000000 4918.000000 63.000000 871.000000 275.000000

Análisis gráfico de media y varianza:

Por medio de histogramas de las variables cuantitativas, vamos a observar la media y varianza de cada una de ellas para ver cuales son las variables que tienen mayor desvio.

In [302]:
for col in numericas_pd.columns:
    plt.figure(figsize=(20, 4)) 
    plt.title("Histograma de "+col+" con media y desvio")
    sns.histplot(numericas_pd[col]) 
    mean = numericas_pd[col].mean() 
    std = numericas_pd[col].std() 
    plt.axvline(std, color='red', linewidth=2, linestyle = 'dashed') 
    plt.axvline(mean, color='blue', linewidth=2, linestyle = 'dashed')

En conclusión las variables que presentan mayor desvío respecto de su media, son las variables : age, month y day

Comprobación y eliminación de nulos

Como primer paso vamos a revisar los valores únicos de las variables cualitativas para ver si hay alguna etiqueta que no corresponda y que este ocultando valores perdidos.

In [303]:
for col in cualitativas.columns:
    cualitativas.select(col).distinct().show()
+-------------+
|          job|
+-------------+
|   management|
|      retired|
|      unknown|
|self-employed|
|      student|
|  blue-collar|
| entrepreneur|
|       admin.|
|   technician|
|     services|
|    housemaid|
|   unemployed|
+-------------+

+--------+
| marital|
+--------+
|divorced|
| married|
|  single|
+--------+

+---------+
|education|
+---------+
|  unknown|
| tertiary|
|secondary|
|  primary|
+---------+

+-------+
|default|
+-------+
|     no|
|    yes|
+-------+

+-------+
|housing|
+-------+
|     no|
|    yes|
+-------+

+----+
|loan|
+----+
|  no|
| yes|
+----+

+---------+
|  contact|
+---------+
|  unknown|
| cellular|
|telephone|
+---------+

+--------+
|poutcome|
+--------+
| success|
| unknown|
|   other|
| failure|
+--------+

+------+
|Target|
+------+
|    no|
|   yes|
+------+

Por medio de un distinct vemos que hay una categoría presente en varias variables que es "unknown". Al no haber dato en la variable esta categoría lo que hace es rellenar la celda, por lo tanto vamos a remover esta etiqueta por un valor vacío para desenmascarar a los valores perdidos ocultos.

In [304]:
df = df.replace('unknown',None)

Una vez reemplazada la etiqueta procedo a revisar el % de nulos por variables

In [305]:
for col in df.columns:
    
    n_missing = df.filter(F.col(col).isNull()).count()
    perc_missing = 100 * n_missing / df.count()
    print(col, round(perc_missing, 2))
age 0.0
job 0.64
marital 0.0
education 4.11
default 0.0
balance 0.0
housing 0.0
loan 0.0
contact 28.8
day 0.0
month 0.0
duration 0.0
campaign 0.0
pdays 0.0
previous 0.0
poutcome 81.75
Target 0.0

Realizando una revisión del porcentaje de valores nulos, vemos que la variable job, education, contact y poutcome tienen valores nulos, por lo tanto para tratar estos casos vamos a determinar algunos criterios:

  • Las variables con un % de nulos > 80% (ej: poutcome) voy a proceder a eliminarla del dataset ya que es una variable incompleta en la mayoría de sus filas.
  • Para el resto de las variables, como son cualitativas, voy a realizar una imputación de los valores nulos por la categoría con mayor frecuencia de cada una de ellas.

Eliminación de variable poutcome

In [306]:
len(df.columns)
Out[306]:
17
In [307]:
df = df.drop('poutcome')

Chequeamos que las variable hayan quedado eliminadas del dataset:

In [308]:
len(df.columns)
Out[308]:
16

Una vez eliminada la variable poutcome, nos quedan 3 variables más con nulos: job, education y contact de las cuales vamos a tomar el criterio de imputación por la cateogoría más frecuente de cada una de ellas:

Categorías con mayor frecuencia:

In [309]:
max_job = df.select('job').groupBy('job').count().orderBy(F.col('count').desc()).select('job').first()[0]
max_educ = df.select('education').groupBy('education').count().orderBy(F.col('count').desc()).select('education').first()[0]
max_cont = df.select('contact').groupBy('contact').count().orderBy(F.col('count').desc()).select('contact').first()[0]

Categoría más frecuentes:

In [310]:
max_job + ', '+ max_educ + ', ' + max_cont
Out[310]:
'blue-collar, secondary, cellular'

Imputación por categoría más frecuente:

In [311]:
df = df.fillna(max_job, subset=['job'])
df = df.fillna(max_educ, subset=['education'])
df = df.fillna(max_cont, subset=['contact'])

Luego de la imputación verificamos nuevamente si quedaron valores nulos:

In [312]:
for col in df.columns:
    
    n_missing = df.filter(F.col(col).isNull()).count()
    perc_missing = 100 * n_missing / df.count()
    print(col, round(perc_missing, 2))
age 0.0
job 0.0
marital 0.0
education 0.0
default 0.0
balance 0.0
housing 0.0
loan 0.0
contact 0.0
day 0.0
month 0.0
duration 0.0
campaign 0.0
pdays 0.0
previous 0.0
Target 0.0

Finalmente hemos descubierto a los valores nulos ocultos y se han inputado siguiendo determinados criterios.

Comprobación y eliminación de outliers

Como primer paso vamos a crear un dataframe con variables numéricas

In [313]:
var_num = df.select([c for c,t in df.dtypes if t in['int']])
var_num = var_num.toPandas()

A continuación vamos a comprobar por medio de una función si las variables tienen distribución normal o no, para evaluar que método es conveniente aplicar.

El alpha seteado es de 0.05

In [314]:
def is_normal(variable, alpha=0.05):
    mean = variable.mean()
    std = variable.std()
    pvalue = kstest(variable, 'norm', args=(mean, std)).pvalue
    return pvalue >= alpha

Loop para chequear normalidad por cada columna del dataset de variables numéricas

In [315]:
normal_cols = []
non_normal_cols = []

for col in var_num.columns:
    normal = is_normal(var_num[col])
    if normal:
        normal_cols.append(col)
    else:
        non_normal_cols.append(col)
In [316]:
normal_cols
Out[316]:
[]

Niguna de las variables analizadas tienen distribución normal

In [317]:
non_normal_cols
Out[317]:
['age', 'balance', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous']

Concluimos que las variables no tienen una distribución normal, por lo tanto aplicaremos el test de Tukey que se maneja con cuartiles:

Chequeo de outliers por variable

In [318]:
def tukey_outliers(df,column,extreme=False):
    q1, q3 = np.percentile(df[column],[25,75])
    iqr = q3 - q1
    constant = 1.5 if not extreme else 3
    return df[~((df[column]>(q3+constant*iqr)) | (df[column]<(q1-constant*iqr)))]
In [319]:
for columna in var_num.columns: #loop para outliers
    
    outliers = round((1 - len(tukey_outliers(var_num,columna,extreme=False))/len(var_num))*100,2)
    
    print('Outliers en columna',columna,':',outliers,'%')
Outliers en columna age : 1.08 %
Outliers en columna balance : 10.46 %
Outliers en columna day : 0.0 %
Outliers en columna month : 0.0 %
Outliers en columna duration : 7.16 %
Outliers en columna campaign : 6.78 %
Outliers en columna pdays : 18.26 %
Outliers en columna previous : 18.26 %

Removiendo Outliers

In [320]:
def remove_tukey_outliers(df, col):
    q1, q3 = df.approxQuantile(col, [0.25, 0.75], 0.01)
    IQR = q3 - q1
    
    min_thresh = q1 - 1.5 * IQR
    max_thresh = q3 + 1.5 * IQR
    
    df_no_outliers = df.filter(F.col(col).between(min_thresh, max_thresh))
    
    return df_no_outliers
In [321]:
df_final = df
df_col_out = ['age','balance','campaign','duration']
for col in df_col_out:
    df_final = remove_tukey_outliers(df_final,col)
print("Filas eliminadas:",df.count()-df_final.count())
Filas eliminadas: 10828

Dimensiones del dataset sin outliers

In [322]:
print("El dataset luego de la limpieza contiene: ", df_final.count(), "filas y ",len(df_final.columns),"columnas")
El dataset luego de la limpieza contiene:  34383 filas y  16 columnas

Análisis Exploratorio de Datos

Estudio de las variables categóricas

In [323]:
cualitativas_final = df_final.select([c for c,t in df.dtypes if t in['string']])
In [324]:
df_final.toPandas().describe(include = 'O')
Out[324]:
job marital education default housing loan contact Target
count 34383 34383 34383 34383 34383 34383 34383 34383
unique 11 3 3 2 2 2 2 2
top blue-collar married secondary no yes no cellular no
freq 7841 20489 19564 33710 19678 28496 32491 31570
In [325]:
for col in cualitativas_final.columns:
    tablas = cualitativas_final.select(col).groupBy(col).count().orderBy(F.col('count').desc()).toPandas()
    plt.figure(figsize=(10, 4)) 
    plt.title("Gráfico de barras: "+col)
    sns.barplot(x=tablas['count'], y =  tablas[col],orient = 'h')
    plt.show()

Vemos que la mayoria de las variables tienen 2 o 3 categorías salvo job que tiene varias categorías.

Creación de variables

Creamos variable job_recod que indica si la persona tiene trabajo o no en base a las categorías disponibles en job

In [326]:
df_final = df_final.withColumn('trabaja',F.when(F.col('job')=='retired',0)
                               .when(F.col('job')=='unemployed',0)
                               .when(F.col('job')=='student',0)
                               .otherwise(1))
In [327]:
df_final.groupBy('trabaja').count().show()
+-------+-----+
|trabaja|count|
+-------+-----+
|      1|31280|
|      0| 3103|
+-------+-----+

Luego creamos la variable duration_cat que indica si el llamado es mayor igual al promedio o no.

In [328]:
mean_duration = df_final.select(F.mean('duration')).first()[0]
In [329]:
print("Promedio de llamada: ",round(mean_duration,2),"Segs.")
Promedio de llamada:  205.42 Segs.
In [330]:
df_final = df_final.withColumn('duration_cat',F.when(F.col('duration')>=mean_duration,'long_call')
                               .otherwise('normal_call'))
In [331]:
df_final.groupBy('duration_cat').count().show()
+------------+-----+
|duration_cat|count|
+------------+-----+
|   long_call|13789|
| normal_call|20594|
+------------+-----+

Variables Dummy

In [332]:
vars_dummy = [c for c,t in df.dtypes if t in['string']]
In [334]:
from pyspark.ml.feature import StringIndexer

df_ind = df_final

for element in vars_dummy:
    feature_indexer = StringIndexer(inputCol=element, outputCol=element+'_indexed')
    feature_indexer_model = feature_indexer.fit(df_final)
    cualitativas_final_ind = feature_indexer_model.transform(df_ind)

df_ind.show(1,vertical = True)
-RECORD 0-------------------
 age          | 34          
 job          | admin.      
 marital      | married     
 education    | secondary   
 default      | no          
 balance      | 3           
 housing      | yes         
 loan         | no          
 contact      | cellular    
 day          | 5           
 month        | 5           
 duration     | 120         
 campaign     | 3           
 pdays        | -1          
 previous     | 0           
 Target       | no          
 trabaja      | 1           
 duration_cat | normal_call 
only showing top 1 row

In [335]:
dictionaries = []

for element in vars_dummy:
    string_indexer = StringIndexer(inputCol=element, outputCol=element+'_category')
    onehotencoder = OneHotEncoder(dropLast=False, inputCol= string_indexer.getOutputCol(), outputCol=element+'_dummy')
    pipeline = Pipeline(stages=[string_indexer, onehotencoder])
    pipeline_model = pipeline.fit(df_final)
    dictionaries.append((element, list(enumerate(pipeline_model.stages[0].labels)), pipeline_model.stages[0]))
    df_final = pipeline_model.transform(df_final)
    df_final = df_final.drop(string_indexer.getOutputCol())
In [336]:
for element in dictionaries:
    df_final = (df_final.withColumn('activated_indices'+element[0], F.udf(lambda x: x.toArray().tolist(), ArrayType(DoubleType()))
                        (F.col(element[0]+'_dummy'))))
    
    vocab = [re.sub(r'\W', '_', value) for value in element[-1].labels]
    df_final = df_final.select(df_final.columns + [F.col("activated_indices"+element[0])[i] for i in range(len(vocab))])
    
    dictionary = {"activated_indices"+element[0]+"[{0}]".format(x): element[0]+'_'+vocab[x] for x in range(len(vocab))}

    # Renombramos las columnas con el elemento que sea
    df_final = df_final.selectExpr(["{0} as {1}".format(x, x) if x not in dictionary else "{0} as {1}".format(x, dictionary[x]) 
                                  for x in df_final.columns])
    df_final = df_final.drop('activated_indices'+element[0], element[0]+'_dummy')
In [337]:
df_final.show(1,vertical = True)
-RECORD 0--------------------------
 age                 | 34          
 job                 | admin.      
 marital             | married     
 education           | secondary   
 default             | no          
 balance             | 3           
 housing             | yes         
 loan                | no          
 contact             | cellular    
 day                 | 5           
 month               | 5           
 duration            | 120         
 campaign            | 3           
 pdays               | -1          
 previous            | 0           
 Target              | no          
 trabaja             | 1           
 duration_cat        | normal_call 
 job_blue_collar     | 0.0         
 job_management      | 0.0         
 job_technician      | 0.0         
 job_admin_          | 1.0         
 job_services        | 0.0         
 job_retired         | 0.0         
 job_self_employed   | 0.0         
 job_entrepreneur    | 0.0         
 job_unemployed      | 0.0         
 job_housemaid       | 0.0         
 job_student         | 0.0         
 marital_married     | 1.0         
 marital_single      | 0.0         
 marital_divorced    | 0.0         
 education_secondary | 1.0         
 education_tertiary  | 0.0         
 education_primary   | 0.0         
 default_no          | 1.0         
 default_yes         | 0.0         
 housing_yes         | 1.0         
 housing_no          | 0.0         
 loan_no             | 1.0         
 loan_yes            | 0.0         
 contact_cellular    | 1.0         
 contact_telephone   | 0.0         
 Target_no           | 1.0         
 Target_yes          | 0.0         
only showing top 1 row

In [338]:
df_final.printSchema()
root
 |-- age: integer (nullable = true)
 |-- job: string (nullable = false)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = false)
 |-- default: string (nullable = true)
 |-- balance: integer (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = false)
 |-- day: integer (nullable = true)
 |-- month: integer (nullable = false)
 |-- duration: integer (nullable = true)
 |-- campaign: integer (nullable = true)
 |-- pdays: integer (nullable = true)
 |-- previous: integer (nullable = true)
 |-- Target: string (nullable = true)
 |-- trabaja: integer (nullable = false)
 |-- duration_cat: string (nullable = false)
 |-- job_blue_collar: double (nullable = true)
 |-- job_management: double (nullable = true)
 |-- job_technician: double (nullable = true)
 |-- job_admin_: double (nullable = true)
 |-- job_services: double (nullable = true)
 |-- job_retired: double (nullable = true)
 |-- job_self_employed: double (nullable = true)
 |-- job_entrepreneur: double (nullable = true)
 |-- job_unemployed: double (nullable = true)
 |-- job_housemaid: double (nullable = true)
 |-- job_student: double (nullable = true)
 |-- marital_married: double (nullable = true)
 |-- marital_single: double (nullable = true)
 |-- marital_divorced: double (nullable = true)
 |-- education_secondary: double (nullable = true)
 |-- education_tertiary: double (nullable = true)
 |-- education_primary: double (nullable = true)
 |-- default_no: double (nullable = true)
 |-- default_yes: double (nullable = true)
 |-- housing_yes: double (nullable = true)
 |-- housing_no: double (nullable = true)
 |-- loan_no: double (nullable = true)
 |-- loan_yes: double (nullable = true)
 |-- contact_cellular: double (nullable = true)
 |-- contact_telephone: double (nullable = true)
 |-- Target_no: double (nullable = true)
 |-- Target_yes: double (nullable = true)

Correlaciones

In [339]:
var_num_corr=[c for c,t in df_final.dtypes if t in['int','double']]
corr_matrix = Statistics.corr(df_final.select(var_num_corr).rdd.map(lambda v: Vectors.dense(v)),method='pearson')
corr_matrix=pd.DataFrame(corr_matrix,columns=var_num_corr, index=var_num_corr)

corr_matrix
Out[339]:
age balance day month duration campaign pdays previous trabaja job_blue_collar ... default_no default_yes housing_yes housing_no loan_no loan_yes contact_cellular contact_telephone Target_no Target_yes
age 1.000000 0.077290 -0.007015 0.086837 -0.037576 0.040698 -0.033130 -0.008915 -0.139331 -0.013717 ... 0.014419 -0.014419 -0.147030 0.147030 -0.008672 0.008672 -0.113395 0.113395 0.016941 -0.016941
balance 0.077290 1.000000 0.008142 0.052643 0.023973 -0.026345 0.027388 0.034972 -0.038939 -0.020796 ... 0.131074 -0.131074 -0.057900 0.057900 0.094102 -0.094102 -0.039060 0.039060 -0.086747 0.086747
day -0.007015 0.008142 1.000000 0.076505 -0.035157 0.103589 -0.083889 -0.044219 0.014657 -0.021185 ... -0.007576 0.007576 -0.026344 0.026344 -0.008857 0.008857 -0.020015 0.020015 0.030920 -0.030920
month 0.086837 0.052643 0.076505 1.000000 -0.034701 0.051476 -0.115286 -0.037664 0.008224 -0.082268 ... -0.022077 0.022077 -0.176837 0.176837 -0.032103 0.032103 -0.029637 0.029637 -0.026665 0.026665
duration -0.037576 0.023973 -0.035157 -0.034701 1.000000 -0.075095 0.015862 0.014962 -0.019310 0.014597 ... 0.001694 -0.001694 0.003812 -0.003812 0.007362 -0.007362 0.054304 -0.054304 -0.255014 0.255014
campaign 0.040698 -0.026345 0.103589 0.051476 -0.075095 1.000000 -0.069169 -0.005101 0.027784 0.011953 ... -0.011303 0.011303 -0.036898 0.036898 0.002527 -0.002527 -0.059438 0.059438 0.079285 -0.079285
pdays -0.033130 0.027388 -0.083889 -0.115286 0.015862 -0.069169 1.000000 0.435377 0.007528 0.019680 ... 0.032755 -0.032755 0.132043 -0.132043 0.023629 -0.023629 -0.015919 0.015919 -0.125062 0.125062
previous -0.008915 0.034972 -0.044219 -0.037664 0.014962 -0.005101 0.435377 1.000000 0.001040 -0.015842 ... 0.017585 -0.017585 0.041618 -0.041618 0.006849 -0.006849 -0.021512 0.021512 -0.111471 0.111471
trabaja -0.139331 -0.038939 0.014657 0.008224 -0.019310 0.027784 0.007528 0.001040 1.000000 0.171189 ... -0.007867 0.007867 0.162673 -0.162673 -0.047771 0.047771 0.052191 -0.052191 0.097077 -0.097077
job_blue_collar -0.013717 -0.020796 -0.021185 -0.082268 0.014597 0.011953 0.019680 -0.015842 0.171189 1.000000 ... -0.010770 0.010770 0.161610 -0.161610 -0.004136 0.004136 -0.015969 0.015969 0.078529 -0.078529
job_management -0.004715 0.043852 0.013482 0.074338 -0.020920 0.020613 -0.008308 0.020490 0.157944 -0.272562 ... -0.000925 0.000925 -0.068387 0.068387 0.036406 -0.036406 0.023931 -0.023931 -0.041511 0.041511
job_technician -0.046356 -0.012413 0.030247 0.040272 -0.004722 0.018560 -0.011268 0.001244 0.142452 -0.245828 ... 0.007431 -0.007431 -0.021256 0.021256 -0.016708 0.016708 0.035091 -0.035091 0.003592 -0.003592
job_admin_ -0.051122 -0.019392 -0.007069 -0.035169 -0.001311 -0.040925 0.028150 0.016753 0.116756 -0.201484 ... 0.010506 -0.010506 0.041413 -0.041413 -0.028134 0.028134 0.013142 -0.013142 -0.014374 0.014374
job_services -0.060692 -0.034344 -0.008866 -0.041768 0.003976 -0.011109 0.005569 -0.011770 0.102814 -0.177425 ... 0.004134 -0.004134 0.062783 -0.062783 -0.031432 0.031432 0.005705 -0.005705 0.030175 -0.030175
job_retired 0.358957 0.037035 -0.009392 0.021930 0.016629 -0.011517 -0.015788 -0.008369 -0.647031 -0.110765 ... 0.004115 -0.004115 -0.124579 0.124579 -0.005627 0.005627 -0.042734 0.042734 -0.044923 0.044923
job_self_employed -0.001764 0.011320 0.001710 0.024318 -0.005166 0.007348 -0.008634 0.001285 0.058932 -0.101697 ... -0.006081 0.006081 -0.031090 0.031090 0.007315 -0.007315 -0.000708 0.000708 -0.000499 0.000499
job_entrepreneur 0.025822 -0.004903 0.001812 0.030833 0.006191 0.004771 -0.013222 -0.009048 0.058246 -0.100515 ... -0.022008 0.022008 0.008638 -0.008638 -0.033387 0.033387 0.006110 -0.006110 0.023750 -0.023750
job_unemployed 0.004522 0.009234 -0.003027 -0.033280 0.010754 -0.015246 -0.012632 -0.009520 -0.542114 -0.092804 ... -0.006245 0.006245 -0.051179 0.051179 0.033398 -0.033398 -0.008767 0.008767 -0.026424 0.026424
job_housemaid 0.095089 -0.000376 0.003802 0.036070 -0.014955 0.013289 -0.033558 -0.014423 0.052399 -0.090423 ... 0.001459 -0.001459 -0.084962 0.084962 0.017431 -0.017431 -0.049671 0.049671 0.010986 -0.010986
job_student -0.211544 0.016201 -0.012667 -0.007669 0.003373 -0.021647 0.020620 0.019879 -0.476701 -0.081606 ... 0.016925 -0.016925 -0.093612 0.093612 0.062985 -0.062985 -0.034997 0.034997 -0.099721 0.099721
marital_married 0.294217 0.026723 0.002030 0.059189 -0.020764 0.052897 -0.025897 -0.009835 0.043664 0.134127 ... 0.015849 -0.015849 0.032080 -0.032080 -0.041994 0.041994 -0.045627 0.045627 0.059745 -0.059745
marital_single -0.439773 -0.004081 -0.004546 -0.064415 0.023995 -0.050094 0.030938 0.016075 -0.041722 -0.099114 ... -0.004331 0.004331 -0.035256 0.035256 0.056455 -0.056455 0.032804 -0.032804 -0.073726 0.073726
marital_divorced 0.170654 -0.035136 0.003311 0.000364 -0.002101 -0.010226 -0.004048 -0.007644 -0.007917 -0.065333 ... -0.018142 0.018142 0.000684 -0.000684 -0.015442 0.015442 0.023513 -0.023513 0.012661 -0.012661
education_secondary -0.062711 -0.044986 0.000278 -0.053827 0.025413 -0.023462 0.021056 -0.005685 -0.015657 0.041077 ... -0.010625 0.010625 0.078369 -0.078369 -0.046505 0.046505 0.008899 -0.008899 0.024772 -0.024772
education_tertiary -0.076079 0.059198 0.017404 0.081131 -0.011930 0.009564 -0.008348 0.022360 0.064348 -0.321132 ... 0.015427 -0.015427 -0.104046 0.104046 0.041452 -0.041452 0.040335 -0.040335 -0.079320 0.079320
education_primary 0.183619 -0.012294 -0.022418 -0.027747 -0.020284 0.020563 -0.018753 -0.020387 -0.059651 0.349290 ... -0.004733 0.004733 0.022579 -0.022579 0.012283 -0.012283 -0.063446 0.063446 0.065910 -0.065910
default_no 0.014419 0.131074 -0.007576 -0.022077 0.001694 -0.011303 0.032755 0.017585 -0.007867 -0.010770 ... 1.000000 -1.000000 0.010681 -0.010681 0.071211 -0.071211 -0.016603 0.016603 -0.023793 0.023793
default_yes -0.014419 -0.131074 0.007576 0.022077 -0.001694 0.011303 -0.032755 -0.017585 0.007867 0.010770 ... -1.000000 1.000000 -0.010681 0.010681 -0.071211 0.071211 0.016603 -0.016603 0.023793 -0.023793
housing_yes -0.147030 -0.057900 -0.026344 -0.176837 0.003812 -0.036898 0.132043 0.041618 0.162673 0.161610 ... 0.010681 -0.010681 1.000000 -1.000000 -0.030707 0.030707 0.058217 -0.058217 0.159991 -0.159991
housing_no 0.147030 0.057900 0.026344 0.176837 -0.003812 0.036898 -0.132043 -0.041618 -0.162673 -0.161610 ... -0.010681 0.010681 -1.000000 1.000000 0.030707 -0.030707 -0.058217 0.058217 -0.159991 0.159991
loan_no -0.008672 0.094102 -0.008857 -0.032103 0.007362 0.002527 0.023629 0.006849 -0.047771 -0.004136 ... 0.071211 -0.071211 -0.030707 0.030707 1.000000 -1.000000 -0.007430 0.007430 -0.076519 0.076519
loan_yes 0.008672 -0.094102 0.008857 0.032103 -0.007362 -0.002527 -0.023629 -0.006849 0.047771 0.004136 ... -0.071211 0.071211 0.030707 -0.030707 -1.000000 1.000000 0.007430 -0.007430 0.076519 -0.076519
contact_cellular -0.113395 -0.039060 -0.020015 -0.029637 0.054304 -0.059438 -0.015919 -0.021512 0.052191 -0.015969 ... -0.016603 0.016603 0.058217 -0.058217 -0.007430 0.007430 1.000000 -1.000000 0.005681 -0.005681
contact_telephone 0.113395 0.039060 0.020015 0.029637 -0.054304 0.059438 0.015919 0.021512 -0.052191 0.015969 ... 0.016603 -0.016603 -0.058217 0.058217 0.007430 -0.007430 -1.000000 1.000000 -0.005681 0.005681
Target_no 0.016941 -0.086747 0.030920 -0.026665 -0.255014 0.079285 -0.125062 -0.111471 0.097077 0.078529 ... -0.023793 0.023793 0.159991 -0.159991 -0.076519 0.076519 0.005681 -0.005681 1.000000 -1.000000
Target_yes -0.016941 0.086747 -0.030920 0.026665 0.255014 -0.079285 0.125062 0.111471 -0.097077 -0.078529 ... 0.023793 -0.023793 -0.159991 0.159991 0.076519 -0.076519 -0.005681 0.005681 -1.000000 1.000000

36 rows × 36 columns

Gráfico Matriz de correlaciones

In [342]:
plt.figure(figsize=(18,18))
sns.heatmap(corr_matrix,cmap='coolwarm', vmax=1, vmin=-1, square=True, annot=True, fmt='.2f')
Out[342]:
<AxesSubplot:>

Comprobación de pares de variables correlacionadas

Por medio de la función incluida abajo, se chequean los pares de variables con sus respectivas correlaciones.

In [344]:
def corrank(X):
    import itertools
    df_correlaciones = pd.DataFrame([[(i,j),X.corr().loc[i,j]] for i,j in list(itertools.combinations(X.corr(), 2))],columns=['pares','corr'])    
    return df_correlaciones.sort_values(by='corr',ascending=False)
corrank(corr_matrix).head(5)
Out[344]:
pares corr
318 (job_management, education_tertiary) 0.863244
195 (pdays, previous) 0.724100
13 (age, job_retired) 0.589237
294 (job_blue_collar, education_primary) 0.559119
19 (age, marital_married) 0.535614

En la tabla generada arriba se pueden observar los pares de variables con fuerte correlación positiva.

In [345]:
corrank(corr_matrix).tail(5)
Out[345]:
pares corr
602 (housing_yes, housing_no) -1.0
624 (contact_cellular, contact_telephone) -1.0
615 (loan_no, loan_yes) -1.0
585 (default_no, default_yes) -1.0
629 (Target_no, Target_yes) -1.0

En la tabla generada arriba se pueden observar los pares de variables con fuerte correlación negativa.

Próximamente:

Creación de modelos con spark

Autor:
Marcelo G Gonzalez