Ejercicio 1

Este práctico es similar al práctico 1, pero agregará un paso extra que es el uso de redes en escalera avanzadas, ya sean Redes Convolucionales o Redes Recurrentes. Se les dará, como base, el mismo conjunto de datos de la competencia “PetFinder” que se trabajó para el práctico 1, con el agregado de, en este caso, utilizar la descripción como un feature extra y todo el procesamiento que ello requiere.

head(train$Description)
## [1] "Nibble is a 3+ month old ball of cuteness. He is energetic and playful. I rescued a couple of cats a few months ago but could not get them neutered in time as the clinic was fully scheduled. The result was this little kitty. I do not have enough space and funds to care for more cats in my household. Looking for responsible people to take over Nibble's care."                                         
## [2] "Good guard dog, very alert, active, obedience waiting for her good master, plz call or sms for more details if you really get interested, thanks!!"                                                                                                                                                                                                                                                              
## [3] "This handsome yet cute boy is up for adoption. He is the most playful pal we've seen in our puppies. He loves to nibble on shoelaces , Chase you at such a young age. Imagine what a cute brat he will be when he grows. We are looking for a loving home for Hunter , one that will take care of him and give him the love that he needs. Please call urgently if you would like to adopt this cutie."          
## [4] "This is a stray kitten that came to my house. Have been feeding it, but cannot keep it."                                                                                                                                                                                                                                                                                                                         
## [5] "anyone within the area of ipoh or taiping who interested to adopt my cat can contact my father at this number (mazuvil)or can just email me. currently bulat is at my hometown at perak but anyone outside the area still want to adopt can travel there to my hometown.there is a lot of cats in my house rite now..i think i should let one of them go to a better owner who can give better attention to him."
## [6] "healthy and active, feisty kitten found in neighbours' garden. Not sure of sex."

Preproceso del texto

Tokenizacion Vocabulario Word embeddings

1 - Tokenizacion

#1.1 - Pasar todas las palabras a minusculas

texto <- tolower(train$Description)

#1.2 - Eliminar "stop words"

texto <- removeWords(texto, stopwords("english"))

#1.3 - Tamaño del texto

summary(str_count(texto, '\\S+'))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   14.00   26.00   37.19   47.00 1252.00       6

El 75% de los datos tiene menos de 47 palabras; utilizaremos 50 para hacer el corte.

2 - Vocabulario

max_feature: The num_words argument defines the number of words we want to consider (this will be our feature space).

max_features <- 20000

tokenizer <- text_tokenizer(num_words = max_features)

prep_fun = function(x) {
  x %>%
    # make text lower case
    str_to_lower %>%
    # remove non-letters symbols
    str_replace_all("[^[:alpha:][:space:]]", "") %>%
    # collapse multiple spaces
    #str_replace_all("\s+", " ") %>%
    # Trim any garbage left
    str_trim
}

texto <- prep_fun(texto)

tokenizer %>% 
  fit_text_tokenizer(texto)

tokenizer$document_count
## [1] 10582
tokenizer$word_index %>%
  head()
## $home
## [1] 1
## 
## $please
## [1] 2
## 
## $dog
## [1] 3
## 
## $can
## [1] 4
## 
## $adoption
## [1] 5
## 
## $cat
## [1] 6
text_seqs <- texts_to_sequences(tokenizer, texto)

text_seqs %>%
  head()
## [[1]]
##  [1] 3646   77   10  885 1649  464   15   28  684   23   33  148   58  180
## [15]   41  363  229 3038 1537   45  179  282  372 3215   20   23  982   14
## [29]  348   66   25 7549   20
## 
## [[2]]
##  [1]    7  185    3  257   29 2166  322    7 1118  983   19  138  267   73
## [15]   58   13  122
## 
## [[3]]
##  [1]   373   268    37   105     5    15  5355   575    44    42  3646
## [12]  7550   958   173    97  2167    37 10131  2326    14    18     1
## [23]  3039    22    25    20     9    16    93     2    19   334    32
## [34]     8   568
## 
## [[4]]
## [1]  82  21 277  30 226  51
## 
## [[5]]
##  [1]    87   473    83  1074  6224    13     8     6     4    11   647
## [12]   358 10132     4    49   125    89  3040  2730  2168    87   200
## [23]    83    81    92     8     4  1429  2730   191    23    30  7551
## [34]    17   212   119    22    74   240    27     4     9   240   112
## 
## [[6]]
## [1]   26   29 1502   21   12  629  668  204 1805

3 - Embedding

maxlen <- 50

texto_train <- text_seqs %>%
  pad_sequences(maxlen = maxlen)

dim(texto_train)
## [1] 10582    50

Preparación del dataset

Para el armado del dataset, se tendrán en cuenta datos “one-hot-encoded”, “embeddings” y numéricos. En este caso el texto será considerado como una capa más del modelo y entrenado con la misma red.

Definición del Modelo

En este modelo usaré el vector “word embedding” construído con los datos del conjunto de entrenamiento, incluído como una capa más del modelo.

#parámetros

maxlen <- 50
batch_size <- 32
embedding_dims <- 32
filters <- 32
kernel_size <- 3
hidden_dims <- 32
epochs <- 20

# gender(3) + ms(4) + type(2) + age_r(4)

inp_e1 <- layer_input(shape = c(50), name = 'texto')
inp_e2 <- layer_input(shape = c(1), name = 'breed')
inp_cat <- layer_input(shape = c(13), name = 'categoricas')

emb_out1 <- inp_e1 %>%
  layer_embedding(input_dim = 20000+1, output_dim = embedding_dims, input_length = 50, name = 'descripcion') %>%
  layer_flatten()

emb_out2 <- inp_e2 %>%
  layer_embedding(input_dim = 164+1, output_dim = embedding_dims, input_length = 308, name = 'raza') %>%
  layer_flatten()

model_1 <- layer_concatenate(c(emb_out1, emb_out2, inp_cat)) %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dropout(0.5) %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dropout(0.3) %>%
  layer_dense(units = 12, activation = "relu") %>%
  layer_dropout(0.2) %>%
  layer_dense(units = 5, activation = "softmax", name = "main_output")

model <- keras_model(
  inputs = c(inp_e1, inp_e2, inp_cat),
  outputs = model_1
)

model %>% compile(
  loss = "sparse_categorical_crossentropy",
  optimizer = "adam",
  metrics = "accuracy"
)
summary(model)
## Model: "model"
## ___________________________________________________________________________
## Layer (type)            Output Shape     Param #  Connected to             
## ===========================================================================
## texto (InputLayer)      [(None, 50)]     0                                 
## ___________________________________________________________________________
## breed (InputLayer)      [(None, 1)]      0                                 
## ___________________________________________________________________________
## descripcion (Embedding) (None, 50, 32)   640032   texto[0][0]              
## ___________________________________________________________________________
## raza (Embedding)        (None, 1, 32)    5280     breed[0][0]              
## ___________________________________________________________________________
## flatten (Flatten)       (None, 1600)     0        descripcion[0][0]        
## ___________________________________________________________________________
## flatten_1 (Flatten)     (None, 32)       0        raza[0][0]               
## ___________________________________________________________________________
## categoricas (InputLayer [(None, 13)]     0                                 
## ___________________________________________________________________________
## concatenate (Concatenat (None, 1645)     0        flatten[0][0]            
##                                                   flatten_1[0][0]          
##                                                   categoricas[0][0]        
## ___________________________________________________________________________
## dense (Dense)           (None, 64)       105344   concatenate[0][0]        
## ___________________________________________________________________________
## dropout (Dropout)       (None, 64)       0        dense[0][0]              
## ___________________________________________________________________________
## dense_1 (Dense)         (None, 32)       2080     dropout[0][0]            
## ___________________________________________________________________________
## dropout_1 (Dropout)     (None, 32)       0        dense_1[0][0]            
## ___________________________________________________________________________
## dense_2 (Dense)         (None, 12)       396      dropout_1[0][0]          
## ___________________________________________________________________________
## dropout_2 (Dropout)     (None, 12)       0        dense_2[0][0]            
## ___________________________________________________________________________
## main_output (Dense)     (None, 5)        65       dropout_2[0][0]          
## ===========================================================================
## Total params: 753,197
## Trainable params: 753,197
## Non-trainable params: 0
## ___________________________________________________________________________
y <- train$AdoptionSpeed
x <- list(texto_train, raza_1, cbind(Gender, MaturitySize, Type, Age_r))
hist <- model %>%
  fit(x, y,
    batch_size = batch_size,
    epochs = epochs,
    validation_split = 0.3
  )
hist
## Trained on 7,407 samples (batch_size=32, epochs=20)
## Final epoch (plot to see history):
##     loss: 0.1692
##      acc: 0.944
## val_loss: 4.646
##  val_acc: 0.3351
plot(hist)

El modelo overfitea muy rápido, 0.94 para los datos de entrenamiento y 0.32 para los datos de validación, este valor muy similar a los modelos más simples del Práctico 1.

Ejercicio 2

En este punto se intenta usar un modelo pre entrenado para el texto. Los resultados no son los esperados; el modelo sigue con un altísimo overfit. En primer lugar se trabaja sobre la variable descripción.

token <- tokenizer %>% 
            fit_text_tokenizer(texto)

word_index <- token$word_index

cat("Tokens únicos", length(word_index))
## Tokens únicos 19625

Se utilizará el modelo pre entrenado Glove

glove_dir = 'C:/DiploDatos/Practico_AP/glove6b'
lines <- readLines(file.path(glove_dir, "glove.6B.100d.txt"))

Construcción de la matriz de embeddings:

p1 <- as.data.frame(lines)
p1$lines <- as.character(p1$lines)

max(str_count(p1$lines, " "))
## [1] 100
p1 <- separate(p1, lines, paste("V", 0:100, sep = ""), sep = " ")

p1 <- p1 %>% 
  mutate_at(vars(-V0), as.numeric)

word_index_2 <- as.data.frame(t(as.data.frame(word_index)))
word_index_2 <- rownames_to_column(word_index_2, "word")
word_index_2$V1 <- NULL

word_index <- left_join(word_index_2, p1, by = c('word' = 'V0'))

Al relacionar las palabras del modelo entrenado con las que aparecen en el texto, se observan 6813 palabras no encontradas

word_index <- word_index %>% 
  mutate_at(vars(-word), ~replace(., is.na(.), 0))

embedding_matrix <- as.matrix(word_index[, c(2:(embedding_dims+1))])
dim(embedding_matrix)
## [1] 19625    32

Modelo

inp_e1 <- layer_input(shape = c(100), name = 'texto')
inp_e2 <- layer_input(shape = c(1), name = 'breed')
inp_cat <- layer_input(shape = c(13), name = 'categoricas')

embedding_layer <-  
  layer_embedding(
    input_dim = 19625,
    output_dim = 32,
    weights = list(embedding_matrix),
    input_length = 100,
    trainable = FALSE)

sequence_input <- layer_input(shape = 100, dtype='int32')

preds <- sequence_input %>%
  embedding_layer %>%
  layer_conv_1d(filters = 128, kernel_size = 5, activation = 'relu') %>%
  layer_max_pooling_1d(pool_size = 5) %>%
  layer_flatten() %>%
  layer_dropout(0.5) %>%
  layer_dense(units = 5, activation = "softmax", name = "main_output")

model <- keras_model(
  inputs = c(sequence_input, inp_e2, inp_cat),
  outputs = preds)

model %>% compile(
  loss = "sparse_categorical_crossentropy",
  optimizer = "adam",
  metrics = "accuracy"
)
summary(model)
## Model: "model_1"
## ___________________________________________________________________________
## Layer (type)            Output Shape     Param #  Connected to             
## ===========================================================================
## input_1 (InputLayer)    [(None, 100)]    0                                 
## ___________________________________________________________________________
## embedding (Embedding)   (None, 100, 32)  628000   input_1[0][0]            
## ___________________________________________________________________________
## conv1d (Conv1D)         (None, 96, 128)  20608    embedding[0][0]          
## ___________________________________________________________________________
## max_pooling1d (MaxPooli (None, 19, 128)  0        conv1d[0][0]             
## ___________________________________________________________________________
## flatten_2 (Flatten)     (None, 2432)     0        max_pooling1d[0][0]      
## ___________________________________________________________________________
## dropout_3 (Dropout)     (None, 2432)     0        flatten_2[0][0]          
## ___________________________________________________________________________
## breed (InputLayer)      [(None, 1)]      0                                 
## ___________________________________________________________________________
## categoricas (InputLayer [(None, 13)]     0                                 
## ___________________________________________________________________________
## main_output (Dense)     (None, 5)        12165    dropout_3[0][0]          
## ===========================================================================
## Total params: 660,773
## Trainable params: 32,773
## Non-trainable params: 628,000
## ___________________________________________________________________________
texto_train <- text_seqs %>%
  pad_sequences(maxlen = 100)

y <- train$AdoptionSpeed
x <- list(texto_train, raza_1, cbind(Gender, MaturitySize, Type, Age_r))

dim(texto_train)
## [1] 10582   100
hist <- model %>%
  fit(x, y,
      batch_size = batch_size,
      epochs = epochs,
      validation_split = 0.3
  )

hist
## Trained on 7,407 samples (batch_size=32, epochs=20)
## Final epoch (plot to see history):
##     loss: 0.9048
##      acc: 0.6335
## val_loss: 1.671
##  val_acc: 0.3225
plot(hist)