Este práctico es similar al práctico 1, pero agregará un paso extra que es el uso de redes en escalera avanzadas, ya sean Redes Convolucionales o Redes Recurrentes. Se les dará, como base, el mismo conjunto de datos de la competencia “PetFinder” que se trabajó para el práctico 1, con el agregado de, en este caso, utilizar la descripción como un feature extra y todo el procesamiento que ello requiere.
head(train$Description)
## [1] "Nibble is a 3+ month old ball of cuteness. He is energetic and playful. I rescued a couple of cats a few months ago but could not get them neutered in time as the clinic was fully scheduled. The result was this little kitty. I do not have enough space and funds to care for more cats in my household. Looking for responsible people to take over Nibble's care."
## [2] "Good guard dog, very alert, active, obedience waiting for her good master, plz call or sms for more details if you really get interested, thanks!!"
## [3] "This handsome yet cute boy is up for adoption. He is the most playful pal we've seen in our puppies. He loves to nibble on shoelaces , Chase you at such a young age. Imagine what a cute brat he will be when he grows. We are looking for a loving home for Hunter , one that will take care of him and give him the love that he needs. Please call urgently if you would like to adopt this cutie."
## [4] "This is a stray kitten that came to my house. Have been feeding it, but cannot keep it."
## [5] "anyone within the area of ipoh or taiping who interested to adopt my cat can contact my father at this number (mazuvil)or can just email me. currently bulat is at my hometown at perak but anyone outside the area still want to adopt can travel there to my hometown.there is a lot of cats in my house rite now..i think i should let one of them go to a better owner who can give better attention to him."
## [6] "healthy and active, feisty kitten found in neighbours' garden. Not sure of sex."
Tokenizacion Vocabulario Word embeddings
#1.1 - Pasar todas las palabras a minusculas
texto <- tolower(train$Description)
#1.2 - Eliminar "stop words"
texto <- removeWords(texto, stopwords("english"))
#1.3 - Tamaño del texto
summary(str_count(texto, '\\S+'))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 14.00 26.00 37.19 47.00 1252.00 6
El 75% de los datos tiene menos de 47 palabras; utilizaremos 50 para hacer el corte.
max_feature: The num_words argument defines the number of words we want to consider (this will be our feature space).
max_features <- 20000
tokenizer <- text_tokenizer(num_words = max_features)
prep_fun = function(x) {
x %>%
# make text lower case
str_to_lower %>%
# remove non-letters symbols
str_replace_all("[^[:alpha:][:space:]]", "") %>%
# collapse multiple spaces
#str_replace_all("\s+", " ") %>%
# Trim any garbage left
str_trim
}
texto <- prep_fun(texto)
tokenizer %>%
fit_text_tokenizer(texto)
tokenizer$document_count
## [1] 10582
tokenizer$word_index %>%
head()
## $home
## [1] 1
##
## $please
## [1] 2
##
## $dog
## [1] 3
##
## $can
## [1] 4
##
## $adoption
## [1] 5
##
## $cat
## [1] 6
text_seqs <- texts_to_sequences(tokenizer, texto)
text_seqs %>%
head()
## [[1]]
## [1] 3646 77 10 885 1649 464 15 28 684 23 33 148 58 180
## [15] 41 363 229 3038 1537 45 179 282 372 3215 20 23 982 14
## [29] 348 66 25 7549 20
##
## [[2]]
## [1] 7 185 3 257 29 2166 322 7 1118 983 19 138 267 73
## [15] 58 13 122
##
## [[3]]
## [1] 373 268 37 105 5 15 5355 575 44 42 3646
## [12] 7550 958 173 97 2167 37 10131 2326 14 18 1
## [23] 3039 22 25 20 9 16 93 2 19 334 32
## [34] 8 568
##
## [[4]]
## [1] 82 21 277 30 226 51
##
## [[5]]
## [1] 87 473 83 1074 6224 13 8 6 4 11 647
## [12] 358 10132 4 49 125 89 3040 2730 2168 87 200
## [23] 83 81 92 8 4 1429 2730 191 23 30 7551
## [34] 17 212 119 22 74 240 27 4 9 240 112
##
## [[6]]
## [1] 26 29 1502 21 12 629 668 204 1805
maxlen <- 50
texto_train <- text_seqs %>%
pad_sequences(maxlen = maxlen)
dim(texto_train)
## [1] 10582 50
Para el armado del dataset, se tendrán en cuenta datos “one-hot-encoded”, “embeddings” y numéricos. En este caso el texto será considerado como una capa más del modelo y entrenado con la misma red.
En este modelo usaré el vector “word embedding” construído con los datos del conjunto de entrenamiento, incluído como una capa más del modelo.
#parámetros
maxlen <- 50
batch_size <- 32
embedding_dims <- 32
filters <- 32
kernel_size <- 3
hidden_dims <- 32
epochs <- 20
# gender(3) + ms(4) + type(2) + age_r(4)
inp_e1 <- layer_input(shape = c(50), name = 'texto')
inp_e2 <- layer_input(shape = c(1), name = 'breed')
inp_cat <- layer_input(shape = c(13), name = 'categoricas')
emb_out1 <- inp_e1 %>%
layer_embedding(input_dim = 20000+1, output_dim = embedding_dims, input_length = 50, name = 'descripcion') %>%
layer_flatten()
emb_out2 <- inp_e2 %>%
layer_embedding(input_dim = 164+1, output_dim = embedding_dims, input_length = 308, name = 'raza') %>%
layer_flatten()
model_1 <- layer_concatenate(c(emb_out1, emb_out2, inp_cat)) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dropout(0.5) %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dropout(0.3) %>%
layer_dense(units = 12, activation = "relu") %>%
layer_dropout(0.2) %>%
layer_dense(units = 5, activation = "softmax", name = "main_output")
model <- keras_model(
inputs = c(inp_e1, inp_e2, inp_cat),
outputs = model_1
)
model %>% compile(
loss = "sparse_categorical_crossentropy",
optimizer = "adam",
metrics = "accuracy"
)
summary(model)
## Model: "model"
## ___________________________________________________________________________
## Layer (type) Output Shape Param # Connected to
## ===========================================================================
## texto (InputLayer) [(None, 50)] 0
## ___________________________________________________________________________
## breed (InputLayer) [(None, 1)] 0
## ___________________________________________________________________________
## descripcion (Embedding) (None, 50, 32) 640032 texto[0][0]
## ___________________________________________________________________________
## raza (Embedding) (None, 1, 32) 5280 breed[0][0]
## ___________________________________________________________________________
## flatten (Flatten) (None, 1600) 0 descripcion[0][0]
## ___________________________________________________________________________
## flatten_1 (Flatten) (None, 32) 0 raza[0][0]
## ___________________________________________________________________________
## categoricas (InputLayer [(None, 13)] 0
## ___________________________________________________________________________
## concatenate (Concatenat (None, 1645) 0 flatten[0][0]
## flatten_1[0][0]
## categoricas[0][0]
## ___________________________________________________________________________
## dense (Dense) (None, 64) 105344 concatenate[0][0]
## ___________________________________________________________________________
## dropout (Dropout) (None, 64) 0 dense[0][0]
## ___________________________________________________________________________
## dense_1 (Dense) (None, 32) 2080 dropout[0][0]
## ___________________________________________________________________________
## dropout_1 (Dropout) (None, 32) 0 dense_1[0][0]
## ___________________________________________________________________________
## dense_2 (Dense) (None, 12) 396 dropout_1[0][0]
## ___________________________________________________________________________
## dropout_2 (Dropout) (None, 12) 0 dense_2[0][0]
## ___________________________________________________________________________
## main_output (Dense) (None, 5) 65 dropout_2[0][0]
## ===========================================================================
## Total params: 753,197
## Trainable params: 753,197
## Non-trainable params: 0
## ___________________________________________________________________________
y <- train$AdoptionSpeed
x <- list(texto_train, raza_1, cbind(Gender, MaturitySize, Type, Age_r))
hist <- model %>%
fit(x, y,
batch_size = batch_size,
epochs = epochs,
validation_split = 0.3
)
hist
## Trained on 7,407 samples (batch_size=32, epochs=20)
## Final epoch (plot to see history):
## loss: 0.1692
## acc: 0.944
## val_loss: 4.646
## val_acc: 0.3351
plot(hist)
El modelo overfitea muy rápido, 0.94 para los datos de entrenamiento y 0.32 para los datos de validación, este valor muy similar a los modelos más simples del Práctico 1.
En este punto se intenta usar un modelo pre entrenado para el texto. Los resultados no son los esperados; el modelo sigue con un altísimo overfit. En primer lugar se trabaja sobre la variable descripción.
token <- tokenizer %>%
fit_text_tokenizer(texto)
word_index <- token$word_index
cat("Tokens únicos", length(word_index))
## Tokens únicos 19625
Se utilizará el modelo pre entrenado Glove
glove_dir = 'C:/DiploDatos/Practico_AP/glove6b'
lines <- readLines(file.path(glove_dir, "glove.6B.100d.txt"))
Construcción de la matriz de embeddings:
p1 <- as.data.frame(lines)
p1$lines <- as.character(p1$lines)
max(str_count(p1$lines, " "))
## [1] 100
p1 <- separate(p1, lines, paste("V", 0:100, sep = ""), sep = " ")
p1 <- p1 %>%
mutate_at(vars(-V0), as.numeric)
word_index_2 <- as.data.frame(t(as.data.frame(word_index)))
word_index_2 <- rownames_to_column(word_index_2, "word")
word_index_2$V1 <- NULL
word_index <- left_join(word_index_2, p1, by = c('word' = 'V0'))
Al relacionar las palabras del modelo entrenado con las que aparecen en el texto, se observan 6813 palabras no encontradas
word_index <- word_index %>%
mutate_at(vars(-word), ~replace(., is.na(.), 0))
embedding_matrix <- as.matrix(word_index[, c(2:(embedding_dims+1))])
dim(embedding_matrix)
## [1] 19625 32
inp_e1 <- layer_input(shape = c(100), name = 'texto')
inp_e2 <- layer_input(shape = c(1), name = 'breed')
inp_cat <- layer_input(shape = c(13), name = 'categoricas')
embedding_layer <-
layer_embedding(
input_dim = 19625,
output_dim = 32,
weights = list(embedding_matrix),
input_length = 100,
trainable = FALSE)
sequence_input <- layer_input(shape = 100, dtype='int32')
preds <- sequence_input %>%
embedding_layer %>%
layer_conv_1d(filters = 128, kernel_size = 5, activation = 'relu') %>%
layer_max_pooling_1d(pool_size = 5) %>%
layer_flatten() %>%
layer_dropout(0.5) %>%
layer_dense(units = 5, activation = "softmax", name = "main_output")
model <- keras_model(
inputs = c(sequence_input, inp_e2, inp_cat),
outputs = preds)
model %>% compile(
loss = "sparse_categorical_crossentropy",
optimizer = "adam",
metrics = "accuracy"
)
summary(model)
## Model: "model_1"
## ___________________________________________________________________________
## Layer (type) Output Shape Param # Connected to
## ===========================================================================
## input_1 (InputLayer) [(None, 100)] 0
## ___________________________________________________________________________
## embedding (Embedding) (None, 100, 32) 628000 input_1[0][0]
## ___________________________________________________________________________
## conv1d (Conv1D) (None, 96, 128) 20608 embedding[0][0]
## ___________________________________________________________________________
## max_pooling1d (MaxPooli (None, 19, 128) 0 conv1d[0][0]
## ___________________________________________________________________________
## flatten_2 (Flatten) (None, 2432) 0 max_pooling1d[0][0]
## ___________________________________________________________________________
## dropout_3 (Dropout) (None, 2432) 0 flatten_2[0][0]
## ___________________________________________________________________________
## breed (InputLayer) [(None, 1)] 0
## ___________________________________________________________________________
## categoricas (InputLayer [(None, 13)] 0
## ___________________________________________________________________________
## main_output (Dense) (None, 5) 12165 dropout_3[0][0]
## ===========================================================================
## Total params: 660,773
## Trainable params: 32,773
## Non-trainable params: 628,000
## ___________________________________________________________________________
texto_train <- text_seqs %>%
pad_sequences(maxlen = 100)
y <- train$AdoptionSpeed
x <- list(texto_train, raza_1, cbind(Gender, MaturitySize, Type, Age_r))
dim(texto_train)
## [1] 10582 100
hist <- model %>%
fit(x, y,
batch_size = batch_size,
epochs = epochs,
validation_split = 0.3
)
hist
## Trained on 7,407 samples (batch_size=32, epochs=20)
## Final epoch (plot to see history):
## loss: 0.9048
## acc: 0.6335
## val_loss: 1.671
## val_acc: 0.3225
plot(hist)