Sentiment Analysis used for analyzing the feedback from users. By analyzing the words or comments, we will understand our value of service in order to improve it. Words are coming from feedback especially in Social Media Platform will come in various characters, users usually use punctuation, space, URL and also use slang words.
Before we work with the Deep Learning, we load the library first:
library(tidyverse)## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.4
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(keras)
use_condaenv("r-tensorflow")
library(textclean)
library (tm)## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
We will use Finance News as our data. Read the data and update the columns’ name:
data <- read_csv("data/finance_news.csv", col_names = F) %>%
rename(
label = X1,
text = X2
)##
## -- Column specification --------------------------------------------------------
## cols(
## X1 = col_character(),
## X2 = col_character()
## )
head(data)## # A tibble: 6 x 2
## label text
## <chr> <chr>
## 1 neutral According to Gran , the company has no plans to move all production ~
## 2 neutral Technopolis plans to develop in stages an area of no less than 100,0~
## 3 negative The international electronic industry company Elcoteq has laid off t~
## 4 positive With the new production plant the company would increase its capacit~
## 5 positive According to the company 's updated strategy for the years 2009-2012~
## 6 positive FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is aggressively pursuing it~
We will clean the data by : - Replace URL & HTML - Remove username & hashtag - Replace contraction - Replace word elongation - Replace slang words - Remove/Replace certain characters - Remove Punctuation - Lower text - Remove whitespace
data <- data %>%
mutate(
text = text %>%
replace_url() %>%
replace_html() %>%
str_remove_all("@([0-9a-zA-Z_]+)") %>%
str_remove_all("#([0-9a-zA-Z_]+)") %>%
str_replace_all("[\\?]+", " questionmark ") %>%
str_replace_all("[\\!]+", " exclamationmark ") %>%
str_remove_all('[\\&]+') %>%
str_remove_all('[\\"]+') %>%
replace_contraction() %>%
replace_word_elongation() %>%
replace_internet_slang() %>%
str_remove_all(pattern = "[[:digit:]]") %>% # remove number
str_remove_all(pattern = "[[:punct:]]") %>%
str_remove_all(pattern = "\\$") %>% # remove dollar sign
str_to_lower() %>%
str_squish(),
label = base::factor(label, levels = c("negative", "neutral", "positive")) %>%
as.numeric() %>% {. - 1}
) %>%
select(text, label) %>%
na.omit() # remove NA
head(data)## # A tibble: 6 x 2
## text label
## <chr> <dbl>
## 1 according to gran the company has no plans to move all production to ru~ 1
## 2 technopolis plans to develop in stages an area of no less than square m~ 1
## 3 the international electronic industry company elcoteq has laid off tens~ 0
## 4 with the new production plant the company would increase its capacity t~ 2
## 5 according to the company s updated strategy for the years basware targe~ 2
## 6 financing of aspocomp s growth aspocomp is aggressively pursuing its gr~ 2
rm.stopwords <- VCorpus(VectorSource(data$text)) %>%
tm_map(removeWords,stopwords("en")) %>%
tm_map(stripWhitespace) %>%
sapply(as.character) %>%
as.data.frame(stringsAsFactors = FALSE)
#rm.stopwords
data.clean <- bind_cols(rm.stopwords, data[,2]) %>%
`colnames<-`(c("text","label"))
# glimpse(data.clean)
# head(data.clean)After we perform the cleaning process, we will use keras text_tokenizer() to transform each word as separate tokens. The num_words parameter sets maximum number of words to consider as features.
Giving 1024 as the input means, from the total unique words that we have, we’d only use 1024 of them to make the model.
num_words <- 1024
# prepare tokenizers
tokenizer <- text_tokenizer(num_words = num_words, lower = TRUE) %>%
fit_text_tokenizer(data.clean$text)
paste(
"Total Unique Words:", length(tokenizer$word_counts),"|",
"Total Features:", num_words
)## [1] "Total Unique Words: 9341 | Total Features: 1024"
We need to split data into train, validation and test. These parts will be used to : - Data Train will be used to train the model - Data Validation used to validating the hyperparameter tuning in the models - Data Test as evaluator of the model that we make on unseen data.
library(rsample)
set.seed(100)
# split into train - test
split <- initial_split(data.clean, strata = "label")
data_train <- training(split)
data_test <- testing(split)
# split data test to test - validation
split_val <- initial_split(data.clean, prop = 0.5, strata = "label")
data_val <- training(split_val)
data_test <- training(split_val)In this sentiment analysis, we won’t avoid the words sequence. Because it will affect the meaning of the sentiment itself.
maxlen <- max(str_count(data.clean$text, "\\w+")) + 1 # Text cutoff
# prepare x
data_train_x <- texts_to_sequences(tokenizer, data_train$text) %>%
pad_sequences(maxlen = maxlen)
data_val_x <- texts_to_sequences(tokenizer, data_val$text) %>%
pad_sequences(maxlen = maxlen)
data_test_x <- texts_to_sequences(tokenizer, data_test$text) %>%
pad_sequences(maxlen = maxlen)
# prepare y
data_train_y <- to_categorical(data_train$label, num_classes = 3)
data_val_y <- to_categorical(data_train$label, num_classes = 3)
data_test_y <- to_categorical(data_test$label, num_classes = 3)Embedding Layer Embedding Layers can only be used in the initial / first layer of the LSTM architecture. In a variety of deep learning frameworks such as Keras, the embedding layer aims to train text data into numerical vectors which represent the closeness of the meaning of each word.
Embedding layer accepts several parameters. Some examples are:
Deep Neural Layer The Deep Network Layer accepts the embedding matrix as input and then is converted into smaller dimensions. The dimensions of the compression results have represented information from the data. In the case of data text, the deep learning architecture commonly used is RNN > LSTM / GRU.
Output Layer This output layer is the last layer in the deep learning architecture. At Keras use the layer_dense command where we need to set the unit parameters or how many neurons we want to build. In this case I use 3 units, because there are 3 classes we have (negative, neutral, positive).
# initiate keras model sequence
model <- keras_model_sequential()
# model
model %>%
# layer input
layer_embedding(
name = "input",
input_dim = num_words,
input_length = maxlen,
output_dim = 32
) %>%
# layer dropout
layer_dropout(
name = "embedding_dropout",
rate = 0.5
) %>%
# layer lstm 1
layer_lstm(
name = "lstm",
units = maxlen,
dropout = 0.25,
recurrent_dropout = 0.25,
return_sequences = FALSE,
) %>%
# layer output
layer_dense(
name = "output",
units = 3,
activation = "softmax"
)Dropout parameters are added to reduce the risk of overfit. the range of dropout values between 0 to 1. commonly used is 0.2 to 0.5. the closer to 0 will tend to overfit, while the closer to 1 has the risk of underfit.
For two category classes, the lost function used is binary_crossentropy while for multiclass cases it uses categorical_crossentropy. There are not only 2 option, but the most common when working with classification cases, these 2 loss functions are used.
# compile the model
model %>% compile(
optimizer = "adam",
metrics = "accuracy",
loss = "categorical_crossentropy"
)
# model summary
summary(model)## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## input (Embedding) (None, 39, 32) 32768
## ________________________________________________________________________________
## embedding_dropout (Dropout) (None, 39, 32) 0
## ________________________________________________________________________________
## lstm (LSTM) (None, 39) 11232
## ________________________________________________________________________________
## output (Dense) (None, 3) 120
## ================================================================================
## Total params: 44,120
## Trainable params: 44,120
## Non-trainable params: 0
## ________________________________________________________________________________
# model fit settings
epochs <- 10
batch_size <- 128
# fit the model
history <- model %>% fit(
data_train_x, data_train_y,
batch_size = batch_size,
epochs = epochs,
verbose = 1,
validation_data = list(
data_test_x, data_test_y
)
)
# history plot
plot(history)## `geom_smooth()` using formula 'y ~ x'
The Accuracy from model above is
# predict on test
data_test_pred <- model %>%
predict_classes(data_test_x) %>%
as.vector()
# performance on "unseen data"
yardstick::accuracy_vec(
truth = factor(data_test$label,labels = c("negative", "neutral", "positive")),
estimate = factor(data_test_pred, labels = c("negative", "neutral", "positive"))
)## [1] 0.7599174
We will fine tune our model in order to increase the accuracy and decrease the loss.
Fine tune hyperparameters : change few variabels before the training such as learning rate, batch size and epoch. Parameter will be applied manually (trial and error). We can also increase the quality of the data when preprocessing. In addition, we can use dropout layer. This method is used to avoid overfitting.
# initiate keras model sequence
model1 <- keras_model_sequential()
# model
model1 %>%
# layer input
layer_embedding(
name = "input",
input_dim = num_words,
input_length = maxlen,
output_dim = 32
) %>%
# layer dropout
layer_dropout(
name = "embedding_dropout",
rate = 0.3 #tuning
) %>%
# layer lstm 1
layer_lstm(
name = "lstm",
units = maxlen,
dropout = 0.25,
recurrent_dropout = 0.25,
return_sequences = FALSE,
) %>%
# layer output
layer_dense(
name = "output",
units = 3,
activation = "softmax"
)
# compile the model
model1 %>% compile(
optimizer = "adam",
metrics = "accuracy",
loss = "categorical_crossentropy"
)
# model summary
summary(model1)## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## input (Embedding) (None, 39, 32) 32768
## ________________________________________________________________________________
## embedding_dropout (Dropout) (None, 39, 32) 0
## ________________________________________________________________________________
## lstm (LSTM) (None, 39) 11232
## ________________________________________________________________________________
## output (Dense) (None, 3) 120
## ================================================================================
## Total params: 44,120
## Trainable params: 44,120
## Non-trainable params: 0
## ________________________________________________________________________________
# model fit settings
epochs <- 20 #tuning
batch_size <- 128
# fit the model
history1 <- model1 %>% fit(
data_train_x, data_train_y,
batch_size = batch_size,
epochs = epochs,
verbose = 1,
validation_data = list(
data_test_x, data_test_y
)
)
# history plot
plot(history1)## `geom_smooth()` using formula 'y ~ x'
After we tuning the model, the Accuracy from Model1 above is
# predict on test
data_test_pred1 <- model1 %>%
predict_classes(data_test_x) %>%
as.vector()
# performance on "unseen data"
yardstick::accuracy_vec(
truth = factor(data_test$label,labels = c("negative", "neutral", "positive")),
estimate = factor(data_test_pred1, labels = c("negative", "neutral", "positive"))
)## [1] 0.8553719
# initiate keras model sequence
model2 <- keras_model_sequential()
# model
model2 %>%
# layer input
layer_embedding(
name = "input",
input_dim = num_words,
input_length = maxlen,
output_dim = 32
) %>%
# layer dropout
layer_dropout(
name = "embedding_dropout",
rate = 0.3
) %>%
# layer lstm 1
layer_lstm(
name = "lstm",
units = maxlen,
dropout = 0.35, #tuning
recurrent_dropout = 0.25,
return_sequences = FALSE,
) %>%
# layer output
layer_dense(
name = "output",
units = 3,
activation = "softmax"
)
# compile the model
model2 %>% compile(
optimizer = "adam",
metrics = "accuracy",
loss = "categorical_crossentropy"
)
# model summary
summary(model2)## Model: "sequential_2"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## input (Embedding) (None, 39, 32) 32768
## ________________________________________________________________________________
## embedding_dropout (Dropout) (None, 39, 32) 0
## ________________________________________________________________________________
## lstm (LSTM) (None, 39) 11232
## ________________________________________________________________________________
## output (Dense) (None, 3) 120
## ================================================================================
## Total params: 44,120
## Trainable params: 44,120
## Non-trainable params: 0
## ________________________________________________________________________________
# model fit settings
epochs <- 20
batch_size <- 32 #tuning
# fit the model
history2 <- model2 %>% fit(
data_train_x, data_train_y,
batch_size = batch_size,
epochs = epochs,
verbose = 1,
validation_data = list(
data_test_x, data_test_y
)
)
# history plot
plot(history2)## `geom_smooth()` using formula 'y ~ x'
After we tuning the model, the Accuracy from Model2 above is
# predict on test
data_test_pred2 <- model2 %>%
predict_classes(data_test_x) %>%
as.vector()
# performance on "unseen data"
yardstick::accuracy_vec(
truth = factor(data_test$label,labels = c("negative", "neutral", "positive")),
estimate = factor(data_test_pred2, labels = c("negative", "neutral", "positive"))
)## [1] 0.872314
# initiate keras model sequence
model3 <- keras_model_sequential()
# model
model3 %>%
# layer input
layer_embedding(
name = "input",
input_dim = num_words,
input_length = maxlen,
output_dim = 32
) %>%
# layer dropout
layer_dropout(
name = "embedding_dropout",
rate = 0.3
) %>%
# layer dropout
layer_dropout(
name = "embedding_dropout2",
rate = 0.2
) %>%
# layer lstm 1
layer_lstm(
name = "lstm",
units = maxlen,
dropout = 0.35, #tuning
recurrent_dropout = 0.25,
return_sequences = FALSE,
) %>%
# layer output
layer_dense(
name = "output",
units = 3,
activation = "softmax"
)
# compile the model
model3 %>% compile(
optimizer = "adam",
metrics = "accuracy",
loss = "categorical_crossentropy"
)
# model summary
summary(model3)## Model: "sequential_3"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## input (Embedding) (None, 39, 32) 32768
## ________________________________________________________________________________
## embedding_dropout (Dropout) (None, 39, 32) 0
## ________________________________________________________________________________
## embedding_dropout2 (Dropout) (None, 39, 32) 0
## ________________________________________________________________________________
## lstm (LSTM) (None, 39) 11232
## ________________________________________________________________________________
## output (Dense) (None, 3) 120
## ================================================================================
## Total params: 44,120
## Trainable params: 44,120
## Non-trainable params: 0
## ________________________________________________________________________________
# model fit settings
epochs <- 30
batch_size <- 32
# fit the model
history3 <- model3 %>% fit(
data_train_x, data_train_y,
batch_size = batch_size,
epochs = epochs,
verbose = 1,
validation_data = list(
data_test_x, data_test_y
)
)
# history plot
plot(history3)## `geom_smooth()` using formula 'y ~ x'
We tried to increase the epoch and add the dropout layer, Accuracy increase on model3 into 88,64%. So, we will use model 3 to predict the data test:
# predict on test
data_test_pred3 <- model3 %>%
predict_classes(data_test_x) %>%
as.vector()
# performance on "unseen data"
yardstick::accuracy_vec(
truth = factor(data_test$label,labels = c("negative", "neutral", "positive")),
estimate = factor(data_test_pred3, labels = c("negative", "neutral", "positive"))
)## [1] 0.8780992
Bonus: after we create model and tuning it. We will interpret the Text classification with Lime.
Library setup
# Data Wrangling
library(SnowballC)
# Model Fitting and Evaluation
library(caret)## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(e1071)##
## Attaching package: 'e1071'
## The following object is masked from 'package:rsample':
##
## permutations
# Model Interpretation
library(lime)##
## Attaching package: 'lime'
## The following object is masked from 'package:dplyr':
##
## explain
#Graphical
library(htmlwidgets)
library(shiny)
library(shinythemes)We have Finance News as our data, let’s try to do model fitting:
text <- data.clean$text
y_train <- data.clean$label
max_features <- 1000
tokenizer <- text_tokenizer(num_words = max_features)
tokenizer %>% fit_text_tokenizer(text)Via tokenizer object you can check word indices, word counts and other interesting properties.
# tokenizer$word_counts
# tokenizer$word_index# Finally, we can replace words in dataset with integers
text_seqs <- texts_to_sequences(tokenizer, text)
text_seqs %>% head(3)## [[1]]
## [1] 46 3 217 503 42 81 3 589
##
## [[2]]
## [1] 674 217 711 112 977 475 67 47 633 426 590 752 6
##
## [[3]]
## [1] 181 504 78 3 218 92 978 174 82 427 3 159 792 712 146
# As a final step, restrict the maximum length of all sequences and create a matrix as input for model
x_train <- text_seqs %>% pad_sequences(maxlen = maxlen)
# Lets print the first 2 rows and see that max length of first 2 sequences equals to 15
x_train[1:2,]## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
## [1,] 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0 0 0
## [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38]
## [1,] 0 0 0 0 0 46 3 217 503 42 81 3
## [2,] 674 217 711 112 977 475 67 47 633 426 590 752
## [,39]
## [1,] 589
## [2,] 6
# In order to explain a text with LIME, we should write a preprocess function
# which will help to turn words into integers. Therefore, above mentioned steps
# (how to encode a text) should be repeated BUT within a function.
# As we already have had a tokenizer object, we can apply the same object to train/test or a new text.
get_embedding_explanation <- function(text) {
tokenizer %>% fit_text_tokenizer(text)
text_to_seq <- texts_to_sequences(tokenizer, text)
sentences <- text_to_seq %>% pad_sequences(maxlen = maxlen)
}# Lets choose some text (3 rows) to explain
sentence_to_explain <- data.clean$text[10:12]
# You could notice that our input is just a plain text. Unlike tabular data, lime function
# for text classification requires a preprocess fuction. Because it will help to convert a text to integers
# with provided function.
explainer <- lime(sentence_to_explain, model = model3, preprocess = get_embedding_explanation)
# Get explanation for the first 10 words
explanation <- explain(sentence_to_explain, explainer, n_labels = 1, n_features = 10,n_permutations = 1e4)
# Different graphical ways to show the same information
sentence_to_explain## [1] "operating profit totalled eur mn eur mn representing net sales"
## [2] "teliasonera tlsn said offer line strategy increase ownership core business holdings strengthen eesti telekom s offering customers"
## [3] "stora enso norske skog mreal upmkymmene credit suisse first boston cfsb raised fair value shares four largest nordic forestry groups"
plot_text_explanations(explanation)plot_features(explanation)Interactive way to test the model
interactive_text_explanations(explainer)Since Shiny applications not supported in static R Markdown documents, below is the screenshot:
To create model for Sentiment Analysis, we have to do Data Preprocessing or Text Cleansing. Model can be built using Keras, we can build, compile, split into training & evaluation. After that, we can do model tuning to get maximum accuracy.
We can interpret model with Lime and visualize it using library htmlwidgets & shiny.