1 Sentiment Analysis

Sentiment Analysis used for analyzing the feedback from users. By analyzing the words or comments, we will understand our value of service in order to improve it. Words are coming from feedback especially in Social Media Platform will come in various characters, users usually use punctuation, space, URL and also use slang words.

Before we work with the Deep Learning, we load the library first:

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(keras)
use_condaenv("r-tensorflow")
library(textclean)
library (tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate

2 Finance data

We will use Finance News as our data. Read the data and update the columns’ name:

data <- read_csv("data/finance_news.csv", col_names = F) %>% 
  rename(
    label = X1,
    text = X2
  )
## 
## -- Column specification --------------------------------------------------------
## cols(
##   X1 = col_character(),
##   X2 = col_character()
## )
head(data)
## # A tibble: 6 x 2
##   label    text                                                                 
##   <chr>    <chr>                                                                
## 1 neutral  According to Gran , the company has no plans to move all production ~
## 2 neutral  Technopolis plans to develop in stages an area of no less than 100,0~
## 3 negative The international electronic industry company Elcoteq has laid off t~
## 4 positive With the new production plant the company would increase its capacit~
## 5 positive According to the company 's updated strategy for the years 2009-2012~
## 6 positive FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is aggressively pursuing it~

3 Text Cleansing

We will clean the data by : - Replace URL & HTML - Remove username & hashtag - Replace contraction - Replace word elongation - Replace slang words - Remove/Replace certain characters - Remove Punctuation - Lower text - Remove whitespace

data <- data %>%
  mutate(
    text = text %>%
      replace_url()  %>% 
      replace_html() %>% 
      str_remove_all("@([0-9a-zA-Z_]+)") %>% 
      str_remove_all("#([0-9a-zA-Z_]+)") %>% 
      str_replace_all("[\\?]+", " questionmark ") %>% 
      str_replace_all("[\\!]+", " exclamationmark ") %>% 
      str_remove_all('[\\&]+') %>% 
      str_remove_all('[\\"]+') %>% 
      replace_contraction() %>%
      replace_word_elongation() %>% 
      replace_internet_slang() %>% 
      str_remove_all(pattern = "[[:digit:]]") %>% # remove number
      str_remove_all(pattern = "[[:punct:]]") %>% 
      str_remove_all(pattern = "\\$") %>% # remove dollar sign
      str_to_lower() %>% 
      str_squish(), 
    label = base::factor(label, levels = c("negative", "neutral", "positive")) %>% 
                     as.numeric() %>% {. - 1}
  ) %>% 
  select(text, label) %>% 
  na.omit() # remove NA

head(data)
## # A tibble: 6 x 2
##   text                                                                     label
##   <chr>                                                                    <dbl>
## 1 according to gran the company has no plans to move all production to ru~     1
## 2 technopolis plans to develop in stages an area of no less than square m~     1
## 3 the international electronic industry company elcoteq has laid off tens~     0
## 4 with the new production plant the company would increase its capacity t~     2
## 5 according to the company s updated strategy for the years basware targe~     2
## 6 financing of aspocomp s growth aspocomp is aggressively pursuing its gr~     2
  • Remove stopwords Stop words are words which are filtered out before or after processing of natural language data (text).
rm.stopwords <- VCorpus(VectorSource(data$text)) %>%
  tm_map(removeWords,stopwords("en")) %>%
  tm_map(stripWhitespace) %>% 
  sapply(as.character) %>%
  as.data.frame(stringsAsFactors = FALSE)

#rm.stopwords

data.clean <- bind_cols(rm.stopwords, data[,2]) %>%
  `colnames<-`(c("text","label"))

# glimpse(data.clean)
# head(data.clean)

4 Tokenizer

After we perform the cleaning process, we will use keras text_tokenizer() to transform each word as separate tokens. The num_words parameter sets maximum number of words to consider as features.

Giving 1024 as the input means, from the total unique words that we have, we’d only use 1024 of them to make the model.

num_words <- 1024 

# prepare tokenizers
tokenizer <- text_tokenizer(num_words = num_words, lower = TRUE) %>% 
  fit_text_tokenizer(data.clean$text)

paste(
  "Total Unique Words:", length(tokenizer$word_counts),"|",
  "Total Features:", num_words
)
## [1] "Total Unique Words: 9341 | Total Features: 1024"

5 Split Data

We need to split data into train, validation and test. These parts will be used to : - Data Train will be used to train the model - Data Validation used to validating the hyperparameter tuning in the models - Data Test as evaluator of the model that we make on unseen data.

library(rsample)
set.seed(100)

# split into train - test
split <- initial_split(data.clean, strata = "label")
data_train <- training(split)
data_test <- testing(split)

# split data test to test - validation
split_val <- initial_split(data.clean, prop = 0.5, strata = "label")
data_val <- training(split_val)
data_test <- training(split_val)

5.1 Text to sequence

In this sentiment analysis, we won’t avoid the words sequence. Because it will affect the meaning of the sentiment itself.

maxlen <- max(str_count(data.clean$text, "\\w+")) + 1 # Text cutoff
  
# prepare x
data_train_x <- texts_to_sequences(tokenizer, data_train$text) %>%
  pad_sequences(maxlen = maxlen)

data_val_x <- texts_to_sequences(tokenizer, data_val$text) %>%
  pad_sequences(maxlen = maxlen)

data_test_x <- texts_to_sequences(tokenizer, data_test$text) %>%
  pad_sequences(maxlen = maxlen)

# prepare y
data_train_y <- to_categorical(data_train$label, num_classes = 3)
data_val_y <- to_categorical(data_train$label, num_classes = 3)
data_test_y <- to_categorical(data_test$label, num_classes = 3)

6 Build Architecture

6.1 Model Initialization

Embedding Layer Embedding Layers can only be used in the initial / first layer of the LSTM architecture. In a variety of deep learning frameworks such as Keras, the embedding layer aims to train text data into numerical vectors which represent the closeness of the meaning of each word.

Embedding layer accepts several parameters. Some examples are:

  • input_dim: the maximum dimension of the vocabulary that has been explained in the `num_words section.
  • input_length: the maximum length of the word sequence in the document input.
  • output_dim: the embedding dimension of the output layer which will be passed to the next layer. generally is 32, but can be more dependent on the problem we face.

Deep Neural Layer The Deep Network Layer accepts the embedding matrix as input and then is converted into smaller dimensions. The dimensions of the compression results have represented information from the data. In the case of data text, the deep learning architecture commonly used is RNN > LSTM / GRU.

Output Layer This output layer is the last layer in the deep learning architecture. At Keras use the layer_dense command where we need to set the unit parameters or how many neurons we want to build. In this case I use 3 units, because there are 3 classes we have (negative, neutral, positive).

# initiate keras model sequence
model <- keras_model_sequential()

# model
model %>%
  # layer input
  layer_embedding(
    name = "input",
    input_dim = num_words,
    input_length = maxlen, 
    output_dim = 32
  ) %>%
  # layer dropout
  layer_dropout(
    name = "embedding_dropout",
    rate = 0.5
  ) %>%
  # layer lstm 1
  layer_lstm(
    name = "lstm",
    units = maxlen,
    dropout = 0.25,
    recurrent_dropout = 0.25,
    return_sequences = FALSE, 
  ) %>%
  # layer output
  layer_dense(
    name = "output",
    units = 3,
    activation = "softmax"
  )

Dropout parameters are added to reduce the risk of overfit. the range of dropout values between 0 to 1. commonly used is 0.2 to 0.5. the closer to 0 will tend to overfit, while the closer to 1 has the risk of underfit.

6.2 Model Compiling

For two category classes, the lost function used is binary_crossentropy while for multiclass cases it uses categorical_crossentropy. There are not only 2 option, but the most common when working with classification cases, these 2 loss functions are used.

# compile the model
model %>% compile(
  optimizer = "adam",
  metrics = "accuracy",
  loss = "categorical_crossentropy"
)

# model summary
summary(model)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## input (Embedding)                   (None, 39, 32)                  32768       
## ________________________________________________________________________________
## embedding_dropout (Dropout)         (None, 39, 32)                  0           
## ________________________________________________________________________________
## lstm (LSTM)                         (None, 39)                      11232       
## ________________________________________________________________________________
## output (Dense)                      (None, 3)                       120         
## ================================================================================
## Total params: 44,120
## Trainable params: 44,120
## Non-trainable params: 0
## ________________________________________________________________________________

6.3 Model Training

# model fit settings
epochs <- 10
batch_size <- 128

# fit the model
history <- model %>% fit(
  data_train_x, data_train_y,
  batch_size = batch_size, 
  epochs = epochs,
  verbose = 1,
  validation_data = list(
    data_test_x, data_test_y
  )
)

# history plot
plot(history)
## `geom_smooth()` using formula 'y ~ x'

6.4 Model Evaluation

The Accuracy from model above is

# predict on test
data_test_pred <- model %>%
  predict_classes(data_test_x) %>%
  as.vector()

# performance on "unseen data"
yardstick::accuracy_vec(
 truth = factor(data_test$label,labels = c("negative", "neutral", "positive")),
 estimate = factor(data_test_pred, labels = c("negative", "neutral", "positive"))
)
## [1] 0.7599174

6.5 Model Tuning

We will fine tune our model in order to increase the accuracy and decrease the loss.

Fine tune hyperparameters : change few variabels before the training such as learning rate, batch size and epoch. Parameter will be applied manually (trial and error). We can also increase the quality of the data when preprocessing. In addition, we can use dropout layer. This method is used to avoid overfitting.

6.5.1 Model 1

# initiate keras model sequence
model1 <- keras_model_sequential()

# model
model1 %>%
  # layer input
  layer_embedding(
    name = "input",
    input_dim = num_words,
    input_length = maxlen, 
    output_dim = 32
  ) %>%
  # layer dropout
  layer_dropout(
    name = "embedding_dropout",
    rate = 0.3 #tuning
  ) %>%
  # layer lstm 1
  layer_lstm(
    name = "lstm",
    units = maxlen,
    dropout = 0.25,
    recurrent_dropout = 0.25,
    return_sequences = FALSE, 
  ) %>%
  # layer output
  layer_dense(
    name = "output",
    units = 3,
    activation = "softmax"
  )


# compile the model
model1 %>% compile(
  optimizer = "adam",
  metrics = "accuracy",
  loss = "categorical_crossentropy"
)

# model summary
summary(model1)
## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## input (Embedding)                   (None, 39, 32)                  32768       
## ________________________________________________________________________________
## embedding_dropout (Dropout)         (None, 39, 32)                  0           
## ________________________________________________________________________________
## lstm (LSTM)                         (None, 39)                      11232       
## ________________________________________________________________________________
## output (Dense)                      (None, 3)                       120         
## ================================================================================
## Total params: 44,120
## Trainable params: 44,120
## Non-trainable params: 0
## ________________________________________________________________________________
# model fit settings
epochs <- 20 #tuning
batch_size <- 128

# fit the model
history1 <- model1 %>% fit(
  data_train_x, data_train_y,
  batch_size = batch_size, 
  epochs = epochs,
  verbose = 1,
  validation_data = list(
    data_test_x, data_test_y
  )
)

# history plot
plot(history1)
## `geom_smooth()` using formula 'y ~ x'

After we tuning the model, the Accuracy from Model1 above is

# predict on test
data_test_pred1 <- model1 %>%
  predict_classes(data_test_x) %>%
  as.vector()

# performance on "unseen data"
yardstick::accuracy_vec(
 truth = factor(data_test$label,labels = c("negative", "neutral", "positive")),
 estimate = factor(data_test_pred1, labels = c("negative", "neutral", "positive"))
)
## [1] 0.8553719

6.5.2 Model 2

# initiate keras model sequence
model2 <- keras_model_sequential()

# model
model2 %>%
  # layer input
  layer_embedding(
    name = "input",
    input_dim = num_words,
    input_length = maxlen, 
    output_dim = 32
  ) %>%
  # layer dropout
  layer_dropout(
    name = "embedding_dropout",
    rate = 0.3 
  ) %>%
  # layer lstm 1
  layer_lstm(
    name = "lstm",
    units = maxlen,
    dropout = 0.35, #tuning
    recurrent_dropout = 0.25,
    return_sequences = FALSE, 
  ) %>%
  # layer output
  layer_dense(
    name = "output",
    units = 3,
    activation = "softmax"
  )


# compile the model
model2 %>% compile(
  optimizer = "adam",
  metrics = "accuracy",
  loss = "categorical_crossentropy"
)

# model summary
summary(model2)
## Model: "sequential_2"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## input (Embedding)                   (None, 39, 32)                  32768       
## ________________________________________________________________________________
## embedding_dropout (Dropout)         (None, 39, 32)                  0           
## ________________________________________________________________________________
## lstm (LSTM)                         (None, 39)                      11232       
## ________________________________________________________________________________
## output (Dense)                      (None, 3)                       120         
## ================================================================================
## Total params: 44,120
## Trainable params: 44,120
## Non-trainable params: 0
## ________________________________________________________________________________
# model fit settings
epochs <- 20
batch_size <- 32 #tuning

# fit the model
history2 <- model2 %>% fit(
  data_train_x, data_train_y,
  batch_size = batch_size, 
  epochs = epochs,
  verbose = 1,
  validation_data = list(
    data_test_x, data_test_y
  )
)

# history plot
plot(history2)
## `geom_smooth()` using formula 'y ~ x'

After we tuning the model, the Accuracy from Model2 above is

# predict on test
data_test_pred2 <- model2 %>%
  predict_classes(data_test_x) %>%
  as.vector()

# performance on "unseen data"
yardstick::accuracy_vec(
 truth = factor(data_test$label,labels = c("negative", "neutral", "positive")),
 estimate = factor(data_test_pred2, labels = c("negative", "neutral", "positive"))
)
## [1] 0.872314

6.5.3 Model 3

# initiate keras model sequence
model3 <- keras_model_sequential()

# model
model3 %>%
  # layer input
  layer_embedding(
    name = "input",
    input_dim = num_words,
    input_length = maxlen, 
    output_dim = 32
  ) %>%
  # layer dropout
  layer_dropout(
    name = "embedding_dropout",
    rate = 0.3 
  ) %>%
  # layer dropout
  layer_dropout(
    name = "embedding_dropout2",
    rate = 0.2 
  ) %>%
  # layer lstm 1
  layer_lstm(
    name = "lstm",
    units = maxlen,
    dropout = 0.35, #tuning
    recurrent_dropout = 0.25,
    return_sequences = FALSE, 
  ) %>%
  # layer output
  layer_dense(
    name = "output",
    units = 3,
    activation = "softmax"
  )


# compile the model
model3 %>% compile(
  optimizer = "adam",
  metrics = "accuracy",
  loss = "categorical_crossentropy"
)

# model summary
summary(model3)
## Model: "sequential_3"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## input (Embedding)                   (None, 39, 32)                  32768       
## ________________________________________________________________________________
## embedding_dropout (Dropout)         (None, 39, 32)                  0           
## ________________________________________________________________________________
## embedding_dropout2 (Dropout)        (None, 39, 32)                  0           
## ________________________________________________________________________________
## lstm (LSTM)                         (None, 39)                      11232       
## ________________________________________________________________________________
## output (Dense)                      (None, 3)                       120         
## ================================================================================
## Total params: 44,120
## Trainable params: 44,120
## Non-trainable params: 0
## ________________________________________________________________________________
# model fit settings
epochs <- 30
batch_size <- 32

# fit the model
history3 <- model3 %>% fit(
  data_train_x, data_train_y,
  batch_size = batch_size, 
  epochs = epochs,
  verbose = 1,
  validation_data = list(
    data_test_x, data_test_y
  )
)

# history plot
plot(history3)
## `geom_smooth()` using formula 'y ~ x'

We tried to increase the epoch and add the dropout layer, Accuracy increase on model3 into 88,64%. So, we will use model 3 to predict the data test:

# predict on test
data_test_pred3 <- model3 %>%
  predict_classes(data_test_x) %>%
  as.vector()

# performance on "unseen data"
yardstick::accuracy_vec(
 truth = factor(data_test$label,labels = c("negative", "neutral", "positive")),
 estimate = factor(data_test_pred3, labels = c("negative", "neutral", "positive"))
)
## [1] 0.8780992

7 Lime

Bonus: after we create model and tuning it. We will interpret the Text classification with Lime.

Library setup

# Data Wrangling
library(SnowballC)

# Model Fitting and Evaluation
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(e1071)
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:rsample':
## 
##     permutations
# Model Interpretation
library(lime)
## 
## Attaching package: 'lime'
## The following object is masked from 'package:dplyr':
## 
##     explain
#Graphical
library(htmlwidgets)
library(shiny)
library(shinythemes)

We have Finance News as our data, let’s try to do model fitting:

text <- data.clean$text
y_train <- data.clean$label

max_features <- 1000
tokenizer <- text_tokenizer(num_words = max_features)

tokenizer %>% fit_text_tokenizer(text)

Via tokenizer object you can check word indices, word counts and other interesting properties.

# tokenizer$word_counts 
# tokenizer$word_index
# Finally, we can replace words in dataset with integers
text_seqs <- texts_to_sequences(tokenizer, text)

text_seqs %>% head(3)
## [[1]]
## [1]  46   3 217 503  42  81   3 589
## 
## [[2]]
##  [1] 674 217 711 112 977 475  67  47 633 426 590 752   6
## 
## [[3]]
##  [1] 181 504  78   3 218  92 978 174  82 427   3 159 792 712 146
# As a final step, restrict the maximum length of all sequences and create a matrix as input for model
x_train <- text_seqs %>% pad_sequences(maxlen = maxlen)

# Lets print the first 2 rows and see that max length of first 2 sequences equals to 15
x_train[1:2,]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0
## [2,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0
##      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
## [1,]     0     0     0     0     0     0     0     0     0     0     0     0
## [2,]     0     0     0     0     0     0     0     0     0     0     0     0
##      [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38]
## [1,]     0     0     0     0     0    46     3   217   503    42    81     3
## [2,]   674   217   711   112   977   475    67    47   633   426   590   752
##      [,39]
## [1,]   589
## [2,]     6
# In order to explain a text with LIME, we should write a preprocess function
# which will help to turn words into integers. Therefore, above mentioned steps 
# (how to encode a text) should be repeated BUT within a function. 
# As we already have had a tokenizer object, we can apply the same object to train/test or a new text.

get_embedding_explanation <- function(text) {
  
  tokenizer %>% fit_text_tokenizer(text)
  
  text_to_seq <- texts_to_sequences(tokenizer, text)
  sentences <- text_to_seq %>% pad_sequences(maxlen = maxlen)
}
# Lets choose some text (3 rows) to explain
sentence_to_explain <- data.clean$text[10:12]

# You could notice that our input is just a plain text. Unlike tabular data, lime function 
# for text classification requires a preprocess fuction. Because it will help to convert a text to integers 
# with provided function. 
explainer <- lime(sentence_to_explain, model = model3, preprocess = get_embedding_explanation)

# Get explanation for the first 10 words
explanation <- explain(sentence_to_explain, explainer, n_labels = 1, n_features = 10,n_permutations = 1e4)


# Different graphical ways to show the same information
sentence_to_explain
## [1] "operating profit totalled eur mn eur mn representing net sales"                                                                      
## [2] "teliasonera tlsn said offer line strategy increase ownership core business holdings strengthen eesti telekom s offering customers"   
## [3] "stora enso norske skog mreal upmkymmene credit suisse first boston cfsb raised fair value shares four largest nordic forestry groups"
plot_text_explanations(explanation)
plot_features(explanation)

Interactive way to test the model

interactive_text_explanations(explainer)

Shiny applications not supported in static R Markdown documents

Since Shiny applications not supported in static R Markdown documents, below is the screenshot:

8 Conclusion

To create model for Sentiment Analysis, we have to do Data Preprocessing or Text Cleansing. Model can be built using Keras, we can build, compile, split into training & evaluation. After that, we can do model tuning to get maximum accuracy.

We can interpret model with Lime and visualize it using library htmlwidgets & shiny.