Sentiment Analysis with Word2vec Model

2018 G7 Sentiment Summary
- word2vec for sentiment analysis
Data Pipeline: loading classified tweets dataset
- Splitting Data into Train and Test Sets
Word Vectorization
- creating vocabulary and document-term matrix
- fit the TF-IDF to the train data
Train the word2vec Model
Fetching G7 related Tweets
Analyze the sentiment by the word2vec model
- visualization of the sentiment analysis

2018 G7 Sentiment Summary

It comes after a tumultuous G7 meeting in Quebec, which saw US President Donald Trump retract support for the summit’s final statement after trading barbs with fellow members including Canada over tariffs. Stock markets are broadly positive, despite the turbulence at the G7 meeting at the weekend. The sentiment analsysis show publics seem unfazed by the tough talk from President Trump. The markets are likely to stand their ground until there is a formal reaction from other members of the G7.

word2vec for sentiment analysis

The problem with the previous method(https://rpubs.com/JanpuHou/284847.) is that it just computes the number of positive and negative words and makes a conclusion based on their difference. Therefore, when using a simple vocabularies approach for a phrase “not bad” we’ll get a negative estimation. But word2vec is a deep learning algorithm that draws context from phrases. It’s currently one of the best ways of sentiment classification for movie reviews. You can use such method to analyze feedbacks, reviews, comments, and so on.

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words (CBOW) or continuous skip-gram. Sentiment Analysis is the most common text classification tool that analyses an incoming message and tells whether the underlying sentiment is positive, negative our neutral.

Data Pipeline: loading classified tweets dataset

# loading packages
library(twitteR)
library(ROAuth)
library(tidyverse)

## -- Attaching packages ---------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.5
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## -- Conflicts ------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter()   masks stats::filter()
## x dplyr::id()       masks twitteR::id()
## x dplyr::lag()      masks stats::lag()
## x dplyr::location() masks twitteR::location()

library(purrrlyr)
library(text2vec)
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(glmnet)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following object is masked from 'package:tidyr':
## 
##     expand

## Loading required package: foreach

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loaded glmnet 2.0-16

library(ggrepel)

### loading and preprocessing a training set of tweets
# function for converting some symbols
conv_fun <- function(x) iconv(x, "latin1", "ASCII", "")

##### loading classified tweets ######
# source: http://help.sentiment140.com/for-students/
# 0 - the polarity of the tweet (0 = negative, 4 = positive)
# 1 - the id of the tweet
# 2 - the date of the tweet
# 3 - the query. If there is no query, then this value is NO_QUERY.
# 4 - the user that tweeted
# 5 - the text of the tweet

tweets_classified <- read_csv('D:/R_Files/training.1600000.processed.noemoticon.csv', col_names = c('sentiment', 'id', 'date', 'query', 'user', 'text')) %>%
  # converting some symbols
  dmap_at('text', conv_fun) %>%
  # replacing class values
  mutate(sentiment = ifelse(sentiment == 0, 0, 1))

## Parsed with column specification:
## cols(
##   sentiment = col_integer(),
##   id = col_integer(),
##   date = col_character(),
##   query = col_character(),
##   user = col_character(),
##   text = col_character()
## )

## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)

## Warning: 432913 parsing failures.
## row # A tibble: 5 x 5 col      row col   expected   actual     file                                  expected    <int> <chr> <chr>      <chr>      <chr>                                 actual 1 460801 id    an integer 2169448960 'D:/R_Files/training.1600000.process~ file 2 460802 id    an integer 2169449034 'D:/R_Files/training.1600000.process~ row 3 460803 id    an integer 2169449182 'D:/R_Files/training.1600000.process~ col 4 460804 id    an integer 2169449187 'D:/R_Files/training.1600000.process~ expected 5 460805 id    an integer 2169449521 'D:/R_Files/training.1600000.process~
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.

# there are some tweets with NA ids that we replace with dummies
tweets_classified_na <- tweets_classified %>%
  filter(is.na(id) == TRUE) %>%
  mutate(id = c(1:n()))
tweets_classified <- tweets_classified %>%
  filter(!is.na(id)) %>%
  rbind(., tweets_classified_na)
head(tweets_classified)

## # A tibble: 6 x 6
##   sentiment         id date                         query    user   text  
##       <dbl>      <int> <chr>                        <chr>    <chr>  <chr> 
## 1         0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheS~ @swit~
## 2         0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scott~ is up~
## 3         0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY matty~ @Keni~
## 4         0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleC~ my wh~
## 5         0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli @nati~
## 6         0 1467811372 Mon Apr 06 22:20:00 PDT 2009 NO_QUERY joy_w~ @Kwes~

Splitting Data into Train and Test Sets

# data splitting on train and test
set.seed(2340)
trainIndex <- createDataPartition(tweets_classified$sentiment, p = 0.8, 
                                  list = FALSE, 
                                  times = 1)
tweets_train <- tweets_classified[trainIndex, ]
tweets_test <- tweets_classified[-trainIndex, ]

Word Vectorization

##### Vectorization #####
# define preprocessing function and tokenization function
prep_fun <- tolower
tok_fun <- word_tokenizer

it_train <- itoken(tweets_train$text, 
                   preprocessor = prep_fun, 
                   tokenizer = tok_fun,
                   ids = tweets_train$id,
                   progressbar = TRUE)
it_test <- itoken(tweets_test$text, 
                  preprocessor = prep_fun, 
                  tokenizer = tok_fun,
                  ids = tweets_test$id,
                  progressbar = TRUE)

creating vocabulary and document-term matrix

# creating vocabulary and document-term matrix
vocab <- create_vocabulary(it_train)

## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%

vectorizer <- vocab_vectorizer(vocab)
dtm_train <- create_dtm(it_train, vectorizer)

## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%

# define tf-idf model
tfidf <- TfIdf$new()

fit the TF-IDF to the train data

# fit the model to the train data and transform it with the fitted model
dtm_train_tfidf <- fit_transform(dtm_train, tfidf)
# apply pre-trained tf-idf transformation to test data
dtm_test_tfidf  <- create_dtm(it_test, vectorizer) %>% 
  transform(tfidf)

## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%

Train the word2vec Model

t1 <- Sys.time()
glmnet_classifier <- cv.glmnet(x = dtm_train_tfidf,
                               y = tweets_train[['sentiment']], 
                               family = 'binomial', 
                               # L1 penalty
                               alpha = 1,
                               # interested in the area under ROC curve
                               type.measure = "auc",
                               # 5-fold cross-validation
                               nfolds = 5,
                               # high value is less accurate, but has faster training
                               thresh = 1e-3,
                               # again lower number of iterations for faster training
                               maxit = 1e3)

## Warning: from glmnet Fortran code (error code -52); Convergence for 52th
## lambda value not reached after maxit=1000 iterations; solutions for larger
## lambdas returned

## Warning: from glmnet Fortran code (error code -51); Convergence for 51th
## lambda value not reached after maxit=1000 iterations; solutions for larger
## lambdas returned

## Warning: from glmnet Fortran code (error code -50); Convergence for 50th
## lambda value not reached after maxit=1000 iterations; solutions for larger
## lambdas returned

## Warning: from glmnet Fortran code (error code -52); Convergence for 52th
## lambda value not reached after maxit=1000 iterations; solutions for larger
## lambdas returned

## Warning: from glmnet Fortran code (error code -50); Convergence for 50th
## lambda value not reached after maxit=1000 iterations; solutions for larger
## lambdas returned

## Warning: from glmnet Fortran code (error code -52); Convergence for 52th
## lambda value not reached after maxit=1000 iterations; solutions for larger
## lambdas returned

print(difftime(Sys.time(), t1, units = 'mins'))

## Time difference of 31.205 mins

plot(glmnet_classifier)

print(paste("max AUC(Area under the curve) =", round(max(glmnet_classifier$cvm),4)))

## [1] "max AUC(Area under the curve) = 0.8767"

preds <- predict(glmnet_classifier, dtm_test_tfidf, type = 'response')[ ,1]
auc(as.numeric(tweets_test$sentiment), preds)

## [1] 0.8755386

# save the objects for future using
rm(list = setdiff(ls(), c('glmnet_classifier', 'conv_fun', 'prep_fun', 'tok_fun', 'vectorizer', 'tfidf')))
save.image('D:/R_Files/image.RData')
rm(list = ls())

Fetching G7 related Tweets

load('D:/R_Files/image.RData')
### fetching tweets ###
# download.file(url = "http://curl.haxx.se/ca/cacert.pem",destfile = "cacert.pem")

# api_key <- "xxxxxxxxxxxxxxxxxxxxxxxxx"
# api_secret <- "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# access_token <- "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# access_token_secret <- "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)

## [1] "Using direct authentication"

Analyze the sentiment by the word2vec model

The most acrimonious G7(Jun 8, 2018 - Jun 9, 2018) in a generation ended with relations between the US and its traditional allies at new lows after President Trump withdrew his support of the meeting’s communiqué on social media.

# converting some symbols

df_tweets <- twListToDF(searchTwitter('G7 OR #G7', n = 1000, lang = 'en')) %>% dmap_at('text', conv_fun)

# preprocessing and tokenization
it_tweets <- itoken(df_tweets$text,
                    preprocessor = prep_fun,
                    tokenizer = tok_fun,
                    ids = df_tweets$id,
                    progressbar = TRUE)

# creating vocabulary and document-term matrix
dtm_tweets <- create_dtm(it_tweets, vectorizer)

## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%

# transforming data with tf-idf
dtm_tweets_tfidf <- fit_transform(dtm_tweets, tfidf)

# predict probabilities of positiveness
preds_tweets <- predict(glmnet_classifier, dtm_tweets_tfidf, type = 'response')[ ,1]

# adding rates to initial dataset
df_tweets$sentiment <- preds_tweets

visualization of the sentiment analysis

# color palette
cols <- c("#ce472e", "#f05336", "#ffd73e", "#eec73a", "#4ab04a")

set.seed(932)
samp_ind <- sample(c(1:nrow(df_tweets)), nrow(df_tweets) * 0.1) # 10% for labeling

# plotting
ggplot(df_tweets, aes(x = created, y = sentiment, color = sentiment)) +
  theme_minimal() +
  scale_color_gradientn(colors = cols, limits = c(0, 1),
                        breaks = seq(0, 1, by = 1/4),
                        labels = c("0", round(1/4*1, 1), round(1/4*2, 1), round(1/4*3, 1), round(1/4*4, 1)),
                        guide = guide_colourbar(ticks = T, nbin = 50, barheight = .5, label = T, barwidth = 10)) +
  geom_point(aes(color = sentiment), alpha = 0.8) +
  geom_hline(yintercept = 0.65, color = "#4ab04a", size = 1.5, alpha = 0.6, linetype = "longdash") +
  geom_hline(yintercept = 0.35, color = "#f05336", size = 1.5, alpha = 0.6, linetype = "longdash") +
  geom_smooth(size = 1.2, alpha = 0.2) +
  geom_label_repel(data = df_tweets[samp_ind, ],
                   aes(label = round(sentiment, 2)),
                   fontface = 'bold',
                   size = 2.5,
                   max.iter = 100) +
  theme(legend.position = 'bottom',
        legend.direction = "horizontal",
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(size = 20, face = "bold", vjust = 2, color = 'black', lineheight = 0.8),
        axis.title.x = element_text(size = 16),
        axis.title.y = element_text(size = 16),
        axis.text.y = element_text(size = 8, face = "bold", color = 'black'),
        axis.text.x = element_text(size = 8, face = "bold", color = 'black')) +
  ggtitle("Tweets Sentiment rate (probability of positiveness)")

## `geom_smooth()` using method = 'gam'