Training a machine to classify tweets according to sentiment

The aim of this exercise is to Train a machine to classify tweets according to sentiment. In order to accomplish this task we have extracted tweets from twitter based on the following six tags #Amma, #IPL, #RIO, #GST, #NarendraModi,#demonitization

We have collected 500 tweets for each of these tags and then put them together to create a corpus of 3000 tweets, after cleaning the data and removing duplicates(retweets etc.,) we got about 1059 tweets in our final corpus.We have classified these 1058 tweets manually by reading each tweet if they reflected positive or negative or neutral sentiment, by giving a value of +1 for positive sentiment, 0 for neutral and -1 for negative sentiment.

Here we are using the TM and RTextTools package for classifying the text.

Of the 1059 tweets in our final corpus, 70% we have used, to train the model and remaining 30% to test the model.

Published at : http://rpubs.com/rajiv2806/Training_a_machine_to_classify_tweets_according_to_sentiment

Invoking libraries

library("tm")
library("RTextTools")
library('magrittr')

Read the training data set in R and assign column names

library(RCurl)

## Loading required package: bitops

data = read.csv(text = getURL("https://raw.githubusercontent.com/Rajiv2806/Supervised_Text_Classfication/master/Tweets%20traindata.txt"),sep = '\t',stringsAsFactors = F)
dim(data)

## [1] 1617    2

colnames(data) <- c('sentiment','text')                    # Rename variables
head(data)                                                 # View few rows

##   sentiment
## 1         0
## 2         0
## 3         0
## 4        -1
## 5         0
## 6         1
##                                                                                                                                          text
## 1 A good leader appeals to the best in people amp a bad one appeals to the worst in them MrModi has lost sense of right or wrong NarendraModi
## 2        A Rio salvou esse ano de bosta Foi quase uma realidade paralela boa em que eu fiz algo que amo e fui plenamente feliz Obrigada SRCOM
## 3                                                       a Sandro Campagna fate raccontare del mega ceffone durante una partita olimpica a rio
## 4                         A Small Miracle By Mi Jayalalithaa Amma She Opposing Sasikala in AIADMK SAVE MI TAMILNADU FROM MUNNARGUDIMAFIA NEWS
## 5                                        A torcida brasileira vibrou com ele durante a Rio no Estdio Aqutico talo Gomes tocantinense de Porto
## 6                                                                           A Tribute To Amma on th day By The Nature CycloneVardah Stay Safe

head(data[which(data$sentiment < 1),])                     # View few rows with negative sentiment

##   sentiment
## 1         0
## 2         0
## 3         0
## 4        -1
## 5         0
## 7        -1
##                                                                                                                                          text
## 1 A good leader appeals to the best in people amp a bad one appeals to the worst in them MrModi has lost sense of right or wrong NarendraModi
## 2        A Rio salvou esse ano de bosta Foi quase uma realidade paralela boa em que eu fiz algo que amo e fui plenamente feliz Obrigada SRCOM
## 3                                                       a Sandro Campagna fate raccontare del mega ceffone durante una partita olimpica a rio
## 4                         A Small Miracle By Mi Jayalalithaa Amma She Opposing Sasikala in AIADMK SAVE MI TAMILNADU FROM MUNNARGUDIMAFIA NEWS
## 5                                        A torcida brasileira vibrou com ele durante a Rio no Estdio Aqutico talo Gomes tocantinense de Porto
## 7                                          AAPtards are lying that Kejriwal was in Chennai for Ammas funeral thats why he didnt attend Agenda

Split this data in two parts for evaluating models

samp_id = sample(1:nrow(data),              
                 round(nrow(data)*.70),     # 70% records will be used for training
                 replace = F)               # sampling without replacement.

train = data[samp_id,]                      # 70% of training data set, examine struc of samp_id obj
test = data[-samp_id,]                      # remaining 30% of training data set

dim(test) ; dim(train)                      # dimns of test n training

## [1] 485   2

## [1] 1132    2

head(test)

##    sentiment
## 3          0
## 4         -1
## 5          0
## 13         1
## 15         1
## 25         0
##                                                                                                                   text
## 3                                a Sandro Campagna fate raccontare del mega ceffone durante una partita olimpica a rio
## 4  A Small Miracle By Mi Jayalalithaa Amma She Opposing Sasikala in AIADMK SAVE MI TAMILNADU FROM MUNNARGUDIMAFIA NEWS
## 5                 A torcida brasileira vibrou com ele durante a Rio no Estdio Aqutico talo Gomes tocantinense de Porto
## 13                                                               Actress Trisha at Amma Memorial Paying her her Homage
## 15                Dear PM Sir can u lead all polititian to impose LOKPAL BILL urgently in INDIA as real SURGICALSTRIKE
## 25                                                      Ah yes the timeless PR trick of PositiveSpin Boeing Trump Iran

head(train)

##     sentiment
## 307         0
## 886        -1
## 524         0
## 602         0
## 43         -1
## 79          0
##                                                                                                                 text
## 307                                                  Cricket PaKvAuS BBL iPL PSL WBBL CSA First ScoRes Follow SND ON
## 886                            Demonetisation RahulGandhi addresses crowd in Dadri says Modi waging war against poor
## 524                                   Al Franken Faces Donald Trump and the Next Four Years USPolitics trump clinton
## 602                                                           Jayalalithaa with Mother Teresa ThrowBack Amma RIPAmma
## 43                              President Trump and the Ocean while theres lots to be worried about there is hope TY
## 79  OIL COMPANIES amp WALL ST are biggest advertisers for CORP media Do u REALLY think theyll criticize Trumps picks

Process the text data and create DTM (Document Term Matrix)

train.data = rbind(train,test)              # join the data sets
train.data$text = tolower(train.data$text)  # Convert to lower case

text = train.data$text                      
text = removePunctuation(text)              # remove punctuation marks
text = removeNumbers(text)                  # remove numbers
text = stripWhitespace(text)                # remove blank space
cor = Corpus(VectorSource(text))            # Create text corpus
dtm = DocumentTermMatrix(cor,               # Craete DTM
                         control = list(weighting =             
                                               function(x)
                                                 weightTfIdf(x, normalize = F))) # IDF weighing

training_codes = train.data$sentiment       # Coded labels
dim(dtm)

## [1] 1617 5798

Test the models and choose the best model

we have tested with various models as listed below MAXENT,SVM,GLMNET,SLDA,TREE,BAGGING,BOOSTING,RF and found that we are getting maximum accuracy of 63% with MAXENT

container <- create_container(dtm,               # creates a 'container' obj for training, classifying, and analyzing docs
                              t(training_codes), # labels or the Y variable / outcome we want to train on
                              trainSize = 1:nrow(train), 
                              testSize = (nrow(train)+1):nrow(train.data), 
                              virgin = FALSE)      # whether to treat the classification data as 'virgin' data or not.
                                                   # if virgin = TRUE, then machine won;t borrow from prior datasets.
str(container)     # view struc of the container obj; is a list of training n test data

## Formal class 'matrix_container' [package "RTextTools"] with 6 slots
##   ..@ training_matrix      :Formal class 'matrix.csr' [package "SparseM"] with 4 slots
##   .. .. ..@ ra       : num [1:11732] 8.07 8.34 8.66 6.85 4.66 ...
##   .. .. ..@ ja       : int [1:11732] 1 2 3 4 5 6 7 8 9 10 ...
##   .. .. ..@ ia       : int [1:1133] 1 12 23 34 41 53 69 84 90 100 ...
##   .. .. ..@ dimension: int [1:2] 1132 5798
##   ..@ classification_matrix:Formal class 'matrix.csr' [package "SparseM"] with 4 slots
##   .. .. ..@ ra       : num [1:4929] 10.66 10.66 7.34 8.66 10.66 ...
##   .. .. ..@ ja       : int [1:4929] 4594 4595 873 4596 4597 4598 4599 4600 4601 113 ...
##   .. .. ..@ ia       : int [1:486] 1 13 26 39 46 59 67 83 96 112 ...
##   .. .. ..@ dimension: int [1:2] 485 5798
##   ..@ training_codes       : Factor w/ 3 levels "-1","0","1": 2 1 2 2 1 2 2 2 1 2 ...
##   ..@ testing_codes        : Factor w/ 3 levels "-1","0","1": 2 1 2 3 3 2 1 2 2 2 ...
##   ..@ column_names         : chr [1:5798] "bbl" "cricket" "csa" "first" ...
##   ..@ virgin               : logi FALSE

models <- train_models(container,              # ?train_models; makes a model object using the specified algorithms.
                       algorithms=c("MAXENT")) #"MAXENT","SVM","GLMNET","SLDA","TREE","BAGGING","BOOSTING","RF"

results <- classify_models(container, models)

head(results)

##   MAXENTROPY_LABEL MAXENTROPY_PROB
## 1                0       0.9990497
## 2                0       0.8322476
## 3                0       0.9990895
## 4                1       1.0000000
## 5                1       0.6660575
## 6                0       0.9849228

Building a confusion matrix to see accuracy of prediction results

out = data.frame(model_sentiment = results$MAXENTROPY_LABEL,    # rounded probability == model's prediction of Y
                 model_prob = results$MAXENTROPY_PROB,
                 actual_sentiment = train.data$sentiment[(nrow(train)+1):nrow(train.data)])  # actual value of Y

dim(out); head(out);

## [1] 485   3

##   model_sentiment model_prob actual_sentiment
## 1               0  0.9990497                0
## 2               0  0.8322476               -1
## 3               0  0.9990895                0
## 4               1  1.0000000                1
## 5               1  0.6660575                1
## 6               0  0.9849228                0

summary(out)           # how many 0s and 1s were there anyway?

##  model_sentiment   model_prob     actual_sentiment  
##  -1: 62          Min.   :0.4893   Min.   :-1.00000  
##  0 :386          1st Qu.:0.9807   1st Qu.: 0.00000  
##  1 : 37          Median :0.9998   Median : 0.00000  
##                  Mean   :0.9568   Mean   :-0.05773  
##                  3rd Qu.:1.0000   3rd Qu.: 0.00000  
##                  Max.   :1.0000   Max.   : 1.00000

(z = as.matrix(table(out[,1], out[,3])))   # display the confusion matrix.

##     
##       -1   0   1
##   -1  28  29   5
##   0   54 295  37
##   1    3  19  15

(pct = round(((z[1,1] + z[2,2])/sum(z))*100, 2))      # prediction accuracy in % terms

## [1] 66.6

head(out,10)

##    model_sentiment model_prob actual_sentiment
## 1                0  0.9990497                0
## 2                0  0.8322476               -1
## 3                0  0.9990895                0
## 4                1  1.0000000                1
## 5                1  0.6660575                1
## 6                0  0.9849228                0
## 7                0  0.9999381               -1
## 8                0  0.9999888                0
## 9               -1  0.9997143                0
## 10              -1  0.8312409                0

From the confusing matrix we can see that 75% of neutral tweets, 60% of postive tweets and 50% of negative tweets were predicted accurately by the model. ##Processing the training data and test data together

data.test = read.csv(text = getURL("https://raw.githubusercontent.com/Rajiv2806/Supervised_Text_Classfication/master/Tweets%20testdata.txt"),sep = '\t',stringsAsFactors = F)

dim(data.test)

## [1] 692   1

colnames(data.test) = 'text'

set.seed(16122016)
data.test1 = data.test[sample(1:nrow(data.test), 1617, replace = T),] # randomly Selecting only 1000 rows for demo purpose

dtm.test = DocumentTermMatrix(cor, control = list(weighting = 
                                                  function(x)
                                                    weightTfIdf(x, normalize = F)))

row.names(dtm.test) = (nrow(dtm)+1):(nrow(dtm)+nrow(dtm.test))     # row naming for doc ID
dtm.f = c(dtm, dtm.test)    # concatenating the dtms
training_codes.f = c(training_codes, 
                     rep(NA, length(data.test1)))     # replace unknown Y values with NA

Predict the test data

container.f = create_container(dtm.f,      # build a new container; all same as before
                               t(training_codes.f), trainSize=1:nrow(dtm), 
                               testSize = (nrow(dtm)+1):(nrow(dtm)+nrow(dtm.test)), virgin = T)

model.f = train_models(container.f, algorithms = c("MAXENT")) 

predicted <- classify_models(container.f, model.f)     # classify_models makes predictions from a train_models() object.

out = data.frame(model_sentiment = predicted$MAXENTROPY_LABEL,    # again, building a confusion matrix
                 model_prob = predicted$MAXENTROPY_PROB,
                 text = data.test1)
dim(out)

## [1] 1617    3

head(out,10)

##    model_sentiment model_prob
## 1                0  1.0000000
## 2               -1  1.0000000
## 3                0  1.0000000
## 4                0  1.0000000
## 5               -1  1.0000000
## 6                0  1.0000000
## 7                0  0.9999992
## 8                0  0.9999998
## 9               -1  1.0000000
## 10               0  1.0000000
##                                                                                                                                   text
## 1                                                 Rio organizers say opening ceremony will be the coolest party the athletes have seen
## 2                           If someone can declare some money worthless overnight its not money India Venezuela demonetization bitcoin
## 3                                                                                                     NarendraModi The Richest Fakkeer
## 4                                     Rio Nadal nadie te mete mano hay que ver lo que te vayas se te calienta la boca y me quedo eonoe
## 5                                                                                                                   Demonetization AAP
## 6                                                                                                           demonetization Chidambaram
## 7  ndtv Discussions on dual control of assessees remain inconclusive to be carried forward in the next GST Council meeting arunjaitley
## 8                                                                      Apple iPhone GB Silver TMobile Smartphone win rt amp follow Rio
## 9                                                 Ammas loss n nw Vardahcyclone Testing Time staystrong n staysafechennai we are wid u
## 10                                                                                  Our hero the condolences meet of Amma organised by

We see that the Predicted Sentiment is quite in agreement with the actual sentiment in good number of cases. the same can be witnessed by looking at above output