The aim of this exercise is to Train a machine to classify tweets according to sentiment. In order to accomplish this task we have extracted tweets from twitter based on the following six tags #Amma, #IPL, #RIO, #GST, #NarendraModi,#demonitization
We have collected 500 tweets for each of these tags and then put them together to create a corpus of 3000 tweets, after cleaning the data and removing duplicates(retweets etc.,) we got about 1059 tweets in our final corpus.We have classified these 1058 tweets manually by reading each tweet if they reflected positive or negative or neutral sentiment, by giving a value of +1 for positive sentiment, 0 for neutral and -1 for negative sentiment.
Here we are using the TM and RTextTools package for classifying the text.
Of the 1059 tweets in our final corpus, 70% we have used, to train the model and remaining 30% to test the model.
Published at : http://rpubs.com/rajiv2806/Training_a_machine_to_classify_tweets_according_to_sentiment
library("tm")
library("RTextTools")
library('magrittr')
library(RCurl)
## Loading required package: bitops
data = read.csv(text = getURL("https://raw.githubusercontent.com/Rajiv2806/Supervised_Text_Classfication/master/Tweets%20traindata.txt"),sep = '\t',stringsAsFactors = F)
dim(data)
## [1] 1617 2
colnames(data) <- c('sentiment','text') # Rename variables
head(data) # View few rows
## sentiment
## 1 0
## 2 0
## 3 0
## 4 -1
## 5 0
## 6 1
## text
## 1 A good leader appeals to the best in people amp a bad one appeals to the worst in them MrModi has lost sense of right or wrong NarendraModi
## 2 A Rio salvou esse ano de bosta Foi quase uma realidade paralela boa em que eu fiz algo que amo e fui plenamente feliz Obrigada SRCOM
## 3 a Sandro Campagna fate raccontare del mega ceffone durante una partita olimpica a rio
## 4 A Small Miracle By Mi Jayalalithaa Amma She Opposing Sasikala in AIADMK SAVE MI TAMILNADU FROM MUNNARGUDIMAFIA NEWS
## 5 A torcida brasileira vibrou com ele durante a Rio no Estdio Aqutico talo Gomes tocantinense de Porto
## 6 A Tribute To Amma on th day By The Nature CycloneVardah Stay Safe
head(data[which(data$sentiment < 1),]) # View few rows with negative sentiment
## sentiment
## 1 0
## 2 0
## 3 0
## 4 -1
## 5 0
## 7 -1
## text
## 1 A good leader appeals to the best in people amp a bad one appeals to the worst in them MrModi has lost sense of right or wrong NarendraModi
## 2 A Rio salvou esse ano de bosta Foi quase uma realidade paralela boa em que eu fiz algo que amo e fui plenamente feliz Obrigada SRCOM
## 3 a Sandro Campagna fate raccontare del mega ceffone durante una partita olimpica a rio
## 4 A Small Miracle By Mi Jayalalithaa Amma She Opposing Sasikala in AIADMK SAVE MI TAMILNADU FROM MUNNARGUDIMAFIA NEWS
## 5 A torcida brasileira vibrou com ele durante a Rio no Estdio Aqutico talo Gomes tocantinense de Porto
## 7 AAPtards are lying that Kejriwal was in Chennai for Ammas funeral thats why he didnt attend Agenda
samp_id = sample(1:nrow(data),
round(nrow(data)*.70), # 70% records will be used for training
replace = F) # sampling without replacement.
train = data[samp_id,] # 70% of training data set, examine struc of samp_id obj
test = data[-samp_id,] # remaining 30% of training data set
dim(test) ; dim(train) # dimns of test n training
## [1] 485 2
## [1] 1132 2
head(test)
## sentiment
## 3 0
## 4 -1
## 5 0
## 13 1
## 15 1
## 25 0
## text
## 3 a Sandro Campagna fate raccontare del mega ceffone durante una partita olimpica a rio
## 4 A Small Miracle By Mi Jayalalithaa Amma She Opposing Sasikala in AIADMK SAVE MI TAMILNADU FROM MUNNARGUDIMAFIA NEWS
## 5 A torcida brasileira vibrou com ele durante a Rio no Estdio Aqutico talo Gomes tocantinense de Porto
## 13 Actress Trisha at Amma Memorial Paying her her Homage
## 15 Dear PM Sir can u lead all polititian to impose LOKPAL BILL urgently in INDIA as real SURGICALSTRIKE
## 25 Ah yes the timeless PR trick of PositiveSpin Boeing Trump Iran
head(train)
## sentiment
## 307 0
## 886 -1
## 524 0
## 602 0
## 43 -1
## 79 0
## text
## 307 Cricket PaKvAuS BBL iPL PSL WBBL CSA First ScoRes Follow SND ON
## 886 Demonetisation RahulGandhi addresses crowd in Dadri says Modi waging war against poor
## 524 Al Franken Faces Donald Trump and the Next Four Years USPolitics trump clinton
## 602 Jayalalithaa with Mother Teresa ThrowBack Amma RIPAmma
## 43 President Trump and the Ocean while theres lots to be worried about there is hope TY
## 79 OIL COMPANIES amp WALL ST are biggest advertisers for CORP media Do u REALLY think theyll criticize Trumps picks
train.data = rbind(train,test) # join the data sets
train.data$text = tolower(train.data$text) # Convert to lower case
text = train.data$text
text = removePunctuation(text) # remove punctuation marks
text = removeNumbers(text) # remove numbers
text = stripWhitespace(text) # remove blank space
cor = Corpus(VectorSource(text)) # Create text corpus
dtm = DocumentTermMatrix(cor, # Craete DTM
control = list(weighting =
function(x)
weightTfIdf(x, normalize = F))) # IDF weighing
training_codes = train.data$sentiment # Coded labels
dim(dtm)
## [1] 1617 5798
we have tested with various models as listed below MAXENT,SVM,GLMNET,SLDA,TREE,BAGGING,BOOSTING,RF and found that we are getting maximum accuracy of 63% with MAXENT
container <- create_container(dtm, # creates a 'container' obj for training, classifying, and analyzing docs
t(training_codes), # labels or the Y variable / outcome we want to train on
trainSize = 1:nrow(train),
testSize = (nrow(train)+1):nrow(train.data),
virgin = FALSE) # whether to treat the classification data as 'virgin' data or not.
# if virgin = TRUE, then machine won;t borrow from prior datasets.
str(container) # view struc of the container obj; is a list of training n test data
## Formal class 'matrix_container' [package "RTextTools"] with 6 slots
## ..@ training_matrix :Formal class 'matrix.csr' [package "SparseM"] with 4 slots
## .. .. ..@ ra : num [1:11732] 8.07 8.34 8.66 6.85 4.66 ...
## .. .. ..@ ja : int [1:11732] 1 2 3 4 5 6 7 8 9 10 ...
## .. .. ..@ ia : int [1:1133] 1 12 23 34 41 53 69 84 90 100 ...
## .. .. ..@ dimension: int [1:2] 1132 5798
## ..@ classification_matrix:Formal class 'matrix.csr' [package "SparseM"] with 4 slots
## .. .. ..@ ra : num [1:4929] 10.66 10.66 7.34 8.66 10.66 ...
## .. .. ..@ ja : int [1:4929] 4594 4595 873 4596 4597 4598 4599 4600 4601 113 ...
## .. .. ..@ ia : int [1:486] 1 13 26 39 46 59 67 83 96 112 ...
## .. .. ..@ dimension: int [1:2] 485 5798
## ..@ training_codes : Factor w/ 3 levels "-1","0","1": 2 1 2 2 1 2 2 2 1 2 ...
## ..@ testing_codes : Factor w/ 3 levels "-1","0","1": 2 1 2 3 3 2 1 2 2 2 ...
## ..@ column_names : chr [1:5798] "bbl" "cricket" "csa" "first" ...
## ..@ virgin : logi FALSE
models <- train_models(container, # ?train_models; makes a model object using the specified algorithms.
algorithms=c("MAXENT")) #"MAXENT","SVM","GLMNET","SLDA","TREE","BAGGING","BOOSTING","RF"
results <- classify_models(container, models)
head(results)
## MAXENTROPY_LABEL MAXENTROPY_PROB
## 1 0 0.9990497
## 2 0 0.8322476
## 3 0 0.9990895
## 4 1 1.0000000
## 5 1 0.6660575
## 6 0 0.9849228
out = data.frame(model_sentiment = results$MAXENTROPY_LABEL, # rounded probability == model's prediction of Y
model_prob = results$MAXENTROPY_PROB,
actual_sentiment = train.data$sentiment[(nrow(train)+1):nrow(train.data)]) # actual value of Y
dim(out); head(out);
## [1] 485 3
## model_sentiment model_prob actual_sentiment
## 1 0 0.9990497 0
## 2 0 0.8322476 -1
## 3 0 0.9990895 0
## 4 1 1.0000000 1
## 5 1 0.6660575 1
## 6 0 0.9849228 0
summary(out) # how many 0s and 1s were there anyway?
## model_sentiment model_prob actual_sentiment
## -1: 62 Min. :0.4893 Min. :-1.00000
## 0 :386 1st Qu.:0.9807 1st Qu.: 0.00000
## 1 : 37 Median :0.9998 Median : 0.00000
## Mean :0.9568 Mean :-0.05773
## 3rd Qu.:1.0000 3rd Qu.: 0.00000
## Max. :1.0000 Max. : 1.00000
(z = as.matrix(table(out[,1], out[,3]))) # display the confusion matrix.
##
## -1 0 1
## -1 28 29 5
## 0 54 295 37
## 1 3 19 15
(pct = round(((z[1,1] + z[2,2])/sum(z))*100, 2)) # prediction accuracy in % terms
## [1] 66.6
head(out,10)
## model_sentiment model_prob actual_sentiment
## 1 0 0.9990497 0
## 2 0 0.8322476 -1
## 3 0 0.9990895 0
## 4 1 1.0000000 1
## 5 1 0.6660575 1
## 6 0 0.9849228 0
## 7 0 0.9999381 -1
## 8 0 0.9999888 0
## 9 -1 0.9997143 0
## 10 -1 0.8312409 0
From the confusing matrix we can see that 75% of neutral tweets, 60% of postive tweets and 50% of negative tweets were predicted accurately by the model. ##Processing the training data and test data together
data.test = read.csv(text = getURL("https://raw.githubusercontent.com/Rajiv2806/Supervised_Text_Classfication/master/Tweets%20testdata.txt"),sep = '\t',stringsAsFactors = F)
dim(data.test)
## [1] 692 1
colnames(data.test) = 'text'
set.seed(16122016)
data.test1 = data.test[sample(1:nrow(data.test), 1617, replace = T),] # randomly Selecting only 1000 rows for demo purpose
dtm.test = DocumentTermMatrix(cor, control = list(weighting =
function(x)
weightTfIdf(x, normalize = F)))
row.names(dtm.test) = (nrow(dtm)+1):(nrow(dtm)+nrow(dtm.test)) # row naming for doc ID
dtm.f = c(dtm, dtm.test) # concatenating the dtms
training_codes.f = c(training_codes,
rep(NA, length(data.test1))) # replace unknown Y values with NA
container.f = create_container(dtm.f, # build a new container; all same as before
t(training_codes.f), trainSize=1:nrow(dtm),
testSize = (nrow(dtm)+1):(nrow(dtm)+nrow(dtm.test)), virgin = T)
model.f = train_models(container.f, algorithms = c("MAXENT"))
predicted <- classify_models(container.f, model.f) # classify_models makes predictions from a train_models() object.
out = data.frame(model_sentiment = predicted$MAXENTROPY_LABEL, # again, building a confusion matrix
model_prob = predicted$MAXENTROPY_PROB,
text = data.test1)
dim(out)
## [1] 1617 3
head(out,10)
## model_sentiment model_prob
## 1 0 1.0000000
## 2 -1 1.0000000
## 3 0 1.0000000
## 4 0 1.0000000
## 5 -1 1.0000000
## 6 0 1.0000000
## 7 0 0.9999992
## 8 0 0.9999998
## 9 -1 1.0000000
## 10 0 1.0000000
## text
## 1 Rio organizers say opening ceremony will be the coolest party the athletes have seen
## 2 If someone can declare some money worthless overnight its not money India Venezuela demonetization bitcoin
## 3 NarendraModi The Richest Fakkeer
## 4 Rio Nadal nadie te mete mano hay que ver lo que te vayas se te calienta la boca y me quedo eonoe
## 5 Demonetization AAP
## 6 demonetization Chidambaram
## 7 ndtv Discussions on dual control of assessees remain inconclusive to be carried forward in the next GST Council meeting arunjaitley
## 8 Apple iPhone GB Silver TMobile Smartphone win rt amp follow Rio
## 9 Ammas loss n nw Vardahcyclone Testing Time staystrong n staysafechennai we are wid u
## 10 Our hero the condolences meet of Amma organised by