Train machine to classify tweets

Introduction

Purpose of this Markdown is to extracts tweets related to 6 different hashtags and train the machine to classify them as Positive, Neutral or Negative. Approximately 70% of the tweets are classified manually to use as training set and the remaining 30% to test the predictions based on the model chosen.

library("tm")
library("RTextTools")

Extraction and cleaning of tweets

Extracted 1000 tweets for hashtags #demonetization’,‘#Jayalalithaa’, ‘#brexit’, ‘#indvseng’, ‘#trump’,‘#blockchain’ usinf the Twitter API.

This twitter data is cleaned up by removing Retweet source user hashtag Punctuations Number Links New lines, double blank leading and trailing spaces Non ASCII characters

Then all these tweets are merged into a single document and duplicates are removed.

Manual classification for training set

Around 70% of these tweets (1823 tweets) are randomly chosen consisting of all the 6 topics and manually classified with -1 for Negative, 0 for Neutral and 1 for Positive

Read the training dataset

data = read.csv("/Users/surya/Documents/CBA_ISB/TA/Group Assignment 2/Tweets Classification_Training.csv",stringsAsFactors = F)

From the training dataset select 70% of the tweets for training and 30% to validate the results

set.seed(16102016)                          # To fix the sample 

samp_id = sample(1:nrow(data),              # do ?sample to examine the sample() func
                 round(nrow(data)*.70),     # 70% records will be used for training
                 replace = F)               # sampling without replacement.

train = data[samp_id,]                      # 70% of training data set, examine struc of samp_id obj
test = data[-samp_id,]                      # remaining 30% of training data set

Process the text data and create DTM (Document Term Matrix)

train.data = rbind(train,test)              # join the data sets
train.data$text = tolower(train.data$text)  # Convert to lower case

text = train.data$text                      
text = removePunctuation(text)              # remove punctuation marks
text = removeNumbers(text)                  # remove numbers
text = stripWhitespace(text)                # remove blank space

stpw1 = readLines("/Users/surya/Documents/CBA_ISB/TA/Group Assignment 2/stopwords_Twitter.txt")# stopwords list from git
stpw2 = tm::stopwords('english')               # tm package stop word list; tokenizer package has the same name function

comn  = unique(c(stpw1, stpw2))                 # Union of two list #'solid state chemistry','microeconomics','linguistic'
stopwords = unique(gsub("'"," ",comn))  # final stop word lsit after removing punctuation
text  =  removeWords(text,stopwords) 

cor = Corpus(VectorSource(text))            # Create text corpus
dtm = DocumentTermMatrix(cor,               # Craete DTM
                         control = list(weighting =             
                                               function(x)
                                                 weightTfIdf(x, normalize = F))) # IDF weighing

training_codes = train.data$Classification       # Coded labels

It is observed that the adding using of stopwords helped improving the accuracy of the model. Proper nouns related to these tweets are added to stopwords

head(stpw1,10)

##  [1] "amma"         "rajnikanth"   "modi"         "trump"       
##  [5] "russia"       "putin"        "jayalalitha"  "jayalalithaa"
##  [9] "blockchain"   "bitcoin"

Test the models and choose the best model

All the models are tested and the confusion matrix is evaluated. SVM model gave the highest accuracy.

container <- create_container(dtm,               # creates a 'container' obj for training, classifying, and analyzing docs
                              t(training_codes), # labels or the Y variable / outcome we want to train on
                              trainSize = 1:nrow(train), 
                              testSize = (nrow(train)+1):nrow(train.data), 
                              virgin = FALSE)

models <- train_models(container,              # ?train_models; makes a model object using the specified algorithms.
                       algorithms=c("SVM")) #"MAXENT","SVM","GLMNET","SLDA","TREE","BAGGING","BOOSTING","RF"

results <- classify_models(container, models)

# building a confusion matrix to see accuracy of prediction results
out = data.frame(model_sentiment = results$SVM_LABEL,    # rounded probability == model's prediction of Y
                 model_prob = results$SVM_PROB,
                 actual_sentiment = train.data$Classification[(nrow(train)+1):nrow(train.data)])  # actual value of Y

# dim(out); head(out); 
# summary(out)           # how many 0s and 1s were there anyway?

(z = as.matrix(table(out[,1], out[,3])))   # display the confusion matrix.

##     
##       -1   0   1
##   -1 126  36  34
##   0   36  93  45
##   1   29  27 121

(pct = round(((z[1,1] + z[2,2] + z[3,3])/sum(z))*100, 2))      # prediction accuracy in % terms

## [1] 62.16

Process the training and the test data

data.test = read.csv("/Users/surya/Documents/CBA_ISB/TA/Group Assignment 2/Tweets Classification_Test.csv",stringsAsFactors = F)

dim(data.test)

## [1] 783   2

set.seed(16102016)
text = data.test$text
text = removePunctuation(text)
text = removeNumbers(text)
text = stripWhitespace(text)
cor = Corpus(VectorSource(text))
dtm.test = DocumentTermMatrix(cor, control = list(weighting = 
                                                  function(x)
                                                    weightTfIdf(x, normalize = F)))

row.names(dtm.test) = (nrow(dtm)+1):(nrow(dtm)+nrow(dtm.test))     # row naming for doc ID
dtm.f = c(dtm, dtm.test)    # concatenating the dtms
training_codes.f = c(training_codes, 
                     rep(NA, length(data.test)))     # replace unknown Y values with NA

Predict the test data

container.f = create_container(dtm.f,      # build a new container; all same as before
                               t(training_codes.f), trainSize=1:nrow(dtm), 
                               testSize = (nrow(dtm)+1):(nrow(dtm)+nrow(dtm.test)), virgin = T)

model.f = train_models(container.f, algorithms = c("SVM")) 

predicted <- classify_models(container.f, model.f)     # ?classify_models makes predictions from a train_models() object.

out = data.frame(model_sentiment = predicted$SVM_LABEL,    # again, building a confusion matrix
                 model_prob = predicted$SVM_PROB,
                 text = data.test)

head(out,10)

##    model_sentiment model_prob text.id
## 1                0  0.4524000       1
## 2                1  0.6481473       2
## 3               -1  0.5395165       3
## 4               -1  0.4411334       4
## 5               -1  0.5498858       5
## 6               -1  0.6400861       6
## 7               -1  0.4342796       7
## 8               -1  0.3772911       8
## 9                0  0.4136253       9
## 10              -1  0.5338034      10
##                                                                                                           text.text
## 1                                          Sensational inning of DoubleCentury by in test match IndvsEng ViratKohli
## 2    Serie of art works dedicated to DonaldTrump exhibition art Trump sculpture paiting marble arte maga TrumpTrain
## 3       Several Dozen Suspects Arrested Amid Bank of Russia Cyberheist Investigation bitcoin blockchain crypto news
## 4                               Sex life Vardah IndvsEng BiggBoss ModiNoteGate DeMonetisationDisaster indianidol BB
## 5                                                                     sexywife cheating milf cougar ass boobs trump
## 6                Shame on Boris amp for taking UK backwards with Brexit Hopefully of us amp the vast majority of yo
## 7  She didnt make public appearance after being admitted to the hospital in September Was she in a vegetative state
## 8                         Sheikh Imran Hosein over Donald Trump als president Mocht hij de Islam nu nog verzaken ok
## 9                                     ShekarReddy SmallContractor Crorepathi Demonetization ITRain CurrencyShortage
## 10                   Shes Just another trump troll assuming its a she Maybe lb kid living in their parents basement

write.csv(out,"/Users/surya/Documents/CBA_ISB/TA/Group Assignment 2/prediction.csv")     # save the predictions for the holdout sample.

Observations

Removing of uninformative stopwords helped improve the accuracy values
Different models seem to be working well for particular sentiment. Overall SVM seem to have the highest accuracy over other models in this case
We also noticed that may be some of the topics chosen have a lot of factual information for instance bitcoin/blockchain and that may have overall impacted the prediction. Also in case of Jayalalitha most of the tweets are RIP or condelences which might not be a good choice for testing these models