Purpose of this Markdown is to extracts tweets related to 6 different hashtags and train the machine to classify them as Positive, Neutral or Negative. Approximately 70% of the tweets are classified manually to use as training set and the remaining 30% to test the predictions based on the model chosen.
library("tm")
library("RTextTools")
Extracted 1000 tweets for hashtags #demonetization’,‘#Jayalalithaa’, ‘#brexit’, ‘#indvseng’, ‘#trump’,‘#blockchain’ usinf the Twitter API.
This twitter data is cleaned up by removing Retweet source user hashtag Punctuations Number Links New lines, double blank leading and trailing spaces Non ASCII characters
Then all these tweets are merged into a single document and duplicates are removed.
Around 70% of these tweets (1823 tweets) are randomly chosen consisting of all the 6 topics and manually classified with -1 for Negative, 0 for Neutral and 1 for Positive
Read the training dataset
data = read.csv("/Users/surya/Documents/CBA_ISB/TA/Group Assignment 2/Tweets Classification_Training.csv",stringsAsFactors = F)
From the training dataset select 70% of the tweets for training and 30% to validate the results
set.seed(16102016) # To fix the sample
samp_id = sample(1:nrow(data), # do ?sample to examine the sample() func
round(nrow(data)*.70), # 70% records will be used for training
replace = F) # sampling without replacement.
train = data[samp_id,] # 70% of training data set, examine struc of samp_id obj
test = data[-samp_id,] # remaining 30% of training data set
Process the text data and create DTM (Document Term Matrix)
train.data = rbind(train,test) # join the data sets
train.data$text = tolower(train.data$text) # Convert to lower case
text = train.data$text
text = removePunctuation(text) # remove punctuation marks
text = removeNumbers(text) # remove numbers
text = stripWhitespace(text) # remove blank space
stpw1 = readLines("/Users/surya/Documents/CBA_ISB/TA/Group Assignment 2/stopwords_Twitter.txt")# stopwords list from git
stpw2 = tm::stopwords('english') # tm package stop word list; tokenizer package has the same name function
comn = unique(c(stpw1, stpw2)) # Union of two list #'solid state chemistry','microeconomics','linguistic'
stopwords = unique(gsub("'"," ",comn)) # final stop word lsit after removing punctuation
text = removeWords(text,stopwords)
cor = Corpus(VectorSource(text)) # Create text corpus
dtm = DocumentTermMatrix(cor, # Craete DTM
control = list(weighting =
function(x)
weightTfIdf(x, normalize = F))) # IDF weighing
training_codes = train.data$Classification # Coded labels
It is observed that the adding using of stopwords helped improving the accuracy of the model. Proper nouns related to these tweets are added to stopwords
head(stpw1,10)
## [1] "amma" "rajnikanth" "modi" "trump"
## [5] "russia" "putin" "jayalalitha" "jayalalithaa"
## [9] "blockchain" "bitcoin"
All the models are tested and the confusion matrix is evaluated. SVM model gave the highest accuracy.
container <- create_container(dtm, # creates a 'container' obj for training, classifying, and analyzing docs
t(training_codes), # labels or the Y variable / outcome we want to train on
trainSize = 1:nrow(train),
testSize = (nrow(train)+1):nrow(train.data),
virgin = FALSE)
models <- train_models(container, # ?train_models; makes a model object using the specified algorithms.
algorithms=c("SVM")) #"MAXENT","SVM","GLMNET","SLDA","TREE","BAGGING","BOOSTING","RF"
results <- classify_models(container, models)
# building a confusion matrix to see accuracy of prediction results
out = data.frame(model_sentiment = results$SVM_LABEL, # rounded probability == model's prediction of Y
model_prob = results$SVM_PROB,
actual_sentiment = train.data$Classification[(nrow(train)+1):nrow(train.data)]) # actual value of Y
# dim(out); head(out);
# summary(out) # how many 0s and 1s were there anyway?
(z = as.matrix(table(out[,1], out[,3]))) # display the confusion matrix.
##
## -1 0 1
## -1 126 36 34
## 0 36 93 45
## 1 29 27 121
(pct = round(((z[1,1] + z[2,2] + z[3,3])/sum(z))*100, 2)) # prediction accuracy in % terms
## [1] 62.16
Process the training and the test data
data.test = read.csv("/Users/surya/Documents/CBA_ISB/TA/Group Assignment 2/Tweets Classification_Test.csv",stringsAsFactors = F)
dim(data.test)
## [1] 783 2
set.seed(16102016)
text = data.test$text
text = removePunctuation(text)
text = removeNumbers(text)
text = stripWhitespace(text)
cor = Corpus(VectorSource(text))
dtm.test = DocumentTermMatrix(cor, control = list(weighting =
function(x)
weightTfIdf(x, normalize = F)))
row.names(dtm.test) = (nrow(dtm)+1):(nrow(dtm)+nrow(dtm.test)) # row naming for doc ID
dtm.f = c(dtm, dtm.test) # concatenating the dtms
training_codes.f = c(training_codes,
rep(NA, length(data.test))) # replace unknown Y values with NA
Predict the test data
container.f = create_container(dtm.f, # build a new container; all same as before
t(training_codes.f), trainSize=1:nrow(dtm),
testSize = (nrow(dtm)+1):(nrow(dtm)+nrow(dtm.test)), virgin = T)
model.f = train_models(container.f, algorithms = c("SVM"))
predicted <- classify_models(container.f, model.f) # ?classify_models makes predictions from a train_models() object.
out = data.frame(model_sentiment = predicted$SVM_LABEL, # again, building a confusion matrix
model_prob = predicted$SVM_PROB,
text = data.test)
head(out,10)
## model_sentiment model_prob text.id
## 1 0 0.4524000 1
## 2 1 0.6481473 2
## 3 -1 0.5395165 3
## 4 -1 0.4411334 4
## 5 -1 0.5498858 5
## 6 -1 0.6400861 6
## 7 -1 0.4342796 7
## 8 -1 0.3772911 8
## 9 0 0.4136253 9
## 10 -1 0.5338034 10
## text.text
## 1 Sensational inning of DoubleCentury by in test match IndvsEng ViratKohli
## 2 Serie of art works dedicated to DonaldTrump exhibition art Trump sculpture paiting marble arte maga TrumpTrain
## 3 Several Dozen Suspects Arrested Amid Bank of Russia Cyberheist Investigation bitcoin blockchain crypto news
## 4 Sex life Vardah IndvsEng BiggBoss ModiNoteGate DeMonetisationDisaster indianidol BB
## 5 sexywife cheating milf cougar ass boobs trump
## 6 Shame on Boris amp for taking UK backwards with Brexit Hopefully of us amp the vast majority of yo
## 7 She didnt make public appearance after being admitted to the hospital in September Was she in a vegetative state
## 8 Sheikh Imran Hosein over Donald Trump als president Mocht hij de Islam nu nog verzaken ok
## 9 ShekarReddy SmallContractor Crorepathi Demonetization ITRain CurrencyShortage
## 10 Shes Just another trump troll assuming its a she Maybe lb kid living in their parents basement
write.csv(out,"/Users/surya/Documents/CBA_ISB/TA/Group Assignment 2/prediction.csv") # save the predictions for the holdout sample.