Lets look into how we can train a machine to classify the tweets from Twitter. For training a machine to classify the tweets the below packages has to be installed.
library(twitteR)
library(base64enc)
library("tm")
library("RTextTools")
library(magrittr)
The twitter data can be extracted by performing the Test OAuth using the API key, API Secret, Access Token and Access Token Secret.
#############################################
# Authentication
#############################################
# options(httr_oauth_cache=T)
api_key <- "1bL9ufvChCVI94e5ennGX6Hc6" #Consumer key: *
api_secret <- "pgGNIfrCF62c3gCiiOopzNicAuh8trgR9yjPpUHehnYYQ5Pz7E" # Consumer secret: *
access_token <- "811063757291950080-XqmeBQYkLFyhA4TAzACmyYP2CxlDAmQ" # Access token:
access_token_secret <- "N5SoPjplfBQVUEX2oawTdcSMYJw218gzQyKVJ4pkWjGcX" # Access token secret:
# After this line of command type 1 for selection as Yes
setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)
Lets perform the analysis on the hashtags. We have chosen the below 6 hashtags to perform the tweets search in twitter. They are aas below: 1. #2016 2. #brexit 3. #christmas 4. #kerala 5. #modi 6. #trump
The extracted tweets are written seperately into different files and the regex is performed on the data. The retweets are removed. The blank sppaces are excluded. The punctuation marks, links, user names, new line leading and trailing white spaces etc are removed and the data is cleaned.
#############################################
# Extract Tweets
#############################################
hashtags = c('#2016', '#brexit', '#christmas', '#kerala', '#modi', '#trump')
for (hashtag in hashtags){
tweets = searchTwitter(hashtag, lang="en", n=1000 ) # hash tag for tweets search and number of tweets
tweets = twListToDF(tweets) # Convert from list to dataframe
tweets.df = tweets[,1] # assign tweets for cleaning
head(tweets.df)
#https://msdn.microsoft.com/en-us/library/ae5bf541(v=vs.100).aspx
tweets.df = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", tweets.df);head(tweets.df) #remove retweet source
tweets.df = gsub("@\\w+", "", tweets.df);head(tweets.df) # regex for removing @user
tweets.df = gsub("#\\w+", "", tweets.df);head(tweets.df) # regex for removing #key_word
tweets.df = gsub("[[:punct:]]", "", tweets.df);head(tweets.df) # regex for removing punctuation mark
tweets.df = gsub("[[:digit:]]", "", tweets.df);head(tweets.df) # regex for removing numbers
tweets.df = gsub("http\\w+", "", tweets.df);head(tweets.df) # regex for removing links
tweets.df = gsub("\n", " ", tweets.df);head(tweets.df) ## regex for removing new line (\n)
tweets.df = gsub("[ \t]{2,}", " ", tweets.df);head(tweets.df) ## regex for removing two blank space
tweets.df = gsub("[^[:alnum:]///' ]", " ", tweets.df) # keep only alpha numeric
tweets.df = iconv(tweets.df, "latin1", "ASCII", sub="") # Keep only ASCII characters
tweets.df = gsub("^\\s+|\\s+$", "", tweets.df);head(tweets.df) # Remove leading and trailing white space
tweets[,1] = tweets.df # save in Data frame
head(tweets)
write.csv(tweets,paste0(gsub('#','',hashtag),'.csv'))
}
All the tweets that are stored in different files are combined together and are written into a single file by placing only the unique tweets and removing the duplicate tweets.
# read all the tweets and combine the data sets
text.tweets = NULL
for (hashtag in hashtags) {
d0 = read.csv(paste0(gsub('#','',hashtag),'.csv'), stringsAsFactors = F)
text.tweets = c(text.tweets, d0$text)
}
text.tweets = unique(text.tweets)
df = data.frame(id = 1:length(text.tweets), text = text.tweets)
write.csv(df, 'tweets_classification_data.csv', row.names = F)
Lets now perform the Text classification based on the sentiments of the text. The text classification is performed in two steps: 1. By Creating a training model 2. By Testing the data with the training model.
Lets read the training dataset from the local disk. This is the combined file containing the unique tweets with the regex removed.
# Step 1- Read the training data set in R #
data = read.csv(file.choose(), # File path
stringsAsFactors = F) # String as text
dim(data)
## [1] 3159 3
head(data) # View few rows
## id
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## text
## 1 Free and
## 2 gets connected via from this
## 3 iSkate Irelands largest outdoor ice rink opens in mins
## 4 The beautiful tree in Santa eulalia till
## 5 will collect extra recycling over just separate it into clear bags or boxes
## 6 Closing todaylast chance to win an Ipad follow us on Twitter Instagram amp postshare your photo use
## classification
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
The data is then split into parts for evaluating the models. 70% of the tweets is split and fed for the training the model and test is performed on the remaining 30% of tweets.
# Step 2- Split this data in two parts for evaluating models
set.seed(16102016) # To fix the sample
samp_id = sample(1:nrow(data), # do ?sample to examine the sample() func
round(nrow(data)*.70), # 70% records will be used for training
replace = F) # sampling without replacement.
train = data[samp_id,] # 70% of training data set, examine struc of samp_id obj
test = data[-samp_id,] # remaining 30% of training data set
Lets View the dimensions of the training data
dim(train) # dimns of training data
## [1] 2211 3
Lets view the test data
dim(test) # dimns of test data
## [1] 948 3
The text data is then processed and removed from the regex like the punctuation marks, numbers, blank spaces. On the processed data the text corpus is created and the Document Term Matrix is created through the IDF Weighing Model.
# Step 3- Process the text data and create DTM (Document Term Matrix)
train.data = rbind(train,test) # join the data sets
train.data$text = tolower(train.data$text) # Convert to lower case
text = train.data$text
text = removePunctuation(text) # remove punctuation marks
text = removeNumbers(text) # remove numbers
text = stripWhitespace(text) # remove blank space
cor = Corpus(VectorSource(text)) # Create text corpus
dtm = DocumentTermMatrix(cor, # Create DTM
control = list(weighting =
function(x)
weightTfIdf(x, normalize = F))) # IDF weighing
training_codes = train.data$classification # Coded labels
Lets view the Document Term Matrix
dim(dtm) # Dimensions of the DTM
## [1] 3159 3456
Lets now test the models and choose the best model and create a container. A container is an object for training, classifying and analyzing the data.
# Step 4- Test the models and choose best model
container <- create_container(dtm, # creates a 'container' obj for training, classifying, and analyzing docs
t(training_codes), # labels or the Y variable / outcome we want to train on
trainSize = 1:nrow(train),
testSize = (nrow(train)+1):nrow(train.data),
virgin = FALSE) # whether to treat the classification data as 'virgin' data or not.
# if virgin = TRUE, then machine won;t borrow from prior datasets.
str(container) # view struc of the container obj; is a list of training n test data
## Formal class 'matrix_container' [package "RTextTools"] with 6 slots
## ..@ training_matrix :Formal class 'matrix.csr' [package "SparseM"] with 4 slots
## .. .. ..@ ra : num [1:19438] 5.34 6.72 8.63 4.4 6.17 ...
## .. .. ..@ ja : int [1:19438] 8 109 139 182 193 307 692 835 1446 1576 ...
## .. .. ..@ ia : int [1:2212] 1 16 25 30 38 48 58 65 74 86 ...
## .. .. ..@ dimension: int [1:2] 2211 3456
## ..@ classification_matrix:Formal class 'matrix.csr' [package "SparseM"] with 4 slots
## .. .. ..@ ra : num [1:8099] 11.63 6.34 10.04 11.63 3.55 ...
## .. .. ..@ ja : int [1:8099] 555 1360 1493 3316 126 527 1204 1270 1304 1790 ...
## .. .. ..@ ia : int [1:949] 1 5 19 28 30 37 45 53 57 66 ...
## .. .. ..@ dimension: int [1:2] 948 3456
## ..@ training_codes : Factor w/ 3 levels "-1","0","1": 3 3 2 3 2 2 2 2 2 2 ...
## ..@ testing_codes : Factor w/ 3 levels "-1","0","1": 3 3 3 3 3 2 3 2 3 1 ...
## ..@ column_names : chr [1:3456] "aajkaal" "aap" "abc" "able" ...
## ..@ virgin : logi FALSE
The BOOSTING Algorithm is used and the result of the container is as displayed below. Initially we tried using the MAXEENT Algorithm but the model displayed a lesser prediction percentage. Hence, the BOOSTING Algorithm was chosen
models <- train_models(container, # ?train_models; makes a model object using the specified algorithms.
algorithms=c("BOOSTING")) #"MAXENT","SVM","GLMNET","SLDA","TREE","BAGGING","BOOSTING","RF"
results <- classify_models(container, models)
head(results)
## LOGITBOOST_LABEL LOGITBOOST_PROB
## 1 0 0.5000000
## 2 0 0.5000000
## 3 0 0.5000000
## 4 0 0.5000000
## 5 -1 0.5000000
## 6 -1 0.8807971
Lets now build the confusion matrix on the entropy and determine the Prediction Accuracy.
# building a confusion matrix to see accuracy of prediction results
out = data.frame(model_sentiment = results$LOGITBOOST_LABEL, # rounded probability == model's prediction of Y
model_prob = results$LOGITBOOST_PROB,
actual_sentiment = train.data$classification[(nrow(train)+1):nrow(train.data)]) # actual value of Y
dim(out); head(out);
## [1] 948 3
## model_sentiment model_prob actual_sentiment
## 1 0 0.5000000 1
## 2 0 0.5000000 1
## 3 0 0.5000000 1
## 4 0 0.5000000 1
## 5 -1 0.5000000 1
## 6 -1 0.8807971 0
summary(out) # how many 0s and 1s were there anyway?
## model_sentiment model_prob actual_sentiment
## -1:135 Min. :0.1192 Min. :-1.0000
## 0 :766 1st Qu.:0.5000 1st Qu.: 0.0000
## 1 : 47 Median :0.5000 Median : 0.0000
## Mean :0.6647 Mean : 0.0443
## 3rd Qu.:0.8808 3rd Qu.: 0.0000
## Max. :0.9975 Max. : 1.0000
The confusion matrix is as below:
(z = as.matrix(table(out[,1], out[,3]))) # display the confusion matrix.
##
## -1 0 1
## -1 65 50 20
## 0 54 605 107
## 1 0 13 34
The Prediction Accuracy in terms of percentage is
(pct = round(((z[1,1] + z[2,2] + z[3,3])/sum(z))*100, 2)) # prediction accuracy in % terms
## [1] 74.26
The hashtags were chosen considering the relevance of terms in the current year and its influence on the public. All the hashtags involved were considering the public emotions.
The sentiments of the corpus is categorised into three terms: The positive terms, the negative terms ans the neutral terms. In the dataset we have 47 sentences for positive sentiments, 135 sentences for negative sentiments and 766 sentences for neutral sentiments.
This model was run on the classification by removing the stopwords and filtering the data couple of times. It was oserved that When the data was more filtered the prediction accuracy increased. Initially before the filtering of data the prediction accuracy obtained was only 60.80%. We had therefore changed the modelling algorithm from MAXENTENT to BOOSTING and filtered the data classified into more of a meaningful text by including more stopwrords. This increased the accuracy to 74.26%.