Classification of Tweets from Twitter

Lets look into how we can train a machine to classify the tweets from Twitter. For training a machine to classify the tweets the below packages has to be installed.

library(twitteR)
library(base64enc)

library("tm")
library("RTextTools")
library(magrittr)

The twitter data can be extracted by performing the Test OAuth using the API key, API Secret, Access Token and Access Token Secret.

#############################################
# Authentication
#############################################
# options(httr_oauth_cache=T)

api_key <- "1bL9ufvChCVI94e5ennGX6Hc6"   #Consumer key: *

api_secret <- "pgGNIfrCF62c3gCiiOopzNicAuh8trgR9yjPpUHehnYYQ5Pz7E"   # Consumer secret: *

access_token <- "811063757291950080-XqmeBQYkLFyhA4TAzACmyYP2CxlDAmQ"  # Access token: 

access_token_secret <- "N5SoPjplfBQVUEX2oawTdcSMYJw218gzQyKVJ4pkWjGcX" # Access token secret: 

# After this line of command type 1 for selection as Yes 

setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)

Lets perform the analysis on the hashtags. We have chosen the below 6 hashtags to perform the tweets search in twitter. They are aas below: 1. #2016 2. #brexit 3. #christmas 4. #kerala 5. #modi 6. #trump

The extracted tweets are written seperately into different files and the regex is performed on the data. The retweets are removed. The blank sppaces are excluded. The punctuation marks, links, user names, new line leading and trailing white spaces etc are removed and the data is cleaned.

#############################################
# Extract Tweets
#############################################

hashtags = c('#2016', '#brexit', '#christmas', '#kerala', '#modi', '#trump')

for (hashtag in hashtags){
tweets = searchTwitter(hashtag, lang="en", n=1000 )     # hash tag for tweets search and number of tweets
tweets = twListToDF(tweets)    # Convert from list to dataframe

tweets.df = tweets[,1]  # assign tweets for cleaning

head(tweets.df)

#https://msdn.microsoft.com/en-us/library/ae5bf541(v=vs.100).aspx

tweets.df = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", tweets.df);head(tweets.df) #remove retweet source

tweets.df = gsub("@\\w+", "", tweets.df);head(tweets.df) # regex for removing @user

tweets.df = gsub("#\\w+", "", tweets.df);head(tweets.df) # regex for removing #key_word

tweets.df = gsub("[[:punct:]]", "", tweets.df);head(tweets.df) # regex for removing punctuation mark
tweets.df = gsub("[[:digit:]]", "", tweets.df);head(tweets.df) # regex for removing numbers

tweets.df = gsub("http\\w+", "", tweets.df);head(tweets.df) # regex for removing links
tweets.df = gsub("\n", " ", tweets.df);head(tweets.df)  ## regex for removing new line (\n)

tweets.df = gsub("[ \t]{2,}", " ", tweets.df);head(tweets.df) ## regex for removing two blank space

tweets.df =  gsub("[^[:alnum:]///' ]", " ", tweets.df)     # keep only alpha numeric 

tweets.df =  iconv(tweets.df, "latin1", "ASCII", sub="")   # Keep only ASCII characters

tweets.df = gsub("^\\s+|\\s+$", "", tweets.df);head(tweets.df)  # Remove leading and trailing white space

tweets[,1] = tweets.df # save in Data frame

head(tweets)

write.csv(tweets,paste0(gsub('#','',hashtag),'.csv'))

}

All the tweets that are stored in different files are combined together and are written into a single file by placing only the unique tweets and removing the duplicate tweets.

# read all the tweets and combine the data sets

text.tweets = NULL

for (hashtag in hashtags) {
  
  d0 = read.csv(paste0(gsub('#','',hashtag),'.csv'), stringsAsFactors = F)  
  text.tweets = c(text.tweets, d0$text)
}


text.tweets = unique(text.tweets)
df = data.frame(id = 1:length(text.tweets), text = text.tweets)
write.csv(df, 'tweets_classification_data.csv', row.names = F)

Lets now perform the Text classification based on the sentiments of the text. The text classification is performed in two steps: 1. By Creating a training model 2. By Testing the data with the training model.

Lets read the training dataset from the local disk. This is the combined file containing the unique tweets with the regex removed.

# Step 1- Read the training data set in R #
data = read.csv(file.choose(),  # File path 
                stringsAsFactors = F)                     # String as text


dim(data)
## [1] 3159    3
head(data)                                                # View few rows
##   id
## 1  1
## 2  2
## 3  3
## 4  4
## 5  5
## 6  6
##                                                                                                  text
## 1                                                                                            Free and
## 2                                                                        gets connected via from this
## 3                                              iSkate Irelands largest outdoor ice rink opens in mins
## 4                                               The beautiful tree in Santa eulalia              till
## 5                         will collect extra recycling over just separate it into clear bags or boxes
## 6 Closing todaylast chance to win an Ipad follow us on Twitter Instagram amp postshare your photo use
##   classification
## 1              1
## 2              1
## 3              1
## 4              1
## 5              1
## 6              1

The data is then split into parts for evaluating the models. 70% of the tweets is split and fed for the training the model and test is performed on the remaining 30% of tweets.

# Step 2- Split this data in two parts for evaluating models
set.seed(16102016)                          # To fix the sample 

samp_id = sample(1:nrow(data),              # do ?sample to examine the sample() func
                 round(nrow(data)*.70),     # 70% records will be used for training
                 replace = F)               # sampling without replacement.

train = data[samp_id,]                      # 70% of training data set, examine struc of samp_id obj
test = data[-samp_id,]                      # remaining 30% of training data set

Lets View the dimensions of the training data

dim(train)                      # dimns of training data
## [1] 2211    3

Lets view the test data

dim(test)                      # dimns of test data
## [1] 948   3

The text data is then processed and removed from the regex like the punctuation marks, numbers, blank spaces. On the processed data the text corpus is created and the Document Term Matrix is created through the IDF Weighing Model.

# Step 3- Process the text data and create DTM (Document Term Matrix)
train.data = rbind(train,test)              # join the data sets
train.data$text = tolower(train.data$text)  # Convert to lower case

text = train.data$text                      
text = removePunctuation(text)              # remove punctuation marks
text = removeNumbers(text)                  # remove numbers
text = stripWhitespace(text)                # remove blank space
cor = Corpus(VectorSource(text))            # Create text corpus
dtm = DocumentTermMatrix(cor,               # Create DTM
                         control = list(weighting =             
                                               function(x)
                                                 weightTfIdf(x, normalize = F))) # IDF weighing

training_codes = train.data$classification       # Coded labels

Lets view the Document Term Matrix

dim(dtm)                                # Dimensions of the DTM
## [1] 3159 3456

Lets now test the models and choose the best model and create a container. A container is an object for training, classifying and analyzing the data.

# Step 4- Test the models and choose best model
container <- create_container(dtm,               # creates a 'container' obj for training, classifying, and analyzing docs
                              t(training_codes), # labels or the Y variable / outcome we want to train on
                              trainSize = 1:nrow(train), 
                              testSize = (nrow(train)+1):nrow(train.data), 
                              virgin = FALSE)      # whether to treat the classification data as 'virgin' data or not.
                                                   # if virgin = TRUE, then machine won;t borrow from prior datasets.
str(container)     # view struc of the container obj; is a list of training n test data
## Formal class 'matrix_container' [package "RTextTools"] with 6 slots
##   ..@ training_matrix      :Formal class 'matrix.csr' [package "SparseM"] with 4 slots
##   .. .. ..@ ra       : num [1:19438] 5.34 6.72 8.63 4.4 6.17 ...
##   .. .. ..@ ja       : int [1:19438] 8 109 139 182 193 307 692 835 1446 1576 ...
##   .. .. ..@ ia       : int [1:2212] 1 16 25 30 38 48 58 65 74 86 ...
##   .. .. ..@ dimension: int [1:2] 2211 3456
##   ..@ classification_matrix:Formal class 'matrix.csr' [package "SparseM"] with 4 slots
##   .. .. ..@ ra       : num [1:8099] 11.63 6.34 10.04 11.63 3.55 ...
##   .. .. ..@ ja       : int [1:8099] 555 1360 1493 3316 126 527 1204 1270 1304 1790 ...
##   .. .. ..@ ia       : int [1:949] 1 5 19 28 30 37 45 53 57 66 ...
##   .. .. ..@ dimension: int [1:2] 948 3456
##   ..@ training_codes       : Factor w/ 3 levels "-1","0","1": 3 3 2 3 2 2 2 2 2 2 ...
##   ..@ testing_codes        : Factor w/ 3 levels "-1","0","1": 3 3 3 3 3 2 3 2 3 1 ...
##   ..@ column_names         : chr [1:3456] "aajkaal" "aap" "abc" "able" ...
##   ..@ virgin               : logi FALSE

The BOOSTING Algorithm is used and the result of the container is as displayed below. Initially we tried using the MAXEENT Algorithm but the model displayed a lesser prediction percentage. Hence, the BOOSTING Algorithm was chosen

models <- train_models(container,              # ?train_models; makes a model object using the specified algorithms.
                       algorithms=c("BOOSTING")) #"MAXENT","SVM","GLMNET","SLDA","TREE","BAGGING","BOOSTING","RF"

results <- classify_models(container, models)

head(results)
##   LOGITBOOST_LABEL LOGITBOOST_PROB
## 1                0       0.5000000
## 2                0       0.5000000
## 3                0       0.5000000
## 4                0       0.5000000
## 5               -1       0.5000000
## 6               -1       0.8807971

Lets now build the confusion matrix on the entropy and determine the Prediction Accuracy.

# building a confusion matrix to see accuracy of prediction results
out = data.frame(model_sentiment = results$LOGITBOOST_LABEL,    # rounded probability == model's prediction of Y
                 model_prob = results$LOGITBOOST_PROB,
                 actual_sentiment = train.data$classification[(nrow(train)+1):nrow(train.data)])  # actual value of Y

dim(out); head(out); 
## [1] 948   3
##   model_sentiment model_prob actual_sentiment
## 1               0  0.5000000                1
## 2               0  0.5000000                1
## 3               0  0.5000000                1
## 4               0  0.5000000                1
## 5              -1  0.5000000                1
## 6              -1  0.8807971                0
summary(out)           # how many 0s and 1s were there anyway?
##  model_sentiment   model_prob     actual_sentiment 
##  -1:135          Min.   :0.1192   Min.   :-1.0000  
##  0 :766          1st Qu.:0.5000   1st Qu.: 0.0000  
##  1 : 47          Median :0.5000   Median : 0.0000  
##                  Mean   :0.6647   Mean   : 0.0443  
##                  3rd Qu.:0.8808   3rd Qu.: 0.0000  
##                  Max.   :0.9975   Max.   : 1.0000

The confusion matrix is as below:

(z = as.matrix(table(out[,1], out[,3])))   # display the confusion matrix.
##     
##       -1   0   1
##   -1  65  50  20
##   0   54 605 107
##   1    0  13  34

The Prediction Accuracy in terms of percentage is

(pct = round(((z[1,1] + z[2,2] + z[3,3])/sum(z))*100, 2))      # prediction accuracy in % terms
## [1] 74.26
  1. The hashtags that were chosen for the testing was
    1. 2016
    2. brexit
    3. christmas
    4. kerala
    5. modi
    6. trump

The hashtags were chosen considering the relevance of terms in the current year and its influence on the public. All the hashtags involved were considering the public emotions.

  1. The sentiments of the corpus is categorised into three terms: The positive terms, the negative terms ans the neutral terms. In the dataset we have 47 sentences for positive sentiments, 135 sentences for negative sentiments and 766 sentences for neutral sentiments.

  2. This model was run on the classification by removing the stopwords and filtering the data couple of times. It was oserved that When the data was more filtered the prediction accuracy increased. Initially before the filtering of data the prediction accuracy obtained was only 60.80%. We had therefore changed the modelling algorithm from MAXENTENT to BOOSTING and filtered the data classified into more of a meaningful text by including more stopwrords. This increased the accuracy to 74.26%.

  3. From the above exercise we have learnt that