Text Analysis

Intro

This page demonstrates how text analysis could be done in R. For text analysis demonstrated in this page, the bag of words technique is used to process the text and h2o random forest model is used as classification model.

Data

The data for this demo program is sourced from http://www.crowdflower.com/data-for-everyone. A subset of the GOP Debate dataset provided on the crowdflower site is used here. The tweets that are labeled either Positive or Negative is used for text analysis.

data_url="http://cdn2.hubspot.net/hubfs/346378/DFE_CSVs/GOP_REL_ONLY.csv?t=1447974095863"
debate_tweets = read.csv(data_url, stringsAsFactors=FALSE)
#Take only tweets marked as either positvie or negative
debate_tweets=debate_tweets[debate_tweets$sentiment=="Positive"|debate_tweets$sentiment=="Negative",]
str(debate_tweets)

## 'data.frame':    10729 obs. of  20 variables:
##  $ candidate                : chr  "Scott Walker" "No candidate mentioned" "Donald Trump" "Ted Cruz" ...
##  $ candidate.confidence     : num  1 1 1 0.633 1 ...
##  $ relevant_yn              : chr  "yes" "yes" "yes" "yes" ...
##  $ relevant_yn.confidence   : num  1 1 1 1 1 ...
##  $ sentiment                : chr  "Positive" "Positive" "Positive" "Positive" ...
##  $ sentiment.confidence     : num  0.633 1 0.705 0.633 0.676 ...
##  $ subject_matter           : chr  "None of the above" "None of the above" "None of the above" "None of the above" ...
##  $ subject_matter.confidence: num  1 0.704 1 1 1 ...
##  $ candidate_gold           : chr  "" "" "" "" ...
##  $ name                     : chr  "PeacefulQuest" "MattFromTexas31" "sharonDay5" "DRJohnson11" ...
##  $ relevant_yn_gold         : chr  "" "" "" "" ...
##  $ retweet_count            : int  26 138 156 228 17 0 1 0 188 0 ...
##  $ sentiment_gold           : chr  "" "" "" "" ...
##  $ subject_matter_gold      : chr  "" "" "" "" ...
##  $ text                     : chr  "RT @ScottWalker: Didn't catch the full #GOPdebate last night. Here are some of Scott's best lines in 90 seconds. #Walker16 http"| __truncated__ "RT @RobGeorge: That Carly Fiorina is trending -- hours after HER debate -- above any of the men in just-completed #GOPdebate sa"| __truncated__ "RT @DanScavino: #GOPDebate w/ @realDonaldTrump delivered the highest ratings in the history of presidential debates. #Trump2016"| __truncated__ "RT @GregAbbott_TX: @TedCruz: \"On my first day I will rescind every illegal executive action taken by Barack Obama.\" #GOPDebat"| __truncated__ ...
##  $ tweet_coord              : chr  "" "" "" "" ...
##  $ tweet_created            : chr  "2015-08-07 09:54:46 -0700" "2015-08-07 09:54:45 -0700" "2015-08-07 09:54:45 -0700" "2015-08-07 09:54:44 -0700" ...
##  $ tweet_id                 : num  6.3e+17 6.3e+17 6.3e+17 6.3e+17 6.3e+17 ...
##  $ tweet_location           : chr  "" "Texas" "" "" ...
##  $ user_timezone            : chr  "" "Central Time (US & Canada)" "Arizona" "Central Time (US & Canada)" ...

Cleanse Data

When using bag of words technique, text punctuations and stopwords are not that useful. So, typically they are removed from the text before processing. Similarly the words are replaced with stem words so that variation of word (for exampl smile vs smiling) are not considered as different words.

library(tm)
#Build corpus
corpus = Corpus(VectorSource(debate_tweets$text))
# Convert to lower-case
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument)

# Remove punctuation
corpus = tm_map(corpus, removePunctuation)

# Remove stopwords
corpus = tm_map(corpus, removeWords,stopwords("english"))

# Stem document 
corpus = tm_map(corpus, stemDocument)

Bag of Words

Once the text is cleansed then the text is split into words and frequency of each word is counted. The text is converted into a matrix of frequency of the words with each row having count of the words from each text. The columns of this matrix represent the words and so the columns of this matrix are unique words found in entire text being analyzed. As the number of unique words may run into large number typically sparse words (words that don’t appear often, for example words that has frequency of 1%) are removed in order to reduce compuatational cost.

#Get frequency of words
frequencies = DocumentTermMatrix(corpus)
#Dimension before removing sparse terms
dim(frequencies)

## [1] 10729 13390

# Remove sparse terms
corpus_dense = removeSparseTerms(frequencies, 0.99)

#Dimension after removing sparse terms
dim(corpus_dense)

## [1] 10729   171

# Convert to a data frame
tweets_df = as.data.frame(as.matrix(corpus_dense))

# Make all variable names R-friendly
colnames(tweets_df) = make.names(colnames(tweets_df))

# Add sentiment variable
tweets_df$sentiment = factor(debate_tweets$sentiment)

Build Model

In order to build model 70% of the data is taken as test data. Then the model is validated against 30% of the data. I have used random forest from h2o package. H2o package builds randomforest quickly compared to randomForest package.

#Convert column names to Ascii so that h2o can work with it
colnames(tweets_df)=iconv(colnames(tweets_df), to='ASCII', sub='')
# Split the data to test and train dataset
library(caTools)
set.seed(100)
split = sample.split(tweets_df$sentiment,0.7)
train = subset(tweets_df, split==TRUE)
test = subset(tweets_df, split==FALSE)

#Using h2o library for fast randomforest 
library(h2o)
h2o.init(nthreads = 7,max_mem_size = '4G',assertion = F)

## Successfully connected to http://127.0.0.1:54321/ 
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 hours 7 minutes 
##     H2O cluster version:        3.2.0.3 
##     H2O cluster name:           H2O_started_from_R_We_egk502 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.56 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  7 
##     H2O cluster healthy:        TRUE

## IP Address: 127.0.0.1 
## Port      : 54321 
## Session ID: _sid_800855c655c11cf429e3ba3ff20440dc 
## Key Count : 0

#Variables to be used for prediction
x_var=colnames(tweets_df)[-172]

#Covert to h2o frames
htrain=as.h2o(train)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

htest=as.h2o(test)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

#hwo random forest model
tweet_sentiment = h2o.randomForest(x=x_var,y="sentiment",ntree=200,max_depth = 70, training_frame=htrain)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |======                                                           |   8%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |============                                                     |  18%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |===================                                              |  28%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |=====================                                            |  33%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |========================                                         |  37%
  |                                                                       
  |=========================                                        |  39%
  |                                                                       
  |===========================                                      |  42%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |================================                                 |  48%
  |                                                                       
  |=================================                                |  51%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |======================================                           |  58%
  |                                                                       
  |========================================                         |  61%
  |                                                                       
  |=========================================                        |  63%
  |                                                                       
  |===========================================                      |  66%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=====================================================            |  82%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |==========================================================       |  89%
  |                                                                       
  |===========================================================      |  92%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%

tweet_sentiment

## Model Details:
## ==============
## 
## H2OBinomialModel: drf
## Model ID:  DRF_model_R_1448164593123_35 
## Model Summary: 
##   number_of_trees model_size_in_bytes min_depth max_depth mean_depth
## 1             200             2308441        70        70   70.00000
##   min_leaves max_leaves mean_leaves
## 1        828       1076   952.80500
## 
## 
## H2OBinomialMetrics: drf
## ** Reported on training data. **
## Description: Metrics reported on Out-Of-Bag training samples
## 
## MSE:  0.1198188
## R^2:  0.2736625
## LogLoss:  0.3838993
## AUC:  0.8177705
## Gini:  0.635541
## 
## Confusion Matrix for F1-optimal threshold:
##          Negative Positive    Error        Rate
## Negative     5231      714 0.120101   =714/5945
## Positive      658      907 0.420447   =658/1565
## Totals       5889     1621 0.182690  =1372/7510
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.321396 0.569366 196
## 2                     max f2  0.115075 0.653883 317
## 3               max f0point5  0.560603 0.615100 115
## 4               max accuracy  0.568147 0.841145 113
## 5              max precision  0.999719 1.000000   0
## 6           max absolute_MCC  0.406487 0.462302 166
## 7 max min_per_class_accuracy  0.167825 0.727157 270

#Predict test sentiment
test_sentiment = as.data.frame(h2o.predict(tweet_sentiment,htest)[[,1]][1])

#Check how accurate the predictions are
table(test$sentiment, test_sentiment$predict)

##           
##            Negative Positive
##   Negative     2225      323
##   Positive      281      390