Load the library that are required in the assignment:
library("tm")
library("SnowballC")
library("caTools")
library("rpart")
library("rpart.plot")
library("ROCR")
library("randomForest")
We will be trying to understand sentiment of tweets about the company Apple, By using the twitter for better understand public perception, Apple wants to monitor how people feel over time and how people receive new announcements.
Our challenge in this lecture is to see if we can correctly classify tweets as being negative, positive, or neither about Apple.
Using data as a data is a difficult task, as text data is not structured as accoring to the requirement and not well written, use of the symbol and other symbolic representation make text analytics more difficult. so handling text data is a challenging problem. So for this field is called Natural Language Processing comes, goal of NLP is to understand and derive meaning from human language in a meaning full way so that machine can understand.
fully understanding the course is difficult so count the number of time a word appears in the document
Text data often has many inconsistencies that will cause algorithms trouble like: Apple, apple and aPple in the text data should we consider as a single word, not multiple words as text data is releted with the apple company only. So for this we diffirent preprocessing techniq to over come such problems, related to the text data.
Here are some of the following steps that we will cover presentation:
To collect the data needed for this task, we had to perform two steps.
The first was to collect data about tweets from the internet.
Twitter data is publicly available, and it can be collected it through scraping the website or via the Twitter API.
The sender of the tweet might be useful to predict sentiment, but we will ignore it to keep our data anonymized.
So we will just be using the text of the tweet.
Then we need to construct the outcome variable for these tweets, which means that we have to label them as positive, negative, or neutral sentiment.
We would like to label thousands of tweets, and we know that two people might disagree over the correct classification of a tweet. To do this efficiently, one option is to use the Amazon Mechanical Turk.
The task that we put on the Amazon Mechanical Turk was to judge the sentiment expressed by the
following item toward the software company Apple.
The items we gave them were tweets that we had collected.
The workers could pick from the following options as their response:
These outcomes were represented as a number on the scale from -2 to 2.
Each tweet was labeled by five workers. For each tweet, we take the average of the five scores given by the five workers, hence the final scores can range from -2 to 2 in increments of 0.2.
The following graph shows the distribution of the number of tweets classified into each of the categories. We can see here that the majority of tweets were classified as neutral, with a small number classified as strongly negative or strongly positive.
tweets <- read.csv("tweets.csv", stringsAsFactors = FALSE)
Note: when working on a text data we add stringsAsFactors = FALSE, as an argument.
Explore the structure of our data:
str(tweets)
## 'data.frame': 1181 obs. of 2 variables:
## $ Tweet: chr "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!! #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
## $ Avg : num 2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...
We have 1181 observations of 2 variables:
The tweet texts are real tweets that gathered on the internet directed to Apple with a few cleaned up words.
We can view tweets sentiment what is the avg of tweets
hist(tweets$Avg,breaks = 5)
We are more interested in being able to detect the tweets with clear negative sentiment, so
let's define a new variable in our data set called Negative.
tweets$Negative <- as.factor(tweets$Avg <= -1)
We can see how many tweets are there in the catogory of negative, this can be done with the help of table
table(tweets$Negative)
##
## FALSE TRUE
## 999 182
Add one more variable for the positive tweets, tweet for that average sentiment score is __grater than of equal to 1
tweets$Positive <- as.factor(tweets$Avg>=1)
Now we can see the structure of our dataframe
str(tweets)
## 'data.frame': 1181 obs. of 4 variables:
## $ Tweet : chr "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!! #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
## $ Avg : num 2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...
## $ Negative: Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
## $ Positive: Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
table(tweets$Positive)
##
## FALSE TRUE
## 1120 61
One of fundamental concepts in text analysis, implemented in the package tm as well,
is that of a corpus.
A corpus is a collection of documents.
We will need to convert our tweets to a corpus for pre-processing.
Various function in the tm package can be used to create a corpus in many different ways.
We will create it from the tweet column of our data frame using two functions, Corpus() and VectorSource().
We feed to this latter the Tweets variable of the tweets data frame.
corpus <- Corpus(VectorSource(tweets$Tweet))
Let's check out our corpus:
corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1181
We can check that the documents match our tweets by using double brackets [[.
To inspect the first tweet in our corpus, we select the first element as:
corpus[[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 101
To deal with text data following pre-processing is required.
Follow the standard steps to build and pre-process the corpus:
1) Build a new corpus variable called corpus.
2) Using tm_map, convert the text to lowercase.
3) Using tm_map, remove all punctuation from the corpus.
4) Using tm_map, remove all English stopwords from the corpus.
5) Using tm_map, stem the words in the corpus.
6) Build a document term matrix from the corpus, called dtm.
Each operation, like stemming or removing stop words, can be done with one line in R,
where we use the tm_map() function which takes as
First step is to transform all text to lower case:
corpus <- tm_map(corpus, tolower)
After performing the first step we can check the same “documents” as before: and we can see that there is no word present in the tweet having upper case character:
corpus[[1]]
## [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"
converts corpus to a Plain Text Document
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
Look at stop words
stopwords("english")[1:10]
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
Stop words are the words that having no meaning or very less meaning in the corpus, so we remove those words which caring less meaning for our corpus
Removing words can be done with the removeWords argument to the tm_map() function, with an
extra argument, i.e. what the stop words are that we want to remove.
We will remove all of these English stop words, but we will also remove the word “apple” since all of these tweets have the word “apple” and it probably won't be very useful in our prediction problem.
corpus <- tm_map(corpus, removeWords, c("apple", stopwords("english")))
Now check out our corpus
corpus[[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 67
Lastly, we want to stem our document with the stemDocument argument.
corpus <- tm_map(corpus, stemDocument)
corpus[[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 61
We can see that this took off the ending of “customer,” “service,” “received,” and “appstore.”
We are now ready to extract the word frequencies to be used in our prediction problem.
The tm package provides a function called DocumentTermMatrix() that generates a matrix where:
The values in the matrix are the number of times that word appears in each document.
DTM <- DocumentTermMatrix(corpus)
DTM
## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
## Non-/sparse entries: 8980/3875329
## Sparsity : 100%
## Maximal term length: 115
## Weighting : term frequency (tf)
We see that in the corpus there are 3289 unique words.
Let's see what this matrix looks like using the inspect() function, in particular
slicing a block of rows/columns from the Document Term Matrix by calling by their indices:
inspect(DTM[1000:1005, 505:515])
## <<DocumentTermMatrix (documents: 6, terms: 11)>>
## Non-/sparse entries: 1/65
## Sparsity : 98%
## Maximal term length: 9
## Weighting : term frequency (tf)
##
## Terms
## Docs cheapen cheaper check cheep cheer cheerio cherylcol chief
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 1 0 0 0
## Terms
## Docs chiiiiqu child children
## character(0) 0 0 0
## character(0) 0 0 0
## character(0) 0 0 0
## character(0) 0 0 0
## character(0) 0 0 0
## character(0) 0 0 0
In this range we see that the word “cheer” appears in the tweet 1005, but “cheap” does not appear in any of these tweets. This data is what we call sparse. This means that there are many zeros in our matrix.
We can look at what the most popular terms are, or words, with the function findFreqTerms(),
selecting a minimum number of 20 occurrences over the whole corpus:
freq <- findFreqTerms(DTM, lowfreq = 20)
freq
## [1] "android" "anyon" "app"
## [4] "appl" "back" "batteri"
## [7] "better" "buy" "can"
## [10] "cant" "come" "dont"
## [13] "fingerprint" "freak" "get"
## [16] "googl" "ios7" "ipad"
## [19] "iphon" "iphone5" "iphone5c"
## [22] "ipod" "ipodplayerpromo" "itun"
## [25] "just" "like" "lol"
## [28] "look" "love" "make"
## [31] "market" "microsoft" "need"
## [34] "new" "now" "one"
## [37] "phone" "pleas" "promo"
## [40] "promoipodplayerpromo" "realli" "releas"
## [43] "samsung" "say" "store"
## [46] "thank" "think" "time"
## [49] "twitter" "updat" "use"
## [52] "via" "want" "well"
## [55] "will" "work"
Out of the 3289 words in our matrix, only 56 words appear at least 20 times in our tweets.
This means that we probably have a lot of terms that will be pretty useless for our prediction model. The number of terms is an issue for two main reasons:
Therefore let's remove some terms that don't appear very often.
sparse_DTM <- removeSparseTerms(DTM, 0.995)
This function takes a second parameters, the sparsity threshold. The sparsity threshold works as follows.
Let's see what the new Document Term Matrix properties look like:
sparse_DTM
## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
## Non-/sparse entries: 4669/360260
## Sparsity : 99%
## Maximal term length: 20
## Weighting : term frequency (tf)
It only contains 309 unique terms, i.e. only about 9.4% of the full set.
Now let's convert the sparse matrix into a data frame that we will be able to use for our predictive models.
tweetsSparse <- as.data.frame(as.matrix(sparse_DTM))
Since R struggles with variable names that start with a number, and we probably have some words
here that start with a number, we should run the make.names() function to make sure all of our
words are appropriate variable names.
It will convert the variable names to make sure they are all appropriate names for R before we
build our predictive models.
You should do this each time you build a data frame using text analytics.
To make all variable names R-friendly use:
colnames(tweetsSparse) <- make.names(colnames(tweetsSparse))
We should add back to this data frame our dependent variable to this data set.
We'll call it tweetsSparse$Negative for the Negative variable and tweetsSparse$Positive for the positive variable, these variables are set from the original Negative variable and Positive variable from
the tweets data frame.
Add Negative variable
tweetsSparse$Negative <- tweets$Negative
Add Positive Variable
tweetsSparse$Positive <- tweets$Positive
Now our data is ready for analysis, now build Machine Learning system.
Before Building the machine learning model, we need to split our data in training and training dataset
Lastly, let's split our data into a training set and a testing set, putting 70% of the data in the training set. Before doing so, we need to set seed, some value so that, we all will get the same result Split data based on the Negative variable
set.seed(123)
splitNegative <- sample.split(tweetsSparse$Negative, SplitRatio = 0.7)
trainSparseNegative <- subset(tweetsSparse, splitNegative == TRUE)
testSparseNegative <- subset(tweetsSparse, splitNegative == FALSE)
Split data based on the positive variable
splitPositive <- sample.split(tweetsSparse$Negative, SplitRatio = 0.7)
trainSparsePositive <- subset(tweetsSparse, splitPositive == TRUE)
testSparsePositive <- subset(tweetsSparse, splitPositive == FALSE)
In this prediction, we are using both the data for positive and negative, training and testing and we are we are going to different machine learning model to train on these data set.
So, Let's first use CART to build a predictive model, using the rpart() function to predict
Negative using all of the other variables as our independent variables and the data set trainSparseNegative.
We'll add one more argument here, which is method = "class" so that the rpart() function knows
to build a classification model.
We keep default settings for all other parameters, in particular we are not adding anything for
minbucket or cp.
tweetCARTNegative <- rpart(Negative ~ . , data = trainSparseNegative, method = "class")
prp(tweetCARTNegative)
The tree says that
TRUE, or negative sentiment.TRUE.TRUE,
or negative sentiment.FALSE, or non-negative sentiment.This tree makes sense intuitively since these three words are generally seen as negative words.
Now, Let's build one more model use CART to for prediction, using the rpart() function to predict
Positive using all of the other variables as our independent variables and the data set trainSparsePositive.
We'll add one more argument here, which is method = "class" so that the rpart() function knows
to build a classification model.
We keep default settings for all other parameters, in particular we are not adding anything for
minbucket or cp.
tweetCARTPositive <- rpart(Positive ~ . , data = trainSparsePositive, method = "class")
Now we have two model ready, one for positive sentiment and one for negative sentiment
Using the predict() function we compute the predictions of our model tweetCARTNegative and tweetCARTPositive on the new data
set testSparsePositive and testSparseNegative.
Be careful to add the argument type = "class" to make sure we get class predictions.
Prediction for the negative sentiment:
predictCARTNegative <- predict(tweetCARTNegative, newdata = testSparseNegative, type = "class")
Prediction for the positive sentiment:
predictCARTPositive <- predict(tweetCARTPositive, newdata = testSparsePositive, type = "class")
Now, Evalute our model accuracy using different approaches like confusion matrix and AUC So for our prediction let's compute the confusion matrix:
cmat_CARTNegative <- table(testSparseNegative$Negative, predictCARTNegative)
cmat_CARTNegative
## predictCARTNegative
## FALSE TRUE
## FALSE 294 6
## TRUE 37 18
accu_CART <- (cmat_CARTNegative[1,1] + cmat_CARTNegative[2,2])/sum(cmat_CARTNegative)
Compute the confusion matrix for the positve:
cmat_CARTPositive <- table(testSparsePositive$Positive, predictCARTPositive)
cmat_CARTPositive
## predictCARTPositive
## FALSE TRUE
## FALSE 335 0
## TRUE 20 0
accu_CARTP <- (cmat_CARTPositive[1,1] + cmat_CARTPositive[2,2])/sum(cmat_CARTPositive)
Let's compare this to a simple baseline model that always predicts non-negative (i.e. the most common value of the dependent variable).
To compute the accuracy of the baseline model, let's make a table of just the outcome variable Negative.
cmat_baseline <- table(testSparseNegative$Negative)
cmat_baseline
##
## FALSE TRUE
## 300 55
accu_baseline <- max(cmat_baseline)/sum(cmat_baseline)
To compute the accuracy of the baseline model, let's make a table of just the outcome variable Positive.
cmat_baselineP <- table(testSparsePositive$Positive)
cmat_baselineP
##
## FALSE TRUE
## 335 20
accu_baselineP <- max(cmat_baselineP)/sum(cmat_baselineP)
The accuracy of the baseline model is then 0.8451, for negative and 0.9437, for positive
So the CART model doing better than the simple baseline model for both the cases.
So, now we are going to build a new model called Random Forest model for both negative and positive dataset.
We use the randomForest() function to predict Negative and Positive again using all of our other variables
as independent variables and the data set trainSparsePositive and trainSparseNegative.
For this model also we are using the default parameter settings: random forest for the negative sentiment
set.seed(123)
tweetRFN <- randomForest(Negative ~ . , data = trainSparseNegative)
tweetRFN
##
## Call:
## randomForest(formula = Negative ~ ., data = trainSparseNegative)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 17
##
## OOB estimate of error rate: 10.46%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 686 17 0.02418208
## TRUE 70 59 0.54263566
random forest for positive sentiment
set.seed(123)
tweetRFP <- randomForest(Positive ~ . , data = trainSparsePositive)
tweetRFP
##
## Call:
## randomForest(formula = Positive ~ ., data = trainSparsePositive)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 17
##
## OOB estimate of error rate: 6.37%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 779 1 0.001282051
## TRUE 52 0 1.000000000
And then compute the Out-of-Sample predictions for negative:
predictRFN <- predict(tweetRFN, newdata = testSparseNegative)
And then compute the Out-of-Sample predictions for positive:
predictRFP <- predict(tweetRFP, newdata = testSparsePositive)
for calculating the accuracy of model, there are so many ways here, we are using the confusion matrix and AUC
compute the confusion matrix for negative:
cmat_RFN <- table(testSparseNegative$Negative, predictRFN)
cmat_RFN
## predictRFN
## FALSE TRUE
## FALSE 296 4
## TRUE 26 29
accu_RFN <- (cmat_RFN[1,1] + cmat_RFN[2,2])/sum(cmat_RFN)
compute the confusion matrix for positive:
cmat_RFP <- table(testSparsePositive$Positive, predictRFP)
cmat_RFP
## predictRFP
## FALSE TRUE
## FALSE 335 0
## TRUE 18 2
accu_RFP <- (cmat_RFP[1,1] + cmat_RFP[2,2])/sum(cmat_RFP)
The overall accuracy of this Random Forest model is 0.9155 for the Negative and 0.9493 for the Positive sentiment.
This model is a little better than the CART model, but due to the interpretability of the CART model,
this latter would probably be preferred over the random forest model.
If you were to use cross-validation to pick the cp parameter for the CART model, the accuracy
would increase to about the same as the random forest model.
So by using a bag-of-words approach and these models, we can reasonably predict sentiment even with a relatively small data set of tweets.
we are creating a logistic regression model called as tweetLogN for Negative and tweetLogP for the Positive, for logistic regression we use function glm with family binomial.
Build the model for the Negative sentiment, using all independent variables as predictors:
tweetLogN <- glm(Negative ~ . , data = trainSparseNegative, family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# summary(tweetLogN)
Build the model for the Positive sentiment, using all independent variables as predictors:
tweetLogP <- glm(Positive ~ . , data = trainSparsePositive, family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# summary(tweetLogP)
Prediction the accuracy of this model on the testing set for negative:
tweetLog_predict_testN <- predict(tweetLogN, type = "response", newdata = testSparseNegative)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
Prediction the accuracy of this model on the testing set for positive:
tweetLog_predict_testP <- predict(tweetLogP, type = "response", newdata = testSparsePositive)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
Confusion matrix for negative with threshold more then 0.5:
cmat_logRegrN <- table(testSparseNegative$Negative, tweetLog_predict_testN > 0.5)
cmat_logRegrN
##
## FALSE TRUE
## FALSE 280 20
## TRUE 21 34
accu_logRegrN <- (cmat_logRegrN[1,1] + cmat_logRegrN[2,2])/sum(cmat_logRegrN)
Confusion matrix for positive with threshold more then 0.5:
cmat_logRegrP <- table(testSparsePositive$Positive, tweetLog_predict_testP > 0.5)
cmat_logRegrP
##
## FALSE TRUE
## FALSE 313 22
## TRUE 4 16
accu_logRegrP <- (cmat_logRegrP[1,1] + cmat_logRegrP[2,2])/sum(cmat_logRegrP)
The overall accuracy of this logistic regression model is 0.8845 for Negative and __ 0.9268__ for Positive sentiment
which is worse than the baseline (?!).
If you were to compute the accuracy on the training set instead, you would see that the model does
really well on the training set.
This is an example of over-fitting. The model fits the training set really well, but does not
perform well on the test set.
A logistic regression model with a large number of variables is particularly at risk for overfitting.