This data and lecture is prepared by MITx: 15.071x The Analytics Edge.
A key question, however, is how to handle this information included in the tweets. Humans cannot, of course, keep up with internet-scale volumes of data as there are about half a billion tweets per day. Even at the small scale, the cost and time required to process this is of course prohibitive.
The field that addresses how computers understand text is called Natural Language Processing. The goal is to understand and derive meaning from human language. In 1950, Alan Turing, a major computer scientist of the era, proposed a test of machine intelligence. That the computer program passes it if it can take part in a real-time conversation and cannot be distinguished from a human.
Apple is a computer company known for its laptops, phones, tablets, and personal media players. While Apple has a large number of fans, they also have a large number of people who don’t like their products. And they have several competitors. To better understand public perception, Apple wants to monitor how people feel over time and how people receive new announcements. Our challenge in this lecture is to see if we can correctly classify tweets as being negative, positive, or neither about Apple. To collect the data needed for this task, we had to perform two steps.
The first was to collect data about tweets from the internet. Twitter data is publicly available. And you can collect it through scraping the website or by using a special interface for programmers that Twitter provides called an API. The sender of the tweet might be useful to predict sentiment. But we’ll ignore it to keep our data anonymized. So we’ll just be using the text of the tweet. Then we need to construct the outcome variable for these tweets, which means that we have to label them as positive, negative, or neutral sentiment. We would like to label thousands of tweets. And we know that two people might disagree over the correct classification of a tweet. So to do this efficiently, one option is to use the Amazon Mechanical Turk.
It allows people to break tasks down into small components and then enables them to distribute these tasks online to be solved by people all over the world. People can sign up to perform the available tasks for a fee. As the task creator, we pay the workers a fixed amount per completed task. For example, we might pay $0.02 for a single classified tweet. The Amazon Mechanical Turk serves as a broker and takes a small cut of the money. Many of the tasks on the Mechanical Turk require human intelligence, like classifying the sentiment of a tweet. But these tasks may be time consuming or require building otherwise unneeded capacity for the creator of the task. And so it’s appealing to outsource the job. The task that we put on the Amazon Mechanical Turk was to judge the sentiment expressed by the following item toward the software company Apple. The items we gave them were tweets that we had collected. The workers could pick from the following options as their response– strongly negative, negative, neutral, positive, and strongly positive. We represented each of these outcomes as a number on the scale from negative 2 to 2.
# Read in the data
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
str(tweets)
## 'data.frame': 1181 obs. of 2 variables:
## $ Tweet: chr "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!! #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
## $ Avg : num 2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...
# Create dependent variable
tweets$Negative = as.factor(tweets$Avg <= -1)
table(tweets$Negative)
##
## FALSE TRUE
## 999 182
# Install new packages
library(tm)
## Loading required package: NLP
library(SnowballC)
It just counts the number of times each word appears in the text and uses these counts as the independent variables.
It’s used as a baseline in text analytics projects and for natural language processing. This isn’t the whole story, though. Preprocessing the text can dramatically improve the performance of the Bag of Words method. One part of preprocessing the text is to clean up irregularities. Text data often as many inconsistencies that will cause algorithms trouble.
Text data often as many inconsistencies that will cause algorithms trouble. These are the step to pre-processing the bags of words:
# Create corpus
corpus = Corpus(VectorSource(tweets$Tweet))
# Look at corpus
corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1181
corpus[[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 101
# Convert to lower-case
corpus = tm_map(corpus, tolower)
corpus[[1]]
## [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"
# IMPORTANT NOTE: If you are using the latest version of the tm package, you will need to run the following line before continuing (it converts corpus to a Plain Text Document). This is a recent change having to do with the tolower function that occurred after this video was recorded.
corpus = tm_map(corpus, PlainTextDocument)
# Remove punctuation
corpus = tm_map(corpus, removePunctuation)
corpus[[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 97
# Look at stop words
stopwords("english")[1:10]
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
# Remove stopwords and apple
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus[[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 67
# Stem document
corpus = tm_map(corpus, stemDocument)
corpus[[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 61
# Create matrix
frequencies = DocumentTermMatrix(corpus)
frequencies
## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
## Non-/sparse entries: 8980/3875329
## Sparsity : 100%
## Maximal term length: 115
## Weighting : term frequency (tf)
# Look at matrix
inspect(frequencies[1000:1005,505:515])
## <<DocumentTermMatrix (documents: 6, terms: 11)>>
## Non-/sparse entries: 1/65
## Sparsity : 98%
## Maximal term length: 9
## Weighting : term frequency (tf)
##
## Terms
## Docs cheapen cheaper check cheep cheer cheerio cherylcol chief
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 1 0 0 0
## Terms
## Docs chiiiiqu child children
## character(0) 0 0 0
## character(0) 0 0 0
## character(0) 0 0 0
## character(0) 0 0 0
## character(0) 0 0 0
## character(0) 0 0 0
# Check for sparsity
findFreqTerms(frequencies, lowfreq=100)
## [1] "iphon" "itun" "new"
# Remove sparse terms
sparse = removeSparseTerms(frequencies, 0.995)
sparse
## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
## Non-/sparse entries: 4669/360260
## Sparsity : 99%
## Maximal term length: 20
## Weighting : term frequency (tf)
# Convert to a data frame
tweetsSparse = as.data.frame(as.matrix(sparse))
# Make all variable names R-friendly
colnames(tweetsSparse) = make.names(colnames(tweetsSparse))
# Add dependent variable
tweetsSparse$Negative = tweets$Negative
# Split the data
library(caTools)
set.seed(123)
split = sample.split(tweetsSparse$Negative, SplitRatio = 0.7)
trainSparse = subset(tweetsSparse, split==TRUE)
testSparse = subset(tweetsSparse, split==FALSE)
# Build a CART model
library(rpart)
library(rpart.plot)
tweetCART = rpart(Negative ~ ., data=trainSparse, method="class")
prp(tweetCART)
# Evaluate the performance of the model
predictCART = predict(tweetCART, newdata=testSparse, type="class")
table(testSparse$Negative, predictCART)
## predictCART
## FALSE TRUE
## FALSE 294 6
## TRUE 37 18
# Compute accuracy
(294+18)/(294+6+37+18)
## [1] 0.8788732
# Baseline accuracy
table(testSparse$Negative)
##
## FALSE TRUE
## 300 55
300/(300+55)
## [1] 0.8450704
# Random forest model
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
set.seed(123)
tweetRF = randomForest(Negative ~ ., data=trainSparse)
predictions = predict(tweetRF, newdata=testSparse, type="response")
# Make predictions:
predictRF = predict(tweetRF, newdata=testSparse)
table(testSparse$Negative, predictRF)
## predictRF
## FALSE TRUE
## FALSE 293 7
## TRUE 34 21
# Accuracy:
(293+21)/(293+7+34+21)
## [1] 0.884507
Over 7,000 research articles have been written on the topic. Hundreds of start-ups are developing sentiment analysis solutions. Many websites perform real-time analysis of tweets. For example, “tweetfeel” shows trends given any term, and “The Stock Sonar” shows sentiment and stock prices.
Sentiment analysis is a particular application of text analytics. In general, the critical aspect of text analytics is to select the specific features that are relevant in a particular application. In addition, it’s important to apply specific knowledge that often leads to better results. For example, using the meaning of the symbols or include features like the number of words.
Analytical sentiment analysis we have seen can replace more labor-intensive methods like polling. Text analytics can also deal with the massive amounts of unstructured data being generated on the internet. Computers are becoming more and more capable of interacting with humans and performing human tasks.