Understand sentiment of tweets about the company Apple

Apple is a computer company known for its laptops, phones, tablets, and personal media players. While Apple has a large number of fans, they also have a large number of people who don’t like their products. And they have several competitors. To better understand public perception, Apple wants to monitor how people feel over time and how people receive new announcements. Our challenge in this lecture is to see if we can correctly classify tweets as being negative, positive, or neither about Apple. To collect the data needed for this task, we had to perform two steps.

The first was to collect data about tweets from the internet. Twitter data is publicly available. And you can collect it through scraping the website or by using a special interface for programmers that Twitter provides called an API. The sender of the tweet might be useful to predict sentiment. But we’ll ignore it to keep our data anonymized. So we’ll just be using the text of the tweet. Then we need to construct the outcome variable for these tweets, which means that we have to label them as positive, negative, or neutral sentiment. We would like to label thousands of tweets. And we know that two people might disagree over the correct classification of a tweet. So to do this efficiently, one option is to use the Amazon Mechanical Turk.

What is the Amazon Mechanical Turk?

It allows people to break tasks down into small components and then enables them to distribute these tasks online to be solved by people all over the world. People can sign up to perform the available tasks for a fee. As the task creator, we pay the workers a fixed amount per completed task. For example, we might pay $0.02 for a single classified tweet. The Amazon Mechanical Turk serves as a broker and takes a small cut of the money. Many of the tasks on the Mechanical Turk require human intelligence, like classifying the sentiment of a tweet. But these tasks may be time consuming or require building otherwise unneeded capacity for the creator of the task. And so it’s appealing to outsource the job. The task that we put on the Amazon Mechanical Turk was to judge the sentiment expressed by the following item toward the software company Apple. The items we gave them were tweets that we had collected. The workers could pick from the following options as their response– strongly negative, negative, neutral, positive, and strongly positive. We represented each of these outcomes as a number on the scale from negative 2 to 2.

# Read in the data

tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)

str(tweets)

## 'data.frame':    1181 obs. of  2 variables:
##  $ Tweet: chr  "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!!  #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
##  $ Avg  : num  2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...

# Create dependent variable

tweets$Negative = as.factor(tweets$Avg <= -1)

table(tweets$Negative)

## 
## FALSE  TRUE 
##   999   182

# Install new packages
library(tm)

## Loading required package: NLP

library(SnowballC)

Pre-Processing in R

It just counts the number of times each word appears in the text and uses these counts as the independent variables.

It’s used as a baseline in text analytics projects and for natural language processing. This isn’t the whole story, though. Preprocessing the text can dramatically improve the performance of the Bag of Words method. One part of preprocessing the text is to clean up irregularities. Text data often as many inconsistencies that will cause algorithms trouble.

Text data often as many inconsistencies that will cause algorithms trouble. These are the step to pre-processing the bags of words:

Change all the letters to lowercase

Removal of punctuation

Remove unhelpful terms

Removing the stop words

Stemming

# Create corpus
 
corpus = Corpus(VectorSource(tweets$Tweet))

# Look at corpus
corpus

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1181

corpus[[1]]

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 101

# Convert to lower-case

corpus = tm_map(corpus, tolower)

corpus[[1]]

## [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"

# IMPORTANT NOTE: If you are using the latest version of the tm package, you will need to run the following line before continuing (it converts corpus to a Plain Text Document). This is a recent change having to do with the tolower function that occurred after this video was recorded.

corpus = tm_map(corpus, PlainTextDocument)


# Remove punctuation

corpus = tm_map(corpus, removePunctuation)

corpus[[1]]

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 97

# Look at stop words 
stopwords("english")[1:10]

##  [1] "i"         "me"        "my"        "myself"    "we"       
##  [6] "our"       "ours"      "ourselves" "you"       "your"

# Remove stopwords and apple

corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))

corpus[[1]]

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 67

# Stem document 

corpus = tm_map(corpus, stemDocument)

corpus[[1]]

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 61

Bag of Words in R

# Create matrix

frequencies = DocumentTermMatrix(corpus)

frequencies

## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
## Non-/sparse entries: 8980/3875329
## Sparsity           : 100%
## Maximal term length: 115
## Weighting          : term frequency (tf)

# Look at matrix 

inspect(frequencies[1000:1005,505:515])

## <<DocumentTermMatrix (documents: 6, terms: 11)>>
## Non-/sparse entries: 1/65
## Sparsity           : 98%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## 
##               Terms
## Docs           cheapen cheaper check cheep cheer cheerio cherylcol chief
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     1       0         0     0
##               Terms
## Docs           chiiiiqu child children
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0

# Check for sparsity

findFreqTerms(frequencies, lowfreq=100)

## [1] "iphon" "itun"  "new"

# Remove sparse terms

sparse = removeSparseTerms(frequencies, 0.995)
sparse

## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
## Non-/sparse entries: 4669/360260
## Sparsity           : 99%
## Maximal term length: 20
## Weighting          : term frequency (tf)

# Convert to a data frame

tweetsSparse = as.data.frame(as.matrix(sparse))

# Make all variable names R-friendly

colnames(tweetsSparse) = make.names(colnames(tweetsSparse))

# Add dependent variable

tweetsSparse$Negative = tweets$Negative

# Split the data

library(caTools)

set.seed(123)

split = sample.split(tweetsSparse$Negative, SplitRatio = 0.7)

trainSparse = subset(tweetsSparse, split==TRUE)
testSparse = subset(tweetsSparse, split==FALSE)

Predicting Sentiment - CART model

# Build a CART model

library(rpart)
library(rpart.plot)

tweetCART = rpart(Negative ~ ., data=trainSparse, method="class")

prp(tweetCART)

# Evaluate the performance of the model
predictCART = predict(tweetCART, newdata=testSparse, type="class")

table(testSparse$Negative, predictCART)

##        predictCART
##         FALSE TRUE
##   FALSE   294    6
##   TRUE     37   18

# Compute accuracy

(294+18)/(294+6+37+18)

## [1] 0.8788732

# Baseline accuracy 

table(testSparse$Negative)

## 
## FALSE  TRUE 
##   300    55

300/(300+55)

## [1] 0.8450704

Predicting Sentiment - Random forest model

# Random forest model

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

set.seed(123)

tweetRF = randomForest(Negative ~ ., data=trainSparse)

predictions = predict(tweetRF, newdata=testSparse, type="response")


# Make predictions:
predictRF = predict(tweetRF, newdata=testSparse)

table(testSparse$Negative, predictRF)

##        predictRF
##         FALSE TRUE
##   FALSE   293    7
##   TRUE     34   21

# Accuracy:
(293+21)/(293+7+34+21)

## [1] 0.884507

Over 7,000 research articles have been written on the topic. Hundreds of start-ups are developing sentiment analysis solutions. Many websites perform real-time analysis of tweets. For example, “tweetfeel” shows trends given any term, and “The Stock Sonar” shows sentiment and stock prices.

Sentiment analysis

Sentiment analysis is a particular application of text analytics. In general, the critical aspect of text analytics is to select the specific features that are relevant in a particular application. In addition, it’s important to apply specific knowledge that often leads to better results. For example, using the meaning of the symbols or include features like the number of words.

Analytical sentiment analysis we have seen can replace more labor-intensive methods like polling. Text analytics can also deal with the massive amounts of unstructured data being generated on the internet. Computers are becoming more and more capable of interacting with humans and performing human tasks.

Turning Tweets into Knowledge: Twitter

KismetK @kishi.co