This data and lecture is prepared by MITx: 15.071x The Analytics Edge.

Why people care about textual data?

A key question, however, is how to handle this information included in the tweets. Humans cannot, of course, keep up with internet-scale volumes of data as there are about half a billion tweets per day. Even at the small scale, the cost and time required to process this is of course prohibitive.

The field that addresses how computers understand text is called Natural Language Processing. The goal is to understand and derive meaning from human language. In 1950, Alan Turing, a major computer scientist of the era, proposed a test of machine intelligence. That the computer program passes it if it can take part in a real-time conversation and cannot be distinguished from a human.

Understand sentiment of tweets about the company Apple

Apple is a computer company known for its laptops, phones, tablets, and personal media players. While Apple has a large number of fans, they also have a large number of people who don’t like their products. And they have several competitors. To better understand public perception, Apple wants to monitor how people feel over time and how people receive new announcements. Our challenge in this lecture is to see if we can correctly classify tweets as being negative, positive, or neither about Apple. To collect the data needed for this task, we had to perform two steps.

The first was to collect data about tweets from the internet. Twitter data is publicly available. And you can collect it through scraping the website or by using a special interface for programmers that Twitter provides called an API. The sender of the tweet might be useful to predict sentiment. But we’ll ignore it to keep our data anonymized. So we’ll just be using the text of the tweet. Then we need to construct the outcome variable for these tweets, which means that we have to label them as positive, negative, or neutral sentiment. We would like to label thousands of tweets. And we know that two people might disagree over the correct classification of a tweet. So to do this efficiently, one option is to use the Amazon Mechanical Turk.

What is the Amazon Mechanical Turk?

It allows people to break tasks down into small components and then enables them to distribute these tasks online to be solved by people all over the world. People can sign up to perform the available tasks for a fee. As the task creator, we pay the workers a fixed amount per completed task. For example, we might pay $0.02 for a single classified tweet. The Amazon Mechanical Turk serves as a broker and takes a small cut of the money. Many of the tasks on the Mechanical Turk require human intelligence, like classifying the sentiment of a tweet. But these tasks may be time consuming or require building otherwise unneeded capacity for the creator of the task. And so it’s appealing to outsource the job. The task that we put on the Amazon Mechanical Turk was to judge the sentiment expressed by the following item toward the software company Apple. The items we gave them were tweets that we had collected. The workers could pick from the following options as their response– strongly negative, negative, neutral, positive, and strongly positive. We represented each of these outcomes as a number on the scale from negative 2 to 2.

# Read in the data

tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)

str(tweets)
## 'data.frame':    1181 obs. of  2 variables:
##  $ Tweet: chr  "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!!  #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
##  $ Avg  : num  2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...
# Create dependent variable

tweets$Negative = as.factor(tweets$Avg <= -1)

table(tweets$Negative)
## 
## FALSE  TRUE 
##   999   182
# Install new packages
library(tm)
## Loading required package: NLP
library(SnowballC)

Pre-Processing in R

It just counts the number of times each word appears in the text and uses these counts as the independent variables.

It’s used as a baseline in text analytics projects and for natural language processing. This isn’t the whole story, though. Preprocessing the text can dramatically improve the performance of the Bag of Words method. One part of preprocessing the text is to clean up irregularities. Text data often as many inconsistencies that will cause algorithms trouble.

Text data often as many inconsistencies that will cause algorithms trouble. These are the step to pre-processing the bags of words:

  • Change all the letters to lowercase
  • Removal of punctuation
  • Remove unhelpful terms
  • Removing the stop words
  • Stemming
  • # Create corpus
     
    corpus = Corpus(VectorSource(tweets$Tweet))
    
    # Look at corpus
    corpus
    ## <<VCorpus>>
    ## Metadata:  corpus specific: 0, document level (indexed): 0
    ## Content:  documents: 1181
    corpus[[1]]
    ## <<PlainTextDocument>>
    ## Metadata:  7
    ## Content:  chars: 101
    # Convert to lower-case
    
    corpus = tm_map(corpus, tolower)
    
    corpus[[1]]
    ## [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"
    # IMPORTANT NOTE: If you are using the latest version of the tm package, you will need to run the following line before continuing (it converts corpus to a Plain Text Document). This is a recent change having to do with the tolower function that occurred after this video was recorded.
    
    corpus = tm_map(corpus, PlainTextDocument)
    
    
    # Remove punctuation
    
    corpus = tm_map(corpus, removePunctuation)
    
    corpus[[1]]
    ## <<PlainTextDocument>>
    ## Metadata:  7
    ## Content:  chars: 97
    # Look at stop words 
    stopwords("english")[1:10]
    ##  [1] "i"         "me"        "my"        "myself"    "we"       
    ##  [6] "our"       "ours"      "ourselves" "you"       "your"
    # Remove stopwords and apple
    
    corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
    
    corpus[[1]]
    ## <<PlainTextDocument>>
    ## Metadata:  7
    ## Content:  chars: 67
    # Stem document 
    
    corpus = tm_map(corpus, stemDocument)
    
    corpus[[1]]
    ## <<PlainTextDocument>>
    ## Metadata:  7
    ## Content:  chars: 61

    Bag of Words in R

    # Create matrix
    
    frequencies = DocumentTermMatrix(corpus)
    
    frequencies
    ## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
    ## Non-/sparse entries: 8980/3875329
    ## Sparsity           : 100%
    ## Maximal term length: 115
    ## Weighting          : term frequency (tf)
    # Look at matrix 
    
    inspect(frequencies[1000:1005,505:515])
    ## <<DocumentTermMatrix (documents: 6, terms: 11)>>
    ## Non-/sparse entries: 1/65
    ## Sparsity           : 98%
    ## Maximal term length: 9
    ## Weighting          : term frequency (tf)
    ## 
    ##               Terms
    ## Docs           cheapen cheaper check cheep cheer cheerio cherylcol chief
    ##   character(0)       0       0     0     0     0       0         0     0
    ##   character(0)       0       0     0     0     0       0         0     0
    ##   character(0)       0       0     0     0     0       0         0     0
    ##   character(0)       0       0     0     0     0       0         0     0
    ##   character(0)       0       0     0     0     0       0         0     0
    ##   character(0)       0       0     0     0     1       0         0     0
    ##               Terms
    ## Docs           chiiiiqu child children
    ##   character(0)        0     0        0
    ##   character(0)        0     0        0
    ##   character(0)        0     0        0
    ##   character(0)        0     0        0
    ##   character(0)        0     0        0
    ##   character(0)        0     0        0
    # Check for sparsity
    
    findFreqTerms(frequencies, lowfreq=100)
    ## [1] "iphon" "itun"  "new"
    # Remove sparse terms
    
    sparse = removeSparseTerms(frequencies, 0.995)
    sparse
    ## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
    ## Non-/sparse entries: 4669/360260
    ## Sparsity           : 99%
    ## Maximal term length: 20
    ## Weighting          : term frequency (tf)
    # Convert to a data frame
    
    tweetsSparse = as.data.frame(as.matrix(sparse))
    
    # Make all variable names R-friendly
    
    colnames(tweetsSparse) = make.names(colnames(tweetsSparse))
    
    # Add dependent variable
    
    tweetsSparse$Negative = tweets$Negative
    
    # Split the data
    
    library(caTools)
    
    set.seed(123)
    
    split = sample.split(tweetsSparse$Negative, SplitRatio = 0.7)
    
    trainSparse = subset(tweetsSparse, split==TRUE)
    testSparse = subset(tweetsSparse, split==FALSE)

    Predicting Sentiment - CART model

    # Build a CART model
    
    library(rpart)
    library(rpart.plot)
    
    tweetCART = rpart(Negative ~ ., data=trainSparse, method="class")
    
    prp(tweetCART)

    # Evaluate the performance of the model
    predictCART = predict(tweetCART, newdata=testSparse, type="class")
    
    table(testSparse$Negative, predictCART)
    ##        predictCART
    ##         FALSE TRUE
    ##   FALSE   294    6
    ##   TRUE     37   18
    # Compute accuracy
    
    (294+18)/(294+6+37+18)
    ## [1] 0.8788732
    # Baseline accuracy 
    
    table(testSparse$Negative)
    ## 
    ## FALSE  TRUE 
    ##   300    55
    300/(300+55)
    ## [1] 0.8450704

    Predicting Sentiment - Random forest model

    # Random forest model
    
    library(randomForest)
    ## randomForest 4.6-12
    ## Type rfNews() to see new features/changes/bug fixes.
    set.seed(123)
    
    tweetRF = randomForest(Negative ~ ., data=trainSparse)
    
    predictions = predict(tweetRF, newdata=testSparse, type="response")
    
    
    # Make predictions:
    predictRF = predict(tweetRF, newdata=testSparse)
    
    table(testSparse$Negative, predictRF)
    ##        predictRF
    ##         FALSE TRUE
    ##   FALSE   293    7
    ##   TRUE     34   21
    # Accuracy:
    (293+21)/(293+7+34+21)
    ## [1] 0.884507

    Over 7,000 research articles have been written on the topic. Hundreds of start-ups are developing sentiment analysis solutions. Many websites perform real-time analysis of tweets. For example, “tweetfeel” shows trends given any term, and “The Stock Sonar” shows sentiment and stock prices.

    Sentiment analysis

    Sentiment analysis is a particular application of text analytics. In general, the critical aspect of text analytics is to select the specific features that are relevant in a particular application. In addition, it’s important to apply specific knowledge that often leads to better results. For example, using the meaning of the symbols or include features like the number of words.

    Analytical sentiment analysis we have seen can replace more labor-intensive methods like polling. Text analytics can also deal with the massive amounts of unstructured data being generated on the internet. Computers are becoming more and more capable of interacting with humans and performing human tasks.