TURNING TWEETS INTO KNOWLEDGE

Twitter

  • Twitter is a social networking and communication website founded in 2006

  • Users share and send messages that can be no longer than 140 characters long

  • One of the Top 10 most-visited sites on the internet

Impact of Twitter

  • Use by protestors across the world

  • Natural disaster notification, tracking of diseases

  • Companies will invest more than $120 billion by 2015 on analytics, hardware, software and services

  • Celebrities, politicians, and companies connect with fans and customers

Understanding People

  • Many companies maintain online presences

  • Managing public perception in age of instant communication essential

  • How can we use analytics to address this?

Using Text as Data

  • Until now, our data has typically been
    • Structured
    • Numerical
    • Categorical
  • Tweets are
    • Loosely structured
    • Textual
    • Poor spelling, non-traditional grammar
    • Multilingual

Text Analytics

  • We have discussed why people care about textual data, but how do we handle it?

  • Humans can’t keep up with Internet-scale volumes of data

How Can Computers Help?

  • Computers need to understand text

  • This field is called Natural Language Processing

Why is it Hard?

  • Computers need to understand text

  • Ambiguity:
    • “I put my bag in the car. It is large and blue”
    • “It” = bag? “It” = car?

Creating the Dataset

  • Twitter data is public ally available
    • Scrape website, or
    • Use special interface for programmers (API)
    • Sender of tweet may be useful, but we will ignore
  • Need to construct the outcome variable for tweets
    • Thousands of tweets
    • Two people may disagree over the correct classification

A Bag of Words

  • Fully understanding text is difficult

  • Simpler approach

    • ** Count the number of times each words appears**
  • “This course is great. I would recommend this course to my friends.”

WORDS Numberoftimes
THIS 2
COURSE 2
GREAT 1
WOULD 1
FRIENDS 1

A simple but Effective Approach

  • One feature for each word - a simple approach, but effective

  • Used as a baseline in text analytics projects and natural language processing

  • Not the whole story though - preprocessing can dramatically improve performance!

Text Analytics in General

  • Selecting the specific features that are relevant in the application

  • Applying problem specific knowledge can get better results
    • Meaning of symbols
    • Features like number of words

The Analytics Edge

  • Analytical sentiment analysis can replace more labor-intensive methods like polling

  • Text analytics can deal with the massive amounts of unstructured data being generated on the internet

  • Computers are becoming more and more capable of interacting with humans and performing human tasks

Preprocessing in R

Read in the data

# Load in the data
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
# Output the string
str(tweets)
## 'data.frame':    1181 obs. of  2 variables:
##  $ Tweet: chr  "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!!  #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
##  $ Avg  : num  2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...

Create dependent variable

# Create the dependent variable
tweets$Negative = as.factor(tweets$Avg <= -1)
# Tabulate the negative tweets
z = table(tweets$Negative)
kable(z)
Var1 Freq
FALSE 999
TRUE 182

Install new packages

library(tm)
library(SnowballC)

Create corpus

# Create Corpus
corpus = VCorpus(VectorSource(tweets$Tweet)) 

# Look at corpus
corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1181
corpus[[1]]$content
## [1] "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore"

Convert to lower-case

# Conver to lower-case
corpus = tm_map(corpus, content_transformer(tolower))

corpus[[1]]$content
## [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"

Remove punctuation

# Remove punctuation
corpus = tm_map(corpus, removePunctuation)

corpus[[1]]$content
## [1] "i have to say apple has by far the best customer care service i have ever received apple appstore"

Look at stop words

# Examine stop words
stopwords("english")[1:10]
##  [1] "i"         "me"        "my"        "myself"    "we"        "our"       "ours"      "ourselves" "you"       "your"

Remove stopwords and apple

# Remove stop words
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))

corpus[[1]]$content
## [1] "   say    far  best customer care service   ever received  appstore"

Stem document

# Stem document
corpus = tm_map(corpus, stemDocument)

corpus[[1]]$content
## [1] "say far best custom care servic ever receiv appstor"

Create matrix

# Create the document matrix
frequencies = DocumentTermMatrix(corpus)

frequencies
## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
## Non-/sparse entries: 8980/3875329
## Sparsity           : 100%
## Maximal term length: 115
## Weighting          : term frequency (tf)

Look at matrix

# Look at the matrix
inspect(frequencies[1000:1005,505:515])
## <<DocumentTermMatrix (documents: 6, terms: 11)>>
## Non-/sparse entries: 1/65
## Sparsity           : 98%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   cheapen cheaper check cheep cheer cheerio cherylcol chief chiiiiqu child
##   1000       0       0     0     0     0       0         0     0        0     0
##   1001       0       0     0     0     0       0         0     0        0     0
##   1002       0       0     0     0     0       0         0     0        0     0
##   1003       0       0     0     0     0       0         0     0        0     0
##   1004       0       0     0     0     0       0         0     0        0     0
##   1005       0       0     0     0     1       0         0     0        0     0

Check for sparsity

# Check sparsity
findFreqTerms(frequencies, lowfreq=20)
##  [1] "android"              "anyon"                "app"                  "appl"                 "back"                 "batteri"              "better"               "buy"                 
##  [9] "can"                  "cant"                 "come"                 "dont"                 "fingerprint"          "freak"                "get"                  "googl"               
## [17] "ios7"                 "ipad"                 "iphon"                "iphone5"              "iphone5c"             "ipod"                 "ipodplayerpromo"      "itun"                
## [25] "just"                 "like"                 "lol"                  "look"                 "love"                 "make"                 "market"               "microsoft"           
## [33] "need"                 "new"                  "now"                  "one"                  "phone"                "pleas"                "promo"                "promoipodplayerpromo"
## [41] "realli"               "releas"               "samsung"              "say"                  "store"                "thank"                "think"                "time"                
## [49] "twitter"              "updat"                "use"                  "via"                  "want"                 "well"                 "will"                 "work"

Remove sparse terms

# Remove sparse terms
sparse = removeSparseTerms(frequencies, 0.995)
sparse
## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
## Non-/sparse entries: 4669/360260
## Sparsity           : 99%
## Maximal term length: 20
## Weighting          : term frequency (tf)

Convert to a data frame

# Convert to a data frame
tweetsSparse = as.data.frame(as.matrix(sparse))

Make all variable names R-friendly

# Make all variables name R-friendly
colnames(tweetsSparse) = make.names(colnames(tweetsSparse))

Add dependent variable

# Add dependent variable
tweetsSparse$Negative = tweets$Negative

Split the data

# Split the data into training and test sets
library(caTools)

set.seed(123)

split = sample.split(tweetsSparse$Negative, SplitRatio = 0.7)

trainSparse = subset(tweetsSparse, split==TRUE)
testSparse = subset(tweetsSparse, split==FALSE)

Build a CART model

# Build CART Model
library(rpart)
library(rpart.plot)
tweetCART = rpart(Negative ~ ., data=trainSparse, method="class")
# Plot the CART Tree
prp(tweetCART)

Evaluate the performance of the model

# Make predictions using the CART Model
predictCART = predict(tweetCART, newdata=testSparse, type="class")
# Tabulate the negative tweets in the testing set vs the predictions
a = table(testSparse$Negative, predictCART)
kable(a)
FALSE TRUE
FALSE 294 6
TRUE 37 18

Compute accuracy

# Compute Accuracy
sum(diag(a))/(sum(a))
## [1] 0.8788732
(294+18)/(294+6+37+18)
## [1] 0.8788732

Baseline accuracy

# Tabulate baseline accuracy
a = table(testSparse$Negative)
kable(a)
Var1 Freq
FALSE 300
TRUE 55
a[1]/(sum(a))
##     FALSE 
## 0.8450704
300/(300+55)
## [1] 0.8450704

Random forest model

# Implement Random Forest model
library(randomForest)
set.seed(123)
tweetRF = randomForest(Negative ~ ., data=trainSparse)

Make predictions:

# Make predictions using the RF model
predictRF = predict(tweetRF, newdata=testSparse)
# Tabulate the negative tweets in the testing set vs the predictions
table(testSparse$Negative, predictRF)
##        predictRF
##         FALSE TRUE
##   FALSE   293    7
##   TRUE     34   21
a = table(testSparse$Negative, predictRF)
kable(a)
FALSE TRUE
FALSE 293 7
TRUE 34 21

Random Forest Accuracy

# Compute the accuracy
sum(diag(a))/(sum(a))
## [1] 0.884507
(293+21)/(293+7+34+21)
## [1] 0.884507

Man vs Machine - IBM Watson

A Grand Challenge

  • In 2004, IBM Vice President Charles Lickel and co- workers were having dinner at a restaurant

  • All of a sudden, the restaurant fell silent

  • Everyone was watching the game show Jeopardy! on the television in the bar

  • A contestant, Ken Jennings, was setting the record for the longest winning streak of all time (75 days)

  • Why was everyone so interested?
    • Jeopardy! is a quiz show that asks complex and clever questions (puns, obscure facts, uncommon words)
    • Originally aired in 1964
    • A huge variety of topics
    • Generally viewed as an impressive feat to do well

The Challenge Begins

  • In 2005, a team at IBM Research started creating a computer that could compete at Jeopardy!

  • Six years later, a two-game exhibition match aired on television
    • The winner would receive $1,000,000

The Contestants

  • Ken Jennings
    • Longest winning streak of 75 days
  • Brad Rutter
    • Biggest money winner of over $3.5 million
  • Watson
    • A supercomputer with 3,000 processors and a database of 200 million pages of information

The Game of Jeopardy!

  • Three rounds per game
    • Jeopardy
    • Double Jeopardy (dollar values doubled)
    • Final Jeopardy (wager on response to one question)
  • Each round has five questions in six categories
    • Wide variety of topics (over 2,500 different categories)
  • Each question has a dollar value - the first to buzz in and answer correctly wins the money
    • If they answer incorrectly they lose the money
Example Round

Example Round

Jeopardy! Questions

  • Cryptic definitions of categories and clues

  • Answer in the form of a question
    • Q: Mozart’s last and perhaps most powerful symphony shares its name with this planet.
    • A: What is Jupiter?
    • Q: Smaller than only Greenland, it’s the world’s second largest island.
    • A: What is New Guinea?

Why is Jeopardy Hard?

  • Wide variety of categories, purposely made cryptic

  • Computers can easily answer precise questions

  • Understanding natural language is hard
    • Where was Albert Einstein born?
    • Suppose you have the following information:
      • “One day, from his city views of Ulm, Otto chose a water color to send to Albert Einstein as a remembrance of his birthplace.”

Using Analytics

  • Watson received each question in text form
    • Normally, players see and hear the questions
  • IBM used analytics to make Watson a competitive player

  • Used over 100 different techniques for analyzing natural language, finding hypotheses, and ranking hypotheses

Watson’s Database and Tools

  • A massive number of data sources
    • Encyclopedias, texts, manuals, magazines, Wikipedia, etc.
  • Lexicon
    • Describes the relationship between different words
    • Ex: “Water” is a “clear liquid” but not all “clear liquids” are “water”
  • Part of speech tagger and parser
    • Identifies functions of words in text
    • Ex: “Race” can be a verb or a noun
    • He won the race by 10 seconds.
    • Please indicate your race.

How Watson Works

  • Step 1: Question Analysis
    • Figure out what the question is looking for
  • Step 2: Hypothesis Generation
    • Search information sources for possible answers
  • Step 3: Scoring Hypotheses
    • Compute confidence levels for each answer
  • Step 4: Final Ranking
    • Look for a highly supported answer

Progress from 2006 - 2010

IBM Watson’s progress from 2006 to 2010

IBM Watson’s progress from 2006 to 2010

The Results

Games KenJennings BradRutter Watson
Game 1 $4,800 $10,400 $35,734
Game 2 $19,200 $11,200 $41,413
Total $24,000 $21,600 $77,147

The Analytics Edge

  • Combine many algorithms to increase accuracy and confidence
    • Any one algorithm wouldn’t have worked
  • Approach the problem in a different way than how a human does
    • Hypothesis generation
  • Deal with massive amounts of data, often in unstructured form
    • 90% of data is unstructured