Analytics Edge: Unit 5 - An Introduction to Text Analytics

TURNING TWEETS INTO KNOWLEDGE

Twitter

Twitter is a social networking and communication website founded in 2006
Users share and send messages that can be no longer than 140 characters long
One of the Top 10 most-visited sites on the internet

Impact of Twitter

Use by protestors across the world
Natural disaster notification, tracking of diseases
Companies will invest more than $120 billion by 2015 on analytics, hardware, software and services
Celebrities, politicians, and companies connect with fans and customers

Understanding People

Many companies maintain online presences
Managing public perception in age of instant communication essential
How can we use analytics to address this?

Using Text as Data

Until now, our data has typically been
- Structured
- Numerical
- Categorical
Tweets are
- Loosely structured
- Textual
- Poor spelling, non-traditional grammar
- Multilingual

Text Analytics

We have discussed why people care about textual data, but how do we handle it?
Humans can’t keep up with Internet-scale volumes of data

How Can Computers Help?

Computers need to understand text
This field is called Natural Language Processing

Why is it Hard?

Computers need to understand text
Ambiguity:
- “I put my bag in the car. It is large and blue”
- “It” = bag? “It” = car?

Creating the Dataset

Twitter data is public ally available
- Scrape website, or
- Use special interface for programmers (API)
- Sender of tweet may be useful, but we will ignore
Need to construct the outcome variable for tweets
- Thousands of tweets
- Two people may disagree over the correct classification

A Bag of Words

Fully understanding text is difficult
Simpler approach
- ** Count the number of times each words appears**
“This course is great. I would recommend this course to my friends.”

WORDS	Numberoftimes
THIS	2
COURSE	2
GREAT	1
…	…
WOULD	1
FRIENDS	1

A simple but Effective Approach

One feature for each word - a simple approach, but effective
Used as a baseline in text analytics projects and natural language processing
Not the whole story though - preprocessing can dramatically improve performance!

Text Analytics in General

Selecting the specific features that are relevant in the application
Applying problem specific knowledge can get better results
- Meaning of symbols
- Features like number of words

The Analytics Edge

Analytical sentiment analysis can replace more labor-intensive methods like polling
Text analytics can deal with the massive amounts of unstructured data being generated on the internet
Computers are becoming more and more capable of interacting with humans and performing human tasks

Preprocessing in R

Read in the data

# Load in the data
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
# Output the string
str(tweets)
## 'data.frame':    1181 obs. of  2 variables:
##  $ Tweet: chr  "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!!  #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
##  $ Avg  : num  2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...

Create dependent variable

# Create the dependent variable
tweets$Negative = as.factor(tweets$Avg <= -1)
# Tabulate the negative tweets
z = table(tweets$Negative)
kable(z)

Var1	Freq
FALSE	999
TRUE	182

Install new packages

library(tm)
library(SnowballC)

Create corpus

# Create Corpus
corpus = VCorpus(VectorSource(tweets$Tweet)) 

# Look at corpus
corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1181
corpus[[1]]$content
## [1] "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore"

Convert to lower-case

# Conver to lower-case
corpus = tm_map(corpus, content_transformer(tolower))

corpus[[1]]$content
## [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"

Remove punctuation

# Remove punctuation
corpus = tm_map(corpus, removePunctuation)

corpus[[1]]$content
## [1] "i have to say apple has by far the best customer care service i have ever received apple appstore"

Look at stop words

# Examine stop words
stopwords("english")[1:10]
##  [1] "i"         "me"        "my"        "myself"    "we"        "our"       "ours"      "ourselves" "you"       "your"

Remove stopwords and apple

# Remove stop words
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))

corpus[[1]]$content
## [1] "   say    far  best customer care service   ever received  appstore"

Stem document

# Stem document
corpus = tm_map(corpus, stemDocument)

corpus[[1]]$content
## [1] "say far best custom care servic ever receiv appstor"

Create matrix

# Create the document matrix
frequencies = DocumentTermMatrix(corpus)

frequencies
## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
## Non-/sparse entries: 8980/3875329
## Sparsity           : 100%
## Maximal term length: 115
## Weighting          : term frequency (tf)

Look at matrix

# Look at the matrix
inspect(frequencies[1000:1005,505:515])
## <<DocumentTermMatrix (documents: 6, terms: 11)>>
## Non-/sparse entries: 1/65
## Sparsity           : 98%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   cheapen cheaper check cheep cheer cheerio cherylcol chief chiiiiqu child
##   1000       0       0     0     0     0       0         0     0        0     0
##   1001       0       0     0     0     0       0         0     0        0     0
##   1002       0       0     0     0     0       0         0     0        0     0
##   1003       0       0     0     0     0       0         0     0        0     0
##   1004       0       0     0     0     0       0         0     0        0     0
##   1005       0       0     0     0     1       0         0     0        0     0

Check for sparsity

# Check sparsity
findFreqTerms(frequencies, lowfreq=20)
##  [1] "android"              "anyon"                "app"                  "appl"                 "back"                 "batteri"              "better"               "buy"                 
##  [9] "can"                  "cant"                 "come"                 "dont"                 "fingerprint"          "freak"                "get"                  "googl"               
## [17] "ios7"                 "ipad"                 "iphon"                "iphone5"              "iphone5c"             "ipod"                 "ipodplayerpromo"      "itun"                
## [25] "just"                 "like"                 "lol"                  "look"                 "love"                 "make"                 "market"               "microsoft"           
## [33] "need"                 "new"                  "now"                  "one"                  "phone"                "pleas"                "promo"                "promoipodplayerpromo"
## [41] "realli"               "releas"               "samsung"              "say"                  "store"                "thank"                "think"                "time"                
## [49] "twitter"              "updat"                "use"                  "via"                  "want"                 "well"                 "will"                 "work"

Remove sparse terms

# Remove sparse terms
sparse = removeSparseTerms(frequencies, 0.995)
sparse
## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
## Non-/sparse entries: 4669/360260
## Sparsity           : 99%
## Maximal term length: 20
## Weighting          : term frequency (tf)

Convert to a data frame

# Convert to a data frame
tweetsSparse = as.data.frame(as.matrix(sparse))

Make all variable names R-friendly

# Make all variables name R-friendly
colnames(tweetsSparse) = make.names(colnames(tweetsSparse))

Add dependent variable

# Add dependent variable
tweetsSparse$Negative = tweets$Negative

Split the data

# Split the data into training and test sets
library(caTools)

set.seed(123)

split = sample.split(tweetsSparse$Negative, SplitRatio = 0.7)

trainSparse = subset(tweetsSparse, split==TRUE)
testSparse = subset(tweetsSparse, split==FALSE)

Build a CART model

# Build CART Model
library(rpart)
library(rpart.plot)
tweetCART = rpart(Negative ~ ., data=trainSparse, method="class")
# Plot the CART Tree
prp(tweetCART)

Evaluate the performance of the model

# Make predictions using the CART Model
predictCART = predict(tweetCART, newdata=testSparse, type="class")
# Tabulate the negative tweets in the testing set vs the predictions
a = table(testSparse$Negative, predictCART)
kable(a)

	FALSE	TRUE
FALSE	294	6
TRUE	37	18

Compute accuracy

# Compute Accuracy
sum(diag(a))/(sum(a))
## [1] 0.8788732
(294+18)/(294+6+37+18)
## [1] 0.8788732

Baseline accuracy

# Tabulate baseline accuracy
a = table(testSparse$Negative)
kable(a)

Var1	Freq
FALSE	300
TRUE	55

a[1]/(sum(a))
##     FALSE 
## 0.8450704
300/(300+55)
## [1] 0.8450704

Random forest model

# Implement Random Forest model
library(randomForest)
set.seed(123)
tweetRF = randomForest(Negative ~ ., data=trainSparse)

Make predictions:

# Make predictions using the RF model
predictRF = predict(tweetRF, newdata=testSparse)
# Tabulate the negative tweets in the testing set vs the predictions
table(testSparse$Negative, predictRF)
##        predictRF
##         FALSE TRUE
##   FALSE   293    7
##   TRUE     34   21
a = table(testSparse$Negative, predictRF)
kable(a)

	FALSE	TRUE
FALSE	293	7
TRUE	34	21

Random Forest Accuracy

# Compute the accuracy
sum(diag(a))/(sum(a))
## [1] 0.884507
(293+21)/(293+7+34+21)
## [1] 0.884507

Man vs Machine - IBM Watson

A Grand Challenge

In 2004, IBM Vice President Charles Lickel and co- workers were having dinner at a restaurant
All of a sudden, the restaurant fell silent
Everyone was watching the game show Jeopardy! on the television in the bar
A contestant, Ken Jennings, was setting the record for the longest winning streak of all time (75 days)
Why was everyone so interested?
- Jeopardy! is a quiz show that asks complex and clever questions (puns, obscure facts, uncommon words)
- Originally aired in 1964
- A huge variety of topics
- Generally viewed as an impressive feat to do well

The Challenge Begins

In 2005, a team at IBM Research started creating a computer that could compete at Jeopardy!
Six years later, a two-game exhibition match aired on television
- The winner would receive $1,000,000

The Contestants

Ken Jennings
- Longest winning streak of 75 days
Brad Rutter
- Biggest money winner of over $3.5 million
Watson
- A supercomputer with 3,000 processors and a database of 200 million pages of information

The Game of Jeopardy!

Three rounds per game
- Jeopardy
- Double Jeopardy (dollar values doubled)
- Final Jeopardy (wager on response to one question)
Each round has five questions in six categories
- Wide variety of topics (over 2,500 different categories)
Each question has a dollar value - the first to buzz in and answer correctly wins the money
- If they answer incorrectly they lose the money

Example Round

Jeopardy! Questions

Cryptic definitions of categories and clues
Answer in the form of a question
- Q: Mozart’s last and perhaps most powerful symphony shares its name with this planet.
- A: What is Jupiter?
- Q: Smaller than only Greenland, it’s the world’s second largest island.
- A: What is New Guinea?

Why is Jeopardy Hard?

Wide variety of categories, purposely made cryptic
Computers can easily answer precise questions
Understanding natural language is hard
- Where was Albert Einstein born?
- Suppose you have the following information:
  - “One day, from his city views of Ulm, Otto chose a water color to send to Albert Einstein as a remembrance of his birthplace.”

Using Analytics

Watson received each question in text form
- Normally, players see and hear the questions
IBM used analytics to make Watson a competitive player
Used over 100 different techniques for analyzing natural language, finding hypotheses, and ranking hypotheses

Watson’s Database and Tools

A massive number of data sources
- Encyclopedias, texts, manuals, magazines, Wikipedia, etc.
Lexicon
- Describes the relationship between different words
- Ex: “Water” is a “clear liquid” but not all “clear liquids” are “water”
Part of speech tagger and parser
- Identifies functions of words in text
- Ex: “Race” can be a verb or a noun
- He won the race by 10 seconds.
- Please indicate your race.

How Watson Works

Step 1: Question Analysis
- Figure out what the question is looking for
Step 2: Hypothesis Generation
- Search information sources for possible answers
Step 3: Scoring Hypotheses
- Compute confidence levels for each answer
Step 4: Final Ranking
- Look for a highly supported answer

Progress from 2006 - 2010

IBM Watson’s progress from 2006 to 2010

The Results

Games	KenJennings	BradRutter	Watson
Game 1	$4,800	$10,400	$35,734
Game 2	$19,200	$11,200	$41,413
Total	$24,000	$21,600	$77,147

The Analytics Edge

Combine many algorithms to increase accuracy and confidence
- Any one algorithm wouldn’t have worked
Approach the problem in a different way than how a human does
- Hypothesis generation
Deal with massive amounts of data, often in unstructured form
- 90% of data is unstructured

Analytics Edge: Unit 5 - An Introduction to Text Analytics

Sulman Khan

October 26, 2018

TURNING TWEETS INTO KNOWLEDGE

Twitter

Impact of Twitter

Understanding People

Using Text as Data

Text Analytics

How Can Computers Help?

Why is it Hard?

Creating the Dataset

A Bag of Words

A simple but Effective Approach

Text Analytics in General

The Analytics Edge

Preprocessing in R

Read in the data

Create dependent variable

Install new packages

Create corpus

Convert to lower-case

Remove punctuation

Look at stop words

Remove stopwords and apple

Stem document

Create matrix

Look at matrix

Check for sparsity

Remove sparse terms

Convert to a data frame

Make all variable names R-friendly

Add dependent variable

Split the data

Build a CART model

Evaluate the performance of the model

Compute accuracy

Baseline accuracy

Random forest model

Make predictions:

Random Forest Accuracy

Man vs Machine - IBM Watson

A Grand Challenge

The Challenge Begins

The Contestants

The Game of Jeopardy!

Jeopardy! Questions

Why is Jeopardy Hard?

Using Analytics

Watson’s Database and Tools

How Watson Works

Progress from 2006 - 2010

The Results

The Analytics Edge