Twitter is a social networking and communication website founded in 2006
Users share and send messages that can be no longer than 140 characters long
One of the Top 10 most-visited sites on the internet
Use by protestors across the world
Natural disaster notification, tracking of diseases
Companies will invest more than $120 billion by 2015 on analytics, hardware, software and services
Celebrities, politicians, and companies connect with fans and customers
Many companies maintain online presences
Managing public perception in age of instant communication essential
How can we use analytics to address this?
We have discussed why people care about textual data, but how do we handle it?
Humans can’t keep up with Internet-scale volumes of data
Computers need to understand text
This field is called Natural Language Processing
Computers need to understand text
Fully understanding text is difficult
Simpler approach
“This course is great. I would recommend this course to my friends.”
| WORDS | Numberoftimes |
|---|---|
| THIS | 2 |
| COURSE | 2 |
| GREAT | 1 |
| … | … |
| WOULD | 1 |
| FRIENDS | 1 |
One feature for each word - a simple approach, but effective
Used as a baseline in text analytics projects and natural language processing
Not the whole story though - preprocessing can dramatically improve performance!
Selecting the specific features that are relevant in the application
Analytical sentiment analysis can replace more labor-intensive methods like polling
Text analytics can deal with the massive amounts of unstructured data being generated on the internet
Computers are becoming more and more capable of interacting with humans and performing human tasks
# Load in the data
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
# Output the string
str(tweets)
## 'data.frame': 1181 obs. of 2 variables:
## $ Tweet: chr "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!! #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
## $ Avg : num 2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...# Create the dependent variable
tweets$Negative = as.factor(tweets$Avg <= -1)
# Tabulate the negative tweets
z = table(tweets$Negative)
kable(z)| Var1 | Freq |
|---|---|
| FALSE | 999 |
| TRUE | 182 |
library(tm)
library(SnowballC)# Create Corpus
corpus = VCorpus(VectorSource(tweets$Tweet))
# Look at corpus
corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1181
corpus[[1]]$content
## [1] "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore"# Conver to lower-case
corpus = tm_map(corpus, content_transformer(tolower))
corpus[[1]]$content
## [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"# Remove punctuation
corpus = tm_map(corpus, removePunctuation)
corpus[[1]]$content
## [1] "i have to say apple has by far the best customer care service i have ever received apple appstore"# Examine stop words
stopwords("english")[1:10]
## [1] "i" "me" "my" "myself" "we" "our" "ours" "ourselves" "you" "your"# Remove stop words
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus[[1]]$content
## [1] " say far best customer care service ever received appstore"# Stem document
corpus = tm_map(corpus, stemDocument)
corpus[[1]]$content
## [1] "say far best custom care servic ever receiv appstor"# Create the document matrix
frequencies = DocumentTermMatrix(corpus)
frequencies
## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
## Non-/sparse entries: 8980/3875329
## Sparsity : 100%
## Maximal term length: 115
## Weighting : term frequency (tf)# Look at the matrix
inspect(frequencies[1000:1005,505:515])
## <<DocumentTermMatrix (documents: 6, terms: 11)>>
## Non-/sparse entries: 1/65
## Sparsity : 98%
## Maximal term length: 9
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs cheapen cheaper check cheep cheer cheerio cherylcol chief chiiiiqu child
## 1000 0 0 0 0 0 0 0 0 0 0
## 1001 0 0 0 0 0 0 0 0 0 0
## 1002 0 0 0 0 0 0 0 0 0 0
## 1003 0 0 0 0 0 0 0 0 0 0
## 1004 0 0 0 0 0 0 0 0 0 0
## 1005 0 0 0 0 1 0 0 0 0 0# Check sparsity
findFreqTerms(frequencies, lowfreq=20)
## [1] "android" "anyon" "app" "appl" "back" "batteri" "better" "buy"
## [9] "can" "cant" "come" "dont" "fingerprint" "freak" "get" "googl"
## [17] "ios7" "ipad" "iphon" "iphone5" "iphone5c" "ipod" "ipodplayerpromo" "itun"
## [25] "just" "like" "lol" "look" "love" "make" "market" "microsoft"
## [33] "need" "new" "now" "one" "phone" "pleas" "promo" "promoipodplayerpromo"
## [41] "realli" "releas" "samsung" "say" "store" "thank" "think" "time"
## [49] "twitter" "updat" "use" "via" "want" "well" "will" "work"# Remove sparse terms
sparse = removeSparseTerms(frequencies, 0.995)
sparse
## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
## Non-/sparse entries: 4669/360260
## Sparsity : 99%
## Maximal term length: 20
## Weighting : term frequency (tf)# Convert to a data frame
tweetsSparse = as.data.frame(as.matrix(sparse))# Make all variables name R-friendly
colnames(tweetsSparse) = make.names(colnames(tweetsSparse))# Add dependent variable
tweetsSparse$Negative = tweets$Negative# Split the data into training and test sets
library(caTools)
set.seed(123)
split = sample.split(tweetsSparse$Negative, SplitRatio = 0.7)
trainSparse = subset(tweetsSparse, split==TRUE)
testSparse = subset(tweetsSparse, split==FALSE)# Build CART Model
library(rpart)
library(rpart.plot)
tweetCART = rpart(Negative ~ ., data=trainSparse, method="class")
# Plot the CART Tree
prp(tweetCART)# Make predictions using the CART Model
predictCART = predict(tweetCART, newdata=testSparse, type="class")
# Tabulate the negative tweets in the testing set vs the predictions
a = table(testSparse$Negative, predictCART)
kable(a)| FALSE | TRUE | |
|---|---|---|
| FALSE | 294 | 6 |
| TRUE | 37 | 18 |
# Compute Accuracy
sum(diag(a))/(sum(a))
## [1] 0.8788732
(294+18)/(294+6+37+18)
## [1] 0.8788732# Tabulate baseline accuracy
a = table(testSparse$Negative)
kable(a)| Var1 | Freq |
|---|---|
| FALSE | 300 |
| TRUE | 55 |
a[1]/(sum(a))
## FALSE
## 0.8450704
300/(300+55)
## [1] 0.8450704# Implement Random Forest model
library(randomForest)
set.seed(123)
tweetRF = randomForest(Negative ~ ., data=trainSparse)# Make predictions using the RF model
predictRF = predict(tweetRF, newdata=testSparse)
# Tabulate the negative tweets in the testing set vs the predictions
table(testSparse$Negative, predictRF)
## predictRF
## FALSE TRUE
## FALSE 293 7
## TRUE 34 21
a = table(testSparse$Negative, predictRF)
kable(a)| FALSE | TRUE | |
|---|---|---|
| FALSE | 293 | 7 |
| TRUE | 34 | 21 |
# Compute the accuracy
sum(diag(a))/(sum(a))
## [1] 0.884507
(293+21)/(293+7+34+21)
## [1] 0.884507In 2004, IBM Vice President Charles Lickel and co- workers were having dinner at a restaurant
All of a sudden, the restaurant fell silent
Everyone was watching the game show Jeopardy! on the television in the bar
A contestant, Ken Jennings, was setting the record for the longest winning streak of all time (75 days)
In 2005, a team at IBM Research started creating a computer that could compete at Jeopardy!
Example Round
Cryptic definitions of categories and clues
Wide variety of categories, purposely made cryptic
Computers can easily answer precise questions
IBM used analytics to make Watson a competitive player
Used over 100 different techniques for analyzing natural language, finding hypotheses, and ranking hypotheses
IBM Watson’s progress from 2006 to 2010
| Games | KenJennings | BradRutter | Watson |
|---|---|---|---|
| Game 1 | $4,800 | $10,400 | $35,734 |
| Game 2 | $19,200 | $11,200 | $41,413 |
| Total | $24,000 | $21,600 | $77,147 |