By Rendani Mbuvha, School of Statistics and Actuarial Science University of Witwatersrand
This notebook provides a short tutorial in building predictive models using text data. The approach we take here is using the Bag of Words or the Document Term Matrix (DTM) technique. This is essentially representing documents by the relative occurrences of different words from set vocabulary. Counts can be of single words or of N-consecutive words (N-grams). This exercise will focus on single word counts (unigrams) and two consecutive word counts(bigrams).
First of all, we need to install R, R studio, and the required packages (where required).
In R install libraries:
Use the command install.packages(“package-name”)
Set working directory to the folder where you would to save your work.
setwd("~/Dropbox/Teaching/ASSA_Workshop/")
Load the required libraries
# Main library for NLP
library(text2vec)
package ‘text2vec’ was built under R version 3.3.2text2vec is still in beta version - APIs can be changed.
For tutorials and examples visit http://text2vec.org.
For FAQ refer to
1. https://stackoverflow.com/questions/tagged/text2vec?sort=newest
2. https://github.com/dselivanov/text2vec/issues?utf8=%E2%9C%93&q=is%3Aissue%20label%3Aquestion
If you have questions please post them at StackOverflow and mark with 'text2vec' tag.
Attaching package: ‘text2vec’
The following object is masked _by_ ‘.GlobalEnv’:
movie_review
# For model evaluation calc
library(ROCR)
#Gives nice plots
library(ggplot2)
package ‘ggplot2’ was built under R version 3.3.2Need help? Try the ggplot2 mailing list:
http://groups.google.com/group/ggplot2.
The dataset for this tutorial consists of 5000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of the reviews is binary, meaning an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 has a sentiment score of 1. No individual movie had more than 30 reviews.
The aim of this tutorial is to predict a movie’s sentiment based on a vectorized representation of the words contained in the movie’s review.
Possible areas in the financial services industry where such modeling techniques can be used include:
Load the IMDb movie review dataset. And perform some preprocessing.
data("movie_review")
#Clean up the review
movie_review$review <- gsub('b"|b\'|\n|\\\\|\\"', "", movie_review$review)
movie_review$review <- gsub("([<>])|[[:punct:]]", "\\1", movie_review$review)
#Randomly sample training records
training_ids = sample(1:5000,3500)
testing_ids = setdiff(1:5000,training_ids)
#Creating subsets.
training = movie_review[training_ids,]
testing = movie_review[testing_ids,]
head(training)
The next step will be to split documents (reviews) into separate words(Tokenize).
#Words to skip - to avoid inflated counts
stop_words = c("the","to","of","for","br","this","for", "in","a","b","and","on","is","by","that","with","from","as","it","are","have","be","us","an","was","u","i")
#Tokenize the documents
train_tokens = itoken(training$review,
preprocessor = tolower,
tokenizer = word_tokenizer,
ids = training$id,
progressbar = TRUE)
Create universe of words from the entire training dataset. This is referred to as the vocabulary or dictionary of the problem.
vocab = create_vocabulary(train_tokens,stopwords = stop_words)
|
|======== | 10%
|
|=============== | 20%
|
|======================= | 30%
|
|============================== | 40%
|
|====================================== | 50%
|
|============================================== | 60%
|
|===================================================== | 70%
|
|============================================================= | 80%
|
|==================================================================== | 90%
|
|============================================================================| 100%
#Trim the vocabulary a little bit
pruned_vocab = prune_vocabulary(vocab,term_count_min = 10, doc_proportion_max = 0.95, doc_proportion_min = 0.001, vocab_term_max = 1000)
Now we are able to create our text features/rating factors. This is done by simply vectorizing each review by the counts of each word (from the vocabulary) that is represented in the document. Simply this is effectively scoring through the vocabulary. This creates what is called a bag-of-words or a document term matrix (DTM).
#Index each token
vectorizer = vocab_vectorizer(pruned_vocab)
#bin counts falling into each token
dtm_train = create_dtm(train_tokens, vectorizer)
|
|======== | 10%
|
|=============== | 20%
|
|======================= | 30%
|
|============================== | 40%
|
|====================================== | 50%
|
|============================================== | 60%
|
|===================================================== | 70%
|
|============================================================= | 80%
|
|==================================================================== | 90%
|
|============================================================================| 100%
Just a quick check on our DTM
# Dimension check
dim(dtm_train)
[1] 3500 1000
# check the first 6 rows
head(data.frame(as.matrix(dtm_train)))
Train a simple GLM on the above created features to predict sentiment.
library(glmnet)
package ‘glmnet’ was built under R version 3.3.2Loading required package: Matrix
package ‘Matrix’ was built under R version 3.3.2Loading required package: foreach
foreach: simple, scalable parallel programming from Revolution Analytics
Use Revolution R for scalability, fault tolerance and more.
http://www.revolutionanalytics.com
Loaded glmnet 2.0-13
set.seed(201L)
NFOLDS = 4
#row normalize the data
dtm_train_norm = normalize(dtm_train, "l1")
unigram_classifier = cv.glmnet(x = dtm_train_norm, y = training$sentiment,
family = 'binomial',
alpha = 1,
type.measure = "auc",
nfolds = NFOLDS,
thresh = 1e-3)
Prepare the testing set by tokenizing and vectorizing using the training scheme.
testing_tokens = itoken(testing$review,
ids = testing$id,
prep_fun= tolower,
tokenizer =word_tokenizer,
progressbar = TRUE)
dtm_test = create_dtm(testing_tokens, vectorizer)
|
|======== | 10%
|
|=============== | 20%
|
|======================= | 30%
|
|============================== | 40%
|
|====================================== | 50%
|
|============================================== | 60%
|
|===================================================== | 70%
|
|============================================================= | 80%
|
|==================================================================== | 90%
|
|============================================================================| 100%
#row normalize the data
dtm_test_norm = normalize(dtm_test, "l1")
preds = predict(unigram_classifier , dtm_test_norm , type = 'response',s = "lambda.min")
Evaluate model performance
#Create a prediction object
prediction = prediction(preds, testing$sentiment)
#infer True postive and false positve rates
perf = performance(prediction, measure = "tpr", x.measure = "fpr")
#accuracy and area under the curve
acc = performance(prediction, measure = "acc")
auc = performance(prediction, measure = "auc")
Plot accuracy at different cutoffs
plot(acc,main=" Unigram Model Accuracy",col="orange",lwd=3)
grid()
The model’s accuracy at cutoff 0.5 is approximately 0.819
roc.data = data.frame(fpr=unlist(perf@x.values),
tpr=unlist(perf@y.values))
ggplot(roc.data, aes(x=fpr, ymin=0, ymax=tpr)) +
geom_ribbon(alpha=0.8) +
geom_line(aes(y=tpr)) +
geom_abline(slope=1, intercept=0, linetype='dashed') +
ggtitle("ROC Curve") +
ylab('True Positive Rate') +
xlab('False Positive Rate')
The model clearly generates lift over random guessing. The AUC is ~ O.8986.
coef_mat = as.matrix(coef(unigram_classifier, s="lambda.min"))
coef_mat = coef_mat[order(coef_mat),]
Bottom 20 coefficients - Words that reduce probability of a positive review.
data.frame(head(coef_mat,n = 20))
Top 20 coefficients - Words that increase probability of a positive review.
data.frame(tail(coef_mat, n = 20))
Here we repeat the exercise using higher order N-grams.
set.seed(20166)
bigram_vocab = create_vocabulary(train_tokens,stopwords = stop_words, ngram = c( 1L,2L))
|
|======== | 10%
|
|=============== | 20%
|
|======================= | 30%
|
|============================== | 40%
|
|====================================== | 50%
|
|============================================== | 60%
|
|===================================================== | 70%
|
|============================================================= | 80%
|
|==================================================================== | 90%
|
|============================================================================| 100%
Prune a little
bigram_vocab = prune_vocabulary(bigram_vocab,
doc_proportion_max = 0.5, term_count_max = 3000)
bigram_vectorizer = vocab_vectorizer(bigram_vocab)
bigram_dtm_train = create_dtm(train_tokens, bigram_vectorizer)
|
|======== | 10%
|
|=============== | 20%
|
|======================= | 30%
|
|============================== | 40%
|
|====================================== | 50%
|
|============================================== | 60%
|
|===================================================== | 70%
|
|============================================================= | 80%
|
|==================================================================== | 90%
|
|============================================================================| 100%
bigram_dtm_train_norm = normalize(bigram_dtm_train, "l1")
bigram_classifier = cv.glmnet(x = bigram_dtm_train_norm, y = training$sentiment,
family = 'binomial',
alpha = 1,
type.measure = "auc",
nfolds = NFOLDS,
thresh = 1e-3,
maxit = 1e3)
bigram_dtm_test = create_dtm(testing_tokens, bigram_vectorizer)
|
|======== | 10%
|
|=============== | 20%
|
|======================= | 30%
|
|============================== | 40%
|
|====================================== | 50%
|
|============================================== | 60%
|
|===================================================== | 70%
|
|============================================================= | 80%
|
|==================================================================== | 90%
|
|============================================================================| 100%
#row normalize the data
bigram_dtm_test_norm = normalize(bigram_dtm_test, "l1")
preds_bg = predict(bigram_classifier , bigram_dtm_test_norm , type = 'response',s ="lambda.min")
#Create a prediction object
prediction_bg = prediction(preds_bg, testing$sentiment)
#infer True postive and false positve rates
perf_bg = performance(prediction_bg, measure = "tpr", x.measure = "fpr")
#accuracy and area under the curve
acc_bg = performance(prediction_bg, measure = "acc")
acc_bg@y.values[[1]][750]
[1] 0.8246667
auc_bg = performance(prediction_bg, measure = "auc")
Plot accuracy
plot(acc_bg,main=" Bigram Model Accuracy",col="orange",lwd=3)
grid()
coef_mat_bg = as.matrix(coef(bigram_classifier, s="lambda.min"))
coef_mat_bg = coef_mat_bg[order(coef_mat_bg),]
Bottom 20 coefficients - Words that reduce probability of a positive review.
data.frame(head(coef_mat_bg,n = 20))
Top 20 coefficients - Words that increase probability of a positive review.
data.frame(tail(coef_mat_bg, n = 20))
roc.data_bg = data.frame(fpr=unlist(perf_bg@x.values),
tpr=unlist(perf_bg@y.values))
ggplot(roc.data_bg, aes(x=fpr, ymin=0, ymax=tpr)) +
geom_ribbon(alpha=0.8) +
geom_line(aes(y=tpr)) +
geom_abline(slope=1, intercept=0, linetype='dashed') +
ggtitle("ROC Curve") +
ylab('True Positive Rate') +
xlab('False Positive Rate')