Word Prediction from Blogs, News, and Twitter Data

Summary

This report describes the data used to construct an applitation to predict the next word that a user will type when writing text. The original data comes from a dataset called the “HC corpora”" and can be found at “http://www.corpora.heliohost.org/”. For this report we are using a small proportion of the original data.

Data Description

Our data comes in four languages, including English. For each language we have data from blogs “en_US.blogs.txt”, news “en_US.news.txt”, and twitter “en_US.twitter.txt”. In this report we show the statistics only for the data en the English language.

list.files(pattern="^en_US")

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Some statistics about these files:

en_US.blogs.txt: 899,288 lines, 37,334,690 words
en_US.news.txt: 1,010,242 lines, 34,372,720 words
en_US.twitter.txt: 4,269,678 lines, 30,374,206 words

Exploratory Data Analysis

The exploratory data analysis started by loading a small sample of the data and then creating a word cloud for each sample. The word cloud shows the words contained in the data with size reflecting how frequent the word is in the data.

Blogs Data

This section of the report shows a summary of the data sample for the blogs data.

The structure of the loaded blogs data and a plot of the 10 most frequent bigrams:

str(blogData)

## 'data.frame':    89 obs. of  1 variable:
##  $ name: Factor w/ 89 levels "- HOUSING: making 83 families homeless, and refusing them to offer them alternative sites.",..: 79 81 32 42 10 89 4 11 36 54 ...

plotTopN(blogBigrams,25,"blogs", 10)

A word cloud for the blogs data with the 100 most frequent words:

Table showing an example of the frequency of a few words in individual documents. In this case, word “abbot” is found in document 1, with a frequency of 1:

##         freq
## term     1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 17 18 19 20 21 23 24 26 27 32
##   a      0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   aa     1 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   abbott 1 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   able   0 1 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   about  0 0 0 0 0 0 0 0 0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0
##   abovei 1 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         freq
## term     35 37 38 40 55 79 81 87 96 98 160
##   a       0  0  0  0  0  0  0  1  0  0   0
##   aa      0  0  0  0  0  0  0  0  0  0   0
##   abbott  0  0  0  0  0  0  0  0  0  0   0
##   able    0  0  0  0  0  0  0  0  0  0   0
##   about   0  0  0  0  0  0  0  0  0  0   0
##   abovei  0  0  0  0  0  0  0  0  0  0   0

News Data

This section of the report shows a summary of the data sample for the news data.

The structure of the loaded news data and a plot of the 10 most frequent bigrams:

str(newsData)

## 'data.frame':    1010 obs. of  1 variable:
##  $ name: Factor w/ 1010 levels "-- Expands library hours on Mondays, Wednesdays and Fridays.",..: 310 727 776 67 64 561 592 930 412 947 ...

plotTopN(newsBigrams,200,"news",10)

A word cloud for the news data with the 100 most frequent words:

Twitter Data

This section of the report shows a summary of the data sample for the twitter data.

The structure of the loaded twitter data and a plot of the 10 most frequent bigrams:

str(twitterData)

## 'data.frame':    2299 obs. of  1 variable:
##  $ name: Factor w/ 2297 levels "- Geisha Doll, gangbanger or ? ",..: 1172 1664 692 1590 2237 1417 27 1290 1258 1407 ...

plotTopN(twitterBigrams,80,"twitter",10)

A word cloud for the twitter data with the 100 most frequent words:

Interesting Findings

Here are a few findings for the work done up to now:

Most of the time, a small sample of the data works fine to predict words.
Accuracy and perplexity are good measures to work in the word prediction problem.
In some cases, using perplexity works better than accuracy but this does not happen all the time, it depends on the data sample and size.

Plans for the Prediction Algorithm and the Shiny App

The prediction algorithm is based on the frequency of bigrams and trigrams. The model receives as input one word (for the bigrams case) or two words (for the case of trigrams case) and predicts the next word. The prediction is based on the probabilities of the occurrence of words on the training set.

For the shyni app I plan to have a pane where the user writes text and a small selection menu where the predicted words will appear and the user can choose one word from there. The chosen word will be written for the user in the text panel.