Synopsis

This document serves as an update on the word suggestion application research. Three sources of data are explored below…blogs, twitter, and news data.

The goal here is to analyze, summarize and explore the data. This work will be the foundation in the development of the application.

Exploratory Data Anaysis

Raw data

Beginning with the three text files, below are the basic file facts…

file_name source line_count word_count unix_mean_word_count
data/final/en_US/en_US.blogs.txt blogs 899,288 37,334,690 41.5
data/final/en_US/en_US.news.txt news 1,010,242 34,372,720 34.0
data/final/en_US/en_US.twitter.txt twitter 2,360,148 30,374,206 12.9

Below is the first record in the twitter data, which after review of the files, is a fine example of the data overall. One observes multiple sentences, use of abbreviation, mixed case, as well as odd formatting (no space after way in “way,way”). This variety will present some challenge, and will require some treatment in an effort to standardize and format such that the application returns accurate suggestions.

[1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."

Corpus and Tokenization

Next is getting the data read into an NLP package to further explore. The quanteda package was chosen for this work. After importing the data into data tables, these tables were read into a corpus. The corpus here is simple, it is just the collection of blog, news, and twitter records; each record is its own document.

Below is a summary of the basic data, after tokenization into words. Note these figures describe a random sample of 100,000 records each of blogs, news, and twitter records.

source mean_nsentencex mean_ntokenx mean_line_length
blogs 2.6 40.9 228.9
news 2.0 33.3 201.9
twitter 1.6 12.5 68.6

Below is a view of the relationship between the number of sentences and the token count. These display some differences in the rate at which new sentences add to the token count.

The twitter feeds generally contain small, brief sentences (therefore fewer tokens), generally five or fewer. The blogs data have a noticeable number of entries with many tokens and sentences, which makes intuitive sense as blogs are essentially a person’s stream of consciousness. The news data seem to have less variation, suggesting most adhere to some journalistic standards in terms of brevity. These statements are also supported by the table above.

Here is a wordcloud of the most frequent 100 words observed in the data. Here one can witness the popularity of the stop words…these are left in for this project as the application will use as much information as possible, including the stop words.

N-gram creation

In order for the application to suggest a likely next word (in a given phrase), the tokenization will be expanded from single words to combinations of multiple words (ngrams). Below are the most frequent 10 combinations for the bigram and trigram modifications.

These ngrams will eventually form a lookup table to be referenced by the application.

The count of single words (1-gram) is 197,442, bigrams is 2,406,552, while the trigram count is 5,585,010, foreshadowing a strain on system resources as the word combination size (n) increases.

Summary

In conclusion, the data look acceptable and are treated with a reasonable NLP package. The corpus formed yielded insights into the quantities of verbiage involved. Judicious use of ngram treatment will be guided by system limitations, as it was observed the more words involved the larger the eventual lookup list will be.

Appendix A: System setup

This work was developed on the following system:

  Model Name: iMac
  Processor Name: Quad-Core Intel Core i7
  Memory: 32 GB

Appendix B: R code