The Milestone Report

Introduction

This report summarizes a review of the three sample files for Natural Language prediction. The three files comprise a sample from news articles, Web logs (blogs), and X/Twitter tweet texts. Although text is supplied in English, Danish, Russian, and Finnish, this analysis will focus only on the English version of these files (en_US).

As with any initial exploration of new datasets, we examine the entire corpus, not just a subset. Training and testing subsets will be arranged in a later program when formal modelling begins. To keep this report concise and focused, we leverage both macOS (UNIX) commands as well as R functions to help quickly summarize the data. Bear in mind that the summary presented here do not necessarily represent the final data that may be used for predictions. We are exploring how these texts are structured and the first step in that process is to verify what the “English” looks like.

How Large Are These Files?

# how large are the files of the en_US dataset?
# bash-3.2$ ls -l
# total 1195648
# -rw-r--r--@ 1 randy  staff  210160014 Jul 22  2014 en_US.blogs.txt
# -rw-r--r--@ 1 randy  staff  205811889 Jul 22  2014 en_US.news.txt
# -rw-r--r--@ 1 randy  staff  167105338 Jul 22  2014 en_US.twitter.txt
#
# That's 210MB for blogs, 205MB for news, and 167MB for tweets

How Many Lines Are In Each File?

A count of lines for each file was performed to determine how many lines we may have to deal with. The basic process involved loading each file and use R’s length() function to output a line count, as shown below.

# con  <- file(source, "r")
# text <- readLines(con, skipNul = TRUE)
# close(con)
# length(text)

This process was repeated for the news data, the blog data, and the tweet data. The results were:

# 1,010,242 lines of news text.
# 899,288   lines of blog text.
# 2,360,148 lines of X/twitter text.

How Many Words Are In Each File?

What constitutes a word in English? A word in English is a contiguous sequence of alphabetic characters demarcated by one (or more) spaces. These are the character sequences we see in sentences. But does a contiguous sequence of alphabetic characters demarcated by one (or more) spaces have to be a word of English? Not necessarily. Here’s why.

To examine the “words” in these texts, and to build a frequency distribution of the words, we must break down the lines/sentences in these texts to their individual “words”. We do so by a filter that retains only alphabetic characters. We also must account for the use of the apostrophe and the hyphen.

# retains only alphabetic characters and the apostrophe
text <- str_replace_all(text,'[^A-Za-z\'-]','\n')
# separate out each word using '\n' as the separator
words <- str_split(text,'\n')
# turn "words" into a vector of character strings
words <- unlist(words)

We again employ the R function length() to count English “words” found in each of the texts. After such processing was completed, here are the essential statistics for words in each text.

# 33,949,660 words in our news text.
# 37,400,126 words in our blog text.
# 29,798,289 words in our X/Twitter text.

What does this mean for words per line?

# for news,   33,949,660 words / 1,010,242 lines = ~34 words/line
# for blogs,  37,400,126 words / 899,288 lines   = ~42 words/line
# for tweets, 29,798,289 words / 2,360,148 lines = ~13 words/line

Here’s where it gets tricky, because if you sample the actual texts, news and blogs have more correct grammar structure whereas X/Twitter entries tend to be ad-hoc and creatively crafted to communicate as much as possible using the fewest keystrokes.

Is it fair to assume that news are written by journalists who’ve been trained in proper writing?
Blogs can be written by anyone, implying a broader range of “license” in the use of English.
And tweets are all over the place; in our penchant for speed, new abbreviations (OMG, BTW, LOL, etc) and emoticons/emojicons have introduced a whole new dimension to the accepted(?) use of English.

How Might We Model These Texts for Predictions?

At this writing the strategy is to find an optimum language predictor to create a prediction model for each source of text. News averages 34 words/line in over a million lines whereas blogs average 42 words/line in less than 900,000 lines. Might this be a hint that the more disciplined news writer might offer a better predictor than what we may get from the bloggers? And tweets! Averaging only 13 words/line, with who knows what level of “grammatical condensation” employed, modelling tweets could be the more interesting model!

The end product of this predictor development is to develop a Shiny app that takes as input a phrase (multiple words) and have it predict the next word.

Some Interesting Finds in the Word Frequency Distribution

We summarize our frequency distribution of words in a table of ordered frequency counts.

What is the frequency of occurrence of the words from the news text, ordered from most used to least used?

foo <- data.frame(value = newsWords) %>% group_by(value) %>% summarize(count = n())
foo <- arrange(foo,desc(count))
head(foo); tail(foo)
# A tibble: 6 × 2
#  value   count
#  <chr>   <int>
# 1 the   1717673
# 2 to     893070
# 3 and    852951
# 4 a      843179
# 5 of     767698
# 6 in     628247
# A tibble: 6 × 2
#  value        count
#  <chr>        <int>
# 1 zuzz             1
# 2 zydeco-blues     1
# 3 zygote           1
# 4 zz               1
# 5 zzril            1
# 6 zzz's            1

Here’s where we find a string of characters between space(s) that turn out not to be English words. Clearly, more refined filtering will be needed for prediction.

What do we find in blog text?

foo <- data.frame(value = blogWords) %>% group_by(value) %>% summarize(count = n())
foo <- arrange(foo,desc(count))
head(foo); tail(foo)
# A tibble: 6 × 2
#  value   count
#  <chr>   <int>
# 1 the   1667591
# 2 to    1052325
# 3 and   1033994
# 4 of     866973
# 5 a      863855
# 6 I      816958
# A tibble: 6 × 2
#  value                           count
#  <chr>                           <int>
# 1 zzzz                                1
# 2 zzzz's                              1
# 3 zzzzz's                             1
# 4 zzzzzzzzz                           1
# 5 zzzzzzzzzzzzzzzzzzzzzzzzzzz         1
# 6 zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz     1

Again, we find single occurrences of odd strings that, while not an English major, this researcher believes will not pass for English! (anyone care to guess what all those zzzzzzz’s mean?)

And what do we find in the X/Twitter text?

foo <- data.frame(value = twitterWords) %>% group_by(value) %>% summarize(count = n())
foo <- arrange(foo,desc(count))
head(foo); tail(foo)
# A tibble: 6 × 2
#  value  count
#  <chr>  <int>
# 1 the   841467
# 2 to    769408
# 3 I     624548
# 4 a     577366
# 5 you   482123
# 6 and   405247
# A tibble: 6 × 2
#  value    count
#  <chr>    <int>
# 1 zzolo        1
# 2 zzt          1
# 3 zztop        1
# 4 zzzs         1
# 5 zzzzzn       1
# 6 zzzzzzzx     1

Note that using a simple filter of alphabetic character strings is insufficient to determine a word of English. We must go further. To build a proper language predictor will require careful filtering and cleaning of the base text, and an early step will be to isolate only proper English words using a dictionary.

We Close With Visualization

The three plots should be self explanatory but they have been made using the word distributions so far. Further cleaning and filtration will likely change these counts. Stay tuned…