This report summarizes a review of the three sample files for Natural Language prediction. The three files comprise a sample from news articles, Web logs (blogs), and X/Twitter tweet texts. Although text is supplied in English, Danish, Russian, and Finnish, this analysis will focus only on the English version of these files (en_US).
As with any initial exploration of new datasets, we examine the entire corpus, not just a subset. Training and testing subsets will be arranged in a later program when formal modelling begins. To keep this report concise and focused, we leverage both macOS (UNIX) commands as well as R functions to help quickly summarize the data. Bear in mind that the summary presented here do not necessarily represent the final data that may be used for predictions. We are exploring how these texts are structured and the first step in that process is to verify what the “English” looks like.
# how large are the files of the en_US dataset?
# bash-3.2$ ls -l
# total 1195648
# -rw-r--r--@ 1 randy staff 210160014 Jul 22 2014 en_US.blogs.txt
# -rw-r--r--@ 1 randy staff 205811889 Jul 22 2014 en_US.news.txt
# -rw-r--r--@ 1 randy staff 167105338 Jul 22 2014 en_US.twitter.txt
#
# That's 210MB for blogs, 205MB for news, and 167MB for tweets
A count of lines for each file was performed to determine how many lines we may have to deal with. The basic process involved loading each file and use R’s length() function to output a line count, as shown below.
# con <- file(source, "r")
# text <- readLines(con, skipNul = TRUE)
# close(con)
# length(text)
This process was repeated for the news data, the blog data, and the tweet data. The results were:
# 1,010,242 lines of news text.
# 899,288 lines of blog text.
# 2,360,148 lines of X/twitter text.
What constitutes a word in English? A word in English is a contiguous sequence of alphabetic characters demarcated by one (or more) spaces. These are the character sequences we see in sentences. But does a contiguous sequence of alphabetic characters demarcated by one (or more) spaces have to be a word of English? Not necessarily. Here’s why.
To examine the “words” in these texts, and to build a frequency distribution of the words, we must break down the lines/sentences in these texts to their individual “words”. We do so by a filter that retains only alphabetic characters. We also must account for the use of the apostrophe and the hyphen.
# retains only alphabetic characters and the apostrophe
text <- str_replace_all(text,'[^A-Za-z\'-]','\n')
# separate out each word using '\n' as the separator
words <- str_split(text,'\n')
# turn "words" into a vector of character strings
words <- unlist(words)
We again employ the R function length() to count English “words” found in each of the texts. After such processing was completed, here are the essential statistics for words in each text.
# 33,949,660 words in our news text.
# 37,400,126 words in our blog text.
# 29,798,289 words in our X/Twitter text.
What does this mean for words per line?
# for news, 33,949,660 words / 1,010,242 lines = ~34 words/line
# for blogs, 37,400,126 words / 899,288 lines = ~42 words/line
# for tweets, 29,798,289 words / 2,360,148 lines = ~13 words/line
Here’s where it gets tricky, because if you sample the actual texts, news and blogs have more correct grammar structure whereas X/Twitter entries tend to be ad-hoc and creatively crafted to communicate as much as possible using the fewest keystrokes.
At this writing the strategy is to find an optimum language predictor to create a prediction model for each source of text. News averages 34 words/line in over a million lines whereas blogs average 42 words/line in less than 900,000 lines. Might this be a hint that the more disciplined news writer might offer a better predictor than what we may get from the bloggers? And tweets! Averaging only 13 words/line, with who knows what level of “grammatical condensation” employed, modelling tweets could be the more interesting model!
The end product of this predictor development is to develop a Shiny app that takes as input a phrase (multiple words) and have it predict the next word.
We summarize our frequency distribution of words in a table of ordered frequency counts.
What is the frequency of occurrence of the words from the news text, ordered from most used to least used?
foo <- data.frame(value = newsWords) %>% group_by(value) %>% summarize(count = n())
foo <- arrange(foo,desc(count))
head(foo); tail(foo)
# A tibble: 6 × 2
# value count
# <chr> <int>
# 1 the 1717673
# 2 to 893070
# 3 and 852951
# 4 a 843179
# 5 of 767698
# 6 in 628247
# A tibble: 6 × 2
# value count
# <chr> <int>
# 1 zuzz 1
# 2 zydeco-blues 1
# 3 zygote 1
# 4 zz 1
# 5 zzril 1
# 6 zzz's 1
Here’s where we find a string of characters between space(s) that turn out not to be English words. Clearly, more refined filtering will be needed for prediction.
What do we find in blog text?
foo <- data.frame(value = blogWords) %>% group_by(value) %>% summarize(count = n())
foo <- arrange(foo,desc(count))
head(foo); tail(foo)
# A tibble: 6 × 2
# value count
# <chr> <int>
# 1 the 1667591
# 2 to 1052325
# 3 and 1033994
# 4 of 866973
# 5 a 863855
# 6 I 816958
# A tibble: 6 × 2
# value count
# <chr> <int>
# 1 zzzz 1
# 2 zzzz's 1
# 3 zzzzz's 1
# 4 zzzzzzzzz 1
# 5 zzzzzzzzzzzzzzzzzzzzzzzzzzz 1
# 6 zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz 1
Again, we find single occurrences of odd strings that, while not an English major, this researcher believes will not pass for English! (anyone care to guess what all those zzzzzzz’s mean?)
And what do we find in the X/Twitter text?
foo <- data.frame(value = twitterWords) %>% group_by(value) %>% summarize(count = n())
foo <- arrange(foo,desc(count))
head(foo); tail(foo)
# A tibble: 6 × 2
# value count
# <chr> <int>
# 1 the 841467
# 2 to 769408
# 3 I 624548
# 4 a 577366
# 5 you 482123
# 6 and 405247
# A tibble: 6 × 2
# value count
# <chr> <int>
# 1 zzolo 1
# 2 zzt 1
# 3 zztop 1
# 4 zzzs 1
# 5 zzzzzn 1
# 6 zzzzzzzx 1
Note that using a simple filter of alphabetic character strings is insufficient to determine a word of English. We must go further. To build a proper language predictor will require careful filtering and cleaning of the base text, and an early step will be to isolate only proper English words using a dictionary.
The three plots should be self explanatory but they have been made using the word distributions so far. Further cleaning and filtration will likely change these counts. Stay tuned…