Summary

The goal of this project is to analyze a large corpus of text documents and discover the structure in the data and how words are put together. This milestone report covers the steps of loading, cleaning and analyzing text data, in the process of getting ready to build a predictive text model.

Corpus Load and Cleanup

The first step to the creation of our predictive model is to understand the data being used to train our model. In this case the dataset is composed of 3 collections of text pieces, containing blog entries, news and tweets. To create the corpus, we’ll be loading 3 files:

+ en_US.blogs.txt     (200mb)
+ en_US.news.txt      (196mb)
+ en_US.twitter.txt   (159mb)

In order to compose the corpus we need to load the text collections and explore some basic features of the data. Due to the size of the data set and processing and memory constraints, a sample of 10% of each file will be used to create the corpus for this report. After loading the text files, let’s take a look of the first lines of each of them:

First 5 blog entries:

## [1] "If I were a bear,"                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [2] "<U+0093>It<U+0092>s no costume, Pricklewood, I<U+0092>m the real McCoy.<U+0094> I then got down onto the carpet, grasped the feet of the armchair with my toes and lifted it off the ground. <U+0093>How many humans do you know who can do that?<U+0094> I asked."                                                                                                                                                                                                                            
## [3] "I napped after studio class tonight. Andrew and I were up really late last night, despite going to bed at a somewhat decent time. We ended up talking to each other for about an hour and a half to two hours just about our lives and what was going on and stuff like that. It was a very nice talk, but it made me tired for today, which explains why I<U+0092>m still awake now. But don<U+0092>t worry<U+0085>I<U+0092>ll be getting to bed shortly, right after I shower."
## [4] "Actually, this April 15, 1912, headline was based on preliminary news and was in error. The final tally shows that 1,513 lives were lost in the sinking of the Titanic, and there were only 711 survivors."                                                                                                                                                                                                                                          
## [5] "I<U+0092>m sad you can<U+0092>t be with us but we know you<U+0092>ll be watching. I don<U+0092>t know if you ever got to eat at Longhorn or not but I hear they cook a really mean steak."

First 5 news:

## [1] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [2] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [3] "But time and again in the report, Sullivan called on CPS to correct problems to improve employee accountability, saying, for example, that measures to keep employees from submitting fraudulent invoices or to block employees from accessing inappropriate websites were not in place."                                                                                                                                                                                                                          
## [4] "Let your hair down; it looks better."                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [5] "Light, simple, and relying heavily on stove-top and cold preparation, crab might make a nice early Thanksgiving menu for those who must shuttle between families. It could also work for families like mine who eat late and want a snack but must keep the oven free for the hated bird."

First 5 tweets:

## [1] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"                                     
## [2] "I'm coo... Jus at work hella tired r u ever in cali"                                                               
## [3] "we need to reconnect THIS WEEK"                                                                                    
## [4] "I'm doing it!<f0><U+009F><U+0091><U+00A6>"                                                                         
## [5] "Good questions. RT : Your #brand will be judged based on its #website. Is your website a good brand ambassador?..."

Now in order to perform some deeper analysis of the corpus, we need to clean up the text, extract unwanted symbols and characters and split the text into tokens. After splitting the corpus, we get the following information:

Source Lines Words
Blogs 89541 3712749
News 101023 3426025
Tweets 235690 2967812
Total 426254 10106586

The above table gives a general sense of the size of the corpus and number of words in each type of data set, as well as in total.

Exploratory Data Analysis

With the complete list of tokens created, let’s proceed to exploring the corpus and get a wider understanding of word relationships. First we need to create a few n-gram collections and explore word frequencies, so we’ll create a collection of bigrams and a collection of trigrams (groups of 2 and 3 contiguous words respectively).

Now we can obtain repetition frequencies for tokens, bigrams and trigrams, and we can see how frequently they show up in the corpus.

Tokens:

##    token  count freq length
## 1    the 475300 4.70      3
## 2     to 275210 2.72      2
## 3    and 239752 2.37      3
## 4      a 237969 2.35      1
## 5     of 201201 1.99      2
## 6     in 164352 1.63      2
## 7      i 163857 1.62      1
## 8    for 110620 1.09      3
## 9     is 106953 1.06      2
## 10  that 103601 1.03      4

Bigrams:

##        gram word1 word2 count freq
## 1    of_the    of   the 43114 0.43
## 2    in_the    in   the 40407 0.40
## 3    to_the    to   the 21532 0.21
## 4   for_the   for   the 20355 0.20
## 5    on_the    on   the 19829 0.20
## 6     to_be    to    be 16202 0.16
## 7    at_the    at   the 14249 0.14
## 8   and_the   and   the 12650 0.13
## 9      in_a    in     a 11976 0.12
## 10 with_the  with   the 10381 0.10

Trigrams:

##              gram  word1 word2 word3 count freq
## 1      one_of_the    one    of   the  3443 0.03
## 2        a_lot_of      a   lot    of  2924 0.03
## 3  thanks_for_the thanks   for   the  2312 0.02
## 4         to_be_a     to    be     a  1898 0.02
## 5     going_to_be  going    to    be  1710 0.02
## 6      the_end_of    the   end    of  1551 0.02
## 7       i_want_to      i  want    to  1484 0.01
## 8      out_of_the    out    of   the  1434 0.01
## 9      as_well_as     as  well    as  1368 0.01
## 10       it_was_a     it   was     a  1359 0.01

Now a few plots will give us a better sense of the words distribution

Figure 1. Word frequencies for the 30 most common tokens in the corpus.

Figure 2. Word frequencies for the 30 most frequent bi-grams in the corpus.

Figure 3. Word frequencies for the 30 most frequent trigrams in the corpus.

After some initial data analysis, we can see the most frequently typed words are usually short words, and mainly articles, connectors, and prepositions.

Predictive Model

The main idea for the predictive model is to use the collection of bigrams and trigrams (and possibly higher order n-grams) to create a Markov Chain which will allow to give next-word probabilities as words are input and a sentence starts to form.

Appendix: The Code

#Load libraries and helper functions
library(dplyr)
library(data.table)
library(knitr)
library(ggplot2)
library(tidyr)
source("loadData.R")
source("cleanData.R")

#Load the data from txt files
blogs <- fread('Data/en_US.blogs.txt', sep = "\n",header = FALSE, encoding = "UTF-8",
               verbose = FALSE, showProgress = FALSE)$V1
news <- fread('Data/en_US.news.txt', sep = "\n",header = FALSE, encoding = "UTF-8", 
              verbose = FALSE, showProgress = FALSE)$V1
tweets <- readFile('Data/en_US.twitter.txt')

#Take random samples representing 30% of each dataset 
sample <- rbinom(length(blogs),1,0.1)
blogs <- blogs[sample == 1]
sample <- rbinom(length(news),1,0.1)
news <- news[sample == 1]
sample <- rbinom(length(tweets),1,0.1)
tweets <- tweets[sample == 1]

head(blogs,5)
head(news,5)
head(tweets,5)

#Count number of lines on each collection and totals
bloglines <- length(blogs)
newslines <- length(news)
tweetslines <- length(tweets)
totallines <- bloglines + newslines + tweetslines

# Split collections into sets of words, remove symbols and convert to lower case
blogtokens <- tokenize(blogs, remove_symbols = TRUE, 
                       convert_to_lower = TRUE, filter_bad_words = FALSE)
newstokens <- tokenize(news, remove_symbols = TRUE, 
                       convert_to_lower = TRUE, filter_bad_words = FALSE)
tweetstokens <- tokenize(tweets, remove_symbols = TRUE, 
                         convert_to_lower = TRUE, filter_bad_words = FALSE)
#Group all tokens together
tokens <- c(blogtokens,newstokens,tweetstokens)

#Create sets of 2-grams and 3-grams
bigrams <- ngrams(tokens,2)
trigrams <- ngrams(tokens,3)

#Count total number of tokens
totaltokens <- length(tokens)
totalbigrams <- length(bigrams)
totaltrigrams <- length(trigrams)

#Create ordered tables of word counts
token_table <- sort(table(tokens), decreasing = TRUE)
bigram_table <- sort(table(bigrams), decreasing = TRUE)
trigram_table <- sort(table(trigrams), decreasing = TRUE)

#Create data frame with tokens frequencies and word length
tokenDF <- data.frame(token = names(token_table), count = as.vector(token_table))
tokenDF$token <- as.vector(tokenDF$token)
tokenDF <- mutate(tokenDF, freq = round((count/totaltokens)*100,2), length = nchar(token))
tokenDF[1:10,]

#Create Bigram data frame with frqeuencies and repetition counts
bigramDF <- data.frame(gram = names(bigram_table), count = as.vector(bigram_table))
bigramDF$gram <- as.vector(bigramDF$gram)
bigramDF <- separate(bigramDF,gram,c("word1","word2"),"_",remove = FALSE)
bigramDF <- mutate(bigramDF, freq = round((count/totalbigrams)*100,2) )
bigramDF[1:10,]

#Create Trigram data frame with frqeuencies and repetition counts
trigramDF <- data.frame(gram = names(trigram_table), count = as.vector(trigram_table))
trigramDF$gram <- as.vector(trigramDF$gram)
trigramDF <- separate(trigramDF,gram,c("word1","word2","word3"),"_",remove = FALSE)
trigramDF <- mutate(trigramDF, freq = round((count/totaltrigrams)*100,2))
trigramDF[1:10,]

#Plot word frequencies
ggplot(tokenDF[1:30,], aes(token,freq)) + 
    geom_bar( stat="identity", position="dodge") + 
    scale_x_discrete(limits=tokenDF$token[1:30]) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    labs(title="Figure 1. Token Frequencies",x="Token",y="Frequency (%)")

#Plot bigram frequencies
ggplot(bigramDF[1:30,], aes(gram,freq)) + 
    geom_bar( stat="identity", position="dodge") + 
    scale_x_discrete(limits=bigramDF$gram[1:30]) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    labs(title="Figure 2. 2-Gram Frequencies",x="Bigram",y="Frequency (%)")

#Plot trigram frequencies
ggplot(trigramDF[1:30,], aes(gram,freq)) + 
    geom_bar( stat="identity", position="dodge") + 
    scale_x_discrete(limits=trigramDF$gram[1:30]) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
    labs(title="Figure 3. 3-Gram Frequencies",x="Trigram",y="Frequency (%)")