The goal of this project is to analyze a large corpus of text documents and discover the structure in the data and how words are put together. This milestone report covers the steps of loading, cleaning and analyzing text data, in the process of getting ready to build a predictive text model.
The first step to the creation of our predictive model is to understand the data being used to train our model. In this case the dataset is composed of 3 collections of text pieces, containing blog entries, news and tweets. To create the corpus, we’ll be loading 3 files:
+ en_US.blogs.txt (200mb)
+ en_US.news.txt (196mb)
+ en_US.twitter.txt (159mb)
In order to compose the corpus we need to load the text collections and explore some basic features of the data. Due to the size of the data set and processing and memory constraints, a sample of 10% of each file will be used to create the corpus for this report. After loading the text files, let’s take a look of the first lines of each of them:
First 5 blog entries:
## [1] "If I were a bear,"
## [2] "<U+0093>It<U+0092>s no costume, Pricklewood, I<U+0092>m the real McCoy.<U+0094> I then got down onto the carpet, grasped the feet of the armchair with my toes and lifted it off the ground. <U+0093>How many humans do you know who can do that?<U+0094> I asked."
## [3] "I napped after studio class tonight. Andrew and I were up really late last night, despite going to bed at a somewhat decent time. We ended up talking to each other for about an hour and a half to two hours just about our lives and what was going on and stuff like that. It was a very nice talk, but it made me tired for today, which explains why I<U+0092>m still awake now. But don<U+0092>t worry<U+0085>I<U+0092>ll be getting to bed shortly, right after I shower."
## [4] "Actually, this April 15, 1912, headline was based on preliminary news and was in error. The final tally shows that 1,513 lives were lost in the sinking of the Titanic, and there were only 711 survivors."
## [5] "I<U+0092>m sad you can<U+0092>t be with us but we know you<U+0092>ll be watching. I don<U+0092>t know if you ever got to eat at Longhorn or not but I hear they cook a really mean steak."
First 5 news:
## [1] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [2] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [3] "But time and again in the report, Sullivan called on CPS to correct problems to improve employee accountability, saying, for example, that measures to keep employees from submitting fraudulent invoices or to block employees from accessing inappropriate websites were not in place."
## [4] "Let your hair down; it looks better."
## [5] "Light, simple, and relying heavily on stove-top and cold preparation, crab might make a nice early Thanksgiving menu for those who must shuttle between families. It could also work for families like mine who eat late and want a snack but must keep the oven free for the hated bird."
First 5 tweets:
## [1] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
## [2] "I'm coo... Jus at work hella tired r u ever in cali"
## [3] "we need to reconnect THIS WEEK"
## [4] "I'm doing it!<f0><U+009F><U+0091><U+00A6>"
## [5] "Good questions. RT : Your #brand will be judged based on its #website. Is your website a good brand ambassador?..."
Now in order to perform some deeper analysis of the corpus, we need to clean up the text, extract unwanted symbols and characters and split the text into tokens. After splitting the corpus, we get the following information:
| Source | Lines | Words |
|---|---|---|
| Blogs | 89541 | 3712749 |
| News | 101023 | 3426025 |
| Tweets | 235690 | 2967812 |
| Total | 426254 | 10106586 |
The above table gives a general sense of the size of the corpus and number of words in each type of data set, as well as in total.
With the complete list of tokens created, let’s proceed to exploring the corpus and get a wider understanding of word relationships. First we need to create a few n-gram collections and explore word frequencies, so we’ll create a collection of bigrams and a collection of trigrams (groups of 2 and 3 contiguous words respectively).
Now we can obtain repetition frequencies for tokens, bigrams and trigrams, and we can see how frequently they show up in the corpus.
Tokens:
## token count freq length
## 1 the 475300 4.70 3
## 2 to 275210 2.72 2
## 3 and 239752 2.37 3
## 4 a 237969 2.35 1
## 5 of 201201 1.99 2
## 6 in 164352 1.63 2
## 7 i 163857 1.62 1
## 8 for 110620 1.09 3
## 9 is 106953 1.06 2
## 10 that 103601 1.03 4
Bigrams:
## gram word1 word2 count freq
## 1 of_the of the 43114 0.43
## 2 in_the in the 40407 0.40
## 3 to_the to the 21532 0.21
## 4 for_the for the 20355 0.20
## 5 on_the on the 19829 0.20
## 6 to_be to be 16202 0.16
## 7 at_the at the 14249 0.14
## 8 and_the and the 12650 0.13
## 9 in_a in a 11976 0.12
## 10 with_the with the 10381 0.10
Trigrams:
## gram word1 word2 word3 count freq
## 1 one_of_the one of the 3443 0.03
## 2 a_lot_of a lot of 2924 0.03
## 3 thanks_for_the thanks for the 2312 0.02
## 4 to_be_a to be a 1898 0.02
## 5 going_to_be going to be 1710 0.02
## 6 the_end_of the end of 1551 0.02
## 7 i_want_to i want to 1484 0.01
## 8 out_of_the out of the 1434 0.01
## 9 as_well_as as well as 1368 0.01
## 10 it_was_a it was a 1359 0.01
Now a few plots will give us a better sense of the words distribution
Figure 1. Word frequencies for the 30 most common tokens in the corpus.
Figure 2. Word frequencies for the 30 most frequent bi-grams in the corpus.
Figure 3. Word frequencies for the 30 most frequent trigrams in the corpus.
After some initial data analysis, we can see the most frequently typed words are usually short words, and mainly articles, connectors, and prepositions.
The main idea for the predictive model is to use the collection of bigrams and trigrams (and possibly higher order n-grams) to create a Markov Chain which will allow to give next-word probabilities as words are input and a sentence starts to form.
#Load libraries and helper functions
library(dplyr)
library(data.table)
library(knitr)
library(ggplot2)
library(tidyr)
source("loadData.R")
source("cleanData.R")
#Load the data from txt files
blogs <- fread('Data/en_US.blogs.txt', sep = "\n",header = FALSE, encoding = "UTF-8",
verbose = FALSE, showProgress = FALSE)$V1
news <- fread('Data/en_US.news.txt', sep = "\n",header = FALSE, encoding = "UTF-8",
verbose = FALSE, showProgress = FALSE)$V1
tweets <- readFile('Data/en_US.twitter.txt')
#Take random samples representing 30% of each dataset
sample <- rbinom(length(blogs),1,0.1)
blogs <- blogs[sample == 1]
sample <- rbinom(length(news),1,0.1)
news <- news[sample == 1]
sample <- rbinom(length(tweets),1,0.1)
tweets <- tweets[sample == 1]
head(blogs,5)
head(news,5)
head(tweets,5)
#Count number of lines on each collection and totals
bloglines <- length(blogs)
newslines <- length(news)
tweetslines <- length(tweets)
totallines <- bloglines + newslines + tweetslines
# Split collections into sets of words, remove symbols and convert to lower case
blogtokens <- tokenize(blogs, remove_symbols = TRUE,
convert_to_lower = TRUE, filter_bad_words = FALSE)
newstokens <- tokenize(news, remove_symbols = TRUE,
convert_to_lower = TRUE, filter_bad_words = FALSE)
tweetstokens <- tokenize(tweets, remove_symbols = TRUE,
convert_to_lower = TRUE, filter_bad_words = FALSE)
#Group all tokens together
tokens <- c(blogtokens,newstokens,tweetstokens)
#Create sets of 2-grams and 3-grams
bigrams <- ngrams(tokens,2)
trigrams <- ngrams(tokens,3)
#Count total number of tokens
totaltokens <- length(tokens)
totalbigrams <- length(bigrams)
totaltrigrams <- length(trigrams)
#Create ordered tables of word counts
token_table <- sort(table(tokens), decreasing = TRUE)
bigram_table <- sort(table(bigrams), decreasing = TRUE)
trigram_table <- sort(table(trigrams), decreasing = TRUE)
#Create data frame with tokens frequencies and word length
tokenDF <- data.frame(token = names(token_table), count = as.vector(token_table))
tokenDF$token <- as.vector(tokenDF$token)
tokenDF <- mutate(tokenDF, freq = round((count/totaltokens)*100,2), length = nchar(token))
tokenDF[1:10,]
#Create Bigram data frame with frqeuencies and repetition counts
bigramDF <- data.frame(gram = names(bigram_table), count = as.vector(bigram_table))
bigramDF$gram <- as.vector(bigramDF$gram)
bigramDF <- separate(bigramDF,gram,c("word1","word2"),"_",remove = FALSE)
bigramDF <- mutate(bigramDF, freq = round((count/totalbigrams)*100,2) )
bigramDF[1:10,]
#Create Trigram data frame with frqeuencies and repetition counts
trigramDF <- data.frame(gram = names(trigram_table), count = as.vector(trigram_table))
trigramDF$gram <- as.vector(trigramDF$gram)
trigramDF <- separate(trigramDF,gram,c("word1","word2","word3"),"_",remove = FALSE)
trigramDF <- mutate(trigramDF, freq = round((count/totaltrigrams)*100,2))
trigramDF[1:10,]
#Plot word frequencies
ggplot(tokenDF[1:30,], aes(token,freq)) +
geom_bar( stat="identity", position="dodge") +
scale_x_discrete(limits=tokenDF$token[1:30]) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title="Figure 1. Token Frequencies",x="Token",y="Frequency (%)")
#Plot bigram frequencies
ggplot(bigramDF[1:30,], aes(gram,freq)) +
geom_bar( stat="identity", position="dodge") +
scale_x_discrete(limits=bigramDF$gram[1:30]) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title="Figure 2. 2-Gram Frequencies",x="Bigram",y="Frequency (%)")
#Plot trigram frequencies
ggplot(trigramDF[1:30,], aes(gram,freq)) +
geom_bar( stat="identity", position="dodge") +
scale_x_discrete(limits=trigramDF$gram[1:30]) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title="Figure 3. 3-Gram Frequencies",x="Trigram",y="Frequency (%)")