Introduction

This is a milestone report for the Coursera Data Science Specialization’s capstone project. Our ultimate aim is to build a text prediction model. To this point we have ingested the raw, original text files into corpus objects using the quanteda package, done thorough exploratory analysis including line a word count summaries, plots of most-frequent words and a variety of other interesting features of the data. Additionally we have begun constructing an n-gram model for text prediction while considering processing time and size.

Data Ingestion

We begin by ingesting three text files:

  1. a set of texts lines of blog posts
  2. a set of text lines of news articles
  3. a set of text lines of Tweets

For modeling purposes we will soon merge the three types of text lines into a single corpus, but first we create document-level variables to classify each line as either blog, news or Twitter.

setwd("C:/Users/v-eritho/Desktop/RScripts/capstone_project/data/")

us_blogs <- read_lines("en_US.blogs.txt")
us_news <- read_lines("en_US.news.txt")
us_twitter <- read_lines("en_US.twitter.txt")

us_blogs_corpus <- corpus(us_blogs)
us_news_corpus <- corpus(us_news)
us_twitter_corpus <- corpus(us_twitter)

# Add document-level variables (docvars) for blogs, news and twitter
docvars(us_blogs_corpus, field = "source") <- rep("blogs", times = nrow(us_blogs_corpus$documents))
docvars(us_news_corpus, field = "source") <- rep("news", times = nrow(us_news_corpus$documents))
docvars(us_twitter_corpus, field = "source") <- rep("twitter", times = nrow(us_twitter_corpus$documents))

Line Counts

Next we merge the three data sources into a single corpus and make a table showing how many documents of each source type exist in our newly-merged corpus. We see there are about 900K blog lines, 1 million news lines and 2.4 million Tweets.

us_corpus <- us_blogs_corpus + us_news_corpus + us_twitter_corpus
table(us_corpus$documents$source)
## 
##   blogs    news twitter 
##  899288 1010242 2360148

Sample Dataset

Since our merged dataset is over 4 million text entries, we can retain a large sample size but improve processing time dramatically by using only 5% of the original data set as the input.

Below we see how many text lines of each type (blogs, news, Twitter) exist in the sample dataset. This will be helpful in constructing our text prediction model. Based on the line counts below, we see that the sample data set is indeed approximately five percent of the total, original data set. Note that for most of the exploratory analysis below, we are using the original, merged dataset, rather than this sample dataset.

sample_us_corpus <- corpus_sample(us_corpus, size = round(.05 * nrow(us_corpus$documents)))
table(sample_us_corpus$documents$source)
## 
##   blogs    news twitter 
##   44876   50519  118089

Exploratory Analysis

Keywords-in-Context

Below is a quick way to see a particular word in its various contexts.

kwic(sample_us_corpus[1:25000], "terror")
##                                                             
##  [text488994, 338] had been subjected to constant | terror |
##   [text172196, 53]        such is my distress and | terror |
##   [text272808, 37]             , tension, action, | terror |
##  [text4917081, 24]            last December. U.S. | terror |
##                                  
##  and peril, compelling them      
##  about what he will throw        
##  , and, of course                
##  experts have described Ali Mussa

Word Counts

In addition to our line counts, we want to count words. Here we see there are over 101 million different words in our original dataset. Below is a table showing the most frequently-occuring word. You can see that “the” appears over 4.7 million times.

sum(textstat_frequency(us_DFM)$frequency)
## [1] 101627352
head(textstat_frequency(us_DFM))
##   feature frequency rank docfreq
## 1     the   4765865    1 2028862
## 2      to   2754522    2 1631477
## 3     and   2414926    3 1395670
## 4       a   2382276    4 1451159
## 5      of   2005499    5 1205450
## 6       i   1653822    6  962367

Next we consider the distribution of the word counts even futher. Specifically, how many unique words does one need in a frequency-sorted dictionary to cover 50% of all word instances in the merged corpus? Below we see that the top 153 most-common words represent about 50.8 million words, which is approximately 50% of our merged corpus.

sum(textstat_frequency(us_DFM)$frequency[1:153])
## [1] 50855976

Similarly, to cover 90% of the word instances in the corpus, one needs the top 9,000 most commonly-occurring words.

sum(textstat_frequency(us_DFM)$frequency[1:9000])
## [1] 91743606

Next we plot the 30 most frequent words:

library(ggplot2)
ggplot(textstat_frequency(us_DFM)[1:30, ], 
       aes(x = reorder(feature, frequency), y = frequency)) +
       geom_point() + 
       labs(x = NULL, y = "Frequency")

Next we find the maximum number of characters in each line. As expected, the maximum length of any Tweet is 140 characters.

max_char_blogs <- nchar(us_blogs[1])
for(i in 1:length(us_blogs)) {
        if(nchar(us_blogs[i]) > max_char_blogs) 
                max_char_blogs <- nchar(us_blogs[i])
}
max_char_blogs
## [1] 40833
max_char_news <- nchar(us_news[1])
for(i in 1:length(us_news)) {
        if(nchar(us_news[i]) > max_char_news) 
                max_char_news <- nchar(us_news[i])
}
max_char_news
## [1] 11384
max_char_twitter <- nchar(us_twitter[1])
for(i in 1:length(us_twitter)) {
        if(nchar(us_twitter[i]) > max_char_twitter) 
                max_char_twitter <- nchar(us_twitter[i])
}
max_char_twitter
## [1] 140

We can easily calculate the ratio of the word “love” to “hate” in the Twitter dataset.

length(grep("love", us_twitter)) / length(grep("hate", us_twitter)) 
## [1] 4.108592

We can create a wordcloud of 500 randomly-sampled elements of our merged corpus.

# Word Cloud
set.seed(555)
textplot_wordcloud(sample_us_DFM[1:500], min.freq = 6, random.order = FALSE,
                   rot.per = .25, 
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))

n-grams

An n-gram is a contiguous sequence of n items from a given sequence of text or speech. n-grams will be foundational to our model.

We first show that our merged dataset contains 925,854 different words, or “types”.

us_DFM@Dim[2]
## [1] 925854

Below are the top 50 1-grams (single “types”):

table1 <- topfeatures(us_DFM, n = 50, decreasing = TRUE, scheme = "count")
head(attr(table1,"names"), 50)
# Top 2-grams
table2 <- textstat_collocations(sample_us_corpus, size = 2)
table2 <- table2[order(table2$count, decreasing = TRUE), ]

# Top 3-grams
table3 <- textstat_collocations(sample_us_corpus, size = 3)
table3 <- table3[order(table3$count, decreasing = TRUE), ]

# Top 4-grams
table4 <- textstat_collocations(sample_us_corpus, size = 4)
table4 <- table4[order(table4$count, decreasing = TRUE), ]
# Most common 2-grams
head(table2)
##   collocation count length   lambda        z
## 1      in the 20230      2 2.021626 238.5982
## 2      of the 21282      2 1.822882 224.8970
## 3     will be  4091      2 4.239709 218.0463
## 4      it was  4729      2 3.231778 192.1029
## 5       to be  7945      2 2.627257 191.9659
## 6   more than  2123      2 5.399407 191.0009
# Most common 3-grams
head(table3)
##           collocation count length     lambda         z
## 13         one of the  1694      3  1.4719325 20.837448
## 113933       a lot of  1555      3  0.2603113  1.687363
## 688    thanks for the  1233      3  1.0477274  9.101429
## 296504        to be a   867      3 -0.1658977 -3.461414
## 14938     going to be   803      3  2.2550053  4.201064
## 364        the end of   744      3  1.3294764 10.601295
# Most common 4-grams
head(table4)
##                  collocation count length     lambda          z
## 4459          the end of the   407      4  1.5483429  1.8011449
## 40606          at the end of   350      4  0.1875669  0.3045873
## 10102        the rest of the   330      4  0.9132821  1.3424442
## 341                  = = = =   325      4  6.8543446  3.0687479
## 117779 thanks for the follow   312      4 -3.8425992 -2.1890331
## 35        for the first time   292      4  3.7693906  4.3877843

Model Building

As mentioned previously, n-grams are a core component of our model. Our goal is to build a text prediction model, and we will begin by creatng a simple model based on n-grams. Specifically, our model will rely on a dataset with three columns containing:

  1. the first n-1 words in the n-gram
  2. a prediction which is the n/superscript^th word of the n-gram
  3. a count of how many times the n-gram appears in the merged corpus

This is done using the below code.

temp2 <- NULL
temp3 <- NULL

for (i in 1:length(table3$collocation)) {
        temp2 <- c(temp2, strsplit(table3$collocation, " ")[[i]][3])
        temp3 <- c(temp3, strsplit(tablve3$collocation, " ")[[i]][1:2])
}

temp4 <- c[c(TRUE, FALSE)]
temp5 <- c[c(FALSE, TRUE)]
temp1 <- paste0(temp4, " ", temp5)
temp1
temp2

This code will create the dataset that will be used to create our first text prediction model, a simple model based on n-grams. Ultimately I expect to need Katz back-off, smoothing and/or skip-grams. But I plan to start with a very simple model.

I would appreciate any feedback on this report.