This is a milestone report for the Coursera Data Science Specialization’s capstone project. Our ultimate aim is to build a text prediction model. To this point we have ingested the raw, original text files into corpus objects using the quanteda package, done thorough exploratory analysis including line a word count summaries, plots of most-frequent words and a variety of other interesting features of the data. Additionally we have begun constructing an n-gram model for text prediction while considering processing time and size.
We begin by ingesting three text files:
For modeling purposes we will soon merge the three types of text lines into a single corpus, but first we create document-level variables to classify each line as either blog, news or Twitter.
setwd("C:/Users/v-eritho/Desktop/RScripts/capstone_project/data/")
us_blogs <- read_lines("en_US.blogs.txt")
us_news <- read_lines("en_US.news.txt")
us_twitter <- read_lines("en_US.twitter.txt")
us_blogs_corpus <- corpus(us_blogs)
us_news_corpus <- corpus(us_news)
us_twitter_corpus <- corpus(us_twitter)
# Add document-level variables (docvars) for blogs, news and twitter
docvars(us_blogs_corpus, field = "source") <- rep("blogs", times = nrow(us_blogs_corpus$documents))
docvars(us_news_corpus, field = "source") <- rep("news", times = nrow(us_news_corpus$documents))
docvars(us_twitter_corpus, field = "source") <- rep("twitter", times = nrow(us_twitter_corpus$documents))
Next we merge the three data sources into a single corpus and make a table showing how many documents of each source type exist in our newly-merged corpus. We see there are about 900K blog lines, 1 million news lines and 2.4 million Tweets.
us_corpus <- us_blogs_corpus + us_news_corpus + us_twitter_corpus
table(us_corpus$documents$source)
##
## blogs news twitter
## 899288 1010242 2360148
Since our merged dataset is over 4 million text entries, we can retain a large sample size but improve processing time dramatically by using only 5% of the original data set as the input.
Below we see how many text lines of each type (blogs, news, Twitter) exist in the sample dataset. This will be helpful in constructing our text prediction model. Based on the line counts below, we see that the sample data set is indeed approximately five percent of the total, original data set. Note that for most of the exploratory analysis below, we are using the original, merged dataset, rather than this sample dataset.
sample_us_corpus <- corpus_sample(us_corpus, size = round(.05 * nrow(us_corpus$documents)))
table(sample_us_corpus$documents$source)
##
## blogs news twitter
## 44876 50519 118089
Below is a quick way to see a particular word in its various contexts.
kwic(sample_us_corpus[1:25000], "terror")
##
## [text488994, 338] had been subjected to constant | terror |
## [text172196, 53] such is my distress and | terror |
## [text272808, 37] , tension, action, | terror |
## [text4917081, 24] last December. U.S. | terror |
##
## and peril, compelling them
## about what he will throw
## , and, of course
## experts have described Ali Mussa
In addition to our line counts, we want to count words. Here we see there are over 101 million different words in our original dataset. Below is a table showing the most frequently-occuring word. You can see that “the” appears over 4.7 million times.
sum(textstat_frequency(us_DFM)$frequency)
## [1] 101627352
head(textstat_frequency(us_DFM))
## feature frequency rank docfreq
## 1 the 4765865 1 2028862
## 2 to 2754522 2 1631477
## 3 and 2414926 3 1395670
## 4 a 2382276 4 1451159
## 5 of 2005499 5 1205450
## 6 i 1653822 6 962367
Next we consider the distribution of the word counts even futher. Specifically, how many unique words does one need in a frequency-sorted dictionary to cover 50% of all word instances in the merged corpus? Below we see that the top 153 most-common words represent about 50.8 million words, which is approximately 50% of our merged corpus.
sum(textstat_frequency(us_DFM)$frequency[1:153])
## [1] 50855976
Similarly, to cover 90% of the word instances in the corpus, one needs the top 9,000 most commonly-occurring words.
sum(textstat_frequency(us_DFM)$frequency[1:9000])
## [1] 91743606
Next we plot the 30 most frequent words:
library(ggplot2)
ggplot(textstat_frequency(us_DFM)[1:30, ],
aes(x = reorder(feature, frequency), y = frequency)) +
geom_point() +
labs(x = NULL, y = "Frequency")
Next we find the maximum number of characters in each line. As expected, the maximum length of any Tweet is 140 characters.
max_char_blogs <- nchar(us_blogs[1])
for(i in 1:length(us_blogs)) {
if(nchar(us_blogs[i]) > max_char_blogs)
max_char_blogs <- nchar(us_blogs[i])
}
max_char_blogs
## [1] 40833
max_char_news <- nchar(us_news[1])
for(i in 1:length(us_news)) {
if(nchar(us_news[i]) > max_char_news)
max_char_news <- nchar(us_news[i])
}
max_char_news
## [1] 11384
max_char_twitter <- nchar(us_twitter[1])
for(i in 1:length(us_twitter)) {
if(nchar(us_twitter[i]) > max_char_twitter)
max_char_twitter <- nchar(us_twitter[i])
}
max_char_twitter
## [1] 140
We can easily calculate the ratio of the word “love” to “hate” in the Twitter dataset.
length(grep("love", us_twitter)) / length(grep("hate", us_twitter))
## [1] 4.108592
We can create a wordcloud of 500 randomly-sampled elements of our merged corpus.
# Word Cloud
set.seed(555)
textplot_wordcloud(sample_us_DFM[1:500], min.freq = 6, random.order = FALSE,
rot.per = .25,
colors = RColorBrewer::brewer.pal(8,"Dark2"))
An n-gram is a contiguous sequence of n items from a given sequence of text or speech. n-grams will be foundational to our model.
We first show that our merged dataset contains 925,854 different words, or “types”.
us_DFM@Dim[2]
## [1] 925854
Below are the top 50 1-grams (single “types”):
table1 <- topfeatures(us_DFM, n = 50, decreasing = TRUE, scheme = "count")
head(attr(table1,"names"), 50)
# Top 2-grams
table2 <- textstat_collocations(sample_us_corpus, size = 2)
table2 <- table2[order(table2$count, decreasing = TRUE), ]
# Top 3-grams
table3 <- textstat_collocations(sample_us_corpus, size = 3)
table3 <- table3[order(table3$count, decreasing = TRUE), ]
# Top 4-grams
table4 <- textstat_collocations(sample_us_corpus, size = 4)
table4 <- table4[order(table4$count, decreasing = TRUE), ]
# Most common 2-grams
head(table2)
## collocation count length lambda z
## 1 in the 20230 2 2.021626 238.5982
## 2 of the 21282 2 1.822882 224.8970
## 3 will be 4091 2 4.239709 218.0463
## 4 it was 4729 2 3.231778 192.1029
## 5 to be 7945 2 2.627257 191.9659
## 6 more than 2123 2 5.399407 191.0009
# Most common 3-grams
head(table3)
## collocation count length lambda z
## 13 one of the 1694 3 1.4719325 20.837448
## 113933 a lot of 1555 3 0.2603113 1.687363
## 688 thanks for the 1233 3 1.0477274 9.101429
## 296504 to be a 867 3 -0.1658977 -3.461414
## 14938 going to be 803 3 2.2550053 4.201064
## 364 the end of 744 3 1.3294764 10.601295
# Most common 4-grams
head(table4)
## collocation count length lambda z
## 4459 the end of the 407 4 1.5483429 1.8011449
## 40606 at the end of 350 4 0.1875669 0.3045873
## 10102 the rest of the 330 4 0.9132821 1.3424442
## 341 = = = = 325 4 6.8543446 3.0687479
## 117779 thanks for the follow 312 4 -3.8425992 -2.1890331
## 35 for the first time 292 4 3.7693906 4.3877843
As mentioned previously, n-grams are a core component of our model. Our goal is to build a text prediction model, and we will begin by creatng a simple model based on n-grams. Specifically, our model will rely on a dataset with three columns containing:
This is done using the below code.
temp2 <- NULL
temp3 <- NULL
for (i in 1:length(table3$collocation)) {
temp2 <- c(temp2, strsplit(table3$collocation, " ")[[i]][3])
temp3 <- c(temp3, strsplit(tablve3$collocation, " ")[[i]][1:2])
}
temp4 <- c[c(TRUE, FALSE)]
temp5 <- c[c(FALSE, TRUE)]
temp1 <- paste0(temp4, " ", temp5)
temp1
temp2
This code will create the dataset that will be used to create our first text prediction model, a simple model based on n-grams. Ultimately I expect to need Katz back-off, smoothing and/or skip-grams. But I plan to start with a very simple model.
I would appreciate any feedback on this report.