I loaded in a corpus of texts from the Coursera course, and conducted some exploratory analyses of these texts. Since the corpus is quite large, I sampled 10% of lines from the corpus to find the most common n-grams (i.e., n = 1, 2, 3, 4). In my sample, I also determined that ~13 000 words provided coverage for 90% of the corpus, and the number of words required to reach 100% coverage exceeded 100 000.
I used the following packages: ggplot2, gridExtra, knitr, readtext, quanteda, stringi, and tidyverse.
In the coming weeks, I will need to construct a predictive model for user text input. The data will be divided into training, development and test sets to develop this model. The n-gram frequency tables from the training set will be used to predict the next word, given the user’s text input (Markov assumption) This will be based on a backoff strategy, where an (n-1)-gram will be used whenever an n-gram is not available, defaulting to a unigram if needed.
In the Shiny app, users will input text which will then:
From Capstone Project descriptions:
We start by loading in the assigned corpus (a collection of texts to analyze), derived from from blogs, news, and Twitter sources. The data set also also texts from other languages, but we will only focus on English sources here.
We also load in a publically-available list of profanities to filter out later in our analysis.
# blogs
btemp <- file("en_US.blogs.txt", open = "r")
blogs <- readLines(btemp, encoding = "UTF-8", skipNul = TRUE)
# news
ntemp <- file("en_US.news.txt", open = "r")
news <- readLines(ntemp, encoding = "UTF-8", skipNul = TRUE)
# twitter
ttemp <- file("en_US.twitter.txt", open = "r")
twitter <- readLines(ttemp, encoding = "UTF-8", skipNul = TRUE)
close(btemp)
close(ntemp)
close(ttemp)
rm(btemp, ntemp, ttemp)
Next, we will quickly summarize the three text files comprising our corpus (i.e., file size, total number of lines, total number of words, and total number of characters). The table below demonstrates that these are quite large files, with tens of millions of words in the entire corpus. Texts from Twitter have fewer words per line, as expected given the platform’s constraints message length.
# summarizing the data files
# file size
filesize <- round(file.info(c("en_US.blogs.txt",
"en_US.news.txt",
"en_US.twitter.txt"))$size / 1024 ^ 2)
# line count
nlines <- sapply(list(blogs, news, twitter),
length)
# word count
nwords <- sapply(list(blogs, news, twitter),
stri_stats_latex)[4,]
# mean words per line
blogslwords <- stri_count_words(blogs)
newslwords <- stri_count_words(news)
twitterlwords <- stri_count_words(twitter)
summarylwords <- sapply(list(blogslwords, newslwords, twitterlwords),
function(x) summary(x)[c('Mean', 'Max.')])
rownames(summarylwords) = c('words_per_line_mean', 'words_per_line_max')
# character count
nchar <- sapply(list(blogs, news, twitter),
stri_stats_general)[3,]
# summarize of data sets
initsummary <- data.frame(source = c("blogs", "news", "twitter"),
file_size_MB = format(filesize, big.mark = ","),
total_lines = format(nlines, big.mark = ","),
total_words = format(nwords, big.mark = ","),
total_characters = format(nchar, big.mark = ","),
t(rbind(round(summarylwords)))
)
kable(initsummary, align = "r")
| source | file_size_MB | total_lines | total_words | total_characters | words_per_line_mean | words_per_line_max |
|---|---|---|---|---|---|---|
| blogs | 200 | 899,288 | 37,570,839 | 206,824,382 | 42 | 6726 |
| news | 196 | 1,010,242 | 34,494,539 | 203,223,154 | 34 | 1796 |
| 159 | 2,360,148 | 30,451,170 | 162,096,241 | 13 | 47 |
pl1 <- qplot(blogslwords,
main = "Blogs",
xlab = "Words per Line",
ylab = "Frequency",
binwidth = 10)
pl2 <- qplot(newslwords,
main = "News",
xlab = "Words per Line",
ylab = "Frequency",
binwidth = 10)
pl3 <- qplot(twitterlwords,
main = "Twitter",
xlab = "Words per Line",
ylab = "Frequency",
binwidth = 1)
grid.arrange(pl1, pl2, pl3)
We plot the words per line for each of the platforms. As suggested in the summary table above, there is a long tail for blog texts in particular, with its longest text having over 6000 words.
The text collections in our corpus are very large, and it would be computationally intensive to use the entire corpus. For our analysis, we sample 10% of the lines from each collection to generate a sample corpus for analysis.
The quanteda package is used here for its fast processing and convenience of use.
The documents in the corpus next need to be tokenized, and we will be removing:
For our model, we will use n-grams (n = 2, 3, 4) to predict the next word based on the one, two, or three previous words, and also to demonstrate the most common words. We also visualize the frequencies of the most common n-grams.
First, we create a document-feature matrix (dfm) from the list of cleaned tokens. The dfm demonstrates the number of times each unique token appears in each text file in the corpus. Common English stop words (e.g., “the”, “me”, “my”) and individual letters are removed during this step.
# make n-gram dfms
unigram <- tokens_ngrams(sampletokens, n = 1)
unigramdfm <- dfm(unigram,
remove = stopwords("english"),
remove_separators = TRUE)
bigram <- tokens_ngrams(sampletokens, n = 2)
bigramdfm <- dfm(bigram, remove = stopwords("english"))
trigram <- tokens_ngrams(sampletokens, n = 3)
trigramdfm <- dfm(trigram, remove = stopwords("english"))
tetragram <- tokens_ngrams(sampletokens, n = 4)
tetragramdfm <- dfm(tetragram, remove = stopwords("english"))
# top 20 unigrams
top20unigram <- topfeatures(unigramdfm, 20)
top20unigram <- data.frame(word = names(top20unigram), freq = top20unigram, row.names = NULL)
ggplot(top20unigram, aes(x = reorder(word, -freq), y = freq)) +
geom_bar(stat = "identity") +
labs(title = "Top 20 Unigram Words") +
xlab("Frequency") +
ylab("Unigram") +
coord_flip()
# top 20 bigrams
top20bigrams <- topfeatures(bigramdfm, 20)
top20bigrams <- data.frame(word = names(top20bigrams), freq = top20bigrams, row.names = NULL)
ggplot(top20bigrams, aes(x = reorder(word, -freq), y = freq)) +
geom_bar(stat = "identity") +
labs(title = "Top 20 Bigrams") +
xlab("Frequency") +
ylab("Bigram") +
coord_flip()
# top 20 trigrams
top20trigrams <- topfeatures(trigramdfm, 20)
top20trigrams <- data.frame(word = names(top20trigrams), freq = top20trigrams, row.names = NULL)
ggplot(top20trigrams, aes(x = reorder(word, -freq), y = freq)) +
geom_bar(stat = "identity") +
labs(title = "Top 20 Trigrams") +
xlab("Frequency") +
ylab("Trigram") +
coord_flip()
# top 20 4-grams
top20tetragrams <- topfeatures(tetragramdfm, 20)
top20tetragrams <- data.frame(word = names(top20tetragrams), freq = top20tetragrams, row.names = NULL)
ggplot(top20tetragrams, aes(x = reorder(word, -freq), y = freq)) +
geom_bar(stat = "identity") +
labs(title = "Top 20 4-grams") +
xlab("Frequency") +
ylab("4-gram") +
coord_flip()
Next, we construct a plot to demonstrate the number of words needed to cover a given proportion of our corpus (i.e., 50%, 60%, 70%, 80%, 90%, 95%, 97.5%, 99%, 100%). We can see that the number of required words grows exponentially from about 90% until 100% coverage.
# clean up objects from memory
rm(sampletokens, top20unigram, top20bigrams, top20trigrams, top20tetragrams, unigram, bigram, trigram, tetragram)
# save dfms to local text files for convenience
saveRDS(unigramdfm, file = "unigramdfm.RDS")
saveRDS(bigramdfm, file = "bigramdfm.RDS")
saveRDS(trigramdfm, file = "trigramdfm.RDS")
saveRDS(tetragramdfm, file = "tetragramdfm.RDS")
unigramfeatures <- textstat_frequency(unigramdfm)
# function to calculate corpus coverage with given number of words
coverage <- function(features, percentc, c){
total <- sum(features$frequency)
i = 0
covered = c # starting from a certain c to reduce looping times
while( covered < total*percentc ){
i = i + 1
covered = covered + features[i,]$frequency
}
i
}
unigrams0.5 <- coverage(unigramfeatures, 0.5, 500)
unigrams0.6 <- coverage(unigramfeatures, 0.6, 1200)
unigrams0.7 <- coverage(unigramfeatures, 0.7, 1500)
unigrams0.8 <- coverage(unigramfeatures, 0.8, 2000)
unigrams0.9 <- coverage(unigramfeatures, 0.9, 6000)
unigrams0.95 <- coverage(unigramfeatures, 0.95, 10000)
unigrams0.975 <- coverage(unigramfeatures, 0.975, 20000)
unigrams0.99 <- coverage(unigramfeatures, 0.99, 40000)
unigrams1.0 <- coverage(unigramfeatures, 1, 60000)
unigramcoverage <- data.frame(Percent.Coverage = c(50, 60, 70, 80, 90, 95, 97.5, 99, 100), Words = c(unigrams0.5, unigrams0.6, unigrams0.7, unigrams0.8, unigrams0.9, unigrams0.95, unigrams0.975, unigrams0.99, unigrams1.0))
# remove unnecessary objects
rm(unigrams0.5, unigrams0.6, unigrams0.7, unigrams0.8, unigrams0.9, unigrams0.95, unigrams0.975, unigrams0.99, unigrams1.0)
# graph of coverage
ggplot(unigramcoverage, aes(x = Words, y = Percent.Coverage, group = 0.5)) +
geom_line(color = "maroon") +
geom_text(aes(label = Words, vjust = 1)) +
labs(x = "Number of unigrams", y = "Coverage (%)") +
labs(title = "Number of unigrams to cover a percentage of total corpus") +
coord_flip()
Work remains to be done on refining my data cleaning procedures. I will need to read further around different modelling methods for natural language processing. The Katz Back-Off model seems to be commonly used for word prediction, and is a discounting method which attempts to overcome data sparcity. The Stupid Back-Off model appears to be a simplified but similar approach, and I will try to implement these in the next few weeks.