n-grams are expressions comprising n number of words. For example:
In this project I shall investigate the data contained in three very large .txt files, comprising blog posts, news articles, and Twitter messages. The purpose of this investigation is to find, from a sample of this data, the most common one-word to five-word n-grams, with a view to ultimately building a predictive text application.
The source data comprises a .zip file hosted by Coursera. The following block of R code downloads and unzips the .zip file, and reads the content of the English language news articles, blog posts, and Twitter messages into three correspondingly-named R datasets.
# Download and unzip .zip file from Coursera website
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, "zipfile.zip")
unzip("zipfile.zip", files = NULL, list = FALSE, overwrite = TRUE, junkpaths = FALSE, exdir = ".",
unzip = "internal", setTimes = FALSE)
# Copy English language .txt files into working directory
file.copy("./final/en_US/en_US.blogs.txt", "./en_US.blogs.txt")
## [1] TRUE
file.copy("./final/en_US/en_US.news.txt", "./en_US.news.txt")
## [1] TRUE
file.copy("./final/en_US/en_US.twitter.txt", "./en_US.twitter.txt")
## [1] TRUE
# Delete unused files
unlink("./final", recursive = TRUE)
unlink("zipfile.zip")
# Read English language .txt files into R
blogs <- readLines("./en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("./en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("./en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
The following block of R code summarises the three datasets in terms of their numbers of lines, words, and characters, and the minimum, average, and maximum number of words in each line in each dataset. This will inform the decision as to how large a sample should be taken from each file to build the corpus of text from which the most common one-word to five-word n-grams shall be found.
WordsPerLine <- sapply(list(blogs, news, twitter), function(x)
summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])
rownames(WordsPerLine) <- c('Min wpl','Ave wpl','Max wpl')
summ <- data.frame(Dataset=c("blogs", "news", "twitter"),
t(rbind
(sapply(list(blogs, news, twitter),
stri_stats_general)[c('Lines', 'Chars'),],
Words = sapply(list(blogs, news, twitter), stri_stats_latex)['Words',],
WordsPerLine
)
)
)
head(summ)
## Dataset Lines Chars Words Min.wpl Ave.wpl Max.wpl
## 1 blogs 899288 206824382 37570839 0 41.75107 6726
## 2 news 77259 15639408 2651432 1 34.61779 1123
## 3 twitter 2360148 162096241 30451170 1 12.75065 47
samplesize <- 0.025
There are 70.67 million words across the three datasets. A 2.5% sample would still be quite large, at around 1.77 million words, before the data is cleaned up to remove punctuation marks, numbers, non-English characters, and excess spaces.
The next step is to remove non-English characters from the three datasets before merging them into a single sample dataset, comprising 2.5% of each of the three datasets. The sample dataset shall then be cleaned further to remove punctuation marks, numbers, non-English characters and excess spaces, and then convert the remaining data to plain text.
The remaining text shall be the corpus from which the most common one-word to five-word n-grams shall be found.
# Remove redundant data
rm(WordsPerLine)
# Remove non-English characters
blogs <- iconv(blogs, "latin1", "ASCII", sub = "")
news <- iconv(news, "latin1", "ASCII", sub = "")
twitter <- iconv(twitter, "latin1", "ASCII", sub = "")
# Take a sample from each dataset and merge it into a single sample dataset
set.seed(20190702)
data_sample <- c(sample(blogs, length(blogs) * samplesize),
sample(news, length(news) * samplesize),
sample(twitter, length(twitter) * samplesize)
)
# Remove redundant data
rm(blogs)
rm(news)
rm(twitter)
# Convert sample dataset into a corpus and then clean
corpus <- VCorpus(VectorSource(data_sample))
# Convert all text to lower case
corpus <- tm_map(corpus, tolower)
# Remove punctuation marks
corpus <- tm_map(corpus, removePunctuation)
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Remove excess spaces
corpus <- tm_map(corpus, stripWhitespace)
# Convert to plain text
corpus <- tm_map(corpus, PlainTextDocument)
# Remove redundant data
rm(data_sample)
The next block of R code shall tokenize the corpus, or break it into one-word to five-word n-grams.
# Tokenize corpus into one-word to five-word n-grams
tokenizer1 <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
tokenizer2 <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tokenizer3 <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tokenizer4 <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
tokenizer5 <- function(x) NGramTokenizer(x, Weka_control(min = 5, max = 5))
# Create matrices of one-word to five-word n-grams
TDM1 <- TermDocumentMatrix(corpus, control = list(tokenize = tokenizer1))
TDM2 <- TermDocumentMatrix(corpus, control = list(tokenize = tokenizer2))
TDM3 <- TermDocumentMatrix(corpus, control = list(tokenize = tokenizer3))
TDM4 <- TermDocumentMatrix(corpus, control = list(tokenize = tokenizer4))
TDM5 <- TermDocumentMatrix(corpus, control = list(tokenize = tokenizer5))
# Find unigrams that occur 50 or more times
freq1 <- findFreqTerms(TDM1, lowfreq = 50)
# Count the number of times each unigram appears and list them in decreasing order
unigramFreqDF <- rowSums(as.matrix(TDM1[freq1,]))
unigramFreqDF <- unigramFreqDF[order(unigramFreqDF, decreasing = TRUE)]
# Add column names
unigramFreqDF <- data.frame(word = names(unigramFreqDF), frequency = unigramFreqDF)
# Count unique words
unique_words <- nrow(unigramFreqDF)
# Create a table of the top 50 unigrams
top50_freq1 <- as.data.frame(unigramFreqDF[1:50,])
# Save unigrams file
saveRDS(unigramFreqDF, file = "unigrams.rds")
# Remove redundant data
rm(freq1)
# Find bigrams that occur 50 or more times
freq2 <- findFreqTerms(TDM2, lowfreq = 50)
# Count the number of times each bigram appears and list them in decreasing order
bigramFreqDF <- rowSums(as.matrix(TDM2[freq2,]))
bigramFreqDF <- bigramFreqDF[order(bigramFreqDF, decreasing = TRUE)]
# Add column names
bigramFreqDF <- data.frame(words = names(bigramFreqDF), frequency = bigramFreqDF)
# Count unique bigrams
unique_bigrams <- nrow(bigramFreqDF)
# Create a table of the top 50 bigrams
top50_freq2 <- as.data.frame(bigramFreqDF[1:50,])
# Save bigrams file
saveRDS(bigramFreqDF, file = "bigrams.rds")
# Remove redundant data
rm(freq2)
# Find trigrams that occur 50 or more times
freq3 <- findFreqTerms(TDM3, lowfreq = 50)
# Count the number of times each trigram appears and list them in decreasing order
trigramFreqDF <- rowSums(as.matrix(TDM3[freq3,]))
trigramFreqDF <- trigramFreqDF[order(trigramFreqDF, decreasing = TRUE)]
# Add column names
trigramFreqDF <- data.frame(words = names(trigramFreqDF), frequency = trigramFreqDF)
# Count unique trigrams
unique_trigrams <- nrow(trigramFreqDF)
# Create a table of the top 50 trigrams
top50_freq3 <- as.data.frame(trigramFreqDF[1:50,])
# Save trigrams file
saveRDS(trigramFreqDF, file = "trigrams.rds")
# Remove redundant data
rm(freq3)
# Find quadgrams that occur 10 or more times
freq4 <- findFreqTerms(TDM4, lowfreq = 10)
# Count the number of times each quadgram appears and list them in decreasing order
quadgramFreqDF <- rowSums(as.matrix(TDM4[freq4,]))
quadgramFreqDF <- quadgramFreqDF[order(quadgramFreqDF, decreasing = TRUE)]
# Add column names
quadgramFreqDF <- data.frame(words = names(quadgramFreqDF), frequency = quadgramFreqDF)
# Count unique quadgrams
unique_quadgrams <- nrow(quadgramFreqDF)
# Create a table of the top 50 quadgrams
top50_freq4 <- as.data.frame(quadgramFreqDF[1:50,])
# Save quadgrams file
saveRDS(quadgramFreqDF, file = "quadgrams.rds")
# Remove redundant data
rm(freq4)
# Find quingrams that occur 10 or more times
freq5 <- findFreqTerms(TDM5, lowfreq = 10)
# Count the number of times each quingram appears and list them in decreasing order
quingramFreqDF <- rowSums(as.matrix(TDM5[freq5,]))
quingramFreqDF <- quingramFreqDF[order(quingramFreqDF, decreasing = TRUE)]
# Add column names
quingramFreqDF <- data.frame(words = names(quingramFreqDF), frequency = quingramFreqDF)
# Count unique quingrams
unique_quingrams <- nrow(quingramFreqDF)
# Create a table of the top 50 quingrams
top50_freq5 <- as.data.frame(quingramFreqDF[1:50,])
# Save quingrams file
saveRDS(quingramFreqDF, file = "quingrams.rds")
# Remove redundant data
rm(freq5)
There are 2,934 unique words with a minimum 50 occurrences in the 1.77 million word sample.
The following graph shows the 50 most common single words (unigrams) in the sample.
ggplot(data = top50_freq1, aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "green", colour = "black") +
ggtitle(paste("Top 50 Unigrams")) +
xlab("Unigrams") +
ylab("Frequency") +
guides(fill = FALSE) +
theme(axis.text.x = element_text(angle = 90))
There are 2,789 unique bigrams with a minimum 50 occurrences in the 1.77 million word sample.
The following graph shows the 50 most common bigrams in the sample.
ggplot(data = top50_freq2, aes(x = reorder(words, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "red", colour = "black") +
ggtitle(paste("Top 50 Bigrams")) +
xlab("Bigrams") +
ylab("Frequency") +
guides(fill = FALSE) +
theme(axis.text.x = element_text(angle = 90))
There are 396 unique trigrams with a minimum 50 occurrences in the 1.77 million word sample.
The following graph shows the 50 most common trigrams in the sample.
ggplot(data = top50_freq3, aes(x = reorder(words, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "blue", colour = "black") +
ggtitle(paste("Top 50 Trigrams")) +
xlab("Trigrams") +
ylab("Frequency") +
guides(fill = FALSE) +
theme(axis.text.x = element_text(angle = 90))
There are 759 unique bigrams with a minimum 10 occurrences in the 1.77 million word sample.
The following graph shows the 50 most common quadgrams in the sample.
ggplot(data = top50_freq4, aes(x = reorder(words, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "orange", colour = "black") +
ggtitle(paste("Top 50 Quadgrams")) +
xlab("Quadgrams") +
ylab("Frequency") +
guides(fill = FALSE) +
theme(axis.text.x = element_text(angle = 90))
There are 72 unique bigrams with a minimum 10 occurrences in the 1.77 million word sample.
The following graph shows the 50 most common quingrams in the sample.
ggplot(data = top50_freq5, aes(x = reorder(words, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "purple", colour = "black") +
ggtitle(paste("Top 50 Quingrams")) +
xlab("Quingrams") +
ylab("Frequency") +
guides(fill = FALSE) +
theme(axis.text.x = element_text(angle = 90))
The graphs displayed above show the the most common one-word to five-word n-grams found in the 1.77 million word sample.
This data shall be used in subsequent studies to develop a predictive model, and subsequently, a predictive text application.
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tm_0.7-6 NLP_0.2-0 stringi_1.4.3 RWeka_0.4-40 raster_2.9-5
## [6] sp_1.3-1 rJava_0.9-11 ngram_3.0.4 ggplot2_3.1.1 dplyr_0.8.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 pillar_1.4.1 compiler_3.6.0
## [4] plyr_1.8.4 tools_3.6.0 RWekajars_3.9.3-1
## [7] digest_0.6.19 evaluate_0.14 tibble_2.1.3
## [10] gtable_0.3.0 lattice_0.20-38 pkgconfig_2.0.2
## [13] rlang_0.3.4 parallel_3.6.0 yaml_2.2.0
## [16] xfun_0.7 xml2_1.2.0 withr_2.1.2
## [19] stringr_1.4.0 knitr_1.23 grid_3.6.0
## [22] tidyselect_0.2.5 glue_1.3.1 R6_2.4.0
## [25] rmarkdown_1.13 purrr_0.3.2 magrittr_1.5
## [28] scales_1.0.0 codetools_0.2-16 htmltools_0.3.6
## [31] assertthat_0.2.1 colorspace_1.4-1 labeling_0.3
## [34] lazyeval_0.2.2 munsell_0.5.0 slam_0.1-45
## [37] crayon_1.3.4