Using natural language processing to identify common English language n-grams

Synopsis

n-grams are expressions comprising n number of words. For example:

a single word is a unigram
a two-word expression is a bigram
a three-word expression is a trigram
a four-word expression is a quadgram
a five-word expression is a quingram

In this project I shall investigate the data contained in three very large .txt files, comprising blog posts, news articles, and Twitter messages. The purpose of this investigation is to find, from a sample of this data, the most common one-word to five-word n-grams, with a view to ultimately building a predictive text application.

Download and unzip data

The source data comprises a .zip file hosted by Coursera. The following block of R code downloads and unzips the .zip file, and reads the content of the English language news articles, blog posts, and Twitter messages into three correspondingly-named R datasets.

# Download and unzip .zip file from Coursera website

url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, "zipfile.zip")
unzip("zipfile.zip", files = NULL, list = FALSE, overwrite = TRUE, junkpaths = FALSE, exdir = ".", 
      unzip = "internal", setTimes = FALSE)

# Copy English language .txt files into working directory
file.copy("./final/en_US/en_US.blogs.txt", "./en_US.blogs.txt")

## [1] TRUE

file.copy("./final/en_US/en_US.news.txt", "./en_US.news.txt")

## [1] TRUE

file.copy("./final/en_US/en_US.twitter.txt", "./en_US.twitter.txt")

## [1] TRUE

# Delete unused files
unlink("./final", recursive = TRUE)
unlink("zipfile.zip")

# Read English language .txt files into R

blogs <- readLines("./en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("./en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("./en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Preliminary exploration

The following block of R code summarises the three datasets in terms of their numbers of lines, words, and characters, and the minimum, average, and maximum number of words in each line in each dataset. This will inform the decision as to how large a sample should be taken from each file to build the corpus of text from which the most common one-word to five-word n-grams shall be found.

WordsPerLine <- sapply(list(blogs, news, twitter), function(x) 
        summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])
rownames(WordsPerLine) <- c('Min wpl','Ave wpl','Max wpl')
summ <- data.frame(Dataset=c("blogs", "news", "twitter"), 
                   t(rbind
                     (sapply(list(blogs, news, twitter), 
                             stri_stats_general)[c('Lines', 'Chars'),], 
                             Words = sapply(list(blogs, news, twitter), stri_stats_latex)['Words',], 
                             WordsPerLine
                           )
                     )
                   )
head(summ)

##   Dataset   Lines     Chars    Words Min.wpl  Ave.wpl Max.wpl
## 1   blogs  899288 206824382 37570839       0 41.75107    6726
## 2    news   77259  15639408  2651432       1 34.61779    1123
## 3 twitter 2360148 162096241 30451170       1 12.75065      47

samplesize <- 0.025

There are 70.67 million words across the three datasets. A 2.5% sample would still be quite large, at around 1.77 million words, before the data is cleaned up to remove punctuation marks, numbers, non-English characters, and excess spaces.

Clean data

The next step is to remove non-English characters from the three datasets before merging them into a single sample dataset, comprising 2.5% of each of the three datasets. The sample dataset shall then be cleaned further to remove punctuation marks, numbers, non-English characters and excess spaces, and then convert the remaining data to plain text.

The remaining text shall be the corpus from which the most common one-word to five-word n-grams shall be found.

# Remove redundant data

rm(WordsPerLine)

# Remove non-English characters

blogs <- iconv(blogs, "latin1", "ASCII", sub = "")
news <- iconv(news, "latin1", "ASCII", sub = "")
twitter <- iconv(twitter, "latin1", "ASCII", sub = "")

# Take a sample from each dataset and merge it into a single sample dataset

set.seed(20190702)
data_sample <- c(sample(blogs, length(blogs) * samplesize), 
                 sample(news, length(news) * samplesize), 
                 sample(twitter, length(twitter) * samplesize)
                 )

# Remove redundant data

rm(blogs)
rm(news)
rm(twitter)

# Convert sample dataset into a corpus and then clean

corpus <- VCorpus(VectorSource(data_sample))

# Convert all text to lower case
        corpus <- tm_map(corpus, tolower)
# Remove punctuation marks
        corpus <- tm_map(corpus, removePunctuation)
# Remove numbers
        corpus <- tm_map(corpus, removeNumbers)
# Remove excess spaces
        corpus <- tm_map(corpus, stripWhitespace)
# Convert to plain text
        corpus <- tm_map(corpus, PlainTextDocument)
# Remove redundant data
        rm(data_sample)

Calculate frequencies of n-grams

The next block of R code shall tokenize the corpus, or break it into one-word to five-word n-grams.

# Tokenize corpus into one-word to five-word n-grams

tokenizer1 <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
tokenizer2 <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tokenizer3 <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tokenizer4 <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
tokenizer5 <- function(x) NGramTokenizer(x, Weka_control(min = 5, max = 5))

# Create matrices of one-word to five-word n-grams 

TDM1 <- TermDocumentMatrix(corpus, control = list(tokenize = tokenizer1))
TDM2 <- TermDocumentMatrix(corpus, control = list(tokenize = tokenizer2))
TDM3 <- TermDocumentMatrix(corpus, control = list(tokenize = tokenizer3))
TDM4 <- TermDocumentMatrix(corpus, control = list(tokenize = tokenizer4))
TDM5 <- TermDocumentMatrix(corpus, control = list(tokenize = tokenizer5))

# Find unigrams that occur 50 or more times
        freq1 <- findFreqTerms(TDM1, lowfreq = 50)
# Count the number of times each unigram appears and list them in decreasing order
        unigramFreqDF <- rowSums(as.matrix(TDM1[freq1,]))
        unigramFreqDF <- unigramFreqDF[order(unigramFreqDF, decreasing = TRUE)]
# Add column names
        unigramFreqDF <- data.frame(word = names(unigramFreqDF), frequency = unigramFreqDF)
# Count unique words
        unique_words <- nrow(unigramFreqDF)
# Create a table of the top 50 unigrams
        top50_freq1 <- as.data.frame(unigramFreqDF[1:50,])
# Save unigrams file
        saveRDS(unigramFreqDF, file = "unigrams.rds")
# Remove redundant data
        rm(freq1)

# Find bigrams that occur 50 or more times
        freq2 <- findFreqTerms(TDM2, lowfreq = 50)
# Count the number of times each bigram appears and list them in decreasing order
        bigramFreqDF <- rowSums(as.matrix(TDM2[freq2,]))
        bigramFreqDF <- bigramFreqDF[order(bigramFreqDF, decreasing = TRUE)]
# Add column names
        bigramFreqDF <- data.frame(words = names(bigramFreqDF), frequency = bigramFreqDF)
# Count unique bigrams
        unique_bigrams <- nrow(bigramFreqDF)
# Create a table of the top 50 bigrams
        top50_freq2 <- as.data.frame(bigramFreqDF[1:50,])
# Save bigrams file
        saveRDS(bigramFreqDF, file = "bigrams.rds")
# Remove redundant data
        rm(freq2)

# Find trigrams that occur 50 or more times
        freq3 <- findFreqTerms(TDM3, lowfreq = 50)
# Count the number of times each trigram appears and list them in decreasing order
        trigramFreqDF <- rowSums(as.matrix(TDM3[freq3,]))
        trigramFreqDF <- trigramFreqDF[order(trigramFreqDF, decreasing = TRUE)]
# Add column names
        trigramFreqDF <- data.frame(words = names(trigramFreqDF), frequency = trigramFreqDF)
# Count unique trigrams
        unique_trigrams <- nrow(trigramFreqDF)
# Create a table of the top 50 trigrams
        top50_freq3 <- as.data.frame(trigramFreqDF[1:50,])
# Save trigrams file
        saveRDS(trigramFreqDF, file = "trigrams.rds")
# Remove redundant data
        rm(freq3)

# Find quadgrams that occur 10 or more times
        freq4 <- findFreqTerms(TDM4, lowfreq = 10)
# Count the number of times each quadgram appears and list them in decreasing order
        quadgramFreqDF <- rowSums(as.matrix(TDM4[freq4,]))
        quadgramFreqDF <- quadgramFreqDF[order(quadgramFreqDF, decreasing = TRUE)]
# Add column names
        quadgramFreqDF <- data.frame(words = names(quadgramFreqDF), frequency = quadgramFreqDF)
# Count unique quadgrams
        unique_quadgrams <- nrow(quadgramFreqDF)
# Create a table of the top 50 quadgrams
        top50_freq4 <- as.data.frame(quadgramFreqDF[1:50,])
# Save quadgrams file
        saveRDS(quadgramFreqDF, file = "quadgrams.rds")
# Remove redundant data
        rm(freq4)

# Find quingrams that occur 10 or more times
        freq5 <- findFreqTerms(TDM5, lowfreq = 10)
# Count the number of times each quingram appears and list them in decreasing order
        quingramFreqDF <- rowSums(as.matrix(TDM5[freq5,]))
        quingramFreqDF <- quingramFreqDF[order(quingramFreqDF, decreasing = TRUE)]
# Add column names
        quingramFreqDF <- data.frame(words = names(quingramFreqDF), frequency = quingramFreqDF)
# Count unique quingrams
        unique_quingrams <- nrow(quingramFreqDF)
# Create a table of the top 50 quingrams
        top50_freq5 <- as.data.frame(quingramFreqDF[1:50,])
# Save quingrams file
        saveRDS(quingramFreqDF, file = "quingrams.rds")
# Remove redundant data
        rm(freq5)

Top 50 unigrams

There are 2,934 unique words with a minimum 50 occurrences in the 1.77 million word sample.

The following graph shows the 50 most common single words (unigrams) in the sample.

ggplot(data = top50_freq1, aes(x = reorder(word, -frequency), y = frequency)) + 
  geom_bar(stat = "identity", fill = "green", colour = "black") + 
  ggtitle(paste("Top 50 Unigrams")) + 
  xlab("Unigrams") + 
  ylab("Frequency") + 
  guides(fill = FALSE) + 
  theme(axis.text.x = element_text(angle = 90))

Top 50 bigrams

There are 2,789 unique bigrams with a minimum 50 occurrences in the 1.77 million word sample.

The following graph shows the 50 most common bigrams in the sample.

ggplot(data = top50_freq2, aes(x = reorder(words, -frequency), y = frequency)) + 
  geom_bar(stat = "identity", fill = "red", colour = "black") + 
  ggtitle(paste("Top 50 Bigrams")) + 
  xlab("Bigrams") + 
  ylab("Frequency") + 
  guides(fill = FALSE) + 
  theme(axis.text.x = element_text(angle = 90))

Top 50 trigrams

There are 396 unique trigrams with a minimum 50 occurrences in the 1.77 million word sample.

The following graph shows the 50 most common trigrams in the sample.

ggplot(data = top50_freq3, aes(x = reorder(words, -frequency), y = frequency)) + 
  geom_bar(stat = "identity", fill = "blue", colour = "black") + 
  ggtitle(paste("Top 50 Trigrams")) + 
  xlab("Trigrams") + 
  ylab("Frequency") + 
  guides(fill = FALSE) + 
  theme(axis.text.x = element_text(angle = 90))

Top 50 quadgrams

There are 759 unique bigrams with a minimum 10 occurrences in the 1.77 million word sample.

The following graph shows the 50 most common quadgrams in the sample.

ggplot(data = top50_freq4, aes(x = reorder(words, -frequency), y = frequency)) + 
  geom_bar(stat = "identity", fill = "orange", colour = "black") + 
  ggtitle(paste("Top 50 Quadgrams")) + 
  xlab("Quadgrams") + 
  ylab("Frequency") + 
  guides(fill = FALSE) + 
  theme(axis.text.x = element_text(angle = 90))

Top 50 quingrams

There are 72 unique bigrams with a minimum 10 occurrences in the 1.77 million word sample.

The following graph shows the 50 most common quingrams in the sample.

ggplot(data = top50_freq5, aes(x = reorder(words, -frequency), y = frequency)) + 
  geom_bar(stat = "identity", fill = "purple", colour = "black") + 
  ggtitle(paste("Top 50 Quingrams")) + 
  xlab("Quingrams") + 
  ylab("Frequency") + 
  guides(fill = FALSE) + 
  theme(axis.text.x = element_text(angle = 90))

Conclusion

The graphs displayed above show the the most common one-word to five-word n-grams found in the 1.77 million word sample.

This data shall be used in subsequent studies to develop a predictive model, and subsequently, a predictive text application.

Software versions used

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] tm_0.7-6      NLP_0.2-0     stringi_1.4.3 RWeka_0.4-40  raster_2.9-5 
##  [6] sp_1.3-1      rJava_0.9-11  ngram_3.0.4   ggplot2_3.1.1 dplyr_0.8.1  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1        pillar_1.4.1      compiler_3.6.0   
##  [4] plyr_1.8.4        tools_3.6.0       RWekajars_3.9.3-1
##  [7] digest_0.6.19     evaluate_0.14     tibble_2.1.3     
## [10] gtable_0.3.0      lattice_0.20-38   pkgconfig_2.0.2  
## [13] rlang_0.3.4       parallel_3.6.0    yaml_2.2.0       
## [16] xfun_0.7          xml2_1.2.0        withr_2.1.2      
## [19] stringr_1.4.0     knitr_1.23        grid_3.6.0       
## [22] tidyselect_0.2.5  glue_1.3.1        R6_2.4.0         
## [25] rmarkdown_1.13    purrr_0.3.2       magrittr_1.5     
## [28] scales_1.0.0      codetools_0.2-16  htmltools_0.3.6  
## [31] assertthat_0.2.1  colorspace_1.4-1  labeling_0.3     
## [34] lazyeval_0.2.2    munsell_0.5.0     slam_0.1-45      
## [37] crayon_1.3.4