Introduction

The goal of this exercise is to build a simple model for the relationship between words that have been collected from Twitter, News Feeds, and Blogs. This document is the first step in building a predictive text mining application. My goal is to explore the potential of developing a simple Natural Language Processing (NLP) model that will help in the prediction of the “Next Word” prediction process. The data that are provided came from SwiftKey.

Initially, the data are downloaded as Zipped files, which were segmented and prepared for an exploratory data analysis. A basic statistics is presented along with graphs of the top 20 word group or N-Grams. The data are segmented into training, test and validation sets to better determine the accuracy of the prediction algorithm. A report on Memory usage as well as time to build the models are reported.

The following R Tools or functions are critical to the operational statistics of the model building and performance of the predictive algorithm.

object.size(): Number of bytes report function that an R object occupies in memory

Rprof(): Profiler function that can be used to determine where bottlenecks in your function may exist. The profr package (available on CRAN) provides some additional tools for visualizing and summarizing profiling data.

gc(): Garbage collector function to retrieve unused RAM for R. In the process it tells you how much memory is currently being used by R.

Data Retrieval

## Check if zip has already been unzipped?
if(!file.exists("./projectData/final")){
  unzip(zipfile="./projectData/Coursera-SwiftKey.zip",exdir="./projectData")
}

# Once the dataset is downloaded start reading it as this a huge dataset so we'll read line by line
# only the amount of data needed. before doing that, lets first list all the files in the directory

file_head <- function(name, open = ""){
  connection <- file(name, open = open, encoding = "UTF-8")
  result <- readLines(connection, warn=FALSE,  skipNul=TRUE)
  close(connection)
  result
}

blogs <- file_head("./projectData/final/en_US/en_US.twitter.txt")
news <- file_head("./projectData/final/en_US/en_US.news.txt", "rb")
tweets <- file_head("./projectData/final/en_US/en_US.twitter.txt")

Retrieve 20% of the Tweets and Blogs data and 100% of the News data. Then we generate the Training, validation and testing data sets, which include the tweets, news and blogs data that are randomly rearranged, and that are finally split to generate the final train, valid, and test data sets that will be applied towards our model.

set.seed(2020)
dataSample <- c(sample(tweets, length(tweets) * 0.1),
                sample(news, length(news) * 0.1),
                sample(blogs, length(blogs) * 0.1))

dataSample <- sample(dataSample)
valIndex <- floor(length(dataSample) * 0.8)
testIndex <- floor(length(dataSample) * 0.9)

train <- dataSample[1:valIndex]
valid <- dataSample[(valIndex+1):testIndex]
test <- dataSample[(testIndex+1):length(dataSample)]

Data Tokenization and Cleaning

genTokens <- function(lines) {
  lines <- tolower(lines)
  lines <- gsub("'", "'", lines)
  lines <- gsub("[.!?]$|[.!?] |$", " ''split'' ", lines)
  tokens <- unlist(strsplit(lines, "[^a-z']"))
  tokens <- tokens[tokens != ""]
  return(tokens)
}

trainTokens <- genTokens(train)
validTokens <- genTokens(valid)
testTokens <- genTokens(test)

N-Grams Generation

tokens2 <- c(trainTokens[-1], ".")
tokens3 <- c(tokens2[-1], ".")
tokens4 <- c(tokens3[-1], ".")
tokens5 <- c(tokens4[-1], ".")
tokens6 <- c(tokens5[-1], ".")

unigrams <- trainTokens
bigrams <- paste(trainTokens, tokens2)
trigrams <- paste(trainTokens, tokens2, tokens3)
quadgrams <- paste(trainTokens, tokens2, tokens3, tokens4)
fivegrams <- paste(trainTokens, tokens2, tokens3, tokens4, tokens5)
sixgrams <- paste(trainTokens, tokens2, tokens3, tokens4, tokens5, tokens6)


unigrams <- unigrams[!grepl("''split''", unigrams)]
bigrams <- bigrams[!grepl("''split''", bigrams)]
trigrams <- trigrams[!grepl("''split''", trigrams)]
quadgrams <- quadgrams[!grepl("''split''", quadgrams)]
fivegrams <- fivegrams[!grepl("''split''", fivegrams)]
sixgrams <- sixgrams[!grepl("''split''", sixgrams)]


unigrams <- sort(table(unigrams), decreasing=T)
bigrams <- sort(table(bigrams), decreasing=T)
trigrams <- sort(table(trigrams), decreasing=T)
quadgrams <- sort(table(quadgrams), decreasing=T)
fivegrams <- sort(table(fivegrams), decreasing=T)
sixgrams <- sort(table(sixgrams), decreasing=T)

# save(unigrams, bigrams, trigrams, quadgrams, fivegrams, sixgrams,  file = "nGrams.RData")

# load("nGrams.RData")

plotme <- function()
{

  # Multiple plot function
  multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL)
  {
    library(grid)

    # Make a list from the ... arguments and plotlist
    plots <- c(list(...), plotlist)

    numPlots = length(plots)

    # If layout is NULL, then use 'cols' to determine layout
    if (is.null(layout)) {
      # Make the panel
      # ncol: Number of columns of plots
      # nrow: Number of rows needed, calculated from # of cols
      layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                       ncol = cols, nrow = ceiling(numPlots/cols))
    }

    if (numPlots==1) {
      print(plots[[1]])

    } else {
      # Set up the page
      grid.newpage()
      pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

      # Make each plot, in the correct location
      for (i in 1:numPlots) {
        # Get the i,j matrix positions of the regions that contain this subplot
        matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

        print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row, layout.pos.col = matchidx$col))
      }
    }
  }


# The 10 most unigrams in the dataset
ugrams <- as.data.frame(unigrams)[1:10, ]
graph.data <- ugrams[order(ugrams$Freq, decreasing = T), ]
p1 <- ggplot(data=graph.data, aes(x=unigrams, y=Freq, fill=unigrams)) + geom_bar(stat="identity") + ggtitle("Top 10 Words") +
  theme(axis.text.x = element_text(angle = 40, hjust = 1), plot.title = element_text(hjust = 0.5),legend.position='none')


# The 10 most bigrams in the dataset
bgrams <- as.data.frame(bigrams)[1:10, ]
graph.data <- bgrams[order(bgrams$Freq, decreasing = T), ]
p2 <- ggplot(data=graph.data, aes(x=bigrams, y=Freq, fill=bigrams)) + geom_bar(stat="identity") + ggtitle("Top 10 Two Words Set") +
  theme(axis.text.x = element_text(angle = 40, hjust = 1), plot.title = element_text(hjust = 0.5),legend.position='none')


# The 10 most trigrams in the dataset
tgrams <- as.data.frame(trigrams)[1:10, ]
graph.data <- tgrams[order(tgrams$Freq, decreasing = T), ]
p3 <- ggplot(data=graph.data, aes(x=trigrams, y=Freq, fill=trigrams)) + geom_bar(stat="identity")  + ggtitle("Top 10 Three Words Set")  +
  theme(axis.text.x = element_text(angle = 40, hjust = 1), plot.title = element_text(hjust = 0.5),legend.position='none')


# The 10 most quadgrams in the dataset
qgrams <- as.data.frame(quadgrams)[1:10, ]
graph.data <- qgrams[order(qgrams$Freq, decreasing = T), ]
p4 <- ggplot(data=graph.data, aes(x=quadgrams, y=Freq, fill=quadgrams)) + geom_bar(stat="identity")  + ggtitle("Top 10 Four Words Set") +
  theme(axis.text.x = element_text(angle = 40, hjust = 1), plot.title = element_text(hjust = 0.5),legend.position='none')

multiplot(p1, p3, p2, p4, cols=2)
}

Basic Statistics (Words & Lines Count)

File	Lines	LinesNEmpty	Chars	CharsNWhite	TotalWords
blogs	2360148	2360148	162096241	134082806	30451170
news	1010242	1010242	203791405	170428853	34678691
tweets	2360148	2360148	162096241	134082806	30451170

Histograms Display of the top 10 N-Grams

## Time difference of 10.91456 mins

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] gridExtra_2.3         patchwork_1.0.1       tm_0.7-7             
##  [4] NLP_0.2-0             devtools_2.3.0        usethis_1.6.1        
##  [7] dplyr_1.0.0           stringr_1.4.0         stringi_1.4.6        
## [10] ggplot2_3.3.2         readtext_0.76         quanteda.corpora_0.91
## [13] quanteda_2.1.0       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5         lattice_0.20-41    prettyunits_1.1.1  ps_1.3.3          
##  [5] assertthat_0.2.1   rprojroot_1.3-2    digest_0.6.25      slam_0.1-47       
##  [9] R6_2.4.1           backports_1.1.8    evaluate_0.14      highr_0.8         
## [13] httr_1.4.1         pillar_1.4.6       rlang_0.4.7        data.table_1.12.8 
## [17] callr_3.4.3        Matrix_1.2-18      rmarkdown_2.3      labeling_0.3      
## [21] desc_1.2.0         munsell_0.5.0      compiler_4.0.2     xfun_0.15         
## [25] pkgconfig_2.0.3    pkgbuild_1.0.8     htmltools_0.5.0    tidyselect_1.1.0  
## [29] tibble_3.0.3       fansi_0.4.1        crayon_1.3.4       withr_2.2.0       
## [33] gtable_0.3.0       lifecycle_0.2.0    magrittr_1.5       scales_1.1.1      
## [37] RcppParallel_5.0.2 cli_2.0.2          farver_2.0.3       fs_1.4.2          
## [41] remotes_2.1.1      testthat_2.3.2     xml2_1.3.2         ellipsis_0.3.1    
## [45] stopwords_2.0      generics_0.0.2     vctrs_0.3.1        fastmatch_1.1-0   
## [49] tools_4.0.2        glue_1.4.1         purrr_0.3.4        processx_3.4.3    
## [53] pkgload_1.1.0      parallel_4.0.2     yaml_2.2.1         colorspace_1.4-1  
## [57] sessioninfo_1.1.1  memoise_1.1.0      knitr_1.29

Conclusion

In an effort to create a Natural Language Processing (NLP) model, I started with a downloading the provided datasets from Swiftkey. The data cam in the US English language and were made-up of chunks from Blogs, Tweets, and News feeds. The data were split into smaller chunks to allow for memory consumption reduction and were further prepared via a Tokenization method to facilitate the final steps of generating the necessary N-Grams that are desired for our goal of next-word prediction. I am in favor of performance in the application so, I took the time to test the data and validate the results to better manage the accuracy of the word prediction algorithm. Specifically, I am looking to dvelop a fairly accurate application that delivers seemingly instant predictions with sufficient accuracy.

At this point, we are ready to move forward towards developing the model and finally, the application itself.

Build Basic N-Gram Model

Johnny Sandaire