Introduction

This report is part of the Coursera Data Science Capstone project from John Hopkins University. This is being run in partnership with their corporate partner in this capstone, Swiftkey who build smart keyboards for mobile devices. The objective of this project is to create a text prediction algorithm which would be one of the cornerstones for keyobard.
Part of the analysis involves utilising some natural language processing (NLP) techniques. The first step in the process was to acquire the raw data and take a look at it structure and size to understand what we are working with.
For the next step of the investigation into the algorithm development, NLP functions such as corpus building, tokenisation, generation of document frequency matrix (DFM) and building n-grams to explore the data set structure and hopefully find a way forward with potential solutions for building the algorithm itself. Some early analysis components can be viewed below in a little more detail.

Getting the data

The source data used in this Natural Language Processing (NLP) capstone project was retrieved in zip format and stored locally on the machine before being processed in R.

        ## To measure time taken:
                start.time <- Sys.time()
        
        ## Setup the working directory where the data is located
        #knitr::opts_knit$set(root.dir = "D:/Documents/Coursera/Assignments/Capstone/Revision")
                setwd("D:/Documents/Coursera/Assignments/Capstone/Revision")
                #getwd()
                
                ## Task Timestamp
                        end.time1 <- Sys.time()             
                
        ## Creates a data folder if one doesn't exist
                if (!file.exists("data")){
                        dir.create("data")
                }
                        
                ## Task Timestamp
                        end.time2 <- Sys.time()  
            
                
        ## Checks to see if data file exists, if not retrieves it from remote web location
                if (!file.exists("./data/Coursera-SwiftKey.zip")) {
                        fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
                        download.file(fileUrl, destfile = "./data/Coursera-SwiftKey.zip")
                        dateDownloaded <- date()
                        dateDownloaded
                        list.files("./data")
                }
                
                path <- file.path("./data/final" , "en_US")
                files<-list.files(path, recursive=TRUE)             
                
                ## Task Timestamp
                        end.time3 <- Sys.time()

Read the contents of the raw files into R:

Initially on first attempt of processing this data I used the TM package for Natural Language Processing but then ran into memory issues on my machine. I did try reducing the sample set size but this still was not working as expected. Subsequently I decided to us the Quanteda package as it worked out to be more resource efficient. Because of this I used readtext to load the data into R as it works nicely with Quanteda.

        ## To measure time taken:
                start.time <- Sys.time()
library(readtext)
                setwd("D:/Documents/Coursera/Assignments/Capstone/Revision")
                
       ## Use Quanteda companion package for loading texts: readtext.                                 
        fileText_twitter <- readtext("./data/final/en_US/en_US.twitter.txt")
        fileText_blogs <- readtext("./data/final/en_US/en_US.blogs.txt")
        fileText_news <- readtext("./data/final/en_US/en_US.news.txt")  
                ## Task Timestamp                
                        end.time3 <- Sys.time() 

                        

Some initial exploratory analysis:

After getting the data itself into R, it is useful to take a look at the data material itself to see how it is made up. Firstly I had a look at some volume and count metrics to get an idea of how much data I would be working with. If the dataset is found to be large some considerations may need to be made in order to handle the data.
Throughout this report generated output files are saved offline after they are used and then removed from R in order to free up memory usage (as initially issues were occurring with memory resources,even after sampling a subset of the data).

        library(stringi)
        library(kableExtra)
                # To measure time taken
                        start.time <- Sys.time()
                        start.time  

        ## Check filesize
                blogs_filesize<-round(file.info("./data/final/en_US/en_US.blogs.txt")$size/(1024*1024))
                news_filesize<-round(file.info("./data/final/en_US/en_US.news.txt")$size/(1024*1024))
                twitter_filesize<-round(file.info("./data/final/en_US/en_US.twitter.txt")$size/(1024*1024))
        
                paste("Blogs Filesize",blogs_filesize,"MB")
                paste("News Filesize",news_filesize,"MB")
                paste("Twitter Filesize",twitter_filesize,"MB")                                
                ## Task Timestamp
                        end.time1 <- Sys.time()   

                        
        ## Check count words in files
                BlogsWords <- stri_count_words(fileText_blogs)
                NewsWords <- stri_count_words(fileText_news)
                TwitterWords <- stri_count_words(fileText_twitter)
        # Task Timestamp
                        end.time2 <- Sys.time()  
                        
         ## Check count characters in files
                characters_news<-nchar(fileText_news)
                characters_blogs<-nchar(fileText_blogs)                
                characters_twitter<-nchar(fileText_twitter)                       
                ## Task Timestamp
                        end.time3 <- Sys.time() 
                        
        ## Table of the data sets
                kable(data.frame(files = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
                           File_Size_MB = c(blogs_filesize, news_filesize, twitter_filesize),
                           Line_Count = c(length(fileText_blogs), length(fileText_news), length(fileText_twitter)),
                           Word_Count = c(sum(BlogsWords), sum(NewsWords), sum(TwitterWords)),
                           Mean_Word_Count = c(mean(BlogsWords), mean(NewsWords), mean(TwitterWords)),
                           Max_Word_Count = c(max(BlogsWords), max(NewsWords), max(TwitterWords)),
                           Max_characters_line=c(max(characters_blogs),max(characters_news),max(characters_twitter)))) %>%
                kable_styling()                       
                ## Task Timestamp
                        end.time4 <- Sys.time()
## [1] "2020-03-09 08:24:36 GMT"
## [1] "Blogs Filesize 200 MB"
## [1] "News Filesize 196 MB"
## [1] "Twitter Filesize 159 MB"
files File_Size_MB Line_Count Word_Count Mean_Word_Count Max_Word_Count Max_characters_line
en_US.blogs.txt 200 2 38154238 38154238 38154238 209260725
en_US.news.txt 196 2 2693898 2693898 2693898 15761023
en_US.twitter.txt 159 2 30218125 30218125 30218125 164744972

Constructing the Corpus:

The corpus were created from the data.frames created using readtext. Individual as well as a consolidated versions were generated to allow flexibility for analysis later on. Quanteda package was used as it was found to be the most efficient from a memory resource and processing perspective. Utilising corpora allows post processing of text bodies. As mentioned earlier for this project ther Quanteda package was utilised. The TM package is also another common one used for NLP.

library(quanteda)
## Package version: 1.5.2
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
                ## To measure time taken:
                start.time <- Sys.time()
                
        corpus_Twitter <- corpus(fileText_twitter)
        #str(fileText_twitter)  
                ## Task Timestamp
                        end.time2 <- Sys.time()  
        
        corpus_Blog <- corpus(fileText_blogs)
                ## Task Timestamp
                        end.time3 <- Sys.time()
                        
        corpus_News <- corpus(fileText_news)
                ## Task Timestamp
                        end.time4 <- Sys.time()
 
        Total_Corpus <- corpus_Twitter+corpus_Blog+corpus_News
                ## Task Timestamp
                        end.time5 <- Sys.time()           

                ## head(docvars(corpus_Twitter))
        docnames(Total_Corpus) <- c("Twitter", "Blog", "News")
                #summary(Total_Corpus)

Tokenise the data:

The data is tokenised by segmenting texts within the corpus by word boundaries. At this point some cleaning of the data is carried out by removing symbols, punctuation, numbers so that we are just left with words. By referencing a list of profanity words, it was also possible to filter these out from the token list. The resultant tokens are stored in a list of vectors. This is more efficient than character strings, but it still preserves positions of words. This facilitates positional analysis of the source text using functions such as textstat_collocations(), tokens_ngrams(), etc. Looking at N-grams will be particularly interesting for the text prediction algorithm because it will find out information about the frequency of sequences of tokens from already tokenized text objects. This can be used to predict the next work.

                ## To measure time taken:
                start.time <- Sys.time()

        Total_Tokens <- tokens(Total_Corpus, remove_numbers=TRUE, remove_punct=TRUE, remove_symbols=TRUE, remove_separators=TRUE, remove_twitter=TRUE, remove_hyphens=TRUE, remove_url=TRUE) 
        Total_Tokens <- tokens_tolower(Total_Tokens)
        Total_Tokens <- tokens_remove(Total_Tokens, pattern="^[^a-zA-Z]|[^a-zA-Z]$", valuetype="regex", padding=TRUE)

                ## Task Timestamp
                        end.time1 <- Sys.time()         
        profanity <- readLines("./data/Profanity/Profanity.txt")
## Warning in readLines("./data/Profanity/Profanity.txt"): incomplete final line
## found on './data/Profanity/Profanity.txt'
        #head(profanity)

        Total_Tokens_Clean <- tokens_remove(Total_Tokens, profanity, padding = TRUE)
                ## Task Timestamp
                        end.time2 <- Sys.time()  
                        
        #head(Total_Tokens_Clean)

Create N-Grams:

In order to make predictions of what the next word might when typing, it will be necessary to explore n-grams. N-grams are simply sequences of words of length n. By investigating the frequency of n-grams it may be possible to use this information as input for the text prediction model. N-grams are useful because DFMs can also be generated to carry out some analysis of them. First some bi-grams were generated and from these some Document-feature Matrices (DFMs) were created to carry out some frequency analysis of those bi-grams.

                ## To measure time taken:
                start.time <- Sys.time()

        Bi_gram <- tokens_ngrams(Total_Tokens_Clean, n=2)
        
                        ## Task Timestamp
                                end.time1 <- Sys.time() 
                        ## Check memory
                                #sort( sapply(ls(),function(x){object.size(get(x))}))
                                memory.size() 
        
                        ## To measure time taken:
                        end.time2 <- Sys.time()
        
        Bi_gram_Df <- dfm(Bi_gram)
        
                        ## Check memory
                                #sort( sapply(ls(),function(x){object.size(get(x))}))
                                memory.size() 
        
                        ## To measure time taken:
                        end.time3 <- Sys.time()
        
        nfeat(Bi_gram_Df)
        
                        ## Check memory
                                #sort( sapply(ls(),function(x){object.size(get(x))}))
                                memory.size() 
        
                        ## To measure time taken:
                        end.time4 <- Sys.time()
        
        
        topfeatures(Bi_gram_Df, n=25)
## [1] 1790.57
## [1] 4711.28
## [1] 10814688
## [1] 4504.86
##   of_the   in_the  for_the   to_the   on_the    to_be   at_the   i_have 
##   258486   246653   137564   136227   129690   118884    89350    79814 
##  and_the    i_was     is_a     in_a    and_i     i_am   it_was    it_is 
##    77901    75737    74571    73001    72709    72149    70550    66794 
##    for_a with_the   if_you   have_a going_to   is_the  will_be   to_get 
##    65994    65922    63895    60664    60465    56239    55336    54032 
## from_the 
##    53029

Then some tri-grams were generated and again from these some further DFMs were created to facilitate some frequency analysis of those tri-grams.

                        ## To measure time taken:
                        start.time <- Sys.time()
        
        Tri_gram <- tokens_ngrams(Total_Tokens_Clean, n=3)
                ## To measure time taken:
                end.time1 <- Sys.time()     
                        ## Check memory
                                #sort( sapply(ls(),function(x){object.size(get(x))}))
                                memory.size() 
        

                 ## Save data offline        
                saveRDS(Total_Tokens_Clean, file = "Total_Tokens_Clean.Rds")
                #Total_Tokens_Clean<- readRDS(file = "Total_Tokens_Clean.Rds")
                         ## To measure time taken:
                        end.time2 <- Sys.time()
                        rm(Total_Tokens_Clean)
                        gc()
                
        Tri_gram_Df <- dfm(Tri_gram)
                        ## To measure time taken:
                        end.time3 <- Sys.time()
                        ## Save data offline        
                        saveRDS(Tri_gram, file = "Tri_gram.Rds")
                rm(Tri_gram)
                gc()
                        ## Check memory
                                #sort( sapply(ls(),function(x){object.size(get(x))}))
                                memory.size() 
        
                        ## To measure time taken:
                        end.time4 <- Sys.time()
        
        nfeat(Tri_gram_Df)
        
                        ## Check memory
                                #sort( sapply(ls(),function(x){object.size(get(x))}))
                                memory.size() 
        
                        ## To measure time taken:
                        end.time5 <- Sys.time()
        
        
        topfeatures(Tri_gram_Df, n=25)

                        ## To measure time taken:
                        end.time6 <- Sys.time()
## [1] 3990.08
##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  36805408 1965.7   69403431 3706.6  40919339 2185.4
## Vcells 214947334 1640.0  429170446 3274.4 760567283 5802.7
##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  36805640 1965.7   69403431 3706.6  69403431 3706.6
## Vcells 255347070 1948.2  703818267 5369.8 760567283 5802.7
## [1] 3730.62
## [1] 34644587
## [1] 4259.57
##     thanks_for_the         one_of_the           a_lot_of            to_be_a 
##              23805              21051              19331              13206 
##          i_want_to        going_to_be           i_have_a looking_forward_to 
##              13180              12730              10896              10572 
##          i_have_to           it_was_a      thank_you_for         the_end_of 
##              10332              10289              10139               9718 
##         out_of_the         be_able_to         i_love_you          i_need_to 
##               9703               9310               9196               9147 
##        some_of_the      can't_wait_to         as_well_as        the_rest_of 
##               8584               8290               8192               8176 
##          one_of_my     for_the_follow        is_going_to        you_want_to 
##               8054               7932               7824               7726 
##        a_couple_of 
##               7466

Create a Document-feature Matrix (DFM):

Another way to anlyse text is just to treat it as a bag-of-words. This can be done using a Document feature Matrix (DFM). That is, it no longer contains positional information relating to the words and provides a summary of frequencies of features/tokens in documents in a matrix.

                        ## To measure time taken:
                        start.time <- Sys.time()
                # Load offline data
                Total_Tokens_Clean<- readRDS(file = "Total_Tokens_Clean.Rds")
                ## Task Timestamp
                        end.time1 <- Sys.time() 
                DFM <- dfm(Total_Tokens_Clean, remove=stopwords("english"))
                ## Task Timestamp
                        end.time2 <- Sys.time() 
        rm(Total_Tokens_Clean)
                gc()
                ## Task Timestamp
                        end.time3 <- Sys.time() 
                DFM <- dfm_remove(DFM, "\\b[a-zA-Z]\\b", valuetype="regex")
                ## Task Timestamp
                        end.time4 <- Sys.time() 

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  2668536 142.6   44418196 2372.2  69403431 3706.6
## Vcells 39930422 304.7  450443692 3436.7 760567283 5802.7

Another way of filtering the features is to remove uncommon words using the min_docfreq function as you are not likely to want to predict these. By setting this =2 below will filter out any words that did not appear in at least 2x documents.

                        ## To measure time taken:
                        start.time <- Sys.time()
                DFM_Trimmed <- dfm_trim(DFM, min_docfreq=2)
                ## Save data offline
                saveRDS(DFM, file = "DFM.Rds")
                saveRDS(DFM_Trimmed, file = "DFM_Trimmed.Rds")
                #DFM<- readRDS(file = "DFM.Rds")
                rm(DFM)
                gc()

                ## Check memory
                                sort( sapply(ls(),function(x){object.size(get(x))}))
                                memory.size()                                 
##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  2283900 122.0   35534557 1897.8  69403431 3706.6
## Vcells 38077596 290.6  360354954 2749.3 760567283 5802.7
##   blogs_filesize    news_filesize twitter_filesize             path 
##               56               56               56              136 
##            files        end.time1        end.time2        end.time3 
##              288              344              344              344 
##        end.time4        end.time5        end.time6        end.time7 
##              344              344              344              344 
##        end.time8       start.time       time.taken      time.taken1 
##              344              344              512              512 
##      time.taken2      time.taken3      time.taken4      time.taken5 
##              512              512              512              512 
##      time.taken6      time.taken7      DFM_Trimmed 
##              512              512         11576728 
## [1] 484.14
                        ## To measure time taken:
                        start.time <- Sys.time()
                Top_N <- topfeatures(DFM_Trimmed, n=25)
                Top_N
#textstat_frequency()
##            just    like     one     can     get    time    love    good     now 
## 1407676  255568  226386  216180  192319  186663  171336  152123  151846  146211 
##     day    know     new     see      go  people    back   great   think    make 
##  145885  141683  129716  118641  117793  114650  112153  108566  103521  101297 
##      us   going  really  thanks   today 
##  100408   97956   96880   96753   96725

Wordclouds can be used to visualise the frequency of used words. In this plot a comparison is also made between the the words used in the 3x type of data. The words closer to the centre would be the words that are commons across the documents being compared. If we look at Twitter it appears to have the biggest portion of the most frequent words. This is most likely because the source messages are shorter. So whilst the total word count of news has a greater number of total words, twitter has in general a higher re-use of it’s most common words (likely due to the typical Tweet length). Care must be taken when reviewing these wordcloud visualisations as they are not and accurate representation of the distribution of the words (for example, words with more letters are naturally going to take up more area in the plot due to longer length but this is not related to their frequency).

                        ## To measure time taken:
                        #DFM_Trimmed<- readRDS(file = "DFM_Trimmed.Rds")
                        start.time <- Sys.time()
                textplot_wordcloud(DFM_Trimmed, comparison = TRUE,  max_words =200,max_size = 6,labelcolor = "darkred")

                        #rm(DFM_Trimmed)
                        ## To measure time taken:
                        start.time <- Sys.time()
        #Twitter_Top<- readRDS(file = "Twitter_Top.Rds")
                Twitter_Top <- topfeatures(DFM_Trimmed[1], n=20)
                Blog_Top <- topfeatures(DFM_Trimmed[2], n=20)
                News_Top <- topfeatures(DFM_Trimmed[3], n=20)

                p1<-barplot(Twitter_Top, main = "Top N Words in Twitter Data Set", ylab = "Count", col="lightsteelblue1",border="indianred3",xaxt="n")
                        text(p1, Twitter_Top * 0.06, labels=names(Twitter_Top), cex=1.0, srt=90, adj=c(0,0.5))

                p2<-barplot(Blog_Top, main = "Top N Words in Blogs Data Set", ylab = "Count", col="steelblue2",border="indianred3",xaxt="n")
                        text(p1, Twitter_Top * 0.04, labels=names(Twitter_Top), cex=1.0, srt=90, adj=c(0,0.5))

                p3<-barplot(News_Top, main = "Top N Words in News Data Set", ylab = "Count", col="lightsteelblue2",border="indianred3",xaxt="n")
                        text(p1, Twitter_Top * 0.01, labels=names(Twitter_Top), cex=1.0, srt=90, adj=c(0,0.5))


## Check memory
                                sort( sapply(ls(),function(x){object.size(get(x))}))
                                memory.size()                 
##   blogs_filesize    news_filesize twitter_filesize             path 
##               56               56               56              136 
##            files        end.time1        end.time2        end.time3 
##              288              344              344              344 
##        end.time4        end.time5        end.time6        end.time7 
##              344              344              344              344 
##        end.time8       start.time               p1               p2 
##              344              344              376              376 
##               p3       time.taken      time.taken1      time.taken2 
##              376              512              512              512 
##      time.taken3      time.taken4      time.taken5      time.taken6 
##              512              512              512              512 
##      time.taken7         Blog_Top         News_Top      Twitter_Top 
##              512             1648             1648             1648 
##            Top_N      DFM_Trimmed 
##             2008         11576728 
## [1] 1249.44

Next Steps:

```