This report is part of the Coursera Data Science Capstone project from John Hopkins University. This is being run in partnership with their corporate partner in this capstone, Swiftkey who build smart keyboards for mobile devices. The objective of this project is to create a text prediction algorithm which would be one of the cornerstones for keyobard.
Part of the analysis involves utilising some natural language processing (NLP) techniques. The first step in the process was to acquire the raw data and take a look at it structure and size to understand what we are working with.
For the next step of the investigation into the algorithm development, NLP functions such as corpus building, tokenisation, generation of document frequency matrix (DFM) and building n-grams to explore the data set structure and hopefully find a way forward with potential solutions for building the algorithm itself. Some early analysis components can be viewed below in a little more detail.
The source data used in this Natural Language Processing (NLP) capstone project was retrieved in zip format and stored locally on the machine before being processed in R.
## To measure time taken:
start.time <- Sys.time()
## Setup the working directory where the data is located
#knitr::opts_knit$set(root.dir = "D:/Documents/Coursera/Assignments/Capstone/Revision")
setwd("D:/Documents/Coursera/Assignments/Capstone/Revision")
#getwd()
## Task Timestamp
end.time1 <- Sys.time()
## Creates a data folder if one doesn't exist
if (!file.exists("data")){
dir.create("data")
}
## Task Timestamp
end.time2 <- Sys.time()
## Checks to see if data file exists, if not retrieves it from remote web location
if (!file.exists("./data/Coursera-SwiftKey.zip")) {
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileUrl, destfile = "./data/Coursera-SwiftKey.zip")
dateDownloaded <- date()
dateDownloaded
list.files("./data")
}
path <- file.path("./data/final" , "en_US")
files<-list.files(path, recursive=TRUE)
## Task Timestamp
end.time3 <- Sys.time()
Initially on first attempt of processing this data I used the TM package for Natural Language Processing but then ran into memory issues on my machine. I did try reducing the sample set size but this still was not working as expected. Subsequently I decided to us the Quanteda package as it worked out to be more resource efficient. Because of this I used readtext to load the data into R as it works nicely with Quanteda.
## To measure time taken:
start.time <- Sys.time()
library(readtext)
setwd("D:/Documents/Coursera/Assignments/Capstone/Revision")
## Use Quanteda companion package for loading texts: readtext.
fileText_twitter <- readtext("./data/final/en_US/en_US.twitter.txt")
fileText_blogs <- readtext("./data/final/en_US/en_US.blogs.txt")
fileText_news <- readtext("./data/final/en_US/en_US.news.txt")
## Task Timestamp
end.time3 <- Sys.time()
After getting the data itself into R, it is useful to take a look at the data material itself to see how it is made up. Firstly I had a look at some volume and count metrics to get an idea of how much data I would be working with. If the dataset is found to be large some considerations may need to be made in order to handle the data.
Throughout this report generated output files are saved offline after they are used and then removed from R in order to free up memory usage (as initially issues were occurring with memory resources,even after sampling a subset of the data).
library(stringi)
library(kableExtra)
# To measure time taken
start.time <- Sys.time()
start.time
## Check filesize
blogs_filesize<-round(file.info("./data/final/en_US/en_US.blogs.txt")$size/(1024*1024))
news_filesize<-round(file.info("./data/final/en_US/en_US.news.txt")$size/(1024*1024))
twitter_filesize<-round(file.info("./data/final/en_US/en_US.twitter.txt")$size/(1024*1024))
paste("Blogs Filesize",blogs_filesize,"MB")
paste("News Filesize",news_filesize,"MB")
paste("Twitter Filesize",twitter_filesize,"MB")
## Task Timestamp
end.time1 <- Sys.time()
## Check count words in files
BlogsWords <- stri_count_words(fileText_blogs)
NewsWords <- stri_count_words(fileText_news)
TwitterWords <- stri_count_words(fileText_twitter)
# Task Timestamp
end.time2 <- Sys.time()
## Check count characters in files
characters_news<-nchar(fileText_news)
characters_blogs<-nchar(fileText_blogs)
characters_twitter<-nchar(fileText_twitter)
## Task Timestamp
end.time3 <- Sys.time()
## Table of the data sets
kable(data.frame(files = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
File_Size_MB = c(blogs_filesize, news_filesize, twitter_filesize),
Line_Count = c(length(fileText_blogs), length(fileText_news), length(fileText_twitter)),
Word_Count = c(sum(BlogsWords), sum(NewsWords), sum(TwitterWords)),
Mean_Word_Count = c(mean(BlogsWords), mean(NewsWords), mean(TwitterWords)),
Max_Word_Count = c(max(BlogsWords), max(NewsWords), max(TwitterWords)),
Max_characters_line=c(max(characters_blogs),max(characters_news),max(characters_twitter)))) %>%
kable_styling()
## Task Timestamp
end.time4 <- Sys.time()
## [1] "2020-03-09 08:24:36 GMT"
## [1] "Blogs Filesize 200 MB"
## [1] "News Filesize 196 MB"
## [1] "Twitter Filesize 159 MB"
| files | File_Size_MB | Line_Count | Word_Count | Mean_Word_Count | Max_Word_Count | Max_characters_line |
|---|---|---|---|---|---|---|
| en_US.blogs.txt | 200 | 2 | 38154238 | 38154238 | 38154238 | 209260725 |
| en_US.news.txt | 196 | 2 | 2693898 | 2693898 | 2693898 | 15761023 |
| en_US.twitter.txt | 159 | 2 | 30218125 | 30218125 | 30218125 | 164744972 |
The corpus were created from the data.frames created using readtext. Individual as well as a consolidated versions were generated to allow flexibility for analysis later on. Quanteda package was used as it was found to be the most efficient from a memory resource and processing perspective. Utilising corpora allows post processing of text bodies. As mentioned earlier for this project ther Quanteda package was utilised. The TM package is also another common one used for NLP.
library(quanteda)
## Package version: 1.5.2
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
## To measure time taken:
start.time <- Sys.time()
corpus_Twitter <- corpus(fileText_twitter)
#str(fileText_twitter)
## Task Timestamp
end.time2 <- Sys.time()
corpus_Blog <- corpus(fileText_blogs)
## Task Timestamp
end.time3 <- Sys.time()
corpus_News <- corpus(fileText_news)
## Task Timestamp
end.time4 <- Sys.time()
Total_Corpus <- corpus_Twitter+corpus_Blog+corpus_News
## Task Timestamp
end.time5 <- Sys.time()
## head(docvars(corpus_Twitter))
docnames(Total_Corpus) <- c("Twitter", "Blog", "News")
#summary(Total_Corpus)
The data is tokenised by segmenting texts within the corpus by word boundaries. At this point some cleaning of the data is carried out by removing symbols, punctuation, numbers so that we are just left with words. By referencing a list of profanity words, it was also possible to filter these out from the token list. The resultant tokens are stored in a list of vectors. This is more efficient than character strings, but it still preserves positions of words. This facilitates positional analysis of the source text using functions such as textstat_collocations(), tokens_ngrams(), etc. Looking at N-grams will be particularly interesting for the text prediction algorithm because it will find out information about the frequency of sequences of tokens from already tokenized text objects. This can be used to predict the next work.
## To measure time taken:
start.time <- Sys.time()
Total_Tokens <- tokens(Total_Corpus, remove_numbers=TRUE, remove_punct=TRUE, remove_symbols=TRUE, remove_separators=TRUE, remove_twitter=TRUE, remove_hyphens=TRUE, remove_url=TRUE)
Total_Tokens <- tokens_tolower(Total_Tokens)
Total_Tokens <- tokens_remove(Total_Tokens, pattern="^[^a-zA-Z]|[^a-zA-Z]$", valuetype="regex", padding=TRUE)
## Task Timestamp
end.time1 <- Sys.time()
profanity <- readLines("./data/Profanity/Profanity.txt")
## Warning in readLines("./data/Profanity/Profanity.txt"): incomplete final line
## found on './data/Profanity/Profanity.txt'
#head(profanity)
Total_Tokens_Clean <- tokens_remove(Total_Tokens, profanity, padding = TRUE)
## Task Timestamp
end.time2 <- Sys.time()
#head(Total_Tokens_Clean)
In order to make predictions of what the next word might when typing, it will be necessary to explore n-grams. N-grams are simply sequences of words of length n. By investigating the frequency of n-grams it may be possible to use this information as input for the text prediction model. N-grams are useful because DFMs can also be generated to carry out some analysis of them. First some bi-grams were generated and from these some Document-feature Matrices (DFMs) were created to carry out some frequency analysis of those bi-grams.
## To measure time taken:
start.time <- Sys.time()
Bi_gram <- tokens_ngrams(Total_Tokens_Clean, n=2)
## Task Timestamp
end.time1 <- Sys.time()
## Check memory
#sort( sapply(ls(),function(x){object.size(get(x))}))
memory.size()
## To measure time taken:
end.time2 <- Sys.time()
Bi_gram_Df <- dfm(Bi_gram)
## Check memory
#sort( sapply(ls(),function(x){object.size(get(x))}))
memory.size()
## To measure time taken:
end.time3 <- Sys.time()
nfeat(Bi_gram_Df)
## Check memory
#sort( sapply(ls(),function(x){object.size(get(x))}))
memory.size()
## To measure time taken:
end.time4 <- Sys.time()
topfeatures(Bi_gram_Df, n=25)
## [1] 1790.57
## [1] 4711.28
## [1] 10814688
## [1] 4504.86
## of_the in_the for_the to_the on_the to_be at_the i_have
## 258486 246653 137564 136227 129690 118884 89350 79814
## and_the i_was is_a in_a and_i i_am it_was it_is
## 77901 75737 74571 73001 72709 72149 70550 66794
## for_a with_the if_you have_a going_to is_the will_be to_get
## 65994 65922 63895 60664 60465 56239 55336 54032
## from_the
## 53029
Then some tri-grams were generated and again from these some further DFMs were created to facilitate some frequency analysis of those tri-grams.
## To measure time taken:
start.time <- Sys.time()
Tri_gram <- tokens_ngrams(Total_Tokens_Clean, n=3)
## To measure time taken:
end.time1 <- Sys.time()
## Check memory
#sort( sapply(ls(),function(x){object.size(get(x))}))
memory.size()
## Save data offline
saveRDS(Total_Tokens_Clean, file = "Total_Tokens_Clean.Rds")
#Total_Tokens_Clean<- readRDS(file = "Total_Tokens_Clean.Rds")
## To measure time taken:
end.time2 <- Sys.time()
rm(Total_Tokens_Clean)
gc()
Tri_gram_Df <- dfm(Tri_gram)
## To measure time taken:
end.time3 <- Sys.time()
## Save data offline
saveRDS(Tri_gram, file = "Tri_gram.Rds")
rm(Tri_gram)
gc()
## Check memory
#sort( sapply(ls(),function(x){object.size(get(x))}))
memory.size()
## To measure time taken:
end.time4 <- Sys.time()
nfeat(Tri_gram_Df)
## Check memory
#sort( sapply(ls(),function(x){object.size(get(x))}))
memory.size()
## To measure time taken:
end.time5 <- Sys.time()
topfeatures(Tri_gram_Df, n=25)
## To measure time taken:
end.time6 <- Sys.time()
## [1] 3990.08
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 36805408 1965.7 69403431 3706.6 40919339 2185.4
## Vcells 214947334 1640.0 429170446 3274.4 760567283 5802.7
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 36805640 1965.7 69403431 3706.6 69403431 3706.6
## Vcells 255347070 1948.2 703818267 5369.8 760567283 5802.7
## [1] 3730.62
## [1] 34644587
## [1] 4259.57
## thanks_for_the one_of_the a_lot_of to_be_a
## 23805 21051 19331 13206
## i_want_to going_to_be i_have_a looking_forward_to
## 13180 12730 10896 10572
## i_have_to it_was_a thank_you_for the_end_of
## 10332 10289 10139 9718
## out_of_the be_able_to i_love_you i_need_to
## 9703 9310 9196 9147
## some_of_the can't_wait_to as_well_as the_rest_of
## 8584 8290 8192 8176
## one_of_my for_the_follow is_going_to you_want_to
## 8054 7932 7824 7726
## a_couple_of
## 7466
Another way to anlyse text is just to treat it as a bag-of-words. This can be done using a Document feature Matrix (DFM). That is, it no longer contains positional information relating to the words and provides a summary of frequencies of features/tokens in documents in a matrix.
## To measure time taken:
start.time <- Sys.time()
# Load offline data
Total_Tokens_Clean<- readRDS(file = "Total_Tokens_Clean.Rds")
## Task Timestamp
end.time1 <- Sys.time()
DFM <- dfm(Total_Tokens_Clean, remove=stopwords("english"))
## Task Timestamp
end.time2 <- Sys.time()
rm(Total_Tokens_Clean)
gc()
## Task Timestamp
end.time3 <- Sys.time()
DFM <- dfm_remove(DFM, "\\b[a-zA-Z]\\b", valuetype="regex")
## Task Timestamp
end.time4 <- Sys.time()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2668536 142.6 44418196 2372.2 69403431 3706.6
## Vcells 39930422 304.7 450443692 3436.7 760567283 5802.7
Another way of filtering the features is to remove uncommon words using the min_docfreq function as you are not likely to want to predict these. By setting this =2 below will filter out any words that did not appear in at least 2x documents.
## To measure time taken:
start.time <- Sys.time()
DFM_Trimmed <- dfm_trim(DFM, min_docfreq=2)
## Save data offline
saveRDS(DFM, file = "DFM.Rds")
saveRDS(DFM_Trimmed, file = "DFM_Trimmed.Rds")
#DFM<- readRDS(file = "DFM.Rds")
rm(DFM)
gc()
## Check memory
sort( sapply(ls(),function(x){object.size(get(x))}))
memory.size()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2283900 122.0 35534557 1897.8 69403431 3706.6
## Vcells 38077596 290.6 360354954 2749.3 760567283 5802.7
## blogs_filesize news_filesize twitter_filesize path
## 56 56 56 136
## files end.time1 end.time2 end.time3
## 288 344 344 344
## end.time4 end.time5 end.time6 end.time7
## 344 344 344 344
## end.time8 start.time time.taken time.taken1
## 344 344 512 512
## time.taken2 time.taken3 time.taken4 time.taken5
## 512 512 512 512
## time.taken6 time.taken7 DFM_Trimmed
## 512 512 11576728
## [1] 484.14
## To measure time taken:
start.time <- Sys.time()
Top_N <- topfeatures(DFM_Trimmed, n=25)
Top_N
#textstat_frequency()
## just like one can get time love good now
## 1407676 255568 226386 216180 192319 186663 171336 152123 151846 146211
## day know new see go people back great think make
## 145885 141683 129716 118641 117793 114650 112153 108566 103521 101297
## us going really thanks today
## 100408 97956 96880 96753 96725
Wordclouds can be used to visualise the frequency of used words. In this plot a comparison is also made between the the words used in the 3x type of data. The words closer to the centre would be the words that are commons across the documents being compared. If we look at Twitter it appears to have the biggest portion of the most frequent words. This is most likely because the source messages are shorter. So whilst the total word count of news has a greater number of total words, twitter has in general a higher re-use of it’s most common words (likely due to the typical Tweet length). Care must be taken when reviewing these wordcloud visualisations as they are not and accurate representation of the distribution of the words (for example, words with more letters are naturally going to take up more area in the plot due to longer length but this is not related to their frequency).
## To measure time taken:
#DFM_Trimmed<- readRDS(file = "DFM_Trimmed.Rds")
start.time <- Sys.time()
textplot_wordcloud(DFM_Trimmed, comparison = TRUE, max_words =200,max_size = 6,labelcolor = "darkred")
#rm(DFM_Trimmed)
## To measure time taken:
start.time <- Sys.time()
#Twitter_Top<- readRDS(file = "Twitter_Top.Rds")
Twitter_Top <- topfeatures(DFM_Trimmed[1], n=20)
Blog_Top <- topfeatures(DFM_Trimmed[2], n=20)
News_Top <- topfeatures(DFM_Trimmed[3], n=20)
p1<-barplot(Twitter_Top, main = "Top N Words in Twitter Data Set", ylab = "Count", col="lightsteelblue1",border="indianred3",xaxt="n")
text(p1, Twitter_Top * 0.06, labels=names(Twitter_Top), cex=1.0, srt=90, adj=c(0,0.5))
p2<-barplot(Blog_Top, main = "Top N Words in Blogs Data Set", ylab = "Count", col="steelblue2",border="indianred3",xaxt="n")
text(p1, Twitter_Top * 0.04, labels=names(Twitter_Top), cex=1.0, srt=90, adj=c(0,0.5))
p3<-barplot(News_Top, main = "Top N Words in News Data Set", ylab = "Count", col="lightsteelblue2",border="indianred3",xaxt="n")
text(p1, Twitter_Top * 0.01, labels=names(Twitter_Top), cex=1.0, srt=90, adj=c(0,0.5))
## Check memory
sort( sapply(ls(),function(x){object.size(get(x))}))
memory.size()
## blogs_filesize news_filesize twitter_filesize path
## 56 56 56 136
## files end.time1 end.time2 end.time3
## 288 344 344 344
## end.time4 end.time5 end.time6 end.time7
## 344 344 344 344
## end.time8 start.time p1 p2
## 344 344 376 376
## p3 time.taken time.taken1 time.taken2
## 376 512 512 512
## time.taken3 time.taken4 time.taken5 time.taken6
## 512 512 512 512
## time.taken7 Blog_Top News_Top Twitter_Top
## 512 1648 1648 1648
## Top_N DFM_Trimmed
## 2008 11576728
## [1] 1249.44
```