Milestone Report for NLP Capstone Project

Introduction
Executive Summary
Understanding The Problem
Summary of Data
Sampling the Data
Creating the Corpus & Tokenising
Conclusions & Next Steps
Links

Introduction

This is the milestone report for the Data Science Specialization Capstone Project on Coursera. This report intends to show my current progress and discussion with hopes of obtaining some constructive feedback from my peers and teachers. As the intended audience for this report are non-data scientists, I have kept code output to minimum and if it interests the reader, you may view the source code for this file at my Github repository.

Executive Summary

The objective of this Data Science Specialization Capstone Project is to produce a predictive text algorithm in R that based on a user’s text input. As the user types some text the system will suggest the next most likely word to be entered.

From my current understanding of the task I will need to process the user’s input as they type and compare the text against a word list. The predicted word will be the word that has the highest probability following the previous word or multi-word phrase.

At this stage of the project I have downloaded the dataset provided and performed some exploratory analyses and data preparation in order to proceed with the predictive modeling and construction of the end user application.

My immediate objective is to find the optimal sample size from the dataset required to build a corpus on which to train the prediction algorithm. The raw dataset is too large to be used even from the beginning (my computer crashes even when processing a sample of 0.05% from the dataset); and the final corpus will need to work well using minimum possible memory as suitable on a mobile device.

Understanding The Problem

Immediate problems are problems such as how to handle undesirable features within the dataset such as non-English words, abbreviations and contractions, foul language (we don’t want to offer bad words).

The main problem to arise is if we are trying to achieve total coverage of all possible word combinations, the algorithm will need to process a large amount of data which exceeds available computing resources as well as making the user wait. So a strategy is needed to find the minimal size of data to use, while achieving maximum coverage, and word suggestions delivered within a tolerable time.

The next problem will be to predict the correct – i.e. the most relevant – word. In the simplest case, this can be done by choosing the highest frequently used word after one or more words. From my little understanding at this stage, there are advanced techniques which will improve relevancy, and I will explore these techniques further as I learn more to complete the project.

Summary of Data

The dataset which was downloaded comprises three files which contains texts mined from blogs, news and Twitter sources. I loaded the complete dataset into R and performed some basic explorations, as summarised below:

Source	Number.of.lines	Average.length	Min.length	Max.length	Variance	Std..Dev.
Blogs	899288	229.98695	1	40833	66905.414	258.66081
News	1010242	201.16101	1	11384	17746.919	133.21756
Twitter	2360074	68.68048	0	421	1386.001	37.22904

From this summary we can see observe some features of the dataset and their implications:

They are very large files and we will need to obtain random samples for processing
The minimum character counts of 0 and 1 show that the files contain some meaningless text
The maximum character count of 421 for Twitter shows that it contains at least one line which exceeds the expected character limit of 142
The relatively small means and standard deviation compared to the maximum values suggest that the majority of lines contain less than 1000 characters

To understand the problem further, I made a density plot to visualise the relative spread of line lengths between the three sources. I have constrained the x-axis to 1000 characters; in reality the plot extends to over 40,000 characters.

The plot shows that Twitter lines tend to be very short, whereas the lengths of blogs and news lines are highly variable. However, it seems that the variations are due to outliers in the data.

Sampling the Data

Using the caret library I obtained a random sampling of 0.1% of the blogs and news dataset, and 0.05% of the twitter dataset. The sample size is very small in order for me to quickly perform various experiments on the dataset. The summary statistics of the samples in terms of character counts per line are shown below.

Source	Number.of.lines	Average.length	Min.length	Max.length	Variance	Std..Dev.
Blogs	899	219.54839	4	1711	60389.464	245.74268
News	1010	194.45347	6	1011	16083.150	126.81936
Twitter	1180	68.22373	6	140	1389.171	37.27159

The sample statistics appears to be representative of the full dataset. Plotting the distribution of number of characters per line as before:

The plot shows that the sampling procedure has removed some noise from the data. Interestingly, we see how twitter texts are tightly constrained to its 142 character limit; news texts have a wider spread, but also seems mostly constrained to certain lengths (which would be expected, given the nature of news items); and blog texts have a wider spread.

My next step is to combine the texts into a single dataset. Then, using the sent_detect() function from the qdap library to split each line into individual sentences. This produced a dataset with length(s.combined) lines.

The line length distribution of the combined texts is plotted as below.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   32.00   70.00   81.81  117.00  560.00

Source	Number.of.lines	Average.length	Min.length	Max.length	Variance	Std..Dev.
Combined	4980	81.81084	1	560	4112.302	64.12723

Creating the sentence splits have caused some fragments to appear in the dataset. I’m not sure what is the impact of this yet, but I will deal with them later.

head(s.combined[which(nchar(s.combined)<10)],10)

##  [1] "!"         "!"         "!"         "” ."       "SORRY|"   
##  [6] "!"         "Save it?"  "3."        "So gross." "!"

Creating the Corpus & Tokenising

Next, the data is converted into a corpus with the tm library and then tokenized using the NGramTokenizer() function in the RWeka library to obtain frequency counts for unigrams, bigrams, and trigrams.

Some transformations are performed while creating the corpus, which significantly reduced the size of the original corpus from 799.5Mb to 16.6Mb. I will show the transformations, as written in the comments:

make_corpus <- function(chrVector) {
  # create corpus
  corpus<- Corpus(VectorSource(chrVector))

  # Convert to lowercase
  corpus <- tm_map(corpus, content_transformer(tolower))
  
   # remove emails
  removeEmails <- function(x) {gsub("\\S+@\\S+", "", x)}
 corpus <- tm_map(corpus,removeEmails)

 # remove URLS
  removeUrls <- function(x) {gsub("http[[:alnum:]]*","",x)}
 corpus <- tm_map(corpus,removeUrls)
 
 # Remove Twitter hashtags
 removeHashtags <- function(x) {gsub("#[[:alnum:]]*","",x)}
 corpus <- tm_map(corpus,removeHashtags)

  # remove Twitter handles (e.g. @username)
  removeHandles <- function(x) {gsub("@[[:alnum:]]*","",x)}
  corpus <- tm_map(corpus,removeHandles)
 
  # remove twitter specific terms like RT (retweet) and PM (private message)
  corpus <- tm_map(corpus, removeWords, c("rt","pm","p m"))

  # remove punctuation, numbers, whitespace, numbers and bad words  
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, stripWhitespace)
  corpus<- tm_map(corpus,removeNumbers)
  corpus <- tm_map(corpus, removeWords, stopwords("en"))
  
  # remove bad words (wordlist obtained from http://www.bannedwordlist.com)
   badwords <- read.csv('./swearWords.csv',stringsAsFactors = FALSE,header=FALSE)
   
   corpus <- tm_map(corpus, removeWords, badwords)
   corpus<- tm_map(corpus, PlainTextDocument)
   corpus
}

Looking for ways to reduce the size of the corpus further, I would next want to use word stemming and also find out if I need to further remove the noise which I have detected above. I have been having a problem with the performance of my computer when performing the stemming procedure so I have skipped the step until I have solved the problem.

Then I create tokenised the corpus into 3 sets of n-grams: unigrams, bigrams, and trigrams as summarised below:

##      Grams         Example Count
## 1 Unigrams           great 12091
## 2  Bigrams   united states 37754
## 3 Trigrams can reached com 39348

The n-grams are sorted by frequency (numbers of times they appear in the texts), and the coverage is calculated. We can see from the following plots what the coverage looks like:

The number of unigrams to achieve 50% coverage is 1002; 80%: 4608; and 90%: 8144.

If only 8144 unigrams is needed to cover 90% of the effective vocabulary, then it would seem that by removing very low frequency words I will be able to achieve a smaller dataset (67.3558845422215 %) to base the prediction algorithm on.

It’s also interesting to look at what are the most frequently used bigrams and trigrams:

head(bigrams)

##             grams Freq cumsum        pct     cumpct
## 34563         u s   28     28 0.07094535 0.07094535
## 23278         p m   25     53 0.06334406 0.13428941
## 21755    new york   24     77 0.06081030 0.19509970
## 17545   last week   16     93 0.04054020 0.23563990
## 14663 high school   14    107 0.03547267 0.27111258
## 17549   last year   13    120 0.03293891 0.30405149

head(trigrams)

##                         grams Freq cumsum        pct     cumpct
## 22619           new york city    6      6 0.01520296 0.01520296
## 19833                   m p m    5     11 0.01266913 0.02787209
## 32119            st marys tca    5     16 0.01266913 0.04054123
## 21034        metal gear solid    4     20 0.01013531 0.05067653
## 4710            cant wait see    3     23 0.00760148 0.05827801
## 5468  chief financial officer    3     26 0.00760148 0.06587949

It seems that there are single letter words and acronyms which shouldn’t be part of the corpus, and I would need to remove such terms in the next steps.

Conclusions & Next Steps

From my current understanding, my plan for the remaining time for this project is to:

Remove noise such as single letter words, acronyms, non-English terms (e.g. chinese characters).
Perhaps I will exclude terms which are shorter than a certain number of characters to ensure that I will have proper words; also, this may be better since I think prediction should help reduce time to type for longer, rare or frequently misspelled words and not short ones.
Use stemming and stem completion to reduce the number of terms in the corpus. Stemming will allow a root word to be predicted in place of its variants
Create the prediction algorithm – I have yet to study how to do this. As I understand it, I can apply clustering to find word associations and predict relevant words by searching within clusters
Increase the sample size and optimize the final corpus to achieve appropriate coverage and improve prediction accuracy
Then create the final application in Shiny

I have had a number of challenges to reach this far into this project. The learning curve is steep and I have had little time to work on this project, not least because I have been travelling the past week for the holiday season and been off the grid. Personal challenges aside, the technical challenge of the project is considerable, especially managing the processing time and memory usage, but with further study once I’m back I think is surmountable.

Thank you.