Introduction

This is the milestone report for the Data Science Specialization Capstone Project on Coursera. This report intends to show my current progress and discussion with hopes of obtaining some constructive feedback from my peers and teachers. As the intended audience for this report are non-data scientists, I have kept code output to minimum and if it interests the reader, you may view the source code for this file at my Github repository.

Executive Summary

The objective of this Data Science Specialization Capstone Project is to produce a predictive text algorithm in R that based on a user’s text input. As the user types some text the system will suggest the next most likely word to be entered.

From my current understanding of the task I will need to process the user’s input as they type and compare the text against a word list. The predicted word will be the word that has the highest probability following the previous word or multi-word phrase.

At this stage of the project I have downloaded the dataset provided and performed some exploratory analyses and data preparation in order to proceed with the predictive modeling and construction of the end user application.

My immediate objective is to find the optimal sample size from the dataset required to build a corpus on which to train the prediction algorithm. The raw dataset is too large to be used even from the beginning (my computer crashes even when processing a sample of 0.05% from the dataset); and the final corpus will need to work well using minimum possible memory as suitable on a mobile device.

Understanding The Problem

From my current understanding of the task I will need to process the user’s input as they type and compare the text against a word list. The predicted word will be the word that has the highest probability following the previous word or multi-word phrase.

Immediate problems are problems such as how to handle undesirable features within the dataset such as non-English words, abbreviations and contractions, foul language (we don’t want to offer bad words).

The main problem to arise is if we are trying to achieve total coverage of all possible word combinations, the algorithm will need to process a large amount of data which exceeds available computing resources as well as making the user wait. So a strategy is needed to find the minimal size of data to use, while achieving maximum coverage, and word suggestions delivered within a tolerable time.

The next problem will be to predict the correct – i.e. the most relevant – word. In the simplest case, this can be done by choosing the highest frequently used word after one or more words. From my little understanding at this stage, there are advanced techniques which will improve relevancy, and I will explore these techniques further as I learn more to complete the project.

Summary of Data

The dataset which was downloaded comprises three files which contains texts mined from blogs, news and Twitter sources. I loaded the complete dataset into R and performed some basic explorations, as summarised below:

Source Number.of.lines Average.length Min.length Max.length Variance Std..Dev.
Blogs 899288 229.98695 1 40833 66905.414 258.66081
News 1010242 201.16101 1 11384 17746.919 133.21756
Twitter 2360074 68.68048 0 421 1386.001 37.22904

From this summary we can see observe some features of the dataset and their implications:

To understand the problem further, I made a density plot to visualise the relative spread of line lengths between the three sources. I have constrained the x-axis to 1000 characters; in reality the plot extends to over 40,000 characters.

The plot shows that Twitter lines tend to be very short, whereas the lengths of blogs and news lines are highly variable. However, it seems that the variations are due to outliers in the data.

Sampling the Data

Using the caret library I obtained a random sampling of 0.1% of the blogs and news dataset, and 0.05% of the twitter dataset. The sample size is very small in order for me to quickly perform various experiments on the dataset. The summary statistics of the samples in terms of character counts per line are shown below.

Source Number.of.lines Average.length Min.length Max.length Variance Std..Dev.
Blogs 899 219.54839 4 1711 60389.464 245.74268
News 1010 194.45347 6 1011 16083.150 126.81936
Twitter 1180 68.22373 6 140 1389.171 37.27159

The sample statistics appears to be representative of the full dataset. Plotting the distribution of number of characters per line as before:

The plot shows that the sampling procedure has removed some noise from the data. Interestingly, we see how twitter texts are tightly constrained to its 142 character limit; news texts have a wider spread, but also seems mostly constrained to certain lengths (which would be expected, given the nature of news items); and blog texts have a wider spread.

My next step is to combine the texts into a single dataset. Then, using the sent_detect() function from the qdap library to split each line into individual sentences. This produced a dataset with length(s.combined) lines.

The line length distribution of the combined texts is plotted as below.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   32.00   70.00   81.81  117.00  560.00
Source Number.of.lines Average.length Min.length Max.length Variance Std..Dev.
Combined 4980 81.81084 1 560 4112.302 64.12723

Creating the sentence splits have caused some fragments to appear in the dataset. I’m not sure what is the impact of this yet, but I will deal with them later.

head(s.combined[which(nchar(s.combined)<10)],10)
##  [1] "!"         "!"         "!"         "” ."       "SORRY|"   
##  [6] "!"         "Save it?"  "3."        "So gross." "!"

Creating the Corpus & Tokenising

Next, the data is converted into a corpus with the tm library and then tokenized using the NGramTokenizer() function in the RWeka library to obtain frequency counts for unigrams, bigrams, and trigrams.

Some transformations are performed while creating the corpus, which significantly reduced the size of the original corpus from 799.5Mb to 16.6Mb. I will show the transformations, as written in the comments:

make_corpus <- function(chrVector) {
  # create corpus
  corpus<- Corpus(VectorSource(chrVector))

  # Convert to lowercase
  corpus <- tm_map(corpus, content_transformer(tolower))
  
   # remove emails
  removeEmails <- function(x) {gsub("\\S+@\\S+", "", x)}
 corpus <- tm_map(corpus,removeEmails)

 # remove URLS
  removeUrls <- function(x) {gsub("http[[:alnum:]]*","",x)}
 corpus <- tm_map(corpus,removeUrls)
 
 # Remove Twitter hashtags
 removeHashtags <- function(x) {gsub("#[[:alnum:]]*","",x)}
 corpus <- tm_map(corpus,removeHashtags)

  # remove Twitter handles (e.g. @username)
  removeHandles <- function(x) {gsub("@[[:alnum:]]*","",x)}
  corpus <- tm_map(corpus,removeHandles)
 
  # remove twitter specific terms like RT (retweet) and PM (private message)
  corpus <- tm_map(corpus, removeWords, c("rt","pm","p m"))

  # remove punctuation, numbers, whitespace, numbers and bad words  
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, stripWhitespace)
  corpus<- tm_map(corpus,removeNumbers)
  corpus <- tm_map(corpus, removeWords, stopwords("en"))
  
  # remove bad words (wordlist obtained from http://www.bannedwordlist.com)
   badwords <- read.csv('./swearWords.csv',stringsAsFactors = FALSE,header=FALSE)
   
   corpus <- tm_map(corpus, removeWords, badwords)
   corpus<- tm_map(corpus, PlainTextDocument)
   corpus
}

Looking for ways to reduce the size of the corpus further, I would next want to use word stemming and also find out if I need to further remove the noise which I have detected above. I have been having a problem with the performance of my computer when performing the stemming procedure so I have skipped the step until I have solved the problem.

Then I create tokenised the corpus into 3 sets of n-grams: unigrams, bigrams, and trigrams as summarised below:

##      Grams         Example Count
## 1 Unigrams           great 12091
## 2  Bigrams   united states 37754
## 3 Trigrams can reached com 39348

The n-grams are sorted by frequency (numbers of times they appear in the texts), and the coverage is calculated. We can see from the following plots what the coverage looks like:

The number of unigrams to achieve 50% coverage is 1002; 80%: 4608; and 90%: 8144.

If only 8144 unigrams is needed to cover 90% of the effective vocabulary, then it would seem that by removing very low frequency words I will be able to achieve a smaller dataset (67.3558845422215 %) to base the prediction algorithm on.

It’s also interesting to look at what are the most frequently used bigrams and trigrams:

head(bigrams)
##             grams Freq cumsum        pct     cumpct
## 34563         u s   28     28 0.07094535 0.07094535
## 23278         p m   25     53 0.06334406 0.13428941
## 21755    new york   24     77 0.06081030 0.19509970
## 17545   last week   16     93 0.04054020 0.23563990
## 14663 high school   14    107 0.03547267 0.27111258
## 17549   last year   13    120 0.03293891 0.30405149
head(trigrams)
##                         grams Freq cumsum        pct     cumpct
## 22619           new york city    6      6 0.01520296 0.01520296
## 19833                   m p m    5     11 0.01266913 0.02787209
## 32119            st marys tca    5     16 0.01266913 0.04054123
## 21034        metal gear solid    4     20 0.01013531 0.05067653
## 4710            cant wait see    3     23 0.00760148 0.05827801
## 5468  chief financial officer    3     26 0.00760148 0.06587949

It seems that there are single letter words and acronyms which shouldn’t be part of the corpus, and I would need to remove such terms in the next steps.

Conclusions & Next Steps

From my current understanding, my plan for the remaining time for this project is to:

I have had a number of challenges to reach this far into this project. The learning curve is steep and I have had little time to work on this project, not least because I have been travelling the past week for the holiday season and been off the grid. Personal challenges aside, the technical challenge of the project is considerable, especially managing the processing time and memory usage, but with further study once I’m back I think is surmountable.

Thank you.