Introduction

Natural Language Processing (NLP) is a branch of artificial intelligence that strives to teach computers how to process human language in order to understand its full meaning (e.g. sentiment and context). For this Capstone project, I will be using NLP methods to create a predictive text model similar to those used by intelligent keyboards found on mobile phones. This milestone report details analyzing and pre-processing of raw text files in order to prepare them for use in a predictive text model.

Getting the Data

The raw text data was provided by SwiftKey, a company that develops intelligent keyboard apps. The data includes files from blogs, news and Twitter in four languages, including English. The files were downloaded from this Coursera link.

Initial Analysis

In order to get a look at the data I was working with, I generated several random samples of each set. This quick scan of the raw data was crucial in determining my next steps. First, it highlighted the differences in language structure between the datasets. Secondly, it raised questions that needed to be answered before cleaning the data. Finally, it provided valuable insights on things I had to look for after the cleaning process.

Blogs Sample

## [1] "10. The Parent Trap (Hayley Mills) - My favorite since childhood."                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "Churches aside…I read something yesterday which got me thinking about individual believers–those that have trusted Christ for salvation, and call ourselves Followers of Jesus. In his book, Barefoot Church, (a great read about the church’s mandate to serve the poor and “the least of these”–and that we largely aren’t doing it), author Brandon Hatmaker spends a little time unpacking one of the biggest causes of that lack of attention to the mission:"
## [3] "Time passed."
  • The length of each line varies greatly.
  • Structure isn’t as formal as the news data, but much more formal that Twitter data
  • This data seems to be conversational without having the issues Twitter data has (slang, abbreviations, emojis, etc). This may prove useful for predicting text. I will keep this in mind when choosing samples for my model.

News Sample

## [1] "A book by Jean Twenge, \u0093Generation Me,\u0094 suggests people in their 20s grew up being fed unrealistic expectations by their parents and media. Nearly three in five high school graduates, for instance, expect to finish not just college but graduate school. In reality, by age 35 fewer than one in 10 of us get that extra sheepskin."
## [2] "The A's have nothing permanent. The Giants have almost everything, as they helpfully noted in their statement."                                                                                                                                                                                                                                   
## [3] "Pletcher said she was such a novice that she showed up for her first class with one of those watercolor paint boxes kids use."
  • Hex-characters (““”“) They need to be removed, but the standard cleaning methods may not work for them.
  • There are a lot of proper nouns in this dataset, especially peoples’ names.

Twitter Sample

## [1] "He Gone"                                                                                                           
## [2] "I thought you felt like a ROCK STAR every day."                                                                    
## [3] "LOL yeah i know(;"                                                                                                 
## [4] "Look what you did with your browords The word \"Bromance\" has been added to the new Merriam-Webster's Dictionary."
## [5] "Have a great Friday everyone! Get outside...enjoy the weather this weekend...then refuel with La Salita!"
  • This is by far the messiest dataset of the three.
  • When reviewing the random samples, I looked for data that is specific to Twitter (e.g. hashtags). Samples indicate that “@username” strings have already been removed; however, some Twitter specific data still remains including:
  1. “RT” (indicating a Retweet)
  2. Hashtags (“#topic”)
  3. Emojis
  4. Text-type abbreviations (LOL, BTW, etc).
  5. Obsenities and offensive words
  • Twitter will require special attention during cleaning.

1)The order of processing is critical. For instance, I need to remove hashtag strings before removing punctuation.
2) Obscenities, offensive terms, and text-type abbreviations will be removed by adding them to a list of stopwords - terms that are removed from NLP data because they do not add context or meaning to keywords and sentences. A few examples are “the”,“or”, and “in”.

Word Counts

In order to have an accurate model, I need to create a large vocabulary from these datasets. Using all of the data is prohibitive due to the memory and processing time it would require, so I must choose a suitable sample size from each dataset. To get an idea of what kind of numbers I’ll be working with, I ran some summary calculations on the data.

Line & Word Totals
dataset NumLines NumWord WordperLineAvg NumSentences
Blogs 899288 37546250 42 2375718
Twitter 2360148 30093372 13 3770155
News 1010242 34762395 34 2024588
Totals 4269678 102402017 24 8170461

  • It’s interesting to note that blogs data has the fewest lines , yet it also has the highest word total.
  • The fact that Twitter has the lowest word total was a little surprising. They limit Tweets to 140 characters, but with such a large number of lines in the data set, I thought its word total would be closer to the others than it is.
  • I calculated the average number of words per line in case its useful when I decide how many lines to sample from each dataset.

Cleaning the Data

I decided to clean each data set in its entirety before pulling samples to use in my model. First, I removed hashtag strings from the twitter data. Then, all three data sets went through the following steps:

  1. Converted all to lower case
  2. Removed numbers
  3. Removed stopwords
  4. Removed punctuation (keeping inerword dashes)
  5. Removed single character words
  6. Stripped whitespaces that resulted from word removal
  7. Removed spaces at the start of lines

Comparing Cleaned Data to Raw Data

Let’s see how many words were removed during the cleaning process. First, I’d like to look at the same blog samples I reviewed before the cleaning process to get an idea of how the new data looks.

## [1] "parent trap hayley mills favorite since childhood"                                                                                                                                                                                                                                                           
## [2] "churches aside… read something yesterday got thinking individual believers– trusted christ salvation call followers jesus book barefoot church great read church’ mandate serve poor “ least ”– largely aren’ author brandon hatmaker spends little time unpacking one biggest causes lack attention mission"
## [3] "time passed"
Word Totals Post Processing
dataset NumWord CleanWords Difference AvgPerLine CleanAvgPerLine
Blogs 37546250 19236864 18309386 42 21
Twitter 30093372 13448724 16644648 13 6
News 34762395 20617263 14145132 34 20
Totals 102402017 53302851 49099166 24 24

The cleaning process removed approximately 49.1M words from the data sets. That should mean the data I use for my model will have a minimal amount of junk, and the vocabulary created using the data will only contain terms that are helpful in training it.

Prediction Algorithm

I researched a few NLP prediction models and have narrowed it down to two possibilities - Markov Chains and Word Vectors (word2vec).

I will randomly sample 2% to 5% of each cleaned data set. Blog data will be represented the most in my samples, and the smallest portion will come from Twitter. The samples will be combined into one data frame. That data frame will be split into training (40%) and testing (60%) sets. I will test both models with the same data and choose the one with the best overall performance (a combination of speed, memory, and accuracy).