Natural Language Processing (NLP) is a branch of artificial intelligence that strives to teach computers how to process human language in order to understand its full meaning (e.g. sentiment and context). For this Capstone project, I will be using NLP methods to create a predictive text model similar to those used by intelligent keyboards found on mobile phones. This milestone report details analyzing and pre-processing of raw text files in order to prepare them for use in a predictive text model.
The raw text data was provided by SwiftKey, a company that develops intelligent keyboard apps. The data includes files from blogs, news and Twitter in four languages, including English. The files were downloaded from this Coursera link.
In order to get a look at the data I was working with, I generated several random samples of each set. This quick scan of the raw data was crucial in determining my next steps. First, it highlighted the differences in language structure between the datasets. Secondly, it raised questions that needed to be answered before cleaning the data. Finally, it provided valuable insights on things I had to look for after the cleaning process.
## [1] "10. The Parent Trap (Hayley Mills) - My favorite since childhood."
## [2] "Churches aside…I read something yesterday which got me thinking about individual believers–those that have trusted Christ for salvation, and call ourselves Followers of Jesus. In his book, Barefoot Church, (a great read about the church’s mandate to serve the poor and “the least of these”–and that we largely aren’t doing it), author Brandon Hatmaker spends a little time unpacking one of the biggest causes of that lack of attention to the mission:"
## [3] "Time passed."
## [1] "A book by Jean Twenge, \u0093Generation Me,\u0094 suggests people in their 20s grew up being fed unrealistic expectations by their parents and media. Nearly three in five high school graduates, for instance, expect to finish not just college but graduate school. In reality, by age 35 fewer than one in 10 of us get that extra sheepskin."
## [2] "The A's have nothing permanent. The Giants have almost everything, as they helpfully noted in their statement."
## [3] "Pletcher said she was such a novice that she showed up for her first class with one of those watercolor paint boxes kids use."
## [1] "He Gone"
## [2] "I thought you felt like a ROCK STAR every day."
## [3] "LOL yeah i know(;"
## [4] "Look what you did with your browords The word \"Bromance\" has been added to the new Merriam-Webster's Dictionary."
## [5] "Have a great Friday everyone! Get outside...enjoy the weather this weekend...then refuel with La Salita!"
1)The order of processing is critical. For instance, I need to remove
hashtag strings before removing punctuation.
2) Obscenities, offensive terms, and text-type abbreviations will be
removed by adding them to a list of stopwords - terms that are removed
from NLP data because they do not add context or meaning to keywords and
sentences. A few examples are “the”,“or”, and “in”.
In order to have an accurate model, I need to create a large vocabulary from these datasets. Using all of the data is prohibitive due to the memory and processing time it would require, so I must choose a suitable sample size from each dataset. To get an idea of what kind of numbers I’ll be working with, I ran some summary calculations on the data.
| dataset | NumLines | NumWord | WordperLineAvg | NumSentences |
|---|---|---|---|---|
| Blogs | 899288 | 37546250 | 42 | 2375718 |
| 2360148 | 30093372 | 13 | 3770155 | |
| News | 1010242 | 34762395 | 34 | 2024588 |
| Totals | 4269678 | 102402017 | 24 | 8170461 |
I decided to clean each data set in its entirety before pulling samples to use in my model. First, I removed hashtag strings from the twitter data. Then, all three data sets went through the following steps:
Let’s see how many words were removed during the cleaning process. First, I’d like to look at the same blog samples I reviewed before the cleaning process to get an idea of how the new data looks.
## [1] "parent trap hayley mills favorite since childhood"
## [2] "churches aside… read something yesterday got thinking individual believers– trusted christ salvation call followers jesus book barefoot church great read church’ mandate serve poor “ least ”– largely aren’ author brandon hatmaker spends little time unpacking one biggest causes lack attention mission"
## [3] "time passed"
| dataset | NumWord | CleanWords | Difference | AvgPerLine | CleanAvgPerLine |
|---|---|---|---|---|---|
| Blogs | 37546250 | 19236864 | 18309386 | 42 | 21 |
| 30093372 | 13448724 | 16644648 | 13 | 6 | |
| News | 34762395 | 20617263 | 14145132 | 34 | 20 |
| Totals | 102402017 | 53302851 | 49099166 | 24 | 24 |
The cleaning process removed approximately 49.1M words from the data sets. That should mean the data I use for my model will have a minimal amount of junk, and the vocabulary created using the data will only contain terms that are helpful in training it.
I researched a few NLP prediction models and have narrowed it down to two possibilities - Markov Chains and Word Vectors (word2vec).
I will randomly sample 2% to 5% of each cleaned data set. Blog data will be represented the most in my samples, and the smallest portion will come from Twitter. The samples will be combined into one data frame. That data frame will be split into training (40%) and testing (60%) sets. I will test both models with the same data and choose the one with the best overall performance (a combination of speed, memory, and accuracy).