Obtaining the data: - download the data in to R; - load/manipulate data in R.
Capstone Dataset is the training data that will be the basis for most of the capstone. The original exploration of the data and modeling steps will be performed on this data set.
Data sets were loaded localy and anzipped into “final” folder.
| path | size |
|---|---|
| ./final/en_US/en_US.blogs.txt | 200M |
| ./final/en_US/en_US.news.txt | 196M |
| ./final/en_US/en_US.twitter.txt | 159M |
| ./final/en_US/sample | 0 |
Where do the data come from? Test data set was downloaded from the link provided in the project description. Can you think of any other data sources that might help you in this project? Useful links: Natural language processing Wikipedia page Text mining infrastucture in R CRAN Task View: Natural Language Processing What are the common steps in natural language processing? THe comon steps in NLP:
1. Importing
2. Cleaning, Preprocessing
3. Representing, filtering, weighting
4. Analysing
Loading the data in. The dataset used in the project is fairly large. Initially we are using a smaller subset of the data.
Loading first 3 lines of en_US.blogs.txt, testing connection, inspect content
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 3
##
## [1] In the years thereafter, most of the Oil fields and platforms were named after pagan â\200Ågodsâ\200\235.
## [2] We love you Mr. Brown.
## [3] Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
Tokenization is the process of splitting a text into tokens Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. We will use a custom function that takes a file as input and returns a tokenized version of it for obtaining general information about the data sets: * count of lines in the text file;
* count of words in the file;
* count of sentences;
* count of punctuation characters;
* count of numbers.
| Object’s name | lines_count | sentences_count | words_count | non-word_count | numbers_count |
|---|---|---|---|---|---|
| blogs.txt | 899288 | 2029113 | 38154238 | 38601176 | 494878 |
| news.txt | 77259 | 142759 | 2693898 | 2755796 | 82852 |
| twitter.txt | 2360148 | 2583764 | 30218125 | 31130580 | 582533 |
Sampling. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. We created a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. The sample file is stored so that to not have to recreate it every time. In our smple data set we used:
* 0.5% from en_US.blogs.txt ~ 4K;
* 10% from en_US.news.txt ~ 8K;
* 0.2% from en_US.twitter.txt ~ 4K.
Sample data set cleaning: - removed swearWords and bad-words(Profanity filtering - removing profanity and other words we do not want to predict);
There were 40043 words in sample data, after profanity filters applied ther were 40031 words left, 12 words were removed from the sample data set. Data cleaning steps:
- Convert the text to lower case;
- Remove numbers;
- Remove english common stopwords;
- Remove punctuations;
- Eliminate extra white spaces;
- Remove single letter words;
- Remove words with 3 or more repeated letters;
- Text lemmatization (different from steaming in a way that it takes into consideration the morphological analysis of the words).
After data cleaning procedures applied, 29660 words left in the data set.
Exploratory analysis - performed an exploratory analysis of the data for understanding the distribution of words and relationship between the words in the sample text document. To understand frequencies of words and word pairs - build figures and tables that demonstarates frequencies of words and word pairs in the data.
| word | freq |
|---|---|
| say | 2981 |
| good | 1889 |
| get | 1681 |
| will | 1656 |
| one | 1477 |
| make | 1378 |
| time | 1197 |
| just | 1181 |
| year | 1162 |
| like | 1154 |
| word | freq |
|---|---|
| last year | 153 |
| new york | 112 |
| year ago | 92 |
| high school | 90 |
| right now | 90 |
| feel like | 84 |
| look like | 84 |
| last week | 77 |
| st louis | 74 |
| new jersey | 67 |
The top 10 most frequent 3-grams tokens from sample dataset:
| word | freq |
|---|---|
| new york city | 21 |
| two year ago | 12 |
| happy new year | 11 |
| st louis county | 11 |
| world war ii | 9 |
| five year ago | 8 |
| let us know | 8 |
| two week ago | 8 |
| assistant us attorney | 7 |
| new year eve | 7 |
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
There are 626 and 8472 words needed to cover 50% and 90% of all word instances in the language accordingly.
Next steps:
Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.
Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.
How can you efficiently store an n-gram model (think Markov Chains)?
How can you use the knowledge about word frequencies to make your model smaller and more efficient?
How many parameters do you need (i.e. how big is n in your n-gram model)?
Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data) ?
How do you evaluate whether your model is any good?
How can you use backoff models to estimate the probability of unobserved n-grams?