Unit 1

Task 0

Tasks to accomplish

Obtaining the data: - download the data in to R; - load/manipulate data in R.

Capstone Dataset is the training data that will be the basis for most of the capstone. The original exploration of the data and modeling steps will be performed on this data set.

Data sets were loaded localy and anzipped into “final” folder.

Questions to consider

What do the data look like? Working directory information (files names and size):
path size
./final/en_US/en_US.blogs.txt 200M
./final/en_US/en_US.news.txt 196M
./final/en_US/en_US.twitter.txt 159M
./final/en_US/sample 0

Where do the data come from? Test data set was downloaded from the link provided in the project description. Can you think of any other data sources that might help you in this project? Useful links: Natural language processing Wikipedia page Text mining infrastucture in R CRAN Task View: Natural Language Processing What are the common steps in natural language processing? THe comon steps in NLP:
1. Importing
2. Cleaning, Preprocessing
3. Representing, filtering, weighting
4. Analysing

Task 1

Tasks to accomplish

Tips, tricks, and hints

Loading the data in. The dataset used in the project is fairly large. Initially we are using a smaller subset of the data.

Loading first 3 lines of en_US.blogs.txt, testing connection, inspect content

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] In the years thereafter, most of the Oil fields and platforms were named after pagan â\200œgodsâ\200\235.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [2] We love you Mr. Brown.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.

Tokenization is the process of splitting a text into tokens Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. We will use a custom function that takes a file as input and returns a tokenized version of it for obtaining general information about the data sets: * count of lines in the text file;
* count of words in the file;
* count of sentences;
* count of punctuation characters;
* count of numbers.

Object’s name lines_count sentences_count words_count non-word_count numbers_count
blogs.txt 899288 2029113 38154238 38601176 494878
news.txt 77259 142759 2693898 2755796 82852
twitter.txt 2360148 2583764 30218125 31130580 582533

Sampling. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. We created a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. The sample file is stored so that to not have to recreate it every time. In our smple data set we used:
* 0.5% from en_US.blogs.txt ~ 4K;
* 10% from en_US.news.txt ~ 8K;
* 0.2% from en_US.twitter.txt ~ 4K.

Sample data set cleaning: - removed swearWords and bad-words(Profanity filtering - removing profanity and other words we do not want to predict);

There were 40043 words in sample data, after profanity filters applied ther were 40031 words left, 12 words were removed from the sample data set. Data cleaning steps:
- Convert the text to lower case;
- Remove numbers;
- Remove english common stopwords;
- Remove punctuations;
- Eliminate extra white spaces;
- Remove single letter words;
- Remove words with 3 or more repeated letters;
- Text lemmatization (different from steaming in a way that it takes into consideration the morphological analysis of the words).

After data cleaning procedures applied, 29660 words left in the data set.

Unit 2 Exploratory Data Analysis

Tasks to accomplish

Exploratory analysis - performed an exploratory analysis of the data for understanding the distribution of words and relationship between the words in the sample text document. To understand frequencies of words and word pairs - build figures and tables that demonstarates frequencies of words and word pairs in the data.

Questions to consider

Some words are more frequent than others - what are the distributions of word frequencies?
The top 10 most frequent words from sample dataset:
word freq
say 2981
good 1889
get 1681
will 1656
one 1477
make 1378
time 1197
just 1181
year 1162
like 1154

What are the frequencies of 2-grams and 3-grams in the dataset?
The top 10 most frequent 2-grams tokens from sample dataset:
word freq
last year 153
new york 112
year ago 92
high school 90
right now 90
feel like 84
look like 84
last week 77
st louis 74
new jersey 67

The top 10 most frequent 3-grams tokens from sample dataset:

word freq
new york city 21
two year ago 12
happy new year 11
st louis county 11
world war ii 9
five year ago 8
let us know 8
two week ago 8
assistant us attorney 7
new year eve 7

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

There are 626 and 8472 words needed to cover 50% and 90% of all word instances in the language accordingly.

Next steps:

Unit 3 Modeling

Tasks to accomplish

Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.
Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.

Questions to consider

How can you efficiently store an n-gram model (think Markov Chains)?
How can you use the knowledge about word frequencies to make your model smaller and more efficient?
How many parameters do you need (i.e. how big is n in your n-gram model)?
Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data) ?
How do you evaluate whether your model is any good?
How can you use backoff models to estimate the probability of unobserved n-grams?