Unit 1

Task 0

Tasks to accomplish

Obtaining the data: - download the data in to R; - load/manipulate data in R.

Capstone Dataset is the training data that will be the basis for most of the capstone. The original exploration of the data and modeling steps will be performed on this data set.

Data sets were loaded localy and anzipped into “final” folder.

Questions to consider

What do the data look like? Working directory information (files names and size):

path	size
./final/en_US/en_US.blogs.txt	200M
./final/en_US/en_US.news.txt	196M
./final/en_US/en_US.twitter.txt	159M
./final/en_US/sample	0

Where do the data come from? Test data set was downloaded from the link provided in the project description. Can you think of any other data sources that might help you in this project? Useful links: Natural language processing Wikipedia page Text mining infrastucture in R CRAN Task View: Natural Language Processing What are the common steps in natural language processing? THe comon steps in NLP:
1. Importing
2. Cleaning, Preprocessing
3. Representing, filtering, weighting
4. Analysing

Task 1

Tasks to accomplish

Tips, tricks, and hints

Loading the data in. The dataset used in the project is fairly large. Initially we are using a smaller subset of the data.

Loading first 3 lines of en_US.blogs.txt, testing connection, inspect content

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] In the years thereafter, most of the Oil fields and platforms were named after pagan Ã¢\200ÅgodsÃ¢\200\235.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [2] We love you Mr. Brown.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.

Tokenization is the process of splitting a text into tokens Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. We will use a custom function that takes a file as input and returns a tokenized version of it for obtaining general information about the data sets: * count of lines in the text file;
* count of words in the file;
* count of sentences;
* count of punctuation characters;
* count of numbers.

Object’s name	lines_count	sentences_count	words_count	non-word_count	numbers_count
blogs.txt	899288	2029113	38154238	38601176	494878
news.txt	77259	142759	2693898	2755796	82852
twitter.txt	2360148	2583764	30218125	31130580	582533

Sampling. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. We created a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. The sample file is stored so that to not have to recreate it every time. In our smple data set we used:
* 0.5% from en_US.blogs.txt ~ 4K;
* 10% from en_US.news.txt ~ 8K;
* 0.2% from en_US.twitter.txt ~ 4K.

Sample data set cleaning: - removed swearWords and bad-words(Profanity filtering - removing profanity and other words we do not want to predict);

There were 40043 words in sample data, after profanity filters applied ther were 40031 words left, 12 words were removed from the sample data set. Data cleaning steps:
- Convert the text to lower case;
- Remove numbers;
- Remove english common stopwords;
- Remove punctuations;
- Eliminate extra white spaces;
- Remove single letter words;
- Remove words with 3 or more repeated letters;
- Text lemmatization (different from steaming in a way that it takes into consideration the morphological analysis of the words).

After data cleaning procedures applied, 29660 words left in the data set.

Unit 2 Exploratory Data Analysis

Tasks to accomplish

Exploratory analysis - performed an exploratory analysis of the data for understanding the distribution of words and relationship between the words in the sample text document. To understand frequencies of words and word pairs - build figures and tables that demonstarates frequencies of words and word pairs in the data.

Questions to consider

Some words are more frequent than others - what are the distributions of word frequencies?
The top 10 most frequent words from sample dataset:

word	freq
say	2981
good	1889
get	1681
will	1656
one	1477
make	1378
time	1197
just	1181
year	1162
like	1154

What are the frequencies of 2-grams and 3-grams in the dataset?
The top 10 most frequent 2-grams tokens from sample dataset:

word	freq
last year	153
new york	112
year ago	92
high school	90
right now	90
feel like	84
look like	84
last week	77
st louis	74
new jersey	67

The top 10 most frequent 3-grams tokens from sample dataset:

word	freq
new york city	21
two year ago	12
happy new year	11
st louis county	11
world war ii	9
five year ago	8
let us know	8
two week ago	8
assistant us attorney	7
new year eve	7

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

There are 626 and 8472 words needed to cover 50% and 90% of all word instances in the language accordingly.

Next steps:

Unit 3 Modeling

Tasks to accomplish

Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.
Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.

Questions to consider

How can you efficiently store an n-gram model (think Markov Chains)?
How can you use the knowledge about word frequencies to make your model smaller and more efficient?
How many parameters do you need (i.e. how big is n in your n-gram model)?
Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data) ?
How do you evaluate whether your model is any good?
How can you use backoff models to estimate the probability of unobserved n-grams?

Text mining in R

Nadia Stavisky

03 November, 2019

Unit 1

Task 0

Tasks to accomplish

Questions to consider

Task 1

Tasks to accomplish

Tips, tricks, and hints

Unit 2 Exploratory Data Analysis

Tasks to accomplish

Questions to consider

Unit 3 Modeling

Tasks to accomplish

Questions to consider