This report aims to present the progress made in implementing a language model for text prediction and the results of the exploratory analysis of the data set.
The dataset used in this project was taken from the HC Corpora http://www.corpora.heliohost.org. The en_US subset of the dataset has been chosen for this project. The dataset includes blog text (en_US.blog.txt), twitter text (en_US.twitter.txt) and news text (en_US.news.txt)
The original dataset contains millions of lines of text. In order to process and explore the data within reasonable limits, a subset of each text file was extracted. 10% of lines from each text file was used.
Initial inspection of the full dataset showed the following characteristics of the uncleaned data:
| Characteristic | en_US.blogs.txt | en_US.twitter.txt | en_US.news.txt |
|---|---|---|---|
| Number of lines | 899,288 | 2,360,148 | 1,010,242 |
| Number of words | 37,334,690 | 30,374,206 | 34,372,720 |
| Number of unique words | 253,042 | 212,227 | 302,652 |
Processing the original dataset takes a lot of time. For our purpose, only 10% of each source text was used as our working dataset for creating our language model .
In order to properly create a model for text prediction, there is a need to cleanse the data. The following are the steps taken to clean the data:
Combined all lines of text from each subset of text (twitter, news and blogs) into a single list.
Standardized the character to be used for single quotes and apostrophes. Apostrophe and single quotes are used in different ways in formal english grammar. But, since we are dealing with freetext, we assume that the characters were used interchangeably. A single character (single quote) was adopted for all characters that look similar to an apostrophe or single quote. Examples are the grave character (`), the left and right single quotes (‘’) and the prime character (′).
Standardized the character to be used for hyphen. A couple of other characters have been found to look like hyphen but are not. This includes em dash and en dash. They have all been replaced with the hyphen character.
All acronyms, titles and abbreviations that end in dot (.) were modified to remove the dots in the word. i.e. N.B.A. was changed to NBA, Mr. was changed to Mr.
Replaced all punctuations (except single quote and hyphen) with space.
Words connected by a hyphen was separated and the hyphen replaced with space (only if there is more than 1 hyphen in the “word”). i.e. text like ‘no-holds-barred’ was changed to ‘no holds barred’, but text like ‘anti-social’ kept the hyphen.
Numbers with its units attached was separated using the space character. i.e. ‘9pm’ became ‘9 pm’ or ‘5feet’ became ‘5 feet’.
All instances of numbers was replaced with a number marker . This signifies that a number usually is found in that place and the exact number used is unimportant. Any “word” that is composed of numbers and punctuation only was considered a number. i.e. 999.00, 100-110 were replaced by the marker .
All lines of text were broken down into a single list of words. The words were separated using space, multiple spaces or combination of space and punctuation as word separators.
Words that start with these special characters (ø⁰) were processed to preserve the trailing words and remove the special characters.
All words were converted to lower case.
All blanks and NAs are removed from the word list.
All words are that contain characters that are not compatible with ISO-8859-1 (Western Alphabet) were removed. Please see https://en.wikipedia.org/wiki/ISO/IEC_8859-1 for more information.
Lastly, a profanity filter was applied on the word list. To identify offensive words, we used a fork of the google profanity list found in https://gist.github.com/ryanlewis/a37739d710ccdb4b406d. Certain letters wered then converted to match more possible profanity in the text. It has been observed that people usually a special character or a number to replace a letter when writing profane words, so the table below was applied to the profanity list taken from the link above to expand our search list.
| letter | a | a | a | e | e | i | i | i | o | o | u | s | s | t |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| replacement | 4 | @ | * | 3 | * | ! | 1 | * | 0 | * | * | 5 | $ | + |
Here’s a count of lines of text that were used in the working data set (unclean)
| Characteristics | blogs | news | |
|---|---|---|---|
| Number of lines | 89,929 | 101,025 | 236,015 |
| Number of words | 3,750,705 | 3,433,298 | 3,035,210 |
| Number of unique words | 220,800 | 209,421 | 223,856 |
The following shows some characteristics of the combined cleaned working data set.
| Characteristic | Value |
|---|---|
| Total Combined Number of Lines in the Working Data Set | 426,969 |
| Total Number of Tokens/Words in the Working Data Set | 10,185,910 |
| Total Number of Tokens/Words in the Working Data Set Excluding Stop Words | 5,551,724 |
| Total Number of Unique Words in the Working Data Set Excluding Stop Words | 200,387 |
Note: Stop words used was taken from the Text Mining Package in R (tm package) using stopwords(‘en’)
Top 50 Words (Excluding Stop Words)
Below is the graph of the top 50 words excluding stop words:
Below is the word cloud of the top 50 words excluding stop words:
Top 20 2-Word Phrase (2-Grams)
Below is the graph of the top 20 2-word phrases (2-grams) in the working data set. This includes all words in the corpora:
2-Word Phrase Cloud (2-Grams):
Top 20 3-Word Phrase (3-Grams)
Below is the graph of the top 20 3-word phrases (3-grams) in the working data set. This includes all words in the corpora:
3-Word Phrase Cloud (3-Grams):
Top 20 4-Word Phrase (4-Grams)
Below is the graph of the top 20 4-word phrases (4-grams) in the working data set. This includes all words in the corpora:
4-Word Phrase Cloud (4-Grams):
For the Shiny Application and Text Prediction Language Model, the plan is to add the following in the process:
– EOF –