Predictive Text Application Milestone Report

Introduction

This report aims to present the progress made in implementing a language model for text prediction and the results of the exploratory analysis of the data set.

The dataset used in this project was taken from the HC Corpora http://www.corpora.heliohost.org. The en_US subset of the dataset has been chosen for this project. The dataset includes blog text (en_US.blog.txt), twitter text (en_US.twitter.txt) and news text (en_US.news.txt)

The original dataset contains millions of lines of text. In order to process and explore the data within reasonable limits, a subset of each text file was extracted. 10% of lines from each text file was used.

Dataset Characteristics

Initial inspection of the full dataset showed the following characteristics of the uncleaned data:

Characteristic	en_US.blogs.txt	en_US.twitter.txt	en_US.news.txt
Number of lines	899,288	2,360,148	1,010,242
Number of words	37,334,690	30,374,206	34,372,720
Number of unique words	253,042	212,227	302,652

Data Cleansing

Processing the original dataset takes a lot of time. For our purpose, only 10% of each source text was used as our working dataset for creating our language model .

In order to properly create a model for text prediction, there is a need to cleanse the data. The following are the steps taken to clean the data:

Combined all lines of text from each subset of text (twitter, news and blogs) into a single list.
Standardized the character to be used for single quotes and apostrophes. Apostrophe and single quotes are used in different ways in formal english grammar. But, since we are dealing with freetext, we assume that the characters were used interchangeably. A single character (single quote) was adopted for all characters that look similar to an apostrophe or single quote. Examples are the grave character (`), the left and right single quotes (‘’) and the prime character (′).
Standardized the character to be used for hyphen. A couple of other characters have been found to look like hyphen but are not. This includes em dash and en dash. They have all been replaced with the hyphen character.
All acronyms, titles and abbreviations that end in dot (.) were modified to remove the dots in the word. i.e. N.B.A. was changed to NBA, Mr. was changed to Mr.
Replaced all punctuations (except single quote and hyphen) with space.
Words connected by a hyphen was separated and the hyphen replaced with space (only if there is more than 1 hyphen in the “word”). i.e. text like ‘no-holds-barred’ was changed to ‘no holds barred’, but text like ‘anti-social’ kept the hyphen.
Numbers with its units attached was separated using the space character. i.e. ‘9pm’ became ‘9 pm’ or ‘5feet’ became ‘5 feet’.
All instances of numbers was replaced with a number marker . This signifies that a number usually is found in that place and the exact number used is unimportant. Any “word” that is composed of numbers and punctuation only was considered a number. i.e. 999.00, 100-110 were replaced by the marker .
All lines of text were broken down into a single list of words. The words were separated using space, multiple spaces or combination of space and punctuation as word separators.
Words that start with these special characters (ø⁰) were processed to preserve the trailing words and remove the special characters.
All words were converted to lower case.
All blanks and NAs are removed from the word list.
All words are that contain characters that are not compatible with ISO-8859-1 (Western Alphabet) were removed. Please see https://en.wikipedia.org/wiki/ISO/IEC_8859-1 for more information.
Lastly, a profanity filter was applied on the word list. To identify offensive words, we used a fork of the google profanity list found in https://gist.github.com/ryanlewis/a37739d710ccdb4b406d. Certain letters wered then converted to match more possible profanity in the text. It has been observed that people usually a special character or a number to replace a letter when writing profane words, so the table below was applied to the profanity list taken from the link above to expand our search list.

letter a a a e e i i i o o u s s t

replacement 4 @ * 3 * ! 1 * 0 * * 5 $ +

letter	a	a	a	e	e	i	i	i	o	o	u	s	s	t
replacement	4	@	*	3	*	!	1	*	0	*	*	5	$	+

Exploratory Analysis of The Working Data Set

Here’s a count of lines of text that were used in the working data set (unclean)

Characteristics	blogs	news	twitter
Number of lines	89,929	101,025	236,015
Number of words	3,750,705	3,433,298	3,035,210
Number of unique words	220,800	209,421	223,856

The following shows some characteristics of the combined cleaned working data set.

Characteristic	Value
Total Combined Number of Lines in the Working Data Set	426,969
Total Number of Tokens/Words in the Working Data Set	10,185,910
Total Number of Tokens/Words in the Working Data Set Excluding Stop Words	5,551,724
Total Number of Unique Words in the Working Data Set Excluding Stop Words	200,387

Note: Stop words used was taken from the Text Mining Package in R (tm package) using stopwords(‘en’)

Top Words and Phrases

Top 50 Words (Excluding Stop Words)

Below is the graph of the top 50 words excluding stop words:

Below is the word cloud of the top 50 words excluding stop words:

Top Phrases (N-grams)

Top 20 2-Word Phrase (2-Grams)

Below is the graph of the top 20 2-word phrases (2-grams) in the working data set. This includes all words in the corpora:

2-Word Phrase Cloud (2-Grams):

Top 20 3-Word Phrase (3-Grams)

Below is the graph of the top 20 3-word phrases (3-grams) in the working data set. This includes all words in the corpora:

3-Word Phrase Cloud (3-Grams):

Top 20 4-Word Phrase (4-Grams)

Below is the graph of the top 20 4-word phrases (4-grams) in the working data set. This includes all words in the corpora:

4-Word Phrase Cloud (4-Grams):

Next Steps

For the Shiny Application and Text Prediction Language Model, the plan is to add the following in the process:

Use of a dictionary to look up valid english words.
Spelling correction - This may include deriving the root word and calculating how far from a valid english word an existing word in the corpus is if the word does not exists in the dictionary.
Word Tagging - identify how a word is used in the sentence. Identify if a word is a proper noun. Tag a word if it appeared either at the start or the end of a sentence.
The method for assigning probabilities to words following a phrase has not been decided. Initially, I plan to implement a stupid backoff algorithm for the language model. If the results are not satisfactory, I intend to try another method that has not yet been decided.
The language model to be used by the Shiny App will be a combined model of all 3 data sources (blogs, twitter and news).
The language model will be compressed to be as small as possible. Ideally, the target number of unique words will be 25,000. In order to create a model with such a small number of words, the word list should contain words with at least a frequency of 20. The number may vary depending on the performance of the Shiny App.
Stop words will also be kept out of the language model, however, a marker for stopwords will be kept to enable to app to suggest stop words in the appropriate places.

– EOF –