Introduction

This report aims to present the progress made in implementing a language model for text prediction and the results of the exploratory analysis of the data set.

The dataset used in this project was taken from the HC Corpora http://www.corpora.heliohost.org. The en_US subset of the dataset has been chosen for this project. The dataset includes blog text (en_US.blog.txt), twitter text (en_US.twitter.txt) and news text (en_US.news.txt)

The original dataset contains millions of lines of text. In order to process and explore the data within reasonable limits, a subset of each text file was extracted. 10% of lines from each text file was used.

Dataset Characteristics

Initial inspection of the full dataset showed the following characteristics of the uncleaned data:

Characteristic en_US.blogs.txt en_US.twitter.txt en_US.news.txt
Number of lines 899,288 2,360,148 1,010,242
Number of words 37,334,690 30,374,206 34,372,720
Number of unique words 253,042 212,227 302,652

Data Cleansing

Processing the original dataset takes a lot of time. For our purpose, only 10% of each source text was used as our working dataset for creating our language model .

In order to properly create a model for text prediction, there is a need to cleanse the data. The following are the steps taken to clean the data:

Exploratory Analysis of The Working Data Set

Here’s a count of lines of text that were used in the working data set (unclean)

Characteristics blogs news twitter
Number of lines 89,929 101,025 236,015
Number of words 3,750,705 3,433,298 3,035,210
Number of unique words 220,800 209,421 223,856

The following shows some characteristics of the combined cleaned working data set.

Characteristic Value
Total Combined Number of Lines in the Working Data Set 426,969
Total Number of Tokens/Words in the Working Data Set 10,185,910
Total Number of Tokens/Words in the Working Data Set Excluding Stop Words 5,551,724
Total Number of Unique Words in the Working Data Set Excluding Stop Words 200,387

Note: Stop words used was taken from the Text Mining Package in R (tm package) using stopwords(‘en’)

Top Words and Phrases

Top 50 Words (Excluding Stop Words)

Below is the graph of the top 50 words excluding stop words:

Below is the word cloud of the top 50 words excluding stop words:

Top Phrases (N-grams)

Top 20 2-Word Phrase (2-Grams)

Below is the graph of the top 20 2-word phrases (2-grams) in the working data set. This includes all words in the corpora:

2-Word Phrase Cloud (2-Grams):

Top 20 3-Word Phrase (3-Grams)

Below is the graph of the top 20 3-word phrases (3-grams) in the working data set. This includes all words in the corpora:

3-Word Phrase Cloud (3-Grams):

Top 20 4-Word Phrase (4-Grams)

Below is the graph of the top 20 4-word phrases (4-grams) in the working data set. This includes all words in the corpora:

4-Word Phrase Cloud (4-Grams):

Next Steps

For the Shiny Application and Text Prediction Language Model, the plan is to add the following in the process:

– EOF –