Exploratory analysis on text dataset

Objectives

The objective of this report is to perform exploratory analysis on the Project Dataset.

Project dataset

The project dataset consists of 3 files (blogs, twitter and news) which include texts obtained from each of the sources in 4 languages (English, German, Russian and Finnish). The words/phrases will be used to design and train a predictive texting model in the given language. For the purpose of this report and course project, English files will be used.

Summary of findings

There are 3 txt files (blogs, twitter and news) of 120, 120 and 136 bytes respectively. Wordcount of each file is 44m, 3m and 37m totalling 84m. Line-count for each file is 899k, 77k and 2.3m. It is interesting that news appear to have much longer lines than the other two sources.

After removing non-ASCII characters, 83m characters remain from the 3 files out of which 11.5m are white-spaces. This leaves 71m of words, numeric and alphanumeric combinations.

Out of the 71m words/combinations, the top 11 most frequently used words make up just ynder 20% of usage. Top 150 words covers 50% whereas top 100000 90%.

Plan for algorithm

We will now use this data-set to build a model for predictive texting. The method will be based in analysing word combinations to identify the most commonly used combinations. The main focus will be a 2-gram model but the model will also accommodate 3-word phrases.

Analysis

The first step is to list the files and understand their respective sizes and length etc.

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

##   file_names unlist.size.
## 1      blogs          120
## 2       news          120
## 3    twitter          136

##   file_names wordcount linecount
## 1      blogs  43937582    899288
## 2       news   3137816     77259
## 3    twitter  36947512   2360148
## 4      Total  84022910   3336695

Frequency of words

The next step is to look a bit deeper into the each of the files about frequency of (English) words. All 3 files are also combined into one corpus for ease of analysis. Because of the nature of texting/blogging, words containing numbers will be taken into account in case it is a popular shorthand.

Revised word count

[1] Remaining NO. of words [2] words removed

## [1] 83081749

## [1] 941161

## Loading required package: NLP

##    US_EN.ASCII     Freq
## 1              11560554
## 2          the  2644358
## 3           to  1895039
## 4            I  1651448
## 5            a  1507983
## 6          and  1507581
## 7           of  1279227
## 8           in   964643
## 9          you   802892
## 10          is   786907

##    US_EN.ASCII    Freq percentage cum_percentage
## 2          the 2644358 0.03697307     0.03697307
## 3           to 1895039 0.02649619     0.06346926
## 4            I 1651448 0.02309033     0.08655959
## 5            a 1507983 0.02108442     0.10764401
## 6          and 1507581 0.02107880     0.12872281
## 7           of 1279227 0.01788599     0.14660880
## 8           in  964643 0.01348751     0.16009631
## 9          you  802892 0.01122593     0.17132224
## 10          is  786907 0.01100243     0.18232467
## 11         for  742367 0.01037968     0.19270435