Summary

The goal of this project is to build predictive model that supports users typing text on a mobile device. Three differnt data sources in form of text corpora are available. Each one is different in style. The first one is a set of tweets, the second one a set of blogs and the third one a set of news artricle. All sources are written by multiple users.

This reports contains:

Note: I have been running into stackoverflow issues running knitr on MacBook when generation reports. This is a shorter version of the original report. The graphic files have been submitted with this report.

Data Acquisition and Inspection

First step is to download and unzip text files if not on disk. The next step is to load and inspect text data.

The twitter text file contains 2360148 tweets. Each line represents one tweet. The longest has 140 characters.

The news text file contains 1010242 news articles. Each line represents one news article. The longest article has 11384 characters.

The blog text file contains 899288 blogs. Each line represents once blog entry. The longest blog has 40833 characters.

Table of Text Source with Entries and Longest Entry

Corpus Number of Entries Num Chars of Longest Entry
Twitter 2360148 140
News 1010242 11384
Blogs 899288 40833

Data Exploration - Frequent Used Words

For visual representation we use word clouds. A word cloud is a visual representation for text data, typically used to depict keyword on websites, or to visualize free form text. It is an simple technique to see the most frequent used words in each text data source. Common words like ‘is’, ‘of’, ‘the’ etc. have been removed.

Wordcloud of Twitter Text

Due limitations on my machine (stack overflow running knitr) this graph has been submitted along with this report.

Wordcloud of News Text:

Due limitations on my machine (stack overflow running knitr) this graph has been submitted along with this report.

Wordcloud of Blog Text:

Due limitations on my machine (stack overflow running knitr) this graph has been submitted along with this report.

Term Frequency

Additional analysis was done on term frequency (using the ‘tm’ package). The News text contain fewer terms compared to the other two text sources. This might be caused by smaller number of authors and that most news are written by professional writers. In addition, blogs and tweets cover a wide range of topics

Future Steps

To build a predictive model based on these text data, the following steps will be taken:

  1. create term frequency tables for each corpus

  2. create ngrams

  3. create probability table based on ngrams

  4. build predictive model using markov model and conditional probabilities of phrases.

  5. build Shinny application around data sources and algorithm.