The goal of this project is to build predictive model that supports users typing text on a mobile device. Three differnt data sources in form of text corpora are available. Each one is different in style. The first one is a set of tweets, the second one a set of blogs and the third one a set of news artricle. All sources are written by multiple users.
This reports contains:
Note: I have been running into stackoverflow issues running knitr on MacBook when generation reports. This is a shorter version of the original report. The graphic files have been submitted with this report.
First step is to download and unzip text files if not on disk. The next step is to load and inspect text data.
The twitter text file contains 2360148 tweets. Each line represents one tweet. The longest has 140 characters.
The news text file contains 1010242 news articles. Each line represents one news article. The longest article has 11384 characters.
The blog text file contains 899288 blogs. Each line represents once blog entry. The longest blog has 40833 characters.
| Corpus | Number of Entries | Num Chars of Longest Entry |
|---|---|---|
| 2360148 | 140 | |
| News | 1010242 | 11384 |
| Blogs | 899288 | 40833 |
For visual representation we use word clouds. A word cloud is a visual representation for text data, typically used to depict keyword on websites, or to visualize free form text. It is an simple technique to see the most frequent used words in each text data source. Common words like ‘is’, ‘of’, ‘the’ etc. have been removed.
Due limitations on my machine (stack overflow running knitr) this graph has been submitted along with this report.
Due limitations on my machine (stack overflow running knitr) this graph has been submitted along with this report.
Due limitations on my machine (stack overflow running knitr) this graph has been submitted along with this report.
Additional analysis was done on term frequency (using the ‘tm’ package). The News text contain fewer terms compared to the other two text sources. This might be caused by smaller number of authors and that most news are written by professional writers. In addition, blogs and tweets cover a wide range of topics
To build a predictive model based on these text data, the following steps will be taken:
create term frequency tables for each corpus
create ngrams
create probability table based on ngrams
build predictive model using markov model and conditional probabilities of phrases.
build Shinny application around data sources and algorithm.