This is an intermediate report for the Coursera Data Science Capstone Project. The objective of this step is to understand and get used to work with the data. Through the perfroemd Exploratory Data Analysis we will lead to the main keys for the prediction app and algorithm. Particullary, main objectives of this report are:
The data set consists of a collection of text files in four languages: English, Russian, German and Finnish. For each language, three text files exist: blogs, news and twitter. We will only consider the English language files for this project
The corpora have been collected from numerous different webpages, with the aim of getting a varied and comprehensive corpus of current use of the respective language. The data sources are newspapers, magazines, blogs (personal and professional) and Twitter updates.
*Basic information from the Dataset( file size, line and word count for the blog, news and twitter files)
| File_Name | File_SizeinMB | Lines | LinesNEmpty | Chars | CharsNWhite | Word_Count |
|---|---|---|---|---|---|---|
| en_US.blogs | 200.4242 | 899288 | 899288 | 206824382 | 170389539 | 37570839 |
| en_US.news | 196.2775 | 1010242 | 1010242 | 203223154 | 169860866 | 34494539 |
| en_US.twitter | 159.3641 | 2360148 | 2360148 | 162096031 | 134082634 | 30451128 |
As one can see from the previous table, the data files are quite large. Exploratory data analysis on the full data set would be too time consuming. In order to facilitate faster exploratory analysis, we will create a random sample of the English language blogs, news and twitter files.We will randomly sample 15000 the lines in each file. These files will then be written to their own directory for subsequent analysis.
Sampling to 15000 lines/file, fitted to my laptop performance.
Next, sampled data is used to create a corpus; and following clean up steps are performed.
A Matrix of terms has been created for each corpus.
We will show the most significant content of the three corpus by a Word Cloud chart. It is of interest to get an idea for the most frequently occurring words in the documents. The following code computes word frequencies for each document and orders them from largest to smallest. We report the top 20 most frequent words in each file.
Below is a summary of the most frequent words from the sampled data:
We have used RWeka package to create unigram, bigram and trigram. And then ggplot2 package has been used to plot them in order to evaluate the frequency of the main woprds for each corpus.
Prediction of the next word in a sentence will depend on the previos N-grams in that sentence or phrase. The prediction algorithm should be based on 2-, 3-, n-grams. For example, consider a given phrase where we now want to predict the next word. I would create separate predictions based on the previous 2-gram, 3-gram, etc.
After the exploratory analysis, next steps are:
Establish the predictive model by using the sampled data obtained from the previous analysis. The model created will subsequently be tested and tweaked to strike a good balance between accuracy and speed.
Develop the shiny app and presentation to make word prediction based on user inputs.