Data Science Capstone : Milestone Report

The goal of this project is just to display that we have gotten used to working with the data and that we are on track to create your prediction algorithm. This explains the exploratory analysis and goals for the eventual app and algorithm. This document is concise and explain only the major features of the data we have identified and briefly summarize the plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.

This document makes use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: ####1. Demonstrate that you’ve downloaded the data and have successfully loaded it in. ####2. Create a basic report of summary statistics about the data sets. ####3. Report any interesting findings that you amassed so far. ####4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Load the files

The files were downloaded from the given Url and unzipped using readtext method in readtext package. I have used parallelizeTask method to which does tasks in parallel processing way.

library(quanteda)
#library(readtext)
#blogs <- parallelizeTask(readLines,"en_US.blogs.txt")
blog <- parallelizeTask(readLines,"en_US.blogs.txt")
news <- parallelizeTask(readLines,"en_US.news.txt")
twitter <- parallelizeTask(readLines,"en_US.twitter.txt")

Summary of files

The summary of the three en_Us files were captured using the unix command “wc ” which gives line count, word count and character count of the files.

line_count word_count character_count
en_US.blogs.txt 899288 37334114 210160014
en_US.news.txt 1010242 34365936 205811889
en_US.twitter.txt 2360148 30359852 167105338

Sampling

The package quanteda was used to create sample out of the files.

library(quanteda)
blogSubset <- sample(blog, size=length(blog)*.01, replace=FALSE) 
twitterSubset <- parallelizeTask(sample,twitter, size=length(twitter)*.01, replace=FALSE) 
newsSubset <- parallelizeTask(sample,news, size=length(news)*.01, replace=FALSE) 
enSubset <- c(blogSubset, twitterSubset, newsSubset) #for dfm

20 Most frequently occuring words in the document

This model was built using the function dfm from quanteda package and plotted using ggplot package

Frequency of top 20 Bi-grams

Frequency of top 20 Tri-grams

Similarities between texts:

The package quanteda provides with various functions to find the similarity between texts (textstat_simil) and distance between words. We can use these information to build the models.

Evaluate the words from foreign language:

We can remove all english words and non alphanumeric characters and the remaining words would be from the foreigh language

Further ideas

Identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases, The size of n-gram word frequency dataset increases as n does. This is because the more n-gram, the less the token would be used. So we have to optimize our algorithm in order to reduce data usage. So my idea is to use tri-grams at the max. we need to do is to use less data to cover more word frequency, namely coverage.

There are many statistical model to smooth the probabilities, for example, Katz’s backoff Model and Good-Turing’s method are good candidates.

The next task consists of generating the n-grams and frequencies from the sampled “training” dataset.

After smooth out the ngram creation process the ngrams are stored in a dictionary or data.frame. There is suggested Markov Chains packages to store and retrieve the ngrams.

Predict the next word using simple back off approach and with weighting to get the probable next word use it on test data to calculate its accuracy.

Since the currently available predictive text models can run on mobile phones we need to handle memory efficiently.