Preparation and Plan for a Predictive Model of English Text

Project

Build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words

This project summary uses tables and plots to give the reader a glimpse into a dataset which will become the basis of a prediction algorithm. Using exploratory analysis we highlight the major features of the data. The report concludes with a concise plan to create the eventual application and prediction algorithm, and is posted here: http://rpubs.com/ for ease of access. This summary is broken down in to the following sections:

Exploratory Analysis: download data and load into R for analysis.
Report of summary statistics about the data set created from the text file, ‘en_US.blogs.txt’
Points of Interest so far
Project plan

1. Exploratory Analysis

Download data, load into R, load libraries & clean. Consider word count, structure, & object size of blogs txt file. Clear RAM with garbage collection

blogs <- readLines("/Users/susanlmartin/coursera/course10/data/final/en_US/en_US.blogs.txt")
cnum.blogs<-nchar(blogs)
#cnum.news<-nchar(news)
#cnum.twitter<-nchar(twitter)

str(blogs)

##  chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”." ...

object.size(blogs)

## 267758632 bytes

gc(verbose = getOption("verbose"), reset = FALSE, full = TRUE)

##            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
## Ncells  2089079 111.6    3104917 165.9         NA  2156027 115.2
## Vcells 31123072 237.5   41284637 315.0      16384 33738118 257.5

#number of words in blogs entry - this goes above with analysis
cword.blogs<-stri_count_words(blogs)
#cword.blogs[0:10]

2. Statistics Summary

##    0%    1%    5%   25%   50%   75%   95%   99%  100% 
##     1     8    15    47   156   329   694  1112 40833

Sample the data - a subset of ‘Blogs’

set.seed(3202016)
sample_size<-5000

#twitter.s<-sample(twitter, sample_size)
#news.s<-sample(news, sample_size)
blogs.s<-sample(blogs, sample_size)
# take a look at a typical entries in the dataset
blogs.s[3:2]

## [1] "The cuisine is continental, but more importantly, the beer list resembles the likes we have seen on better part of the continent, too. Food may be king here, but it is seldom to see such a well-chosen list of 32 beers in this country, including 4 Nøgne Ø's, 2 Atna beers and 9 suberb Belgians. The Norwegian macro beer establishment is represented by Aass, another good choice in my opinion. In terms of selection this place ranks third in the beer desert of Oslo, but when you add beer knowledge and service it may well be a contender for the gold medal."                                                                                                                                                                                                                                                                         
## [2] "On Saturday we had an East End wander and intended little by way of beer as we were going out for dinner later, but we did call into Mason and Taylor. I wrote positively about it here, some time ago. On Saturday it was quiet, being around two in the afternoon, but the contrast couldn't have been greater. Young enthusiastic staff all said hello as we walked in. My choice of Saltaire Rye IPA (£3.90) brought an immediate offer of a taster and I was asked if I knew the beer. We were advised that other samples were freely available. \"Just ask.\". Brilliant. A blues ensemble with New Orleans touches, struck up and we thoroughly enjoyed two more pints of the excellent Saltaire beer. I did try a couple of BrewDog keg tasters and quite enjoyed them. Motueka seemed good, but £8 a pint is too much for my sensibilities."

3. Primary Points of Interest

a calculation of n-grams, the 20 most frequent n-grams for n=1,2,3

4. Project Prediction Plan

At this point, the data is downloaded, loaded into R, and cleaned. The primary features of the data have been identified and shown in the histograms and tables above. Next, the algorithm will discover a prediction of next words as it works through test and training data set samples. Error rate and accuracy will be considered. A Shiny App will then be constructed to allow user entry of words with output of predicted words.

To provide the best possible user experience, application performance will be profiled and optimized, balancing memory usage and speed of operation. A review of the final output may indicate the need to clean data a bit further.

Request for feedback

Thanks for your review including any suggestions for the described prediction algorithm and Shiny app.

Note that the echo = FALSE parameter was added to the code chunks to prevent printing of the R code that generates programmatic output including plots.