This is a milestone report for the Coursera Capstone project, an Natural Language Processing (NLP) analysis. The report discusses (1) How do download and load the project data (2) Presents basic summary statistics about the datasets (3) Reports uninteresting findings (4) Requests peer feedback on project plans
The following code downloads and loads the requisite datasets for US language analysis:
if (!file.exists('./corpus.zip'))
{
download.file('https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip', destfile='corpus.zip')
unzip('corpus.zip',overwrite=TRUE)
}
en_us_twitter <- readLines(file('./final/en_US/en_US.twitter.txt','r',blocking=F),-1,skipNul=T)
en_us_news <- readLines(file('./final/en_US/en_US.news.txt','r',blocking=F),-1,skipNul=T)
en_us_blogs <- readLines(file('./final/en_US/en_US.blogs.txt','r',blocking=F),-1,skipNul=T)
For ease of data manipulation, the files are loaded into a single data table, keeping sourcing information intact in case needed for future use:
all_lines <- data.table(src=factor(c("twitter"),levels=c("twitter","news","blogs")),lines=en_us_twitter)
all_lines <- rbind(all_lines,data.table(src=factor(c("news"),levels=c("twitter","news","blogs")),lines=en_us_news))
all_lines <- rbind(all_lines,data.table(src=factor(c("blogs"),levels=c("twitter","news","blogs")),lines=en_us_blogs))
The data constitutes close to 600 Mb of data, with twitter leading with 301.4Mb; blogs at 248.5Mb; and news headliens from reuters 19.2Mb. Per-line statistics are as follows:
summary(all_lines)
## src lines
## twitter:2360148 Length:3336694
## news : 77258 Class :character
## blogs : 899288 Mode :character
Frequencies of top word counts are presented below.
So far, analysis has been focused on organizing the data to generate per-word frequency statistics and n-gram frequencies for 2-word pair and 3-word pairs. I’ve explored the data by first breaking lines down into phrases, using regular expressions to split on punctuation (,.!? etc), with the approach that between sentence and between-clause word correlations don’t have predictive next-word value. Subseqently, each phrase is split on white-space word breaks. Word frequencies are then explored.
The approach so far has broken down the overall datset of 3 million plus lines of text into a testing and training dataset, stratifying partitioning by data source at 1/1000 (primarily to help ease exploratory computations). The training dataset is then partitioned into a verification dataset at 1/20. The resultant training dataset has 3171 lines of text. The n-gram frequencies of this training set are as follows. Most interesting finding: “ha ha ha” is the most frequent word triple (actually, it was most frequeny under previous random partition sampling…no longer the case in running this Rpubs report. Also an interesting finding: seems I’ve some sampling bias already; should set the random seed).
In terms of n-grams the following counts are observed (among the training partition): Single words: 12137
Double words: 37663 Triple words: 46834
Interestingly, the number of unique triples is not remarkably greater than the count of distinct doubles.
I’m planning on developing a prediction algorithm that uses n-gram frequencies to select the next-word based on preceeding n-word sequences (likely 2 precending words, thus trying to predict 3 n-gram word sequences). The shiney app will I think give most-probably word selections based on probability of current word being typed; the current word in the 2 n-gram context; the current word in the 3 n-gram context. In other words, user will have 3 auto-complete options. Otherwise, I confess my plans are pretty sparse as I’m playing catch-up: the course start date coincided with my vacation plans. Barcelona was great – highly recommended. Not the best location for contemplating NLP prediction analysis approaches, though. :-)