The Data

This report shows preliminary results for analyizing a several natural language datasources taken from http://www.corpora.heliohost.org/aboutcorpus.html. The source includes files in several different languages, but only those in English are used for this analysis. The thre files are sourced from blogs, Twitter, and news sites.

library(knitr)
newsData <- readLines("C:/Users/brend/Documents/Capstone/Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = "UTF-8")
twitData <- readLines("C:/Users/brend/Documents/Capstone/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = "UTF-8")
blogData <- readLines("C:/Users/brend/Documents/Capstone/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = "UTF-8")


DFnews <- c("News", length(newsData), sum(sapply(strsplit(newsData, " "), FUN=length, simplify=TRUE))) 

DFtwit <- c("Twitter", length(twitData), sum(sapply(strsplit(twitData, " "), FUN=length, simplify=TRUE))) 

DFBlog <- c("Blogs", length(blogData), sum(sapply(strsplit(blogData, " "), FUN=length, simplify=TRUE))) 

dfSummary <- data.frame(rbind(DFnews, DFtwit, DFBlog))


kable(x=dfSummary, col.names=c("Source", "Line Count", "Word Count"), digits=1)
Source Line Count Word Count
DFnews News 77259 2643969
DFtwit Twitter 2360148 30373543
DFBlog Blogs 899288 37334131

Part of Speech Analysis

In addition to frequency driven analysis, analysis of parts of speech can help improve prediction rates. Patterns in parts of speech will be more generalizable than N-grams. Parts of speech analysis will also allow additional information to be used in making predictions beyond the frequency information used in an N-gram model.

The NLP package in R is used to annote words with their part of speech. Apply the analysis to every line is computationally intentsive, so results below based on samples of 500 lines drawn at random from the blog, twitter, and news data files. The results indicate some difference in how different parts of speech are used in different sources. In all three of the sources “NN” (nouns, singular or mass) are the most common words. Prepositions (“IN”), determiners (“DT”), and plural nouns (“NNS”) are also frequently common. This work paves the way to analyzing patterns in parts of speech.

This information will help inform my overall strategy. It may be useful to develop different prediction methods based on parts of speech. There are relatively few prepositions in English, but prepositions are one of the most commonly used parts of speech. A simple frequency or N-gram model of prepositions should be fairly predicitive. However, nouns are the most common part of speech and I hypothesize that they are fairly idiosynratic and depend on the subject matter. Modeling based on context clues may be more appropriate.

Plan of Attack

I plan on using a algorithim that combines three pieces of information to make a prediction.

  1. An N-gram frequency based prediction of the next word given the preceding words.
  2. A prediction based on patterns of parts of speech. If the part of speech of the next word can be predicted accurately, this reduces the set of possible words and may allow for a more efficient prediction in addition to improved accuracy.
  3. The association of similar words in the same line regardless of order. If two words are frequently used in the same line, seeing one word will increase the probability of seeing the other.

I have not decided how to integrate all three approaches, but multiple model prediction algorithims tend to be the most accurate.