Milestone Report

The Data

This report shows preliminary results for analyizing a several natural language datasources taken from http://www.corpora.heliohost.org/aboutcorpus.html. The source includes files in several different languages, but only those in English are used for this analysis. The thre files are sourced from blogs, Twitter, and news sites.

library(knitr)
newsData <- readLines("C:/Users/brend/Documents/Capstone/Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = "UTF-8")
twitData <- readLines("C:/Users/brend/Documents/Capstone/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = "UTF-8")
blogData <- readLines("C:/Users/brend/Documents/Capstone/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = "UTF-8")


DFnews <- c("News", length(newsData), sum(sapply(strsplit(newsData, " "), FUN=length, simplify=TRUE))) 

DFtwit <- c("Twitter", length(twitData), sum(sapply(strsplit(twitData, " "), FUN=length, simplify=TRUE))) 

DFBlog <- c("Blogs", length(blogData), sum(sapply(strsplit(blogData, " "), FUN=length, simplify=TRUE))) 

dfSummary <- data.frame(rbind(DFnews, DFtwit, DFBlog))


kable(x=dfSummary, col.names=c("Source", "Line Count", "Word Count"), digits=1)

	Source	Line Count	Word Count
DFnews	News	77259	2643969
DFtwit	Twitter	2360148	30373543
DFBlog	Blogs	899288	37334131

Part of Speech Analysis

In addition to frequency driven analysis, analysis of parts of speech can help improve prediction rates. Patterns in parts of speech will be more generalizable than N-grams. Parts of speech analysis will also allow additional information to be used in making predictions beyond the frequency information used in an N-gram model.

The NLP package in R is used to annote words with their part of speech. Apply the analysis to every line is computationally intentsive, so results below based on samples of 500 lines drawn at random from the blog, twitter, and news data files. The results indicate some difference in how different parts of speech are used in different sources. In all three of the sources “NN” (nouns, singular or mass) are the most common words. Prepositions (“IN”), determiners (“DT”), and plural nouns (“NNS”) are also frequently common. This work paves the way to analyzing patterns in parts of speech.

This information will help inform my overall strategy. It may be useful to develop different prediction methods based on parts of speech. There are relatively few prepositions in English, but prepositions are one of the most commonly used parts of speech. A simple frequency or N-gram model of prepositions should be fairly predicitive. However, nouns are the most common part of speech and I hypothesize that they are fairly idiosynratic and depend on the subject matter. Modeling based on context clues may be more appropriate.

Plan of Attack

I plan on using a algorithim that combines three pieces of information to make a prediction.

An N-gram frequency based prediction of the next word given the preceding words.
A prediction based on patterns of parts of speech. If the part of speech of the next word can be predicted accurately, this reduces the set of possible words and may allow for a more efficient prediction in addition to improved accuracy.
The association of similar words in the same line regardless of order. If two words are frequently used in the same line, seeing one word will increase the probability of seeing the other.

I have not decided how to integrate all three approaches, but multiple model prediction algorithims tend to be the most accurate.

Milestone Report

Joog2006

December 29, 2015

The Data

Part of Speech Analysis

Plan of Attack