This report shows preliminary results for analyizing a several natural language datasources taken from http://www.corpora.heliohost.org/aboutcorpus.html. The source includes files in several different languages, but only those in English are used for this analysis. The thre files are sourced from blogs, Twitter, and news sites.
library(knitr)
newsData <- readLines("C:/Users/brend/Documents/Capstone/Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = "UTF-8")
twitData <- readLines("C:/Users/brend/Documents/Capstone/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = "UTF-8")
blogData <- readLines("C:/Users/brend/Documents/Capstone/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = "UTF-8")
DFnews <- c("News", length(newsData), sum(sapply(strsplit(newsData, " "), FUN=length, simplify=TRUE)))
DFtwit <- c("Twitter", length(twitData), sum(sapply(strsplit(twitData, " "), FUN=length, simplify=TRUE)))
DFBlog <- c("Blogs", length(blogData), sum(sapply(strsplit(blogData, " "), FUN=length, simplify=TRUE)))
dfSummary <- data.frame(rbind(DFnews, DFtwit, DFBlog))
kable(x=dfSummary, col.names=c("Source", "Line Count", "Word Count"), digits=1)
| Source | Line Count | Word Count | |
|---|---|---|---|
| DFnews | News | 77259 | 2643969 |
| DFtwit | 2360148 | 30373543 | |
| DFBlog | Blogs | 899288 | 37334131 |
In addition to frequency driven analysis, analysis of parts of speech can help improve prediction rates. Patterns in parts of speech will be more generalizable than N-grams. Parts of speech analysis will also allow additional information to be used in making predictions beyond the frequency information used in an N-gram model.
The NLP package in R is used to annote words with their part of speech. Apply the analysis to every line is computationally intentsive, so results below based on samples of 500 lines drawn at random from the blog, twitter, and news data files. The results indicate some difference in how different parts of speech are used in different sources. In all three of the sources “NN” (nouns, singular or mass) are the most common words. Prepositions (“IN”), determiners (“DT”), and plural nouns (“NNS”) are also frequently common. This work paves the way to analyzing patterns in parts of speech.
This information will help inform my overall strategy. It may be useful to develop different prediction methods based on parts of speech. There are relatively few prepositions in English, but prepositions are one of the most commonly used parts of speech. A simple frequency or N-gram model of prepositions should be fairly predicitive. However, nouns are the most common part of speech and I hypothesize that they are fairly idiosynratic and depend on the subject matter. Modeling based on context clues may be more appropriate.
I plan on using a algorithim that combines three pieces of information to make a prediction.
I have not decided how to integrate all three approaches, but multiple model prediction algorithims tend to be the most accurate.