title: “Milestone Report” author: William Surles date: July 22, 2015 output: html_document: self_contained: true theme: flatly


Introduction

This document shows my exploratory analysis of text documents. I create a corpus and then n-gram matrices of words. I chart the most frequent 1-gram, 2-gram, and 3-gram words. I briefly summarize my plans for creating the prediction algorithm.

I have three different types of language documents, twitter, news, and blogs. I am looking for the most common series of words so I can better predict what word a person might want to use next in a sentence.

Libraries

library(readr)
Warning: package 'readr' was built under R version 3.1.3
library(dplyr)
library(stringr)
Warning: package 'stringr' was built under R version 3.1.3
library(stringi)
library(printr)
library(knitr)
library(tm)
Warning: package 'tm' was built under R version 3.1.3
Warning: package 'NLP' was built under R version 3.1.3
library(RWeka)
library(slam)
library(wordcloud)
library(rCharts)

Summary of the Data

First, I pull the data into R and explore the shape of the data. I have a lot of text here. This will be a good sample of the english langauge.

## Load the 3 data sets
text_twitter <- read_lines("final/en_US/en_US.twitter.txt") 
text_blogs <- read_lines("final/en_US/en_US.blogs.txt") 
text_news <- read_lines("final/en_US/en_US.news.txt") 
## Explore the shape of the data
df_summary <- data.frame(text = c("twitter","blog","news"))

df_summary$words <- c(
  sum(stri_count(text_twitter,regex="\\S+")),
  sum(stri_count(text_blogs,regex="\\S+")),
  sum(stri_count(text_news,regex="\\S+")))

df_summary$lines <- c(
  length(text_twitter),
  length(text_blogs),
  length(text_news))

df_summary$longest_line <- c(
  max(nchar(text_twitter, type = "chars", allowNA = FALSE)),
  max(nchar(text_blogs, type = "chars", allowNA = FALSE)),
  max(nchar(text_news, type = "chars", allowNA = FALSE)))
## Print a pretty table
df_summary <- prettyNum(df_summary, big.mark=",")
kable(df_summary, align = 'r', caption = "Summary of text files")
text words lines longest_line
twitter 30,373,543 2,360,148 140
blog 37,334,131 899,288 40,833
news 34,372,530 1,010,242 11,384

Create Corpus

Right now I am just exploring the test. I sample for faster exploration. The corpus is just is a better way to store langage data in R.

## sample texts
sample_twitter <- sample(text_twitter, length(text_twitter)/1000)
sample_blogs <- sample(text_blogs, length(text_blogs)/1000)
sample_news <- sample(text_news, length(text_news)/1000)
## create corpus
corpus <- Corpus(VectorSource(c(sample_twitter, sample_blogs, sample_news)))

Clean Corpus

I need to clean up the text. Numbers, punctuation, and capital letters will just create unecessary variation in the words. This will make prediction less accurate but also will make my n-grams blow up to way more combinations than are actually useful. I already may have some issues because of the size of the data. But, I do not want to remove stop words or stems because they are an important part of predictiong the next word in everyday language.

## clean the text
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)

Explore Corpus

To avoid duplicating code, I write functions for

createNgramMatrix <- function(corpus, ngram) {
  ## create a matrix of ngram frequency
  options(mc.cores=1)
  ngramToken <- function(x) NGramTokenizer(x, Weka_control(min = ngram, max = ngram))
  tdm <- TermDocumentMatrix(corpus, control = list(tokenize = ngramToken))  
  ## this is a very sparse matrix and rollup makes it much faster to convert to a matrix
  tdm <- as.matrix(rollup(tdm, 2, na.rm=TRUE, FUN = sum))
}

crunchData <- function(tdm, n) {
  ## turn a doc matrix into a data frame and chose top number of rows
  df <- data.frame(ngram = rownames(tdm), count = tdm[,1]) %>%
  arrange(desc(count)) %>%
  top_n(n, count)
  return(df)
}

chartData <- function(df) {
  ## Create a horizontal bar chart of ngrams
  p <- nPlot(count ~ ngram, data = df, type = 'multiBarHorizontalChart')
  p$chart(margin = list(top = 20, right = 20, bottom = 50, left = 100), showControls = F, showLegend = F)
  p$show('inline', include_assets = TRUE)
  return(p)
}

I look at some of the most common series of words. These should all look very familiar.

Ngram 1

tdm <- createNgramMatrix(corpus, 1)
df <- crunchData(tdm, 20)
p <- chartData(df)

Ngram 2

tdm <- createNgramMatrix(corpus, 2)
df <- crunchData(tdm, 10)
p <- chartData(df)

Ngram 3

tdm <- createNgramMatrix(corpus, 3)
df <- crunchData(tdm, 10)
p <- chartData(df)

Wordcloud

Just for fun

tdm <- createNgramMatrix(corpus, 1)
df <- crunchData(tdm, 100)
wordcloud(df$ngram, df$count)

plot of chunk unnamed-chunk-11

Plans for creating a prediction algorithm and Shiny app.

References