Introduction

title: “Milestone Report” author: William Surles date: July 22, 2015 output: html_document: self_contained: true theme: flatly

Introduction

This document shows my exploratory analysis of text documents. I create a corpus and then n-gram matrices of words. I chart the most frequent 1-gram, 2-gram, and 3-gram words. I briefly summarize my plans for creating the prediction algorithm.

I have three different types of language documents, twitter, news, and blogs. I am looking for the most common series of words so I can better predict what word a person might want to use next in a sentence.

Libraries

library(readr)

Warning: package 'readr' was built under R version 3.1.3

library(dplyr)
library(stringr)

Warning: package 'stringr' was built under R version 3.1.3

library(stringi)
library(printr)
library(knitr)
library(tm)

Warning: package 'tm' was built under R version 3.1.3

Warning: package 'NLP' was built under R version 3.1.3

library(RWeka)
library(slam)
library(wordcloud)
library(rCharts)

Summary of the Data

First, I pull the data into R and explore the shape of the data. I have a lot of text here. This will be a good sample of the english langauge.

## Load the 3 data sets
text_twitter <- read_lines("final/en_US/en_US.twitter.txt") 
text_blogs <- read_lines("final/en_US/en_US.blogs.txt") 
text_news <- read_lines("final/en_US/en_US.news.txt")

## Explore the shape of the data
df_summary <- data.frame(text = c("twitter","blog","news"))

df_summary$words <- c(
  sum(stri_count(text_twitter,regex="\\S+")),
  sum(stri_count(text_blogs,regex="\\S+")),
  sum(stri_count(text_news,regex="\\S+")))

df_summary$lines <- c(
  length(text_twitter),
  length(text_blogs),
  length(text_news))

df_summary$longest_line <- c(
  max(nchar(text_twitter, type = "chars", allowNA = FALSE)),
  max(nchar(text_blogs, type = "chars", allowNA = FALSE)),
  max(nchar(text_news, type = "chars", allowNA = FALSE)))

## Print a pretty table
df_summary <- prettyNum(df_summary, big.mark=",")
kable(df_summary, align = 'r', caption = "Summary of text files")

text	words	lines	longest_line
twitter	30,373,543	2,360,148	140
blog	37,334,131	899,288	40,833
news	34,372,530	1,010,242	11,384

Create Corpus

Right now I am just exploring the test. I sample for faster exploration. The corpus is just is a better way to store langage data in R.

## sample texts
sample_twitter <- sample(text_twitter, length(text_twitter)/1000)
sample_blogs <- sample(text_blogs, length(text_blogs)/1000)
sample_news <- sample(text_news, length(text_news)/1000)
## create corpus
corpus <- Corpus(VectorSource(c(sample_twitter, sample_blogs, sample_news)))

Clean Corpus

I need to clean up the text. Numbers, punctuation, and capital letters will just create unecessary variation in the words. This will make prediction less accurate but also will make my n-grams blow up to way more combinations than are actually useful. I already may have some issues because of the size of the data. But, I do not want to remove stop words or stems because they are an important part of predictiong the next word in everyday language.

## clean the text
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)

Explore Corpus

To avoid duplicating code, I write functions for

creating a ngram matrix
converting this to a dataframe of the top ngrams
charting the top ngrams

createNgramMatrix <- function(corpus, ngram) {
  ## create a matrix of ngram frequency
  options(mc.cores=1)
  ngramToken <- function(x) NGramTokenizer(x, Weka_control(min = ngram, max = ngram))
  tdm <- TermDocumentMatrix(corpus, control = list(tokenize = ngramToken))  
  ## this is a very sparse matrix and rollup makes it much faster to convert to a matrix
  tdm <- as.matrix(rollup(tdm, 2, na.rm=TRUE, FUN = sum))
}

crunchData <- function(tdm, n) {
  ## turn a doc matrix into a data frame and chose top number of rows
  df <- data.frame(ngram = rownames(tdm), count = tdm[,1]) %>%
  arrange(desc(count)) %>%
  top_n(n, count)
  return(df)
}

chartData <- function(df) {
  ## Create a horizontal bar chart of ngrams
  p <- nPlot(count ~ ngram, data = df, type = 'multiBarHorizontalChart')
  p$chart(margin = list(top = 20, right = 20, bottom = 50, left = 100), showControls = F, showLegend = F)
  p$show('inline', include_assets = TRUE)
  return(p)
}

I look at some of the most common series of words. These should all look very familiar.

Ngram 1

tdm <- createNgramMatrix(corpus, 1)
df <- crunchData(tdm, 20)
p <- chartData(df)

Ngram 2

tdm <- createNgramMatrix(corpus, 2)
df <- crunchData(tdm, 10)
p <- chartData(df)

Ngram 3

tdm <- createNgramMatrix(corpus, 3)
df <- crunchData(tdm, 10)
p <- chartData(df)

Wordcloud

Just for fun

tdm <- createNgramMatrix(corpus, 1)
df <- crunchData(tdm, 100)
wordcloud(df$ngram, df$count)

plot of chunk unnamed-chunk-11

Plans for creating a prediction algorithm and Shiny app.

Remove profanity maybe
Learn to do machine learning somehow to predict words
Make an awesome shiny app
Profit!!

Introduction

Libraries

Summary of the Data

Create Corpus

Clean Corpus

Explore Corpus

Ngram 1

Ngram 2

Ngram 3

Wordcloud

Plans for creating a prediction algorithm and Shiny app.

References