title: “Milestone Report” author: William Surles date: July 22, 2015 output: html_document: self_contained: true theme: flatly
This document shows my exploratory analysis of text documents. I create a corpus and then n-gram matrices of words. I chart the most frequent 1-gram, 2-gram, and 3-gram words. I briefly summarize my plans for creating the prediction algorithm.
I have three different types of language documents, twitter, news, and blogs. I am looking for the most common series of words so I can better predict what word a person might want to use next in a sentence.
library(readr)
Warning: package 'readr' was built under R version 3.1.3
library(dplyr)
library(stringr)
Warning: package 'stringr' was built under R version 3.1.3
library(stringi)
library(printr)
library(knitr)
library(tm)
Warning: package 'tm' was built under R version 3.1.3
Warning: package 'NLP' was built under R version 3.1.3
library(RWeka)
library(slam)
library(wordcloud)
library(rCharts)
First, I pull the data into R and explore the shape of the data. I have a lot of text here. This will be a good sample of the english langauge.
## Load the 3 data sets
text_twitter <- read_lines("final/en_US/en_US.twitter.txt")
text_blogs <- read_lines("final/en_US/en_US.blogs.txt")
text_news <- read_lines("final/en_US/en_US.news.txt")
## Explore the shape of the data
df_summary <- data.frame(text = c("twitter","blog","news"))
df_summary$words <- c(
sum(stri_count(text_twitter,regex="\\S+")),
sum(stri_count(text_blogs,regex="\\S+")),
sum(stri_count(text_news,regex="\\S+")))
df_summary$lines <- c(
length(text_twitter),
length(text_blogs),
length(text_news))
df_summary$longest_line <- c(
max(nchar(text_twitter, type = "chars", allowNA = FALSE)),
max(nchar(text_blogs, type = "chars", allowNA = FALSE)),
max(nchar(text_news, type = "chars", allowNA = FALSE)))
## Print a pretty table
df_summary <- prettyNum(df_summary, big.mark=",")
kable(df_summary, align = 'r', caption = "Summary of text files")
text | words | lines | longest_line |
---|---|---|---|
30,373,543 | 2,360,148 | 140 | |
blog | 37,334,131 | 899,288 | 40,833 |
news | 34,372,530 | 1,010,242 | 11,384 |
Right now I am just exploring the test. I sample for faster exploration. The corpus is just is a better way to store langage data in R.
## sample texts
sample_twitter <- sample(text_twitter, length(text_twitter)/1000)
sample_blogs <- sample(text_blogs, length(text_blogs)/1000)
sample_news <- sample(text_news, length(text_news)/1000)
## create corpus
corpus <- Corpus(VectorSource(c(sample_twitter, sample_blogs, sample_news)))
I need to clean up the text. Numbers, punctuation, and capital letters will just create unecessary variation in the words. This will make prediction less accurate but also will make my n-grams blow up to way more combinations than are actually useful. I already may have some issues because of the size of the data. But, I do not want to remove stop words or stems because they are an important part of predictiong the next word in everyday language.
## clean the text
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
To avoid duplicating code, I write functions for
createNgramMatrix <- function(corpus, ngram) {
## create a matrix of ngram frequency
options(mc.cores=1)
ngramToken <- function(x) NGramTokenizer(x, Weka_control(min = ngram, max = ngram))
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = ngramToken))
## this is a very sparse matrix and rollup makes it much faster to convert to a matrix
tdm <- as.matrix(rollup(tdm, 2, na.rm=TRUE, FUN = sum))
}
crunchData <- function(tdm, n) {
## turn a doc matrix into a data frame and chose top number of rows
df <- data.frame(ngram = rownames(tdm), count = tdm[,1]) %>%
arrange(desc(count)) %>%
top_n(n, count)
return(df)
}
chartData <- function(df) {
## Create a horizontal bar chart of ngrams
p <- nPlot(count ~ ngram, data = df, type = 'multiBarHorizontalChart')
p$chart(margin = list(top = 20, right = 20, bottom = 50, left = 100), showControls = F, showLegend = F)
p$show('inline', include_assets = TRUE)
return(p)
}
I look at some of the most common series of words. These should all look very familiar.
tdm <- createNgramMatrix(corpus, 1)
df <- crunchData(tdm, 20)
p <- chartData(df)
tdm <- createNgramMatrix(corpus, 2)
df <- crunchData(tdm, 10)
p <- chartData(df)
tdm <- createNgramMatrix(corpus, 3)
df <- crunchData(tdm, 10)
p <- chartData(df)
Just for fun
tdm <- createNgramMatrix(corpus, 1)
df <- crunchData(tdm, 100)
wordcloud(df$ngram, df$count)