Data Science Capstone - Milestone Report

Summary

Text prediction is a very useful feature, commonly implemented in mobile phone applications when typing a message, or when typing a query into a search engine. In the capstone project, we build such a predictive model. Specifically, we build a model to predict the next word given a word sequence. This report presents an exploratory analysis of the data; we show steps for preprocessing and examine some of the n-grams present in the dataset.

Basic Data Summaries

The data for this project is taken from the HC Corpora collection and consists of three chunks of text - blogs, news, and tweets. The data were downloaded from here. We will be working with the English language portion.

We first load the full data files, and check the file size and line counts.

Load necessary libraries:

library(tm)
library(knitr)
library(qdap)
library(RWeka)

Load full data files:

blogs <- readLines("../data/en_US/en_US.blogs.txt", -1)
news <- readLines("../data/en_US/en_US.news.txt", -1)
twitter <- readLines("../data/en_US/en_US.twitter.txt", -1)

Perform preliminary checks on file size and line count:

stats_names <- rbind("blogs","news","text")
stats_lines <- rbind(length(blogs),
                     length(news),
                     length(twitter))

stats_size <- rbind(format(object.size(blogs), units = "Mb"),
                    format(object.size(news), units = "Mb"),
                    format(object.size(twitter), units = "Mb"))

stats_table <- cbind(stats_names,stats_size,stats_lines)

colnames(stats_table) <- c("file", "size", "lines")

kable(stats_table,format="markdown")

file	size	lines
blogs	248.5 Mb	899288
news	19.2 Mb	77259
text	301.4 Mb	2360148

We also look at some of the text here:

c(blogs[2],news[2],twitter[2])

[1] “We love you Mr. Brown.”
[2] “The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.” [3] “When you meet someone special… you’ll know. Your heart will beat more rapidly and you’ll smile for no reason.”

To reduce memory and time, we will only use the first 10000 lines of each file in this analysis.

Load the (limited) data:

numlines <- 10000

blogs <- readLines("../data/en_US/en_US.blogs.txt", numlines)
news <- readLines("../data/en_US/en_US.news.txt", numlines)
twitter <- readLines("../data/en_US/en_US.twitter.txt", numlines)

Data processing

To prepare the data, here we combine all three text chunks and remove characters not used in the model-building process.

Combine data sources, apply sentence detection (to account for data being on multple lines), and create a corpus object:

text_combined <- paste(blogs,news,twitter)
text_combined <- sent_detect(text_combined, language = "en", model = NULL)

vec_corp <- VectorSource(text_combined)
corp <- VCorpus(vec_corp)

Remove numbers, punctuation, and whitespace, convert to lower case, and convert to data frame:

corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, stripWhitespace)
corp <- tm_map(corp, content_transformer(tolower))

corp_df <- data.frame(text=unlist(sapply(corp, '[',"content")),stringsAsFactors=F)

Exploratory Analysis

In this section, we look at distributions of unigrams, bigrams, and trigrams in the dataset.

Tokenize and find relative frequencies:

delimiter_list <- " \\t\\r\\n.!?,;\"()"

ngram_1_tok <- NGramTokenizer(corp_df, Weka_control(min=1,max=1))
ngram_1_tab <- data.frame(table(ngram_1_tok))
ngram_1_tab <- ngram_1_tab[order(ngram_1_tab$Freq,decreasing=T),]
ngram_1_tab$RelPerc <- ngram_1_tab$Freq / sum(ngram_1_tab$Freq)
colnames(ngram_1_tab) <- c("NGram","Freq","RelFreq")

ngram_2_tok <- NGramTokenizer(corp_df, Weka_control(min=2,max=2, delimiters = delimiter_list))
ngram_2_tab <- data.frame(table(ngram_2_tok))
ngram_2_tab <- ngram_2_tab[order(ngram_2_tab$Freq,decreasing=T),]
ngram_2_tab$RelPerc <- ngram_2_tab$Freq / sum(ngram_2_tab$Freq)
colnames(ngram_2_tab) <- c("NGram","Freq","RelFreq")

ngram_3_tok <- NGramTokenizer(corp_df, Weka_control(min=3,max=3, delimiters = delimiter_list))
ngram_3_tab <- data.frame(table(ngram_3_tok))
ngram_3_tab <- ngram_3_tab[order(ngram_3_tab$Freq,decreasing=T),]
ngram_3_tab$RelPerc <- ngram_3_tab$Freq / sum(ngram_3_tab$Freq)
colnames(ngram_3_tab) <- c("NGram","Freq","RelFreq")

Summarize unique and total ngram counts:

stats_ngram_name <- rbind("unigram","bigram","trigram")

stats_ngram_unique <- rbind(dim(ngram_1_tab)[1],
                            dim(ngram_2_tab)[1],
                            dim(ngram_3_tab)[1])

stats_ngram_total <- rbind(sum(ngram_1_tab$Freq),
                           sum(ngram_2_tab$Freq),
                           sum(ngram_3_tab$Freq))

stats_ngram_table <- cbind(stats_ngram_name,stats_ngram_unique,stats_ngram_total)

colnames(stats_ngram_table) <- c("ngram","unique","total")
kable(stats_ngram_table,format="markdown")

ngram	unique	total
unigram	21774	178247
bigram	113293	178246
trigram	166133	178245

Build plots:

plot_1gram <- ggplot(data=ngram_1_tab[1:20,], aes(NGram,RelFreq)) + 
    geom_bar(stat="identity") + 
    theme(text = element_text(size=10)) + 
    labs(title="unigrams: top 20", x = "n-gram", y="Relative Frequency")

plot_2gram <- ggplot(data=ngram_2_tab[1:20,], aes(NGram,RelFreq)) + 
    geom_bar(stat="identity") + 
    theme(text = element_text(size=10),
          axis.text.x = element_text(angle = 90, hjust = 1)) + 
    labs(title="bigrams: top 20", x = "n-gram", y="Relative Frequency")

plot_3gram <- ggplot(data=ngram_3_tab[1:20,], aes(NGram,RelFreq)) + 
    geom_bar(stat="identity") + 
    theme(text = element_text(size=10),
          axis.text.x = element_text(angle = 90, hjust = 1)) + 
    labs(title="trigrams: top 20", x = "n-gram", y="Relative Frequency")

Plot relative frequencies:

plot of chunk plot_results1

The top unigrams are words such as “the” and “to”, and this is not surprising. Such words, called ‘stop words’ in language processing, might be considered noise and removed in some types of text analysis (for example, topic modeling). However, in this case it makes sense to keep them in, since we would like to predict these words as well. The above summary table also shows that unigrams have the smallest unique count of the three n-grams considered, and this is due to the frequent stop words.

The top bigrams include “in the”, “of the”, and similar bits of speech. Again, this is not surprising, and we would expect our future model to predict an article like “the” or “a” given that the previous word was “in”, “to”, or similar.

The top trigrams also contain the same frequent words: “one of the”, “to be a”, “a lot of”, etc.

Next steps

The next steps involve using the extracted n-grams to build a model that predicts the next word given a phrase.