As a first step in the process of developing a prediction model in the spirit of Swiftkey, we perform a series of preliminary analyses on our text. The objectives of these steps are the following:
The data provided include text from three sources (Twitter, News, Blog) across four locales (en_US, de_DE, ru_RU and fi_FI). For our analysis we look at the three sources within the en_US locale.
First, we set our working directory and create pointers to the files used in our analysis.
## Loading Packages
require(stringr)
require(sqldf)
require(R.utils)
require(dplyr)
require(ggplot2)
require(knitr)
require(stringi)
require(ggthemes)
require(plotflow)
require(grid)
require(gridExtra)
## Set Working Directory and File Pointers
setwd("~/Documents/School/Coursera/course10_capstone/")
data_dir <- "~/Documents/School/Coursera/course10_capstone/Data/en_US/"
files <- c("en_US.twitter.txt",
"en_US.blogs.txt",
"en_US.news.txt")
Using the file.info
function, we discover that the size of our data files range from 167 MB (Twitter) to 210 MB (Blogs). The code used to extract these data is laid out below.
## Get Size of Files
file_size = sapply(1:length(files), function(x) {
file.info(paste0(data_dir, files[x]))$size/ 1024^2
})
Based on the large size of the files, the data from those files is processed into sentences and stored in an external SQLite database, text.db
. Given that the resultant database file exceeds 3 GB in size, it is safe to say that this helps us manage our memory use!
For our next preliminary analysis, we look at at the number of lines in our files.
num_lines = sapply(1:length(files), function(x) {
countLines(paste0(data_dir, files[x]))
})
Using the stringi
package, we can glean some further insights into character and word length.
twitter <- stri_read_raw("Data/en_US/en_US.twitter.txt")
blogs <- stri_read_raw("Data/en_US/en_US.blogs.txt")
news <- stri_read_raw("Data/en_US/en_US.news.txt")
twitter_stats <- stri_stats_latex(twitter)
blogs_stats <- stri_stats_latex(blogs)
news_stats <- stri_stats_latex(news)
num_chars <- c(twitter_stats[1], blogs_stats[1], news_stats[1])
num_words <- c(twitter_stats[4], blogs_stats[4], news_stats[4])
The results of our preliminary analysis are shown below.
file_names | size_mb | num_lines | num_words | num_chars | |
---|---|---|---|---|---|
1 | en_US.twitter.txt | 159.36 | 2360148 | 40358144 | 334210676 |
2 | en_US.blogs.txt | 200.42 | 899288 | 44668811 | 420320028 |
3 | en_US.news.txt | 196.28 | 1010242 | 43729890 | 411623778 |
Using Python and NLTK, the text from the three files was analyzed and prepared for N-Gram extraction.
In order to optimize efficiency of analysis, we use 20,000-entry random samples from each of the corpora.
def sample_text(table, sample_size):
"""generates random sample from from corpus"""
with CON:
CUR.execute(
"SELECT * FROM %s ORDER BY Random() LIMIT %s" % (table, sample_size))
return [x[0].decode("ASCII") for x in CUR.fetchall()]
Most of the data processing and n-gram extraction steps use Python and its Natural Language Toolkit (NLTK
). The factors leading to this decision include the following:
rJava
running on newer builds of OSX, many of the usual R-based NLP packages such as rWeka
cannot be used. NLTK
, however, is a strong and well supported NLP package with Python, and was chosen to avoid excessive debugging of Java environments.NLTK
, collections.Counter
and other Python packages allow for use of sophisticated NLP feature analysis with relative ease.This analysis leverages R for further exploratory analysis and visualization. R will also be the primary driver of predictive model design and development.
Using NLTK
, the following steps were taken to clean the text files for proper analysis:
nltk.corpus.stopwords.words('english
)string.punctuation
and a list of additional punctuation observed to affect the texts)With the text sampled and pre-processed, we develop an n-gram extractor that allows us to easily generate sets of unigrams, bigrams, and trigrams for each of the corpora. The code for extract_ngram
is found below.
def extract_ngram(text_list, length):
# stemmer=PorterStemmer()
words_2_exclude = [x for x in punctuation] + stopwords.words('english')
words_2_exclude.extend(("''",'``', '--', '..', '...'))
phrase_counter = Counter()
for text in text_list:
for sent in nltk.sent_tokenize(text):
words = [x for x in nltk.word_tokenize(sent) if not x in words_2_exclude]
for ngram in ngrams(words, length):
phrase_counter[ngram] += 1
return phrase_counter
The results of applying this function, exporting the results to CSV from Python, and visualizing them in R follow below.
ug_T <- read.csv('twitter_unigrams.csv')[1:10,]
ug_T <- reorder_by(ngram, ~ count, ug_T)
ug_B <- read.csv('blogs_unigrams.csv')[1:10,]
ug_B <- reorder_by(ngram, ~ count, ug_B)
ug_N <- read.csv('news_unigrams.csv')[1:10,]
ug_N <- reorder_by(ngram, ~ count, ug_N)
# Prepare Charts (Sample)
xlab <- "Unigram"
ylab <- "Count"
title <- "Top 10 Unigrams from Twitter"
ug_T_chart <- ggplot(ug_T, aes(x=ngram, y=count, order = count)) +
geom_bar(stat="identity") + coord_flip() +
theme_economist() +
scale_colour_economist() +
xlab(xlab) +
ylab(ylab) +
theme(legend.title=element_blank(), legend.position="right") +
ggtitle(title)
## NULL
## NULL
## NULL
The conclusions to be drawn here are largely tangential to the development of the predictive model; however, there are some interesting questions and insights that are drawn up nonetheless:
Regardless, there are many more technical factors (namely memory, time, and computer resources) that now have to be taken into consideration along with new domain factors (linguistics as whole). These considerations were less pronounced in previous courses; however, it makes the task ahead that much more daunting and exciting!