Introduction

As a first step in the process of developing a prediction model in the spirit of Swiftkey, we perform a series of preliminary analyses on our text. The objectives of these steps are the following:

To assert that we have are able to successfully load and manipulate the data
To develop an understanding of our data’s attributes and what implications they hold on our approach to working with the data
To perform basic exploratory analysis and report any interesting statistics and findings
To lay a foundation for features analysis that will assist in the next phase of the project. For the purposes of this report, that involves preliminary n-gram analysis and visualization

Summary of HC Corpora Data

The data provided include text from three sources (Twitter, News, Blog) across four locales (en_US, de_DE, ru_RU and fi_FI). For our analysis we look at the three sources within the en_US locale.

Set Working Directory, Packages Used, and File Pointers

First, we set our working directory and create pointers to the files used in our analysis.

## Loading Packages
require(stringr)
require(sqldf)
require(R.utils)
require(dplyr)
require(ggplot2)
require(knitr)
require(stringi)
require(ggthemes)
require(plotflow)
require(grid)
require(gridExtra)

## Set Working Directory and File Pointers
setwd("~/Documents/School/Coursera/course10_capstone/")
data_dir <- "~/Documents/School/Coursera/course10_capstone/Data/en_US/"

files <- c("en_US.twitter.txt",
           "en_US.blogs.txt",
           "en_US.news.txt")

File Sizes

Using the file.info function, we discover that the size of our data files range from 167 MB (Twitter) to 210 MB (Blogs). The code used to extract these data is laid out below.

## Get Size of Files
file_size = sapply(1:length(files), function(x) {
  file.info(paste0(data_dir, files[x]))$size/ 1024^2
})

Storage Considerations

Based on the large size of the files, the data from those files is processed into sentences and stored in an external SQLite database, text.db. Given that the resultant database file exceeds 3 GB in size, it is safe to say that this helps us manage our memory use!

Line Counts

For our next preliminary analysis, we look at at the number of lines in our files.

num_lines = sapply(1:length(files), function(x) {
  countLines(paste0(data_dir, files[x]))
})

Number of Words and Chars

Using the stringi package, we can glean some further insights into character and word length.

twitter <- stri_read_raw("Data/en_US/en_US.twitter.txt")
blogs <- stri_read_raw("Data/en_US/en_US.blogs.txt")
news <- stri_read_raw("Data/en_US/en_US.news.txt")

twitter_stats <- stri_stats_latex(twitter)
blogs_stats <- stri_stats_latex(blogs)
news_stats <-  stri_stats_latex(news)

num_chars <- c(twitter_stats[1], blogs_stats[1], news_stats[1])
num_words <- c(twitter_stats[4], blogs_stats[4], news_stats[4])

Preliminary Analysis Results

The results of our preliminary analysis are shown below.

	file_names	size_mb	num_lines	num_words	num_chars
1	en_US.twitter.txt	159.36	2360148	40358144	334210676
2	en_US.blogs.txt	200.42	899288	44668811	420320028
3	en_US.news.txt	196.28	1010242	43729890	411623778

Preparing the Data and Environment

Using Python and NLTK, the text from the three files was analyzed and prepared for N-Gram extraction.

Sample Preparation

In order to optimize efficiency of analysis, we use 20,000-entry random samples from each of the corpora.

def sample_text(table, sample_size):
    """generates random sample from from corpus"""
    with CON:
        CUR.execute(
            "SELECT * FROM %s ORDER BY Random() LIMIT %s" % (table, sample_size))
        return [x[0].decode("ASCII") for x in CUR.fetchall()]

Notes on Approach and Tools Used

Most of the data processing and n-gram extraction steps use Python and its Natural Language Toolkit (NLTK). The factors leading to this decision include the following:

Speed On average, Python is able to work with large data sets more quickly
Technical Compatibility Since there are outstanding errors with rJava running on newer builds of OSX, many of the usual R-based NLP packages such as rWeka cannot be used. NLTK, however, is a strong and well supported NLP package with Python, and was chosen to avoid excessive debugging of Java environments.
Strong NLP Functionality The techniques available through the NLTK, collections.Counter and other Python packages allow for use of sophisticated NLP feature analysis with relative ease.

This analysis leverages R for further exploratory analysis and visualization. R will also be the primary driver of predictive model design and development.

Text Pre-Processing

Using NLTK, the following steps were taken to clean the text files for proper analysis:

Ignoring stopwords (nltk.corpus.stopwords.words('english)
Excluding non-apostrophe punctuation (string.punctuation and a list of additional punctuation observed to affect the texts)
Case normalization (All converted to lower case)
Stripping whitespace from text

Feature Analysis

With the text sampled and pre-processed, we develop an n-gram extractor that allows us to easily generate sets of unigrams, bigrams, and trigrams for each of the corpora. The code for extract_ngram is found below.

def extract_ngram(text_list, length):
    # stemmer=PorterStemmer()
    words_2_exclude = [x for x in punctuation] + stopwords.words('english')
    words_2_exclude.extend(("''",'``', '--', '..', '...'))
    phrase_counter = Counter()
    for text in text_list:
        for sent in nltk.sent_tokenize(text):
            words = [x for x in nltk.word_tokenize(sent) if not x in words_2_exclude]
            for ngram in ngrams(words, length):
                phrase_counter[ngram] += 1
    return phrase_counter

The results of applying this function, exporting the results to CSV from Python, and visualizing them in R follow below.

Unigrams

ug_T <- read.csv('twitter_unigrams.csv')[1:10,]
ug_T <- reorder_by(ngram, ~ count, ug_T)
ug_B <- read.csv('blogs_unigrams.csv')[1:10,]
ug_B <- reorder_by(ngram, ~ count, ug_B)
ug_N <- read.csv('news_unigrams.csv')[1:10,]
ug_N <- reorder_by(ngram, ~ count, ug_N)


# Prepare Charts (Sample)
xlab <- "Unigram"
ylab <- "Count"
title <- "Top 10 Unigrams from Twitter"
ug_T_chart <- ggplot(ug_T, aes(x=ngram, y=count, order = count)) + 
        geom_bar(stat="identity") + coord_flip() +
        theme_economist() + 
        scale_colour_economist() +
        xlab(xlab) + 
        ylab(ylab) +
        theme(legend.title=element_blank(), legend.position="right") +
        ggtitle(title)

## NULL

Bigrams

## NULL

Trigrams

## NULL

Conclusions and Future Steps

The conclusions to be drawn here are largely tangential to the development of the predictive model; however, there are some interesting questions and insights that are drawn up nonetheless:

How would a stricter stemming model affect n-grams containing contractions? (e.g. can’t, isn’t, etc.) Would it be semantically meaningful to handle them?
Is there a role for certain stopwords to play?
Is there a more appropriate sample size or selection process to work with that would generate a more accurate ground to develop a model off of?

Regardless, there are many more technical factors (namely memory, time, and computer resources) that now have to be taken into consideration along with new domain factors (linguistics as whole). These considerations were less pronounced in previous courses; however, it makes the task ahead that much more daunting and exciting!

Milestone Report: Coursera Capstone Course