1. Introduction

The goal of this project is to get familar with the data (US english text from internet news, blog posts and twitter messages) in order to, in the future, build a predictive model. Such a predictive model should suggest possible next words as one is writing text.

2. Data Source

data_dir <- "data"
file_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
zipfile <- "Coursera-SwiftKey.zip"
file_path <- paste(data_dir, zipfile, sep = "/")

if(!file.exists(file_path)) {
        dir.create(data_dir)
        download.file(file_url, destfile = file_path, method = "curl")
        unzip(file_path, data_dir)
}

3. Explenatory data analysis

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships in the data and prepare to build the first linguistic models.

The second step is to understand the frequencies of words and word pairs: - How frequently do certain words appear in the text. - How frequently do certain pairs (or triplets) of words appear together.

3.1 Basic Statistics

I will build figures and tables to understand variation in the frequencies of words and word pairs in the data. For each file we collect the following information: - file size in MB - number of lines - number of non-empty lines - number of words - distribution of words (quantiles and plot) - number of characters - number of non-white characters

3.2 Summary of the Data

We are interested in the English textfiles and filter all subdirectories of the data folder for those files with the prefix “en_”.

filez <- list.files(path=data_dir, pattern="^en_.*.txt", recursive=T, full.names=T)
print(filez)
## [1] "data/final/en_US/en_US.blogs.txt"  
## [2] "data/final/en_US/en_US.news.txt"   
## [3] "data/final/en_US/en_US.twitter.txt"
library(stringi) ## Character string processing
library(knitr)

## Generic metrics function
getMetrics <- function(file_path, file_lines) {
        fiSize <- round(file.info(file_path)$size/1024^2, 2) ## in MB
        x <- stri_stats_general(file_lines) 
        nLines <- x[[1]]
        nChars <- x[[3]]
        charsNWhite <- x[[4]]
        words <- stri_count_words(file_lines) ## count the number of text boundaries in a string.
        nWords <- prettyNum(sum(words), big.mark = ",") ## sum all words up  ##prettyNum(sum(words), big.mark = ",")
        vars <- c(fiSize, nLines, nChars, charsNWhite, nWords)
        vars <- unlist(lapply(vars, function(i) {prettyNum(i, big.mark = ",")}))
        return(vars)
}


## Blogs
con1 <- file(filez[1])
blogs <- readLines(con1, skipNul = TRUE)
close(con1)
rm(con1)
blog_metrics <- getMetrics(filez[1], blogs)


## News
con2 <- file(filez[2])
news <- readLines(filez[2], skipNul = TRUE)
close(con2)
rm(con2)
news_metrics <- getMetrics(filez[2], news)


## Twitter
con3 <- file(filez[3])
twitter <- readLines(filez[3], skipNul = TRUE)
close(con3)
rm(con3)
twitter_metrics <- getMetrics(filez[3], twitter)

## Build a metrics table
metrics <- as.data.frame(rbind(blog_metrics, news_metrics, twitter_metrics))
names(metrics) <- c("size(MB)", "num. of lines", "num. of characters", "num. of blanckspace", "num.of.words")
kable(metrics, digits=5)
size(MB) num. of lines num. of characters num. of blanckspace num.of.words
blog_metrics 200.42 899,288 206,824,382 170,389,539 37,546,246
news_metrics 196.28 1,010,242 203,223,154 169,860,866 34,762,395
twitter_metrics 159.36 2,360,148 162,096,241 134,082,806 30,093,410

Next, we print the word frequency in all three files.

suppressMessages(library(ggplot2))
library(scales)
library(cowplot)

blog_words <- stri_count_words(blogs)
news_words <- stri_count_words(news)
twitter_words <- stri_count_words(twitter)

## Using 'xlim' to show only the relevant parts
qplot(blog_words, binwidth=1, xlim=c(1, 150))

The most blogs have less then 50 words.

qplot(news_words, binwidth=1, xlim=c(1, 100))
## Warning: Removed 13419 rows containing non-finite values (stat_bin).

There is a first peak in the frequency of news words around 5 which may indicates the occurence of headlines in the text. The news text itself in major has between 20 to 40 words.

p3 <- qplot(twitter_words, binwidth=1, xlim=c(1, 50))
p3 + scale_y_continuous(labels = comma)

Twitter messages have a limit of 140 characters (with exceptions for links) and are generally very short.

Conclusion

The basic exploration confirms that blog posts use more words than news articles and news articles use much more words than tweets. Nevertheless, I would rather join them into one big dataset, but of course, without loosing any of the caracteristics of each of that file.

4 Preprocessing

4.1 Building the Corpus and Cleaning the Data

To get clearer versions of the data, we build one corpus out of the three files, convert uppercase to lower case and remove punctuation, white spaces and affiexes (stemming).

For this we use the text mining package “quanteda” which is build around the packages “stringi”, data.table and “Matrix” and is much faster the the package “tm”.

suppressMessages(library(quanteda))
suppressMessages(library(RColorBrewer))

data.sample <- c(sample(blogs, length(blogs) * 0.3),
                 sample(news, length(news) * 0.3),
                 sample(twitter, length(twitter) * 0.3))


##en_corpus <- corpus(c(blogs, news, twitter))
en.corpus <- corpus(data.sample)

## Creating a Document Term Matrix
dfm.matrix <- dfm(en.corpus, ignoredFeatures = stopwords("english"), stem = TRUE)
## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 1,280,902 documents
##    ... indexing features: 411,153 feature types
##    ... removed 174 features, from 174 supplied (glob) feature types
##    ... stemming features (English), trimmed 82275 feature variants
##    ... created a 1280902 x 328705 sparse dfm
##    ... complete. 
## Elapsed time: 139.917 seconds.

4.2 Exploring the Data

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1)–order Markov model.

An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram”; size 3 is a “trigram”.

Unigrams

We organize terms by their frequency and show the 30 most common frequent words (unigrams) in the data sample. At first as wordcloud:

plot(dfm.matrix, max.words = 30, colors = brewer.pal(8, "Dark2"), scale = c(4, 0.5))

And as histogram:

unigram.dfm <- topfeatures(dfm.matrix, 30)  # 30 top words
unigram.dfm <- as.data.frame(unigram.dfm)
unigram.dfm["words"] <- rownames(unigram.dfm)

ggplot(unigram.dfm, aes(x=reorder(words, -unigram.dfm), y=unigram.dfm)) +
         labs(x = "30 Most Common Unigrams", y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", fill = I("blue"))

5. Conclusion and Next Steps

Each of the data files is huge in size and processing takes quite a long time, so the utilization of memory available should be carefully thought.

The quanteda package made a lot easier like cleaning the text - but it may be worth the remove profane words as well and keep others as combined like “East Cost”. Furthermore the package was easy to install - RWeka caused a lot of trozble with the installed Java versions on my machine.

The next steps is to build 2-,3- and 4-grams and to use these as base for my prodiction model.

a first simple model for the relationship between words. I will explore simple and more complicated modeling techniques.