Exploratory Analysis on the Coursera-SwiftKey English Data Sets

Synopsis

This report is an exploratory analysis on the Coursera-SwiftKey English data sets. There are 3 data sets containing snippets from blog posts, news articles and tweets. First, I show how the loading and preprocessing of the data sets were performed. Then, I examine the distribution of words per sentence and list the top 20 words with and without stop words for each data set, discussing some interesting findings. Lastly, I present my future plans for the prediction algorithm and Shiny application.

Needed R libraries

Needed libraries are loaded.

library(readr)
library(quanteda)

Loading and preprocessing the data

There are 3 data sets involved, the first data set (en_US.blogs.txt) contains snippets from blog posts, the second data set (en_US.news.txt) contains snippets from news articles, while the third data set (en_US.twitter.txt) contains Twitter tweets. A look at the first few pages of each data set suggests that they each have different writing styles and focus on different topics, so they will be treated separately in this analysis.

The data sets are read in using readr::read_lines, in binary mode, assuming that the encoding is UTF-8. Binary mode is used because I don’t want the operating system to do any transformations while loading the data sets, while the UTF-8 encoding is used because the file Linux command line tool reports that the data sets’ encoding is UTF-8, verified by visual inspection of the data sets in Vim.

blogs <- read_lines(file("final/en_US/en_US.blogs.txt", "rb", encoding="UTF-8"))
news <- read_lines(file("final/en_US/en_US.news.txt", "rb", encoding="UTF-8"))
twitter <- read_lines(file("final/en_US/en_US.twitter_noNUL.txt", "rb", encoding="UTF-8"))

There was special preprocessing for the Twitter data set because it contains embedded NULs which readr::read_lines aborts on. These NULs were replaced with spaces using the following Linux command line:

tr '\000' ' ' < en_US.twitter.txt > en_US.twitter_noNUL.txt

Verifying the load

To verify that the data sets have been loaded successfully, I compare the size in bytes of the loaded data sets with what ls (a Linux directory command) reports:

210160014 en_US.blogs.txt
205811889 en_US.news.txt
167105338 en_US.twitter_noNUL.txt

From the Linux file command, it is known that the data sets are CRLF-terminated, i.e. two bytes end each line. Therefore, the following statements should all be true:

sum(nchar(blogs, type="bytes")) + 2*length(blogs) == 210160014

## [1] TRUE

sum(nchar(news, type="bytes")) + 2*length(news) == 205811889

## [1] TRUE

sum(nchar(twitter, type="bytes")) + 2*length(twitter) == 167105338

## [1] TRUE

Voila! The data sets appear to have been successfully loaded.

Normalizing the data

In the Unicode UTF-8 encoding, punctuation like double quotation marks, single quotation marks, apostrophes and hyphens can be encoded by several symbols each. I normalize the data so that punctuation in the data sets is encoded only by the ASCII symbol for that punctuation. This is done because it is expected that any R text manipulation function can handle ASCII and I do not need to assume that they can also handle the Unicode UTF-8 symbols as well. For example, tools that remove punctuation might consider only ASCII punctuation symbols and miss out on other encodings of those symbols.

gsub_fast <- function(pattern, replacement, x) {
  gsub(pattern, replacement, x, fixed = TRUE, useBytes = TRUE)
}

normalize_data <- function(data) {
  data <- gsub_fast("\u0093", "\"", data)
  data <- gsub_fast("\u0094", "\"", data)
  data <- gsub_fast("\u201c", "\"", data)
  data <- gsub_fast("\u201d", "\"", data)
  data <- gsub_fast("\u0091", "'", data)
  data <- gsub_fast("\u0092", "'", data)
  data <- gsub_fast("\u2018", "'", data)
  data <- gsub_fast("\u2019", "'", data)
  data <- gsub_fast("\u0096", "-", data)
  data <- gsub_fast("\u0097", "--", data)
  data <- gsub_fast("\u2013", "-", data)
  data <- gsub_fast("\u2014", "--", data)
  data <- gsub_fast("\u0095", "", data) # \u0095 is a point.
  data <- gsub_fast("\u00F8", "", data) # \u00F8 is probably standing in for a point.
  # To be continued...
  
  # Since using gsub(useBytes = TRUE), redeclare encoding as UTF-8.
  Encoding(data) <- "UTF-8"
  data
}

blogs <- normalize_data(blogs)
news <- normalize_data(news)
twitter <- normalize_data(twitter)

The tools used for normalization were tools::showNonASCII and the r12a Unicode code converter. Due to time constraints, this normalization is not yet complete, but I expect that completing the normalization will not significantly affect any findings.

Profanity removal

For profanity removal, I chose the list from the Banned Word List website, the reasons being that the list is free and that it is conservative i.e. it does not include double-entendres or words that are only offensive in context. I may add on to the list later.

profanity <- read_lines(file("swearWords.txt", "rb", encoding="UTF-8"))

Basic summaries and plots

A basic line count summary table can quickly be produced as follows:

c(blogs = length(blogs), news = length(news), twitter = length(twitter))

##   blogs    news twitter 
##  899288 1010242 2360148

However, each line can have multiple sentences which makes the usage of lines for analysis to be less than ideal. Therefore, I use the quanteda::tokenize function to split the data sets into sentences. This sentence segmenter isn’t perfect, but I expect that any mistakes will not be statistically significant.

blogs.sentence <- tokenize(blogs, what = "sentence", simplify = TRUE)
news.sentence <- tokenize(news, what = "sentence", simplify = TRUE)
twitter.sentence <- tokenize(twitter, what = "sentence", simplify = TRUE)

Words per sentence / sentence length

For the word count summary, I will present the histograms of words per sentence (or sentence length) for each data set. I do so because Twitter has a 140 character limit and I expect an interesting finding from the histograms.

I split the sentences into words…

blogs.sentence.words <- tokenize(blogs.sentence, removePunct = TRUE)
news.sentence.words <- tokenize(news.sentence, removePunct = TRUE)
twitter.sentence.words <- tokenize(twitter.sentence, removePunct = TRUE)

… and count the number of words per sentence, or sentence length.

blogs.sentence.length <- sapply(blogs.sentence.words, length)
news.sentence.length <- sapply(news.sentence.words, length)
twitter.sentence.length <- sapply(twitter.sentence.words, length)

I take the median and 95% quantile of the sentence lengths:

c(median = median(blogs.sentence.length), quantile(blogs.sentence.length, probs = 0.95))

## median    95% 
##     13     37

c(median = median(news.sentence.length), quantile(news.sentence.length, probs = 0.95))

## median    95% 
##     16     36

c(median = median(twitter.sentence.length), quantile(twitter.sentence.length, probs = 0.95))

## median    95% 
##      7     19

The results show that the median length for news sentences is the highest, followed by blog sentences and then tweet sentences. Also, at least 95% of the sentences are of length 50 or less.

The histograms follow:

hist(blogs.sentence.length[blogs.sentence.length <= 50], freq = FALSE,
     main = "Histogram of blog sentence length, with cutoff at 50 words",
     xlab = "Blog sentence length")

hist(news.sentence.length[news.sentence.length <= 50], freq = FALSE,
     main = "Histogram of news sentence length, with cutoff at 50 words",
     xlab = "News sentence length")

hist(twitter.sentence.length, freq = FALSE,
     main = "Histogram of tweet sentence length",
     xlab = "Tweet sentence length")

The histograms show that tweet sentence lengths are usually very short (less than 10 words), while blog and news sentence lengths are usually longer (more than 10 words).

Top 20 words with/without stop words

In this report, I will also show the top 20 words for each data set with and without stop words. I do so because I expect another interesting finding here, due to the different data set sources.

First, with stop words but without profanity, I create a document-feature matrix for each data set:

blogs.dfm.withStop <- dfm(blogs, ignoredFeatures = profanity)

## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 899,288 documents
##    ... indexing features: 373,541 feature types
##    ... removed 65 features, from 77 supplied (glob) feature types
##    ... created a 899288 x 373476 sparse dfm
##    ... complete. 
## Elapsed time: 65.82 seconds.

news.dfm.withStop <- dfm(news, ignoredFeatures = profanity)

## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 1,010,242 documents
##    ... indexing features: 326,752 feature types
##    ... removed 50 features, from 77 supplied (glob) feature types
##    ... created a 1010242 x 326702 sparse dfm
##    ... complete. 
## Elapsed time: 61.5 seconds.

twitter.dfm.withStop <- dfm(twitter, ignoredFeatures = profanity)

## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 2,360,148 documents
##    ... indexing features: 418,945 feature types
##    ... removed 65 features, from 77 supplied (glob) feature types
##    ... created a 2360148 x 418880 sparse dfm
##    ... complete. 
## Elapsed time: 120.69 seconds.

I then obtain the top 20 words:

topfeatures(blogs.dfm.withStop, 20)

##     the     and      to       a      of       i      in    that      is 
## 1858296 1092671 1066341  898210  875328  774102  594283  460576  432533 
##      it     for     you    with     was      on      my    this      as 
##  403372  363319  298244  286590  278308  274437  270716  258893  223697 
##    have      be 
##  218770  208491

topfeatures(news.dfm.withStop, 20)

##     the      to     and       a      of      in     for    that      is 
## 1972403  901187  885352  875429  771095  674225  351258  346828  284181 
##      on    with    said     was      he      it      at      as     his 
##  266847  254792  250408  228959  228695  219262  212783  187425  157648 
##       i      be 
##  157390  152197

topfeatures(twitter.dfm.withStop, 20)

##    the     to      i      a    you    and    for     in     of     is 
## 936567 787269 722297 609341 547678 438192 385056 377946 359149 358708 
##     it     my     on   that     me     be     at   with   your   have 
## 294728 291780 277000 234579 202302 187634 186486 173414 171153 168602

The word lists appear to be quite similar to each other, though there are differences: “my” is a common word in both the blog and Twitter data set, “me” is a common word in the Twitter data set, while “said”, “he” and “his” are common words in the news data set.

Second, without both stop words and profanity, I create a document-feature matrix for each data set:

blogs.dfm.noStop <- dfm(blogs, ignoredFeatures = c(profanity, stopwords("english")))

## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 899,288 documents
##    ... indexing features: 373,541 feature types
##    ... removed 239 features, from 251 supplied (glob) feature types
##    ... created a 899288 x 373302 sparse dfm
##    ... complete. 
## Elapsed time: 62.39 seconds.

news.dfm.noStop <- dfm(news, ignoredFeatures = c(profanity, stopwords("english")))

## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 1,010,242 documents
##    ... indexing features: 326,752 feature types
##    ... removed 224 features, from 251 supplied (glob) feature types
##    ... created a 1010242 x 326528 sparse dfm
##    ... complete. 
## Elapsed time: 63.51 seconds.

twitter.dfm.noStop <- dfm(twitter, ignoredFeatures = c(profanity, stopwords("english")))

## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 2,360,148 documents
##    ... indexing features: 418,945 feature types
##    ... removed 239 features, from 251 supplied (glob) feature types
##    ... created a 2360148 x 418706 sparse dfm
##    ... complete. 
## Elapsed time: 97.09 seconds.

Again, I obtain the top 20 words:

topfeatures(blogs.dfm.noStop, 20)

##    one   will   just   like    can   time    get   know    now people 
## 124812 112700 100642  98715  98303  88605  70873  60277  60148  59450 
##   also    new   even  first    day   make   back     us    see really 
##  55337  54480  52073  50890  50669  50656  50571  50217  50058  49948

topfeatures(news.dfm.noStop, 20)

##    said    will     one     new    also     can     two    year    just 
##  250408  108168   83226   70341   58763   58696   57355   54815   53168 
##   first    last    time    like   state  people   years     get    city 
##   52651   51529   51159   49434   48058   47589   46828   43503   37118 
##     now percent 
##   36108   34427

topfeatures(twitter.dfm.noStop, 20)

##   just   like    get   love   good   will    day    can thanks     rt 
## 151022 122022 112321 106176 100750  94683  90074  89735  89532  89301 
##    now    one   know      u  great   time  today     go    lol    new 
##  83797  82017  79821  77139  75995  75587  72850  72405  70026  69624

The word lists now appear less similar to each other, possibly reflecting the different topics and writing styles of each data set source.

Discussion on interesting findings

On sentence length, there is a distinct difference in lengths for tweet sentences compared to news and blog sentences, which is likely due to the 140 character limit for tweets. This may be significant because I expect writing styles will change due to the limit, thus affecting the words used.

On the top 20 word lists, the differences suggest that prediction would be more accurate if there are separate predictors for each data set source. This assumes though that a predictor application can detect whether a user is writing a blog, a news article or a tweet.

Future plans

I plan to build a combination (ensemble) predictor that is a weighted combination of predictors for each category: blog, news or Twitter. The predictors will be based on Markov bigram models – I will also test trigram models if hardware permits. The Shiny application will allow the user to choose the sentence category, resulting in higher weight for the corresponding predictor, or just state the category as unknown. The application will show the top 5 predictions.

If hardware permits, the predictor will be trained on the whole training data set, since more data means that the predictor’s model contains more knowledge. If hardware is an issue, than sampling will need to be done.

Testing will be done in 2 ways. First, test data will be taken from the training set. This reflects the case where the training set is the writing history of a person and that person is in a situation that he or she has been in before. Second, test data will be taken from very recent news articles and tweets to test the predictor’s general knowledge.

I do not yet know how to handle words that are not in the predictor’s vocabulary. Perhaps individual word frequencies will be considered? Certainly, category predictors that are trying to predict based on words that are not in that predictor’s vocabulary will be weighted lower.