Data Science Capstone Milestone W2

Executive summary

Coursera’s Data Science Capstone project’s goal is to develop Shiny application to recommend next word based on previous words on-the-fly, while typeing. This application will be based on predictive text model, trained on the English text datasets from variuos sources. Text models and its distinctive features, like n-grams for example, fall in the area of Natural Language Processing - for which in R we can use packages like tm or RWeka.

This report deals with exploratory analysis on proposed English text datasets and lay ground for further analysis and model building and selection. First part represents setup (loading packages and getting the data), second part shows some immediate dataset features and cleaning of data, and third part builds simple n-grams from datasets.

It is shown that twitter data has less words per line :), and a bit smaller words on average, that cleaned data is not much different to full unproccessed dataset except for the fact that it doesn’t contain words with <3 chars. 2-grams and 3-grams calculated on smaller datasets show that most frequent combinations contain common stop-words. Some remarks are emphasised in conclusion about how to proceed with prediction models.

Part 1: Setup

Getting the data

Data we are going to use is compiled from 3 sources - news, blogs, and tweets. Dataset actually contains similar texts in other languages, but we are going to use only English datasets and build English only predictive model.

handle <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists("Coursera-SwiftKey/final/en_US/en_US.blogs.txt")){
        download.file(url=handle, "Coursera-SwiftKey.zip", mode="wb")
        unzip("Coursera-SwiftKey.zip")
}

blogs <- readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding="UTF-8", skipNul=T)
news <- readLines("Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding="UTF-8", skipNul=T)
twitter <- readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding="UTF-8", skipNul=T)

Loading packages

Package tm and stringi are used for text mining and to manipulate text datasets, quanteda for fast word frequency calculation, SnowballC for stemming, RWeka for building n-grams and dplyr and ggplot2 for basic data manipulation and plotting.

library(tm)
library(stringi)
library(quanteda)
library(SnowballC)
library(RWeka)
library(dplyr)
library(scales)
library(ggplot2)

Session info

This report is built in following environment (shown here for reproducibility reasons):

sessionInfo()

## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.5 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_2.1.0    scales_0.4.0     dplyr_0.4.3      RWeka_0.4-29    
## [5] SnowballC_0.5.1  quanteda_0.9.6-9 stringi_1.1.1    tm_0.6-2        
## [9] NLP_0.1-9       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.5       ca_0.64           knitr_1.13       
##  [4] magrittr_1.5      RWekajars_3.9.0-1 munsell_0.4.3    
##  [7] colorspace_1.2-6  lattice_0.20-33   R6_2.1.2         
## [10] plyr_1.8.4        stringr_1.0.0     tools_3.3.0      
## [13] parallel_3.3.0    grid_3.3.0        gtable_0.2.0     
## [16] data.table_1.9.6  DBI_0.4-1         htmltools_0.3.5  
## [19] assertthat_0.1    yaml_2.1.13       digest_0.6.9     
## [22] Matrix_1.2-6      rJava_0.9-8       formatR_1.4      
## [25] evaluate_0.9      slam_0.1-34       rmarkdown_0.9.6  
## [28] chron_2.3-47

Part 2: Exploratory analysis

Data features

Most important features we need are about words in the dataset: total number of words, number of chars per word, number of words per line, etc. These and other features are given below, separately for each text source:

Words.per.Line <- sapply(list(blogs,news,twitter), function(x) summary(stri_count_words(x))['Mean'])
Chars.per.Word <- sapply(list(blogs,news,twitter), 
          function(x) stri_stats_general(x)[c("Chars")] / stri_stats_latex(x)[c("Words")])
stats <- data.frame(
  Source=c("blogs","news","twitter"),      
  t(rbind(
  sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),],
  Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',],
  Words.per.Line,
  Chars.per.Word)
))
print(stats)

##    Source   Lines     Chars    Words Words.per.Line Chars.per.Word
## 1   blogs  899288 206824382 37570839          41.75       5.504918
## 2    news 1010242 203223154 34494539          34.41       5.891459
## 3 twitter 2360148 162096241 30451170          12.75       5.323153

As expected, twitter data has smaller Words.per.Line, while Chars.per.Word are similar in all three sources. For further analysis sources are merged and used as single corpus, and overall word frequency is ploted.

word_freq <- dfm(c(blogs, news, twitter), verbose = FALSE)
most_freq <- topfeatures(word_freq)
freq.df <- data.frame(words=names(most_freq), freq=most_freq)

ggplot(freq.df, aes(words, freq)) + geom_bar(stat="identity") +
            theme(axis.text.x=element_text(angle=45, hjust=1)) + 
            scale_y_continuous(labels=comma)

Cleaning the data

Cleaning of the text corpus could be very tiresome process. For purposes of this report it is good to show which steps are probably be included in final prediction model, but on much smaller subset of data. Sample is 0.1% from corpus.

set.seed(2121)
blogs <- iconv(blogs,to="utf-8-mac")
news <- iconv(news,to="utf-8-mac")
twitter <- iconv(twitter,to="utf-8-mac")

sample_corpus <- c(sample(blogs, length(blogs) * 0.001),
                 sample(news, length(news) * 0.001),
                 sample(twitter, length(twitter) * 0.001))

corpus <- VCorpus(VectorSource(sample_corpus))
corpus <- tm_map(corpus, content_transformer(tolower), lazy=T)
corpus <- tm_map(corpus, removePunctuation, lazy=T)
corpus <- tm_map(corpus, removeNumbers, lazy=T)
corpus <- tm_map(corpus, stripWhitespace, lazy=T)
corpus <- tm_map(corpus, stemDocument, lazy=T)
corpus <- tm_map(corpus, PlainTextDocument, lazy=T)

Cleaned data word frequency (from sample) is plotted below:

words_matrix <- TermDocumentMatrix(corpus)
words_corpus <- findFreqTerms(words_matrix, lowfreq=100)
words_corpus <- words_corpus[!is.na(words_corpus)]
words_corpus_freq <- rowSums(as.matrix(words_matrix[words_corpus,]))
words_corpus_freq <- data.frame(word=names(words_corpus_freq), frequency=words_corpus_freq)

data <- words_corpus_freq
num <- 10
title <- "Top words by freq"
df <- data[order(-data$frequency),][1:num,] 

ggplot(df, aes(x = seq(1:num), y = frequency)) +
    geom_bar(stat = "identity", width = 0.80) +
    coord_cartesian(xlim = c(0, num+1)) +
    labs(title = title) +
    xlab("Words") +
    ylab("Count") +
    scale_x_discrete(breaks=c(1:num), limits=c(1:num), labels=rownames(df)[1:num]) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

Part 3: n-gram analysis

Building n-grams

Tokenization is done on 2-grams and 3-grams and then token frequency calculated:

bi_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bi_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = bi_tokenizer))

bi_corpus <- findFreqTerms(bi_matrix, lowfreq=40)
bi_corpus <- bi_corpus[!is.na(bi_corpus)]

bi_corpus_freq <- rowSums(as.matrix(bi_matrix[bi_corpus,]))
bi_corpus_freq <- data.frame(word=names(bi_corpus_freq), frequency=bi_corpus_freq)

tri_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tri_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = tri_tokenizer))

tri_corpus <- findFreqTerms(tri_matrix, lowfreq=2)
tri_corpus <- tri_corpus[!is.na(tri_corpus)]

tri_corpus_freq <- rowSums(as.matrix(tri_matrix[tri_corpus,]))
tri_corpus_freq <- data.frame(word=names(tri_corpus_freq), frequency=tri_corpus_freq)

2-gram features

data2 <- bi_corpus_freq
num <- 10
title <- "Top 2-grams"
df2 <- data2[order(-data2$frequency),][1:num,] 

ggplot(df2, aes(x = seq(1:num), y = frequency)) +
    geom_bar(stat = "identity", width = 0.80) +
    coord_cartesian(xlim = c(0, num+1)) +
    labs(title = title) +
    xlab("Words") +
    ylab("Count") +
    scale_x_discrete(breaks=c(1:num), limits=c(1:num), labels=rownames(df2)[1:num]) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

3-gram features

data3 <- tri_corpus_freq
num <- 10
title <- "Top 3-grams"
df3 <- data3[order(-data3$frequency),][1:num,] 

ggplot(df3, aes(x = seq(1:num), y = frequency)) +
    geom_bar(stat = "identity", width = 0.80) +
    coord_cartesian(xlim = c(0, num+1)) +
    labs(title = title) +
    xlab("Words") +
    ylab("Count") +
    scale_x_discrete(breaks=c(1:num), limits=c(1:num), labels=rownames(df3)[1:num]) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusion

In this report there are some basic features shown about datasets. Profanity filter is not considered, nor is advanced normalization or cleaning of data. For example, 2-grams and 3-grams are mostly composed of stop words, while non-stop words in n-grams are most interested cases for prediction. Prediction models should tackle these problems but also problem of optimizing model training because even for these small datasets n-gram tokenization takes really long time.