Data Science Specialization - Milestone report

Executive summary

This is my milestone report for the Capstone Project of the Data Science Specialization offered at Coursera. The purpose of this report is to explore the text data that was provided for the project and investigate interesting features that will help in designing the final data product.

Loading dependencies

My solution is depending mostly on the tm and RWeka packages. For the sake of explorative plotting, I’m using the wordcloud package.

library(tm)
## Loading required package: NLP
library(RWeka)
library(RColorBrewer)
library(wordcloud)

Reading in the data

The three data files were combined into one: combined_text_data.txt. For the sake of this assignment, I’ve extracted the number of lines, words and characters from the combined text data, using a custom Python script.

For those who are interested in such numbers, there are:

  • 4,269,678 Lines
  • 99,827,173 Words
  • 572,143,905 Characters

Fun trivia: When examining the data files separately, one can notice that while the number of lines is the largest in the Twitter dataset (2,360,148), these lines contain the smallest number of words (28,014,705) and characters (162,096,241), which is to be expected, since on Twitter everyone uses shorter sentences with less sophisticated words, compared to blogs or news.

setwd("/home/misi/Coursera/Data science Capstone")

con <- file("Data/combined_text_data.txt", "r")
data <- readLines(con, skipNul=TRUE, encoding="UTF-8")
close(con)

Sampling the data

Using my sampling function, I’ve created a sample of 100k lines, saved into sampled_text_data.txt.

sampling <- function(data, sample_size){
  con <- file("Data/sample/sampled_text_data.txt", "w")
  sample_line_counter <- 0
  for(i in 1: length(data)){
    if(rbinom(1,1,0.5)[1] == 1){
      writeLines(data[i], con)
      sample_line_counter <- sample_line_counter + 1
    }
    if(sample_line_counter == sample_size){
      break
    }
  }
  close(con)  
}

sampling(data, 100000)

Loading in the sample data and processing it

The loaded sample text was processed by removing the numeric characters, the punctuations and the white spaces. The text was then transformed to lower case.

sample_corpus <- Corpus(DirSource("Data/sample"), 
                        readerControl = list(language = "en_US", 
                                             encoding = "UTF-8"))

sample_corpus<-tm_map(sample_corpus, removeNumbers)
sample_corpus<-tm_map(sample_corpus, removePunctuation)
sample_corpus<-tm_map(sample_corpus, stripWhitespace)
sample_corpus<-tm_map(sample_corpus, content_transformer(tolower))

Calculating n-grams

Below is my n-gram function that returns the top (by default 100) highest frequency terms.

n_gram_frequencies <- function(corpus, ngrams, top=100){
  
  ngram <- function(x) NGramTokenizer(x, Weka_control(min = ngrams, max = ngrams))
  tdm <- TermDocumentMatrix(corpus, control = list(tokenize = ngram))
  freq <- rowSums(as.matrix(tdm))
  freq <- sort(freq, decreasing=TRUE)
  freq <- head(freq, top)
  freq <- data.frame("word" = names(freq), "frequency" = freq)
  
  return(freq)
}

Visualizing the frequent n-grams

Word clouds are fun ways to visualize the frequency of word (or n-grams). The more frequent a term is, the larger it appears. The colors also reflect the frequency. While such plots are not quite quantitative, I believe that for exploration that is not the point either, it’s more about visualizing in a way that can give ideas for the follow-up analysis (and model building).

cloud <- function(x){
    require('wordcloud','RColorBrewer')
    data <- data.frame(word = x[1], freq = x[2])
    graph <- wordcloud(data$word, data$freq, 
                       random.order=FALSE, use.r.layout=FALSE, 
                       scale=c(2,1), color=brewer.pal(8, "Set2"))
}

1-grams

UNI_100 <- n_gram_frequencies(sample_corpus, 1)
cloud(UNI_100)

2-grams

BI_70 <- n_gram_frequencies(sample_corpus, 2, 70)
cloud(BI_70)

3-grams

TRI_30 <- n_gram_frequencies(sample_corpus, 3, 30)
cloud(TRI_30)

First conclusion

The most frequent n-grams are biased by stop-words. This can clearly be an issue, since these words might decrease the accuracy of the predictions. Let’s see what happens if they are removed.

Removing the stop words

sample_corpus <- tm_map(sample_corpus, removeWords, stopwords("english"))

1-grams (no stop words)

UNI_100 <- n_gram_frequencies(sample_corpus, 1)
cloud(UNI_100)

2-grams (no stop words)

BI_70 <- n_gram_frequencies(sample_corpus, 2, 70)
cloud(BI_70)

3-grams (no stop words)

TRI_30 <- n_gram_frequencies(sample_corpus, 3, 30)
cloud(TRI_30)

Conclusion

Based on the explorative analysis, the prediction model will be using a fall-back system, starting from 4-grams and back until 2-grams if needed. The underlying training text data will not contain numbers, punctuations, white spaces or English stop words. For the sake of efficiency, very uncommon n-grams might be removed.