Executive Summary

This is the R Markdown document for Data Science Capstone assignment course. It contains an explanation about my Report Submission as a milestone in the evolution to Data Science Specialization final project.

The data for this analysis comes from a corpus called “HC Corpora”.

The main task we have done till now are:

  1. Read English corpus and create one sample (30%) from each file;
  2. Clean up these samples, removing some features, such as punctuation, extra spaces, numbers, upper-case;
  3. Create a main corpus using these sample files with ‘tm’ package;
  4. Tokenizing words until 3-grams, to explore its frequency distribution; do some demonstrative plots;
  5. Study and test several NLP packages available in [R], to observe its pros-cons, fails, memory consumption, performance;
  6. Think of a more suitable model to predict and respond the project proposal.

1. Loading Data

The dataset was downloaded from the course repository available on: Capstone Dataset.

I’ve saved on my desktop in a subdirectory of the main job called “Capstone_Dataset”. Under this, there are 4 subfolders to distinct languages (de - German, en - English, fi - Finnish, ru - Russian). To each one there are 3 similar files that represents Corpora from SwiftKey - blogs, news and twitter. They have a large number of sentences and millions of words.

Until this time, I analyzed only the files in English:

Locale File Size (MB) Lines Words
en_US blogs 200.4 1 010 242 37,334,131
en_US news 196.3 899 288 34,372,530
en_US twitter 159.4 2 360 148 30,373,583
— Total — 556.1 4 269 678 556 102,080,244
To access the code where I got these coutings, see this link.

2. Sampling and First Cleaning

To do the sampling of these above files, I created a set of functions stored in this .R script
-> sampling.R (click this link to view the source on my GitHub).

It includes two another scripts:

  • systematic.R => a function for a systematic sample (the method I used to sample these large files);
  • clean.text.R => a set of functions, made by G.Sanchez, to do a first cleaning of these texts: punctuation, extra spaces and lowering case.

A main method called “read_dataset()”, read the 3 files (blogs, news and twitter), collected a systematic sample of 30% size, cleaned and saved in 3 .RDS format files (to save space and future time processing).

  • blogs.RDS (27.6 MB)
  • news.RDS (27.8 MB)
  • twitter.RDS (23.4 MB)

This process was done in 334 minutes (or 5.6 hrs) and its log was saved in this file: sampling_output.txt.

<< Top

3. Creating Work Corpus and Second Cleaning

This was the hardest job I had. I got some trouble in order to convert .RDS files to a source for ‘tm’ Corpus() method. After I did another transformation to provide the entry parameters of ‘quanteda’ package [method “corpus()”]. While I was generating two intermediary .txt files I did a second cleaning, removing some features that can cause trouble in exploratory analysis and NLP processing - tweet chars (hash tags, @), numbers, extra spaces. But some methods from these packages didn’t work fine. In ‘qdapRegex’ the methods to remove tweet features did not have good performance. They last a long, long time and I needed to abort their use. I want to solve this from now on.

In order to save time, I checked if part of necessary job was done, before running it.

These are the core code I’ve used in [R]:

SAMPLE_DIR  = "./Samples/"
SAMPLE_CLEAN  = "sample_clean.txt"

cFiles <- c("news","twitter","blogs")

# generate 3 txt files with corpora sampling from "Swift Key Dataset", in case of not exists

if(!file.exists(SAMPLE_DIR)) {
  dir.create(SAMPLE_DIR)
  lapply(cFiles, 
    function(i) {
      write.table(objRDS(i), paste0(SAMPLE_DIR,i,".txt"), sep="\t", col.names=FALSE, quote=FALSE, row.names=FALSE, append=TRUE) }
  )
}  
print(paste("Directory",SAMPLE_DIR,"Corpus' created with 3 sample text files."))

# create a primary Corpus, using "tm" package that contains all sample texts (as Source) to be cleaned
# (because cleaning functions for "quanteda' package aren't working fine...)

# generate a clean text file with corpora sampling , in case of not exists

if(!file.exists(SAMPLE_CLEAN)) 
{
  cp <- Corpus(DirSource(SAMPLE_DIR), readerControl = list(language="lat"))
  print("Corpus object's created.")

  #cp <- removeTweetFeatures(cp)
  cp <- removeOtherFeatures(cp, numbers=TRUE, punctuation=FALSE, spaces=TRUE, stopwords=TRUE)
  
  # and now, rewrite these data to another file representing sample Corpora already clean up
  write.table(dfClean, SAMPLE_CLEAN, sep="", col.names=FALSE, quote=FALSE, row.names=FALSE, append=TRUE)
  
  print(paste(SAMPLE_CLEAN,"is generated"))
  print(paste(SAMPLE_CLEAN,"is generated"))
  rm(cp)    # (release memory)
}

# this 'sample clean text file' could be a source to "corpus" method in "quanteda" package
mycorp <- corpus(textfile(SAMPLE_CLEAN))
<< Top

4. Exploratory Analysis

Below we show a graph known as “Word Cloud”. It’s a handy tool to highlight the most frequent words found in these Corpora (I choose 100). After I have removed ‘stop words’, visually we may say that various tokens prevail; these are: will, said, just, one, get, can, like.

[R] function to produce the ‘word clout plot, having ’tm’ corpus object as parameter:

library(wordcloud)
library(RColorBrewer)

plot_word_cloud <- function(corp) {
  tdm <- TermDocumentMatrix(corp)
  m <- as.matrix(tdm)
  v <- sort(rowSums(m),decreasing=TRUE)
  d <- data.frame(word = names(v),freq=v)
  pal <- brewer.pal(9, "BuGn")
  pal <- pal[-(1:2)]
  png("wordcloud.png", width=1280,height=800)
  wordcloud(d$word,d$freq, scale=c(8,.3),min.freq=2,max.words=100, random.order=T, rot.per=.15, colors=pal, vfont=c("sans serif","plain"))
  dev.off()
}

WordCloud

To construct the ‘n-grams’ tokens I used the ‘tokenize’ method from ‘quanteda’ package. This is the [R] code excerpt:

for(i in 1:3) {
  tk <- tokenize(mycorp, ngrams=i, concatenator=',')
  plot_bar_gram( table_tokens(tk), i)
}

These are the graphs of distribution of frequency tokens (dfm) that I’ve got after applying ‘quanteda’ package function ‘tokenize’ (the code above):

Distribution Uni-gram Distribution Bi-gram Distribution Tri-gram

The complete code to produce this Exploratory Analysis you may see on my GitHub’ repository to this project:

<< Top

5. Next Steps

To enhance prediction modelling and finish my project, I am taking account:

  • Evaluate two key aspects that we should keep in mind - the size and runtime of the algorithm.
  • Study and learn more about ‘quanteda’ package possibilities.
    Among these:
    • Incorporate a dictionary in this analysis
    • Try to use a thesaurus and antonyms to grow up predicting capacity
  • Using or not using ‘stopWords’ or ‘punctuation’ when tokenizing. Why?
    • Stop words are an arbitrary choice imposed by the user, and accessing a pre-defined list of words to ignore.
    • Maybe it wouldn’t perfectly fit the needs of predicting model.
      Along the “Quiz 2” answering I perceived that sometimes it was absolutely necessary to find the best anwers depending on each case (if you want to see my solution, here it is).
  • How I could swap easily to other language corpora (to me is very important, because my native language is Portuguese).
  • Project and develop the model to run in a Shiny app (shinyapps.io server). Low-memory comsumption.
  • Build a predictive model based on the previous data modeling steps.
    And…
  • Evaluate the model for efficiency and accuracy - based on timing to get the answers and certain to predict 1st, 2nd and 3rd words.
<< Top
[ html version ]
Published on 2015-29-12 10.23.04 , -0200(BRST).