Executive Summary

This is the R Markdown document for Data Science Capstone assignment course. It contains an explanation about my Report Submission as a milestone in the evolution to Data Science Specialization final project.

The data for this analysis comes from a corpus called “HC Corpora”.

The main task we have done till now are:

Read English corpus and create one sample (30%) from each file;
Clean up these samples, removing some features, such as punctuation, extra spaces, numbers, upper-case;
Create a main corpus using these sample files with ‘tm’ package;
Tokenizing words until 3-grams, to explore its frequency distribution; do some demonstrative plots;
Study and test several NLP packages available in [R], to observe its pros-cons, fails, memory consumption, performance;
Think of a more suitable model to predict and respond the project proposal.

1. Loading Data

The dataset was downloaded from the course repository available on: Capstone Dataset.

I’ve saved on my desktop in a subdirectory of the main job called “Capstone_Dataset”. Under this, there are 4 subfolders to distinct languages (de - German, en - English, fi - Finnish, ru - Russian). To each one there are 3 similar files that represents Corpora from SwiftKey - blogs, news and twitter. They have a large number of sentences and millions of words.

Until this time, I analyzed only the files in English:

Locale	File	Size (MB)	Lines	Words
en_US	blogs	200.4	1 010 242	37,334,131
en_US	news	196.3	899 288	34,372,530
en_US	twitter	159.4	2 360 148	30,373,583
— Total —		556.1	4 269 678 556	102,080,244

To access the code where I got these coutings, see this link.

2. Sampling and First Cleaning

To do the sampling of these above files, I created a set of functions stored in this .R script
-> sampling.R (click this link to view the source on my GitHub).

It includes two another scripts:

systematic.R => a function for a systematic sample (the method I used to sample these large files);
clean.text.R => a set of functions, made by G.Sanchez, to do a first cleaning of these texts: punctuation, extra spaces and lowering case.

A main method called “read_dataset()”, read the 3 files (blogs, news and twitter), collected a systematic sample of 30% size, cleaned and saved in 3 .RDS format files (to save space and future time processing).

blogs.RDS (27.6 MB)
news.RDS (27.8 MB)
twitter.RDS (23.4 MB)

This process was done in 334 minutes (or 5.6 hrs) and its log was saved in this file: sampling_output.txt.

<< Top

3. Creating Work Corpus and Second Cleaning

This was the hardest job I had. I got some trouble in order to convert .RDS files to a source for ‘tm’ Corpus() method. After I did another transformation to provide the entry parameters of ‘quanteda’ package [method “corpus()”]. While I was generating two intermediary .txt files I did a second cleaning, removing some features that can cause trouble in exploratory analysis and NLP processing - tweet chars (hash tags, @), numbers, extra spaces. But some methods from these packages didn’t work fine. In ‘qdapRegex’ the methods to remove tweet features did not have good performance. They last a long, long time and I needed to abort their use. I want to solve this from now on.

In order to save time, I checked if part of necessary job was done, before running it.

These are the core code I’ve used in [R]:

SAMPLE_DIR  = "./Samples/"
SAMPLE_CLEAN  = "sample_clean.txt"

cFiles <- c("news","twitter","blogs")

# generate 3 txt files with corpora sampling from "Swift Key Dataset", in case of not exists

if(!file.exists(SAMPLE_DIR)) {
  dir.create(SAMPLE_DIR)
  lapply(cFiles, 
    function(i) {
      write.table(objRDS(i), paste0(SAMPLE_DIR,i,".txt"), sep="\t", col.names=FALSE, quote=FALSE, row.names=FALSE, append=TRUE) }
  )
}  
print(paste("Directory",SAMPLE_DIR,"Corpus' created with 3 sample text files."))

# create a primary Corpus, using "tm" package that contains all sample texts (as Source) to be cleaned
# (because cleaning functions for "quanteda' package aren't working fine...)

# generate a clean text file with corpora sampling , in case of not exists

if(!file.exists(SAMPLE_CLEAN)) 
{
  cp <- Corpus(DirSource(SAMPLE_DIR), readerControl = list(language="lat"))
  print("Corpus object's created.")

  #cp <- removeTweetFeatures(cp)
  cp <- removeOtherFeatures(cp, numbers=TRUE, punctuation=FALSE, spaces=TRUE, stopwords=TRUE)
  
  # and now, rewrite these data to another file representing sample Corpora already clean up
  write.table(dfClean, SAMPLE_CLEAN, sep="", col.names=FALSE, quote=FALSE, row.names=FALSE, append=TRUE)
  
  print(paste(SAMPLE_CLEAN,"is generated"))
  print(paste(SAMPLE_CLEAN,"is generated"))
  rm(cp)    # (release memory)
}

# this 'sample clean text file' could be a source to "corpus" method in "quanteda" package
mycorp <- corpus(textfile(SAMPLE_CLEAN))

<< Top

4. Exploratory Analysis

Below we show a graph known as “Word Cloud”. It’s a handy tool to highlight the most frequent words found in these Corpora (I choose 100). After I have removed ‘stop words’, visually we may say that various tokens prevail; these are: will, said, just, one, get, can, like.

[R] function to produce the ‘word clout plot, having ’tm’ corpus object as parameter:

library(wordcloud)
library(RColorBrewer)

plot_word_cloud <- function(corp) {
  tdm <- TermDocumentMatrix(corp)
  m <- as.matrix(tdm)
  v <- sort(rowSums(m),decreasing=TRUE)
  d <- data.frame(word = names(v),freq=v)
  pal <- brewer.pal(9, "BuGn")
  pal <- pal[-(1:2)]
  png("wordcloud.png", width=1280,height=800)
  wordcloud(d$word,d$freq, scale=c(8,.3),min.freq=2,max.words=100, random.order=T, rot.per=.15, colors=pal, vfont=c("sans serif","plain"))
  dev.off()
}

WordCloud

To construct the ‘n-grams’ tokens I used the ‘tokenize’ method from ‘quanteda’ package. This is the [R] code excerpt:

for(i in 1:3) {
  tk <- tokenize(mycorp, ngrams=i, concatenator=',')
  plot_bar_gram( table_tokens(tk), i)
}

These are the graphs of distribution of frequency tokens (dfm) that I’ve got after applying ‘quanteda’ package function ‘tokenize’ (the code above):

Distribution Uni-gram Distribution Bi-gram Distribution Tri-gram

The complete code to produce this Exploratory Analysis you may see on my GitHub’ repository to this project:

<< Top

5. Next Steps

To enhance prediction modelling and finish my project, I am taking account:

Evaluate two key aspects that we should keep in mind - the size and runtime of the algorithm.
Study and learn more about ‘quanteda’ package possibilities.
Among these:
- Incorporate a dictionary in this analysis
- Try to use a thesaurus and antonyms to grow up predicting capacity
Using or not using ‘stopWords’ or ‘punctuation’ when tokenizing. Why?
- Stop words are an arbitrary choice imposed by the user, and accessing a pre-defined list of words to ignore.
- Maybe it wouldn’t perfectly fit the needs of predicting model.
  Along the “Quiz 2” answering I perceived that sometimes it was absolutely necessary to find the best anwers depending on each case (if you want to see my solution, here it is).
How I could swap easily to other language corpora (to me is very important, because my native language is Portuguese).
Project and develop the model to run in a Shiny app (shinyapps.io server). Low-memory comsumption.
Build a predictive model based on the previous data modeling steps.
And…
Evaluate the model for efficiency and accuracy - based on timing to get the answers and certain to predict 1st, 2nd and 3rd words.

<< Top

[ html version ]

Published on 2015-29-12 10.23.04 , -0200(BRST).

Data Science Capstone - Report Submission

Sergio Vicente

December 28, 2015

Executive Summary

1. Loading Data

2. Sampling and First Cleaning

3. Creating Work Corpus and Second Cleaning

4. Exploratory Analysis

5. Next Steps