Milestone Report

Introduction

The present report aims to give a brief introduction to the Capstone Project for Coursera Data Science specialization. The project main goal is to develop a Shiny app capable of predicting the next word, given some introduced text.

To accomplish this task are available 3 text files from different backgrounds (blogs, news and Twitter). This report addresses the initial project’s tasks of loading, cleaning and exploring the datasets with the final objective of understanding the best way to create the N-gram that will constitute the model.

Data Loading

The first hurdle that was necessary to surpass was the dataset size which totals almost 600Mb. In spite of being possible to read all the data into R, some of the subsequent steps required the data to be partitioned and worked in chunks.

There are a considerable number of R packages to assist with the natural language processing projects, some of them have a similar function and many of them in spite of presenting very powerful features, reveal lack of efficiency when applied to large amounts of data. In this stage of loading and cleaning the data, the following packages were used:

library(tm)
library(RWeka)
library(ggplot2)
library(wordcloud)
library(NLP)
library(SnowballC)

Before to start working with the dataset, some pre-cleaning was performed, eliminating non ASCII characters, foreign language lines, and some special “words” like web sites and email addresses.

Several types of abbreviations were found, which were manually converted to the expected full word.

    data <- iconv(data, "UTF8", "ASCII", sub="")
    data <- gsub("\1", "",data)
    data <- blogs[textcat(data, p = textcat::ECIMCI_profiles) == "en"]
    data <- paste(data, collapse = " ")
    data <- gsub("[a-z]+@[a-z]+\\.", "",data)
    data <- gsub("www\\.[a-z]+\\.[a-z]+", "",data)
    data <- gsub("[^[:print:]]", "###",data)
    data <- gsub("[Ii]t's ", "it is ",data)
    data <- gsub("'s ", " ",data)
    data <- gsub("###s ", " ",data)
    data <- gsub("'ve ", " have ",data)
    data <- gsub("'u ", " you ",data)
    data <- gsub("'r ", " are ",data)
    data <- gsub("###ve ", " have ",data)
    data <- gsub("n't", " not",data)
    data <- gsub("n###t", " not",data)
    data <- gsub("[Nn]'t", " not",data)
    data <- gsub("[Nn]###t", " not",data)
    data <- gsub("'m ", " am ",data)
    data <- gsub("###m ", " am ",data)
    data <- gsub("###", "",data)
    data <- gsub(" 'n ", " ",data)
    data <- gsub("^i | i ", " I ",data)

The last task of this loading process was to separate the text in sentences in order to form coherent N-Grams. A start mark was introduced in all the sentences.

    sent_vec <- unlist(strsplit(data, "[?|\\.|,|!]+"))
    sent_vec <- gsub("^*", paste0(start," "),sent_vec)

The table bellow shows some basic statistics from the datasets. It is visible that despite of having the double of the number of lines than the others, the Twitter dataset have an identical number of sentences. This can be related with some lack of writing structure.

file	lines	sentences	words
Blogs	899288	4039408	37465856
News	1010242	4146647	34852897
Twitter	2360148	3913034	30823628
All	4269678	12099089	103142381

Exploratoy Analysis

In this example were sampled 5000 sentences from each dataset and created Uni-grams, bi-grams and tri-grams from them. The table bellow presents the number of created uni-grams, bi-grams and tri-grams. Although we can find some differences among the datasets, there is a similarity in the number of n-grams created.

UniGram_Blogs	UniGram_News	UniGram_Twitter	BiGram_Blogs	BiGram_News	BiGram_Twitter	TriGram_Blogs	TriGram_News	TriGram_Twitter
6543	6835	6597	28281	27450	29062	37034	33529	38259

The following table presents the quantile distribution of the word’s frequency. In all three datasets 50% of the words just appear one time in the dataset and 90% of them have a frequency of 9 or less. The Bi-gram and Tri-gram analysis show an escalation of this trend. In this case the one time N-Grams represent 80% and 90% of the dictionary.

	UniGram_Blogs	UniGram_News	UniGram_Twitter	BiGram_Blogs	BiGram_News	BiGram_Twitter	TriGram_Blogs	TriGram_News	TriGram_Twitter
0%	1	1	1	1	1	1	1	1	1
10%	1	1	1	1	1	1	1	1	1
20%	1	1	1	1	1	1	1	1	1
30%	1	1	1	1	1	1	1	1	1
40%	1	1	1	1	1	1	1	1	1
50%	1	1	1	1	1	1	1	1	1
60%	2	2	2	1	1	1	1	1	1
70%	2	2	2	1	1	1	1	1	1
80%	4	4	4	1	1	1	1	1	1
90%	9	9	9	2	2	2	1	1	1
100%	2311	2360	2375	439	383	499	49	57	52

The uni-gram’s distribution is similar in all datasets, with most of the words being coincident among each other. In all cases “the” is the most frequent word.

Bi-gram’s distribution presents the same pattern already seen in uni-grams. The <s> symbol represents the beginning of the sentence.

Tri-gram plots keep the trend of small size words dominating the higher frequencies, however these plots show already some words that doesn’t appear in the above charts.

Conclusions

The analysis revealed a dataset with many sparse terms. A sample of 5000 lines created more than 30000 tri-grams in each dataset, however 90% of them just appear one time. The data analyzed here have already been stemmed (eliminated inflected forms of the same root word (example: run, runs, running)) which means that in the original dataset this issue should have been much more evident. To a further treatment, Some terms which refer to people’s names, brands or places, can be identified and eliminated, reducing the number of uncommon words/ngrams.

In the following steps of these project the three datasets will be joined and than created a train and test set.

Train set will be used to build the models:

Uni-gram
Bi-gram
Tri-gram
Tri-gram with backoff

These models performance will be evaluated using the test set applied to a perplexity function.

Milestone Report

Tiago Marques

30 de Agosto de 2016

Introduction

Data Loading

Exploratoy Analysis

Conclusions