Milestone Report for Coursera Capstone Project

Exploratory Data Analysis

This report is part of the Coursera data science capstone project.The goal of this report is to understand the basic relationships you observe in the data and prepare to build my first linguistic models. The training dataset can be find out here. The dataset contains 3 files across four languages (Russian, Finnish, German and English). This report will focus on the English language datasets. The names of the data files are as follows:

en_US.blogs.txt
en_US.twitter.txt
en_US.news.txt

Summary and Description of the dataset

##      File Size (MG)   Lines    Words Avg by lines
## 1   blogs       205  899288 37546246     41.75108
## 2    news       201 1010242 34762395     34.40997
## 3 twitter       163 2360148 30093410     12.75065

The goal for this prediction model is to minimize both the size and runtime of the model in order to provide a reasonable experience to the user. In order to accomplish this goal we use a random sample of 30000 lines of the originals 4,269,678.

Cleaning the Data

Taking less than a 1% random sample of the initial corpus, so we can manage it easily for exploration, we do some basic text mining preprocessing:

lower all the words
remove numbers
remove punctuation
remove profanity words bad-words list

# Creating a corpus using a VectorSource
corpus <- VCorpus(source)
rm(source)
data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)[10,]

## [1] "4.Tribute My Ass"

corpus <- tm_map(corpus, content_transformer(tolower))
data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)[10,]

## [1] "4.tribute my ass"

corpus <- tm_map(corpus, removeNumbers)
data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)[10,]

## [1] ".tribute my ass"

corpus <- tm_map(corpus, removePunctuation)
data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)[10,]

## [1] "tribute my ass"

corpus <- tm_map(corpus, removeWords, profanity_vector) 
data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)[10,]

## [1] "tribute my "

corpus <- tm_map(corpus, stripWhitespace)
data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)[10,]

## [1] "tribute my "

Creating a corpus and analysing the frequencies

Corpus is tokenized into 1,2 and 3 grams and Term Document Matrices are created to understand the frequency of words and phrases:

dataframe <- data.frame(text=unlist(sapply(corpus, '[',"content")),stringsAsFactors=F)
tokenize_ngrams <- function(x, n=3) {
    return(textcnt(x,method="string",n=n,decreasing=TRUE))}

unigrams <- tokenize_ngrams(dataframe,n=1)
bigrams <- tokenize_ngrams(dataframe,n=2)
trigrams <- tokenize_ngrams(dataframe,n=3)

freq_ngram <- function(txtcnt){
    return(data.frame(word=rownames(as.data.frame(unclass(txtcnt))),
                      freq=unclass(txtcnt)))}
unigramFreq <- freq_ngram(unigrams)
bigramFreq <- freq_ngram(bigrams)
trigramFreq <- freq_ngram(trigrams)

Distributions of word frequencies

The following plot show the unigram that repeats more than 5000 and 100 times.

What are the frequencies of bigrams and trigrams in the dataset?

The following plots show the bigrams and trigrams that repeat more than 800 and 100 times respectively:

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

The 50% of the corpus is covered with only 142 words and 90% of the corpus is covered with only 7,418 words. This correspponds to 0.328%, 17.13% of word dictionary for the corpus respectively.

How do you evaluate how many of the words come from foreign languages?

The code developed does not discriminate languages. When it is necessary to evaluate words from foreign languages, one can make use of the “tm_map” function to “removeWords” based on a language dictionary.

Increasing Coverage

There are several ways that could be used to increase the coverage. One of them is Stemming:

Do we need to draw a distinction between the following words? argue argued argues arguing
Could all be represented by a common stem, argu
Algorithmic process of performing this reduction is called stemming.