This report is part of the Coursera data science capstone project.The goal of this report is to understand the basic relationships you observe in the data and prepare to build my first linguistic models. The training dataset can be find out here. The dataset contains 3 files across four languages (Russian, Finnish, German and English). This report will focus on the English language datasets. The names of the data files are as follows:
en_US.blogs.txt
en_US.twitter.txt
en_US.news.txt
## File Size (MG) Lines Words Avg by lines
## 1 blogs 205 899288 37546246 41.75108
## 2 news 201 1010242 34762395 34.40997
## 3 twitter 163 2360148 30093410 12.75065
The goal for this prediction model is to minimize both the size and runtime of the model in order to provide a reasonable experience to the user. In order to accomplish this goal we use a random sample of 30000 lines of the originals 4,269,678.
Taking less than a 1% random sample of the initial corpus, so we can manage it easily for exploration, we do some basic text mining preprocessing:
# Creating a corpus using a VectorSource
corpus <- VCorpus(source)
rm(source)
data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)[10,]
## [1] "4.Tribute My Ass"
corpus <- tm_map(corpus, content_transformer(tolower))
data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)[10,]
## [1] "4.tribute my ass"
corpus <- tm_map(corpus, removeNumbers)
data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)[10,]
## [1] ".tribute my ass"
corpus <- tm_map(corpus, removePunctuation)
data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)[10,]
## [1] "tribute my ass"
corpus <- tm_map(corpus, removeWords, profanity_vector)
data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)[10,]
## [1] "tribute my "
corpus <- tm_map(corpus, stripWhitespace)
data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)[10,]
## [1] "tribute my "
Corpus is tokenized into 1,2 and 3 grams and Term Document Matrices are created to understand the frequency of words and phrases:
dataframe <- data.frame(text=unlist(sapply(corpus, '[',"content")),stringsAsFactors=F)
tokenize_ngrams <- function(x, n=3) {
return(textcnt(x,method="string",n=n,decreasing=TRUE))}
unigrams <- tokenize_ngrams(dataframe,n=1)
bigrams <- tokenize_ngrams(dataframe,n=2)
trigrams <- tokenize_ngrams(dataframe,n=3)
freq_ngram <- function(txtcnt){
return(data.frame(word=rownames(as.data.frame(unclass(txtcnt))),
freq=unclass(txtcnt)))}
unigramFreq <- freq_ngram(unigrams)
bigramFreq <- freq_ngram(bigrams)
trigramFreq <- freq_ngram(trigrams)
The following plot show the unigram that repeats more than 5000 and 100 times.
The following plots show the bigrams and trigrams that repeat more than 800 and 100 times respectively:
The 50% of the corpus is covered with only 142 words and 90% of the corpus is covered with only 7,418 words. This correspponds to 0.328%, 17.13% of word dictionary for the corpus respectively.
The code developed does not discriminate languages. When it is necessary to evaluate words from foreign languages, one can make use of the “tm_map” function to “removeWords” based on a language dictionary.
There are several ways that could be used to increase the coverage. One of them is Stemming: