Week 2 - Data Science Capstone

Libraries and basic configuration

  suppressPackageStartupMessages(library("NLP"))
  suppressPackageStartupMessages(library("tm"))
  suppressPackageStartupMessages(library("wordcloud"))
  suppressPackageStartupMessages(library("RColorBrewer"))
  suppressPackageStartupMessages(source("functions.R"))
  suppressPackageStartupMessages(library("RWeka"))
  suppressPackageStartupMessages(library("qdap"))
  suppressPackageStartupMessages(library("ggplot2"))
  suppressPackageStartupMessages(library("ngram"))

Exploratory Analysis - Instructions

Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

1. Some words are more frequent than others - what are the distributions of word frequencies?
2. What are the frequencies of 2-grams and 3-grams in the dataset?
3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
4. How do you evaluate how many of the words come from foreign languages?
5. Can you think of a way to increase the coverage -- identifying words that may not be in the corpora or using a smaller number of words in the dictionary to 

Introduction and Conclusions

  1. In this work, I will start showing one way to clean and sample data text of blogs, news, and twitter. After, I will construct a “Corpus” object that is a collection of documents that will support text analysis. For last, I create two functions that help the construction of “NGRAM” scenarios: my_ngram and coverage.

  2. The conclusion of this paper is: It is necessary to sample data on personal computers, also stopwords are frequent in all scenarios, and few words are in a lot of phrases.

Read Data

blogs_en   <- readLines("final/en_US/en_US.blogs.txt",  encoding = "UTF-8")
news_en    <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8")
twitter_en <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8")


summary(blogs_en)
##    Length     Class      Mode 
##    899288 character character
summary(news_en)
##    Length     Class      Mode 
##   1010242 character character
summary(twitter_en)
##    Length     Class      Mode 
##   2360148 character character
b <- wordcount(blogs_en)
n <- wordcount(news_en)
t <- wordcount(twitter_en)

barplot(c(b,n,t), names.arg =  c("blog", "news", "twitter"), horiz = TRUE, main="Word Count")

Set seed, sampleing, and clean data and Sampleing

#set.seed
set.seed(3010)
#sampling
qtd_sample    <- 500
t_blogs_en    <- sample(blogs_en, qtd_sample)
t_news_en     <- sample(news_en, qtd_sample)
t_twitter_en  <- sample(twitter_en, qtd_sample)
t_all         <- c(t_blogs_en, t_news_en, t_twitter_en)
#cleaning
t_all         <- removePunctuation(t_all)
t_all         <- stripWhitespace(t_all)
t_all         <- removeNumbers(t_all)
t_all         <- tolower(t_all)
t_all         <- t_all[which(t_all!="")]

summary(t_all)
##    Length     Class      Mode 
##      1500 character character

1. Some words are more frequent than others - what are the distributions of word frequencies?

modi_txt <- t_all

modi              <- Corpus(VectorSource(modi_txt))
modi_no_stopwords <- tm_map(modi, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(modi, removeWords, stopwords("english")):
## transformation drops documents
tdm_modi          <- TermDocumentMatrix (modi)
TDM1              <- as.matrix(tdm_modi)
v                 <- sort(rowSums(TDM1), decreasing = TRUE)

print("With stopwords")
## [1] "With stopwords"
wordcloud (modi,  max.words=30,  random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

print("Without stopwords")
## [1] "Without stopwords"
wordcloud (modi_no_stopwords, max.words=30,  random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

my_ngram <-function(x, n){
  
  gramN             <- NGramTokenizer(x)
  gramN             <- NGramTokenizer(gramN, Weka_control(min=n, max=n))
  gramN             <- data.frame(table(gramN))
  gramN             <- gramN[order(gramN$Freq, decreasing = TRUE),]  
  
  return(gramN)
  
}

gramOne <- my_ngram(t_all, 1)

gramOne[1:10,]
##      gramN  Freq
## 8127   the 12177
## 8256    to  6711
## 417    and  6694
## 108      a  5645
## 5606    of  5302
## 4093    in  4054
## 4012     i  3378
## 8123  that  2711
## 4272    is  2612
## 3244   for  2402
par(mfrow=c(1,1))
barplot(gramOne$Freq[1:10], names.arg = as.character(gramOne$gramN[1:10]))

2. What are the frequencies of 2-grams and 3-grams in the dataset?

gramTwo   <- my_ngram(t_all, 2)
gramThree <- my_ngram(t_all, 3)


gramTwo[1:10,]
##          gramN Freq
## 17944   of the  584
## 12862   in the  530
## 27164   to the  286
## 18304   on the  242
## 9546   for the  232
## 3365    at the  198
## 29749 with the  183
## 2495   and the  182
## 26732    to be  174
## 13690   it was  146
barplot(gramTwo$Freq[1:10], names.arg = as.character(gramTwo$gramN[1:10]))

gramThree[1:10,]
##                   gramN Freq
## 22538        one of the   20
## 607            a lot of   18
## 4040         as well as   10
## 12131       going to be    9
## 15578     in the middle    9
## 29731        the end of    9
## 16646           it is a    8
## 16791          it was a    8
## 30981 the united states    8
## 9590         end of the    7
barplot(gramThree$Freq[1:10], names.arg = as.character(gramThree$gramN[1:10]))

3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

coverage <- function(x, freq){
  
  tmp = 0
  max = trunc(sum(x$Freq) * freq)
  
  
  for(i in 1:nrow(x)){
    
    tmp = x$Freq[i] + tmp
    
    
    if(tmp > max){
      
      return(i)
      
    }
    
  }

}

coverage(gramOne, 0.5)
## [1] 137
coverage(gramOne, 0.9)
## [1] 4795

4. How do you evaluate how many of the words come from foreign languages?

  1. If a word has accents (á, ã, ü, ç) or ideograms (男 or 男人) the word is not in english language.

  2. It is possible to use a digital dictionary to verify the language of the word.

5. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to

  1. Reduce unpopular words.

  2. Uses a thesaurus to transform similar words.

  3. Cluster similar words.