Exploratory Data Analysis

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.

Tasks to accomplish

Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Questions to consider

Some words are more frequent than others - what are the distributions of word frequencies? What are the frequencies of 2-grams and 3-grams in the dataset? How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%? How do you evaluate how many of the words come from foreign languages? Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

suppressPackageStartupMessages(library(tm))
suppressPackageStartupMessages(library(XML))
suppressPackageStartupMessages(library(wordcloud))
suppressPackageStartupMessages(library(RColorBrewer))
suppressPackageStartupMessages(library(caret))
suppressPackageStartupMessages(library(NLP))
suppressPackageStartupMessages(library(openNLP))
suppressPackageStartupMessages(library(RWeka))
suppressPackageStartupMessages(library(qdap))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(stringi))
suppressPackageStartupMessages(library(dplyr))

set.seed(2020-9-11)
Eng_twitter <- "final/en_US/en_US.twitter.txt" 
Eng_blogs <- "final/en_US/en_US.blogs.txt" 
Eng_news <- "final/en_US/en_US.news.txt" 

Eng_twitter_sample <- "final/en_US/en_US.twitter_sample.txt" 
Eng_blogs_sample <- "final/en_US/en_US.blogs_sample.txt" 
Eng_news_sample <- "final/en_US/en_US.news_sample.txt" 

createSampleFile<-function(inFile, outFile, numberLines=1000){
  incon <- file(inFile, "r") 
  outcon <- file(outFile, "w") 

  for(i in seq(numberLines)){
    line<-readLines(incon, 1)
  
    writeLines(line, outcon)
  }

  close(incon)
  close(outcon)
  
}


createSampleFile(Eng_twitter, Eng_twitter_sample, 1000)
createSampleFile(Eng_blogs, Eng_blogs_sample, 1000)
createSampleFile(Eng_news, Eng_news_sample, 1000)

useSample<-FALSE

if(useSample){
  con_Eng_twitter <- file(Eng_twitter_sample, "r") 
  con_Eng_blogs <- file(Eng_blogs_sample, "r") 
  con_Eng_news <- file(Eng_twitter_sample, "r") 
}else{
  con_Eng_twitter <- file(Eng_twitter, "r") 
  con_Eng_blogs <- file(Eng_blogs, "r") 
  con_Eng_news <- file(Eng_news, "r") 
}

con_Eng_twitter_file <- readLines(con_Eng_twitter)
## Warning in readLines(con_Eng_twitter): line 167155 appears to contain an
## embedded nul
## Warning in readLines(con_Eng_twitter): line 268547 appears to contain an
## embedded nul
## Warning in readLines(con_Eng_twitter): line 1274086 appears to contain an
## embedded nul
## Warning in readLines(con_Eng_twitter): line 1759032 appears to contain an
## embedded nul
con_Eng_blogs_file <- readLines(con_Eng_blogs)
con_Eng_news_file <- readLines(con_Eng_news)
## Warning in readLines(con_Eng_news): incomplete final line found on 'final/en_US/
## en_US.news.txt'
char <- function(x){stri_length(x) - stri_count_fixed(x," ")} 
# function to count characters without spaces

filesummary<-data.frame(Source=c("Twitter", "Blogs", "News"), 
                        FileSize_MB=c(
                          format(structure(
                            object.size(con_Eng_twitter_file), 
                            class="object_size"),
                            units="auto"),
                          format(structure(
                            object.size(con_Eng_blogs_file),
                            class="object_size"),
                            units="auto"),
                          format(structure(
                            object.size(con_Eng_news_file),
                            class="object_size"),
                            units="auto")),
                        Lines=c(length(con_Eng_twitter_file),
                                length(con_Eng_blogs_file),
                                length(con_Eng_news_file)),
                        Words=c(sum(stri_count_words(con_Eng_twitter_file)),
                                sum(stri_count_words(con_Eng_blogs_file)), 
                                sum(stri_count_words(con_Eng_news_file))),
                        Characters=c(sum(char(con_Eng_twitter_file)),
                                     sum(char(con_Eng_blogs_file)),
                                     sum(char(con_Eng_news_file))))

filesummary<-mutate(filesummary, 
                    Words_Per_Line=Words/Lines,
                    Char_Per_Line=round(Characters/Lines,1),
                    Char_Per_Word=round(Characters/Words,2))

print(filesummary)
##    Source FileSize_MB   Lines    Words Characters Words_Per_Line Char_Per_Line
## 1 Twitter      319 Mb 2360148 30218125  134371428       12.80349          56.9
## 2   Blogs    255.4 Mb  899288 38154238  171926595       42.42716         191.2
## 3    News     19.8 Mb   77259  2693898   13117055       34.86840         169.8
##   Char_Per_Word
## 1          4.45
## 2          4.51
## 3          4.87

Samples for testing

con_Eng_twitter_file_sample <- sample(con_Eng_twitter_file,1000)
con_Eng_blogs_file_sample <- sample(con_Eng_blogs_file,1000)
con_Eng_news_file_sample <- sample(con_Eng_news_file,1000)
sample <- c(con_Eng_twitter_file_sample, 
            con_Eng_blogs_file_sample, 
            con_Eng_news_file_sample)
txt <- sent_detect(sample)
remove(con_Eng_twitter_file_sample, 
       con_Eng_blogs_file_sample, 
       con_Eng_news_file_sample,
       con_Eng_twitter_file, 
       con_Eng_blogs_file, 
       con_Eng_news_file,
       sample)

Removing everything we do not need

txt <- removeNumbers(txt)
txt <- removePunctuation(txt)
txt <- stripWhitespace(txt)
txt <- tolower(txt)
txt <- txt[which(txt!="")]
txt <- data.frame(txt,stringsAsFactors = FALSE)

Making ordered data frames of 1-grams, 2-grams, 3-grams

words<-WordTokenizer(txt) 
grams<-NGramTokenizer(txt)

for(i in 1:length(grams)){
  if(length(WordTokenizer(grams[i]))==2){
    break
    }
  }
for(j in 1:length(grams)){
  if(length(WordTokenizer(grams[j]))==1){
    break
    }
  }


onegrams <- data.frame(table(words))
onegrams <- onegrams[order(onegrams$Freq, decreasing = TRUE),]
bigrams <- data.frame(table(grams[i:(j-1)]))
bigrams <- bigrams[order(bigrams$Freq, decreasing = TRUE),]
trigrams <- data.frame(table(grams[1:(i-1)]))
trigrams <- trigrams[order(trigrams$Freq, decreasing = TRUE),]
remove(i,j,grams)

Word cloud from Words

wordcloud(words, scale=c(5,0.1), max.words=100, random.order=FALSE, 
          rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

wordcloud(onegrams$words, onegrams$Freq, scale=c(5,0.5), max.words=300, random.order=FALSE, 
          rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))

The first graph shows the distribution of words in the corpora except such words, as “the”, “a”, “of”, “to”, etc. The second graph - the distribution of all single words. The frequencies lay between 3796 to 1.

What are the frequencies of 2-grams and 3-grams in the dataset?

barplot(bigrams[1:20,2],col="lightblue",
        names.arg = bigrams$Var1[1:20],srt = 45,
        space=0.1, xlim=c(0,20),las=2)

barplot(trigrams[1:20,2],col="lightblue",
        names.arg = trigrams$Var1[1:20],srt = 45,
        space=0.1, xlim=c(0,20),las=2)

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

sumCover <- 0
for(i in 1:length(onegrams$Freq)) {
  sumCover <- sumCover + onegrams$Freq[i]
  if(sumCover >= 0.5*sum(onegrams$Freq)){break}
}
print(i)
## [1] 148
sumCover <- 0
for(i in 1:length(onegrams$Freq)) {
  sumCover <- sumCover + onegrams$Freq[i]
  if(sumCover >= 0.9*sum(onegrams$Freq)){break}
}
print(i)
## [1] 5618

Owing to this, we need 148 words to cover 50% of all word instances in the language and 5618 words to cover 90% of all word instances in the language.

How do you evaluate how many of the words come from foreign languages? It seems to me, that the best way is to compare the text with some well-known dictionary. Also, this is the way to remove “rude” words. Nevertheless, there are too few such words and their impact is too small and we do not need to take it into account.

Can you think of a way to increase the coverage - identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases? Prediction, based on the location (traditions, holidays, names, places, etc) Lerning the writing style of the author Using additional dictionary with n-grams: first, remove from the dictionary low-frequency words, than use the others for better prediction of n-grams.