The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.
Tasks to accomplish
Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.
Questions to consider
Some words are more frequent than others - what are the distributions of word frequencies? What are the frequencies of 2-grams and 3-grams in the dataset? How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%? How do you evaluate how many of the words come from foreign languages? Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?
suppressPackageStartupMessages(library(tm))
suppressPackageStartupMessages(library(XML))
suppressPackageStartupMessages(library(wordcloud))
suppressPackageStartupMessages(library(RColorBrewer))
suppressPackageStartupMessages(library(caret))
suppressPackageStartupMessages(library(NLP))
suppressPackageStartupMessages(library(openNLP))
suppressPackageStartupMessages(library(RWeka))
suppressPackageStartupMessages(library(qdap))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(stringi))
suppressPackageStartupMessages(library(dplyr))
set.seed(2020-9-11)
Eng_twitter <- "final/en_US/en_US.twitter.txt"
Eng_blogs <- "final/en_US/en_US.blogs.txt"
Eng_news <- "final/en_US/en_US.news.txt"
Eng_twitter_sample <- "final/en_US/en_US.twitter_sample.txt"
Eng_blogs_sample <- "final/en_US/en_US.blogs_sample.txt"
Eng_news_sample <- "final/en_US/en_US.news_sample.txt"
createSampleFile<-function(inFile, outFile, numberLines=1000){
incon <- file(inFile, "r")
outcon <- file(outFile, "w")
for(i in seq(numberLines)){
line<-readLines(incon, 1)
writeLines(line, outcon)
}
close(incon)
close(outcon)
}
createSampleFile(Eng_twitter, Eng_twitter_sample, 1000)
createSampleFile(Eng_blogs, Eng_blogs_sample, 1000)
createSampleFile(Eng_news, Eng_news_sample, 1000)
useSample<-FALSE
if(useSample){
con_Eng_twitter <- file(Eng_twitter_sample, "r")
con_Eng_blogs <- file(Eng_blogs_sample, "r")
con_Eng_news <- file(Eng_twitter_sample, "r")
}else{
con_Eng_twitter <- file(Eng_twitter, "r")
con_Eng_blogs <- file(Eng_blogs, "r")
con_Eng_news <- file(Eng_news, "r")
}
con_Eng_twitter_file <- readLines(con_Eng_twitter)
## Warning in readLines(con_Eng_twitter): line 167155 appears to contain an
## embedded nul
## Warning in readLines(con_Eng_twitter): line 268547 appears to contain an
## embedded nul
## Warning in readLines(con_Eng_twitter): line 1274086 appears to contain an
## embedded nul
## Warning in readLines(con_Eng_twitter): line 1759032 appears to contain an
## embedded nul
con_Eng_blogs_file <- readLines(con_Eng_blogs)
con_Eng_news_file <- readLines(con_Eng_news)
## Warning in readLines(con_Eng_news): incomplete final line found on 'final/en_US/
## en_US.news.txt'
char <- function(x){stri_length(x) - stri_count_fixed(x," ")}
# function to count characters without spaces
filesummary<-data.frame(Source=c("Twitter", "Blogs", "News"),
FileSize_MB=c(
format(structure(
object.size(con_Eng_twitter_file),
class="object_size"),
units="auto"),
format(structure(
object.size(con_Eng_blogs_file),
class="object_size"),
units="auto"),
format(structure(
object.size(con_Eng_news_file),
class="object_size"),
units="auto")),
Lines=c(length(con_Eng_twitter_file),
length(con_Eng_blogs_file),
length(con_Eng_news_file)),
Words=c(sum(stri_count_words(con_Eng_twitter_file)),
sum(stri_count_words(con_Eng_blogs_file)),
sum(stri_count_words(con_Eng_news_file))),
Characters=c(sum(char(con_Eng_twitter_file)),
sum(char(con_Eng_blogs_file)),
sum(char(con_Eng_news_file))))
filesummary<-mutate(filesummary,
Words_Per_Line=Words/Lines,
Char_Per_Line=round(Characters/Lines,1),
Char_Per_Word=round(Characters/Words,2))
print(filesummary)
## Source FileSize_MB Lines Words Characters Words_Per_Line Char_Per_Line
## 1 Twitter 319 Mb 2360148 30218125 134371428 12.80349 56.9
## 2 Blogs 255.4 Mb 899288 38154238 171926595 42.42716 191.2
## 3 News 19.8 Mb 77259 2693898 13117055 34.86840 169.8
## Char_Per_Word
## 1 4.45
## 2 4.51
## 3 4.87
Samples for testing
con_Eng_twitter_file_sample <- sample(con_Eng_twitter_file,1000)
con_Eng_blogs_file_sample <- sample(con_Eng_blogs_file,1000)
con_Eng_news_file_sample <- sample(con_Eng_news_file,1000)
sample <- c(con_Eng_twitter_file_sample,
con_Eng_blogs_file_sample,
con_Eng_news_file_sample)
txt <- sent_detect(sample)
remove(con_Eng_twitter_file_sample,
con_Eng_blogs_file_sample,
con_Eng_news_file_sample,
con_Eng_twitter_file,
con_Eng_blogs_file,
con_Eng_news_file,
sample)
Removing everything we do not need
txt <- removeNumbers(txt)
txt <- removePunctuation(txt)
txt <- stripWhitespace(txt)
txt <- tolower(txt)
txt <- txt[which(txt!="")]
txt <- data.frame(txt,stringsAsFactors = FALSE)
Making ordered data frames of 1-grams, 2-grams, 3-grams
words<-WordTokenizer(txt)
grams<-NGramTokenizer(txt)
for(i in 1:length(grams)){
if(length(WordTokenizer(grams[i]))==2){
break
}
}
for(j in 1:length(grams)){
if(length(WordTokenizer(grams[j]))==1){
break
}
}
onegrams <- data.frame(table(words))
onegrams <- onegrams[order(onegrams$Freq, decreasing = TRUE),]
bigrams <- data.frame(table(grams[i:(j-1)]))
bigrams <- bigrams[order(bigrams$Freq, decreasing = TRUE),]
trigrams <- data.frame(table(grams[1:(i-1)]))
trigrams <- trigrams[order(trigrams$Freq, decreasing = TRUE),]
remove(i,j,grams)
Word cloud from Words
wordcloud(words, scale=c(5,0.1), max.words=100, random.order=FALSE,
rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents
wordcloud(onegrams$words, onegrams$Freq, scale=c(5,0.5), max.words=300, random.order=FALSE,
rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))
The first graph shows the distribution of words in the corpora except such words, as “the”, “a”, “of”, “to”, etc. The second graph - the distribution of all single words. The frequencies lay between 3796 to 1.
What are the frequencies of 2-grams and 3-grams in the dataset?
barplot(bigrams[1:20,2],col="lightblue",
names.arg = bigrams$Var1[1:20],srt = 45,
space=0.1, xlim=c(0,20),las=2)
barplot(trigrams[1:20,2],col="lightblue",
names.arg = trigrams$Var1[1:20],srt = 45,
space=0.1, xlim=c(0,20),las=2)
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
sumCover <- 0
for(i in 1:length(onegrams$Freq)) {
sumCover <- sumCover + onegrams$Freq[i]
if(sumCover >= 0.5*sum(onegrams$Freq)){break}
}
print(i)
## [1] 148
sumCover <- 0
for(i in 1:length(onegrams$Freq)) {
sumCover <- sumCover + onegrams$Freq[i]
if(sumCover >= 0.9*sum(onegrams$Freq)){break}
}
print(i)
## [1] 5618
Owing to this, we need 148 words to cover 50% of all word instances in the language and 5618 words to cover 90% of all word instances in the language.
How do you evaluate how many of the words come from foreign languages? It seems to me, that the best way is to compare the text with some well-known dictionary. Also, this is the way to remove “rude” words. Nevertheless, there are too few such words and their impact is too small and we do not need to take it into account.
Can you think of a way to increase the coverage - identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases? Prediction, based on the location (traditions, holidays, names, places, etc) Lerning the writing style of the author Using additional dictionary with n-grams: first, remove from the dictionary low-frequency words, than use the others for better prediction of n-grams.