Task 2 - Exploratory Data Analysis

Tasks to accomplish

Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Some words are more frequent than others - what are the distributions of word frequencies?

The following code reads-in a sample from US twitter dataset, cleans it and returns the inspected Document-Term Matrix.

require(tm)
require(dplyr)
require(SnowballC)
# load("final/ustwitter_clean.Rdata")
# read-in in a sample
con <- file("final/en_US/en_US.twitter.txt")
ust <- readLines(con, 2000)
close(con)
ust <- VCorpus(VectorSource(ust))
ust <- tm_map(ust, removeWords, stopwords("english"))
dtm <- DocumentTermMatrix(ust, control = list(
  language=("english"),
  tolower = TRUE,
  removeNumbers = TRUE,
  removeWords = TRUE,
  stopwords("english"),
  stemDocument = TRUE,
  removePunctuation = TRUE,
  stripWhitespace = TRUE
))
dtm

## <<DocumentTermMatrix (documents: 2000, terms: 5292)>>
## Non-/sparse entries: 13906/10570094
## Sparsity           : 100%
## Maximal term length: 30
## Weighting          : term frequency (tf)

As you can see, in first 2000 lines of the corpora, there are 5292 terms, sparsity is 100%. Let’s summarize this sample by creating:

frequency table,
wordcloud as nice visualization of the sample and
barplot with top 15 frequent words

require(RColorBrewer)
require(wordcloud2)
require(ggplot2)
dtm_m <- as.matrix(dtm)
fr_tb <- cbind(
  colnames(dtm_m),
  colSums(dtm_m)
) %>% as.data.frame()
rownames(fr_tb) <- NULL
colnames(fr_tb) <- c("word", "freq")
fr_tb$freq <- as.numeric(fr_tb$freq)
fr_tb <- fr_tb[order(fr_tb$freq, decreasing = T),]
head(fr_tb, 15)

wordcloud::wordcloud(
  word = fr_tb$word, 
  freq = fr_tb$freq,
  colors=brewer.pal(8, "Dark2"),
  min.freq = 1,
  max.words=200, 
  random.order=FALSE)

fr_tb$word <- as.factor(fr_tb$word)
ggplot(data = head(fr_tb,15)) + geom_col(aes(x = word, y = freq))

What is worth noting is that the dataset cleaning was not done fully correctly, as the stopword “the” still occurs in the results.

What are the frequencies of 2-grams and 3-grams in the dataset?

In order to create 2-grams and 3-grams frequency dataframes, I will use RWeka package.

require(RWeka)
require(tm)

con <- file("final/en_US/en_US.twitter.txt", encoding = "UTF-8")
ust <- readLines(con, 2000)
close(con)
ust <- Corpus(VectorSource(ust))
ust <- tm_map(ust, removeWords, stopwords("english"))
ust <- tm_map(ust, removeNumbers)
ust <- tm_map(ust, stemDocument, language="english")
ust <- tm_map(ust, removePunctuation)
ust <- tm_map(ust, stripWhitespace)
ust <- VCorpus(VectorSource(ust))
# BigramTokenizer <- function(x)unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
BigramTokenizer <- function(x)NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x)NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm_2gram <- DocumentTermMatrix(ust, control = list(tokenize = BigramTokenizer))
dtm_3gram <- DocumentTermMatrix(ust, control = list(tokenize = TrigramTokenizer))
# inspect(dtm_2gram[1:20,1:20])
# inspect(dtm_3gram[1:20,1:20])

The table below puts top 15 most frequent of 2-grams and 3-grams:

require(ggplot2)
require(ggpubr)
theme_set(theme_pubr())
dtm_m_2gram <- as.matrix(dtm_2gram)
fr_tb_2gram <- cbind(
  colnames(dtm_m_2gram),
  colSums(dtm_m_2gram)
) %>% as.data.frame()
rownames(fr_tb_2gram) <- NULL
colnames(fr_tb_2gram) <- c("two_gram", "freq")
fr_tb_2gram$freq <- as.numeric(fr_tb_2gram$freq)
fr_tb_2gram <- fr_tb_2gram[order(fr_tb_2gram$freq, decreasing = T),]

dtm_m_3gram <- as.matrix(dtm_3gram)
fr_tb_3gram <- cbind(
  colnames(dtm_m_3gram),
  colSums(dtm_m_3gram)
) %>% as.data.frame()
rownames(fr_tb_3gram) <- NULL
colnames(fr_tb_3gram) <- c("three_gram", "freq")
fr_tb_3gram$freq <- as.numeric(fr_tb_3gram$freq)
fr_tb_3gram <- fr_tb_3gram[order(fr_tb_3gram$freq, decreasing = T),]
head(fr_tb_3gram, 15)

head(fr_tb_2gram, 15)

tw_gr <- ggplot(data = head(fr_tb_2gram,15)) + geom_col(aes(x = freq, y = two_gram))
th_gr <- ggplot(data = head(fr_tb_3gram,15)) + geom_col(aes(x = freq, y = three_gram))
ggarrange(tw_gr, th_gr)

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

To answer this question, we need to compare certain number of top frequent unique words with all unique words in the corpora. This fraction should be 0.5 and 0.9. In order to find nominator of this fraction, we will use the table with 1-gram, ordered with decreasing frequency.

require(dplyr)
require(tm)
con <- file("final/en_US/en_US.twitter.txt", encoding = "UTF-8")
ust <- readLines(con, 2000)
close(con)
ust <- Corpus(VectorSource(ust))
ust <- tm_map(ust, removeWords, stopwords("english"))
ust <- tm_map(ust, removeNumbers)
ust <- tm_map(ust, stemDocument, language="english")
ust <- tm_map(ust, removePunctuation)
ust <- tm_map(ust, stripWhitespace)
ust <- VCorpus(VectorSource(ust))
dtm <- DocumentTermMatrix(ust)
dtm_m <- as.matrix(dtm)
fr_tb <- cbind(
  colnames(dtm_m),
  colSums(dtm_m)
) %>% as.data.frame()
rownames(fr_tb) <- NULL
colnames(fr_tb) <- c("word", "freq")
fr_tb$freq <- as.numeric(fr_tb$freq)
fr_tb <- fr_tb[order(fr_tb$freq, decreasing = T),]

cs <- cumsum(fr_tb$freq)/sum(fr_tb$freq)

So to cover 50% of all word instances we need to have 369 words. To cover 90%, we need 3407 words.

How do you evaluate how many of the words come from foreign languages?

One of the easiest approach, is to compare Corpora with english dictionary. I will use the list of words provided by DWLY LTD: words without numbers and symbols.

library(dplyr)
url2 <- ("https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt")
if(!file.exists("words_alpha.txt")) {download.file(url2, "words_alpha.txt")}
ust_words <- as.character(fr_tb$word)
eng_dict <- readLines("words_alpha.txt")

Comparing these two datasets of words reveals, that there are 1632 words from foreign languages. It seems to be quite high number, so it’s worth examining what are these words (first 15):

library(dplyr)
setdiff(ust_words, eng_dict) %>% head(15)

##  [1] "lol"     "realli"  "happi"   "peopl"   "everi"   "anyon"   "someth" 
##  [8] "favorit" "littl"   "noth"    "someon"  "amaz"    "everyon" "pretti" 
## [15] "whi"

The used corpora, was stemmed previously, so that is one of the reason, why there are so many words not found in the English dictionary. The other reason for the big amount of “non-english” words, is that there are misspellings, abbreviations, onomatopeias. Moreover there are also a lot of colloquial language texts in the corpora. Thus the enhanced english word list would need to be used, to find out number of words from foreign languages. The alternative approach would be to check the Corpora against dictionaries of other popular languages like spanish, italian, german, french etc.

Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

There are basically two ways to add new words to the dictionary, without having them in corpora in the first place:

use of thesaurus,
stemming technique.

While the definition of thesaurus (a book that lists words in groups of synonyms and related concepts) is quite clear, the stemming technique may require some explanation. The stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root (wikipedia).