The purpose of this notebook is:
I will follow the suggested questions by the course authors.
Tasks to accomplish
Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.
The following code reads-in a sample from US twitter dataset, cleans it and returns the inspected Document-Term Matrix.
require(tm)
require(dplyr)
require(SnowballC)
# load("final/ustwitter_clean.Rdata")
# read-in in a sample
con <- file("final/en_US/en_US.twitter.txt")
ust <- readLines(con, 2000)
close(con)
ust <- VCorpus(VectorSource(ust))
ust <- tm_map(ust, removeWords, stopwords("english"))
dtm <- DocumentTermMatrix(ust, control = list(
language=("english"),
tolower = TRUE,
removeNumbers = TRUE,
removeWords = TRUE,
stopwords("english"),
stemDocument = TRUE,
removePunctuation = TRUE,
stripWhitespace = TRUE
))
dtm
## <<DocumentTermMatrix (documents: 2000, terms: 5292)>>
## Non-/sparse entries: 13906/10570094
## Sparsity : 100%
## Maximal term length: 30
## Weighting : term frequency (tf)
As you can see, in first 2000 lines of the corpora, there are 5292 terms, sparsity is 100%. Let’s summarize this sample by creating:
require(RColorBrewer)
require(wordcloud2)
require(ggplot2)
dtm_m <- as.matrix(dtm)
fr_tb <- cbind(
colnames(dtm_m),
colSums(dtm_m)
) %>% as.data.frame()
rownames(fr_tb) <- NULL
colnames(fr_tb) <- c("word", "freq")
fr_tb$freq <- as.numeric(fr_tb$freq)
fr_tb <- fr_tb[order(fr_tb$freq, decreasing = T),]
head(fr_tb, 15)
wordcloud::wordcloud(
word = fr_tb$word,
freq = fr_tb$freq,
colors=brewer.pal(8, "Dark2"),
min.freq = 1,
max.words=200,
random.order=FALSE)
fr_tb$word <- as.factor(fr_tb$word)
ggplot(data = head(fr_tb,15)) + geom_col(aes(x = word, y = freq))
What is worth noting is that the dataset cleaning was not done fully correctly, as the stopword “the” still occurs in the results.
In order to create 2-grams and 3-grams frequency dataframes, I will use RWeka package.
require(RWeka)
require(tm)
con <- file("final/en_US/en_US.twitter.txt", encoding = "UTF-8")
ust <- readLines(con, 2000)
close(con)
ust <- Corpus(VectorSource(ust))
ust <- tm_map(ust, removeWords, stopwords("english"))
ust <- tm_map(ust, removeNumbers)
ust <- tm_map(ust, stemDocument, language="english")
ust <- tm_map(ust, removePunctuation)
ust <- tm_map(ust, stripWhitespace)
ust <- VCorpus(VectorSource(ust))
# BigramTokenizer <- function(x)unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
BigramTokenizer <- function(x)NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x)NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm_2gram <- DocumentTermMatrix(ust, control = list(tokenize = BigramTokenizer))
dtm_3gram <- DocumentTermMatrix(ust, control = list(tokenize = TrigramTokenizer))
# inspect(dtm_2gram[1:20,1:20])
# inspect(dtm_3gram[1:20,1:20])
The table below puts top 15 most frequent of 2-grams and 3-grams:
require(ggplot2)
require(ggpubr)
theme_set(theme_pubr())
dtm_m_2gram <- as.matrix(dtm_2gram)
fr_tb_2gram <- cbind(
colnames(dtm_m_2gram),
colSums(dtm_m_2gram)
) %>% as.data.frame()
rownames(fr_tb_2gram) <- NULL
colnames(fr_tb_2gram) <- c("two_gram", "freq")
fr_tb_2gram$freq <- as.numeric(fr_tb_2gram$freq)
fr_tb_2gram <- fr_tb_2gram[order(fr_tb_2gram$freq, decreasing = T),]
dtm_m_3gram <- as.matrix(dtm_3gram)
fr_tb_3gram <- cbind(
colnames(dtm_m_3gram),
colSums(dtm_m_3gram)
) %>% as.data.frame()
rownames(fr_tb_3gram) <- NULL
colnames(fr_tb_3gram) <- c("three_gram", "freq")
fr_tb_3gram$freq <- as.numeric(fr_tb_3gram$freq)
fr_tb_3gram <- fr_tb_3gram[order(fr_tb_3gram$freq, decreasing = T),]
head(fr_tb_3gram, 15)
head(fr_tb_2gram, 15)
tw_gr <- ggplot(data = head(fr_tb_2gram,15)) + geom_col(aes(x = freq, y = two_gram))
th_gr <- ggplot(data = head(fr_tb_3gram,15)) + geom_col(aes(x = freq, y = three_gram))
ggarrange(tw_gr, th_gr)
To answer this question, we need to compare certain number of top frequent unique words with all unique words in the corpora. This fraction should be 0.5 and 0.9. In order to find nominator of this fraction, we will use the table with 1-gram, ordered with decreasing frequency.
require(dplyr)
require(tm)
con <- file("final/en_US/en_US.twitter.txt", encoding = "UTF-8")
ust <- readLines(con, 2000)
close(con)
ust <- Corpus(VectorSource(ust))
ust <- tm_map(ust, removeWords, stopwords("english"))
ust <- tm_map(ust, removeNumbers)
ust <- tm_map(ust, stemDocument, language="english")
ust <- tm_map(ust, removePunctuation)
ust <- tm_map(ust, stripWhitespace)
ust <- VCorpus(VectorSource(ust))
dtm <- DocumentTermMatrix(ust)
dtm_m <- as.matrix(dtm)
fr_tb <- cbind(
colnames(dtm_m),
colSums(dtm_m)
) %>% as.data.frame()
rownames(fr_tb) <- NULL
colnames(fr_tb) <- c("word", "freq")
fr_tb$freq <- as.numeric(fr_tb$freq)
fr_tb <- fr_tb[order(fr_tb$freq, decreasing = T),]
cs <- cumsum(fr_tb$freq)/sum(fr_tb$freq)
So to cover 50% of all word instances we need to have 369 words. To cover 90%, we need 3407 words.
One of the easiest approach, is to compare Corpora with english dictionary. I will use the list of words provided by DWLY LTD: words without numbers and symbols.
library(dplyr)
url2 <- ("https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt")
if(!file.exists("words_alpha.txt")) {download.file(url2, "words_alpha.txt")}
ust_words <- as.character(fr_tb$word)
eng_dict <- readLines("words_alpha.txt")
Comparing these two datasets of words reveals, that there are 1632 words from foreign languages. It seems to be quite high number, so it’s worth examining what are these words (first 15):
library(dplyr)
setdiff(ust_words, eng_dict) %>% head(15)
## [1] "lol" "realli" "happi" "peopl" "everi" "anyon" "someth"
## [8] "favorit" "littl" "noth" "someon" "amaz" "everyon" "pretti"
## [15] "whi"
The used corpora, was stemmed previously, so that is one of the reason, why there are so many words not found in the English dictionary. The other reason for the big amount of “non-english” words, is that there are misspellings, abbreviations, onomatopeias. Moreover there are also a lot of colloquial language texts in the corpora. Thus the enhanced english word list would need to be used, to find out number of words from foreign languages. The alternative approach would be to check the Corpora against dictionaries of other popular languages like spanish, italian, german, french etc.
There are basically two ways to add new words to the dictionary, without having them in corpora in the first place:
While the definition of thesaurus (a book that lists words in groups of synonyms and related concepts) is quite clear, the stemming technique may require some explanation. The stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root (wikipedia).