This project uses natural language processing to investigate large text files to develop an interactive product similar to the one used with the Swiftkey app. Swiftkey takes an ordered series of words as input and predicts and outputs the next word in the series. This document contains an overview of the 2nd task of the larger project, conducting an exploratory data analysis.
Text mining analysis involves several steps “mainly influenced by the fact that texts, from a computer perspective, are rather unstructured collections of words. A text mining analyst typically starts with a set of highly heterogeneous input texts.” (Feinerer, Ingo, et al. 2008. “Text Mining Infrastructure in R”, p 4. online here). Because of these traits, it is necessary to complete a number of preprocessing tasks such as text reformatting prior to conducting textual analysis. After preprocessing the text, the analyst must then “transform the preprocessed texts into structured formats to be actually computed (Feinerer, et al. 2008, p. 5).” These transformations extract document feature matricies (dfms) when using the quanteda library. The next sections outline these steps while the following code chunk loads the relevant text data.
tweet.con <- file("./en_US/en_US.twitter.txt", open="rb")
tweet.vector <- readLines(tweet.con, encoding="UTF-8")
tweet.subset <- sample(tweet.vector, size=length(tweet.vector)* 0.01, replace = FALSE)
tweet.corpus <- corpus(tweet.subset)
close(tweet.con)
blog.con <- file("./en_US/en_US.blogs.txt", open="rb")
blog.vector <- readLines(blog.con, encoding="UTF-8")
blog.subset <- sample(blog.vector, size=length(blog.vector)* 0.01, replace = FALSE)
blog.corpus <- corpus(blog.subset)
close(blog.con)
news.con <- file("./en_US/en_US.news.txt", open="rb")
news.vector <- readLines(news.con, encoding="UTF-8")
news.subset <- sample(news.vector, size=length(news.vector)*0.01, replace = FALSE)
news.corpus <- corpus(news.subset)
close(news.con)
docvars(tweet.corpus, "type") <- "tweet"
docvars(blog.corpus, "type") <- "blog"
docvars(news.corpus, "type") <- "news"
corpus <- tweet.corpus + blog.corpus + news.corpus
Prior to processing the text, I calculate summary statistics from the text files.
corpus_summary <- sapply(list(blog.vector,news.vector,tweet.vector),function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(corpus_summary)=c('min.words.line','mean.words.line','max.words.line')
stats <- data.frame(
Dataset=c("blogs","news","twitter"),
t(rbind(
sapply(list(blog.vector,news.vector,tweet.vector),stri_stats_general)[c('Lines','Chars'),],
Words=sapply(list(blog.vector,news.vector,tweet.vector),stri_stats_latex)['Words',],
corpus_summary)
))
head(stats)
## Dataset Lines Chars Words min.words.line mean.words.line
## 1 blogs 899288 206824382 37570839 0 41.75108
## 2 news 1010242 203223154 34494539 1 34.40997
## 3 twitter 2360148 162096031 30451128 1 12.75063
## max.words.line
## 1 6726
## 2 1796
## 3 47
First, I use the quanteda library to create and preprocess a corpus for analysis. The concept of a “corpus” as a specific data object was new to me. In general, I believe that it means something relatively generic (e.g., “a collection of texts”). According to the vignette included in the quanteda documentation, however, the corpus their package creates draws on some unique features.
The developers of quandteda designed their corpus to be “a more or less static container of texts with respect to processing and analysis. This means that the texts in a corpus are not designed to be changed internally through (for example) cleaning or pre-processing steps, such as stemming or removing punctuation. Rather, texts can be extracted from the corpus as part of processing, and assigned to new objects, but the idea is that the corpus will remain as an original reference copy so that other analyses can be performed on the same corpus.” (“Getting Started with quanteda,” online here.). This mutative feature of the “corpus” data object in the quanteda package is useful in decreasing the time it takes to perform NLP tasks.
Because of the large size of the corpus, I decide to take a sample before starting the analysis. Doing so, I somewhat arbitrarily sample 1 percent of each text file and concatonate them into a ~6 MB file (compared to the 0.5 GB size of the original). This sample will serve as a training dataset, the results of which I will eventually apply to the entire population.
I am now able to perform a number of common natural language processing tasks such as stemming (reducing derivationally related- or inflectional-forms of a word to common base forms, known as lemmas), tokenizing, text cleaning, and forming n-grams (which are a contiguous sequence consisting of n-words). If I understand the concept of stemming correctly, it seems to be a reasonable approach in identifying “words that may not be in the corpora or use a small number of words in the dictionary to cover the same number of phrases,” as the background material asks. It’s something I’ll investigate more later.
After loading the files into R, I have the option of tokenizing the dataset. In the quanteda package, text becomes features (the data object based on individual words and used to extract frequency information, described further below) through the process of tokenisation (i.e., grouping words in combination based on a given proximity of elements). These steps are applied when creating a dfm, but can be called explicitly as a prior step if needed.
To tokenize data, quanteda provides the tokens() command. This produces an intermediate object, consisting of a list of tokens in the form of character vectors, where each element of the list corresponds to an input document. Tokenizing will allow me to efficiently clean the text to remove URLs, special characters, punctuations, numbers, excess whitespace, and to change the text to lower case. The tokens() command is quite conservative. The default behavior for each of these options is set to false. While tokenizing is not a necessary step, it might be a useful one.
Having preprocessed the data, I can now moveon to exploratory analysis. A first step in this process is to extract a document-feature matrix (dfm) from the corpus (these matricies are similar to the data-term matrix objects the tm package creates). The dfm associates values for certain features within each document. The quanteda package vignette defines features as “raw terms, stemmed terms, the parts of speech of terms, terms after stopwords have been removed, or a dictionary class to which a term belongs. Features can be entirely general, such as ngrams or syntactic dependencies, and we leave this open-ended.” (“Getting Started with Quanteda”, online here.).
When creating a dfm from the corpus, I am able to extracting n-grams from the text. An n-gram is a sequence of n-“words”. Summarizing and displaying n-grams is an important step in the exploratory analysis below. The following code chunks describe and initiate these processes. The following code chunk creates dfms and n-gram word clouds of the 100 most frequently used monograms, bigrams, and trigrams throughout the corpus. I include two seperate versions of the monogram plots, one that I rudimentarily cleaned by removing punctuation and stop words. I do not include similar graphics for bigrams and trigrams (doing so would require seperately tokenizing the text, which is fairly time intensive).
##monograms
dfm.monogram <- dfm(corpus, stem = FALSE)
dfm.monogram.alt <- dfm(corpus, remove = stopwords("english"), remove_punct = TRUE, stem = FALSE)
#monogram_wordclouds
textplot_wordcloud(dfm.monogram, max.words=100, colors = brewer.pal(8, "Dark2"))
textplot_wordcloud(dfm.monogram.alt, max.words=100, colors = brewer.pal(8, "Dark2"))
##trimmed monogram
#dfm.monogram.trimmed <- dfm_trim(dfm.monogram, min_count = 5, min_docfreq = 3)
top.features <- topfeatures(dfm.monogram, n=20)
top.features.df <- data.frame(top.features)
top.features.df["unigram"] <- rownames(top.features.df)
top.features.plot <- ggplot(top.features.df, aes(x=reorder(unigram, -top.features), y=top.features))
top.features.plot <- top.features.plot + geom_bar(position = "identity", stat = "identity")
top.features.plot <- top.features.plot + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Feature (no cleaning)") + ylab("Count")
top.features.plot
top.features <- topfeatures(dfm.monogram.alt, n=20)
top.features.df <- data.frame(top.features)
top.features.df["unigram"] <- rownames(top.features.df)
top.features.plot <- ggplot(top.features.df, aes(x=reorder(unigram, -top.features), y=top.features))
top.features.plot <- top.features.plot + geom_bar(position = "identity", stat = "identity")
top.features.plot <- top.features.plot + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Feature (punctuation and stopwords removed)") + ylab("Count")
top.features.plot
##bigrams
dfm.bigram <- dfm(corpus, stem=FALSE, ngrams=2)
#dfm.bigram.alt <- dfm(corpus, remove = stopwords("english"), remove_punct = TRUE, stem = FALSE, ngrams = 2)
#bigram_wordclouds
textplot_wordcloud(dfm.bigram, max.words=100, colors = brewer.pal(8, "Dark2"))
#textplot_wordcloud(dfm.bigram.alt, max.words=100, colors = brewer.pal(8, "Dark2"))
top.features <- topfeatures(dfm.bigram, n=20)
top.features.df <- data.frame(top.features)
top.features.df["unigram"] <- rownames(top.features.df)
top.features.plot <- ggplot(top.features.df, aes(x=reorder(unigram, -top.features), y=top.features))
top.features.plot <- top.features.plot + geom_bar(position = "identity", stat = "identity")
top.features.plot <- top.features.plot + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Feature (no cleaning)") + ylab("Count")
top.features.plot
##trigrams
dfm.trigram <- dfm(corpus, stem=FALSE, ngrams=3)
#dfm.trigram.alt <- dfm(corpus, remove = stopwords("english"), remove_punct = TRUE, ngrams = 2, stem = FALSE)
#trigram_wordclouds
textplot_wordcloud(dfm.trigram, max.words=100, colors = brewer.pal(8, "Dark2"))
#textplot_wordcloud(dfm.trigram.alt, max.words=100, colors = brewer.pal(8, "Dark2"))
top.features <- topfeatures(dfm.trigram, n=20)
top.features.df <- data.frame(top.features)
top.features.df["unigram"] <- rownames(top.features.df)
top.features.plot <- ggplot(top.features.df, aes(x=reorder(unigram, -top.features), y=top.features))
top.features.plot <- top.features.plot + geom_bar(position = "identity", stat = "identity")
top.features.plot <- top.features.plot + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Feature") + ylab("Count")
top.features.plot
How many unique words do you need in a frequency sorted dictionary to cover 50% of all words instances? How about 90%? I need to think about this one.
How do you evaluate how many of the words come from foreign languages? Use regular expressions to look at the frequency of non-english language characters?
Note that this section is mostly a placeholder for now.
The basic goal of this task is to build an n-gram model, which will allow you to predict a word given the previous one, two, or three words. This will be based on the combinations of words that you observe in the data used to train the model.
Ultimately, however, it will not be possible to predict all words this way. Sometimes, people will want to type unique combinations of words that are not part of the training data input. Because of this, I will also build a model to handle cases where a particular n-gram is not observed.
In creating these models, one goal is to minimize both the size and runtime of the model. Therefore, it is necessary to consider how much memory different objects in the workplace require and how much time it takes to run the model. Finding the right balance between the two is an important part of a good user experience.
-How can you efficiently store an n-gram model (think Markov Chains)?
-How can you use the knowledge about word frequencies to make your model smaller and more efficient?
-How many parameters do you need (i.e. how big is n in your n-gram model)?
-Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data) ?
-How do you evaluate whether your model is any good?
-How can you use backoff models to estimate the probability of unobserved n-grams?
Here are some code chunks that I didn’t end up using
##Note that this code chunk takes some time to run. It relies on the tm package and its Corpora and VCorpora functions.
#corpus <- VCorpus(VectorSource(sampleData))
#convert a VCorpus corpus object from the tm package into a quanteda corpus
myCorpus <- corpus(sampleData) # build a quanteda corpus
twitterCorpus <- corpus(sampleTwitter)
blogCorpus <- corpus(sampleBlogs)
newsCorpus <- corpus(sampleNews)
# Note that the content_transformer is from the tm package
cleaning_function <- content_transformer(function(x, pattern) gsub(pattern, "", x))
#remove urls and emails with regex and cleaning function
docs <- tm_map(docs, cleaning_function, "(f|ht)tp(s?)://(.*)[.][a-z]+")
docs <- tm_map(docs, cleaning_function, "@[^\\s]+")
#make lowercase
docs <- tm_map(docs, tolower)
#remove stop words
docs <- tm_map(docs, removeWords, stopwords("english"))
#remove punctuation
docs <- tm_map(docs, removePunctuation)
#remove numbers
docs <- tm_map(docs, removeNumbers)
#remove white space
docs <- tm_map(docs, stripWhitespace)
#create text document
docs <- tm_map(docs, PlainTextDocument)
#It will take a minute to load the data
twitter <- readLines(con <- file("./en_US/en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)
blogs <- readLines(con <- file("./en_US/en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)
?close(con) #Do I need to close the connection after each call to the readlines function?
news <- readLines(con <- file("./en_US/en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)
#Create a training data set of with a 1% sample of the text
set.seed(02143)
sampleTwitter <- twitter[sample(1:length(twitter),0.01*length(twitter))]
sampleNews <- news[sample(1:length(news),0.01*length(news))]
sampleBlogs <- blogs[sample(1:length(blogs),0.01*length(blogs))]
sampleData <- list()
sampleData <- c(sampleTwitter, sampleNews, sampleBlogs)
writeLines(sampleData, "./sample/sampleData.txt")
# 8MB
monograms <- tokens(docs_quanteda, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_separators = TRUE, remove_twitter = TRUE, remove_url = TRUE, ngrams = 1L, skip = 0L, concatenator = "_", hash = TRUE, include_docvars = TRUE) #skip = 0L means that the ngram extraction only applies to immediately neighboring words. hash = TRUE refers to the type of data object created by the `tokens()` command.
# 137MB
bigrams <- tokens(docs_quanteda, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_separators = TRUE, remove_twitter = TRUE, remove_url = TRUE, ngrams = 2, skip = 0L, concatenator = "_", hash = TRUE, include_docvars = TRUE)
#287MB
trigrams <- tokens(docs_quanteda, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_separators = TRUE, remove_twitter = TRUE, remove_url = TRUE, ngrams = 3, skip = 0L, concatenator = "_", hash = TRUE, include_docvars = TRUE)
##Tools
#object.size(): this function reports the number of bytes that an R object occupies in memory.
#Rprof(): this function runs the profiler in R that can be used to determine where bottlenecks in a function may exist. The profr package provides some additional tools for visualizing and summarizing profiling data.
#gc(): This function runs the garbage collector to retreive unused RAM for R. In this process it tells you how much memory is currently in use by R.