First Step: Text Mining
Benjamin Rouillé d’Orfeuil
February 21, 2017
We aim at developing a machine learning algorithm to predict the next word given a word or phrase. To achieve this goal, the first step is to build a language model by ingesting a corpus of documents. These documents are used to understand the distribution of words and how they are put together. For this reason, this exploratory data analysis is focused on extracting n-grams - sequence of words of length n - from the corpus.
We first load the data and provide a basic summary of the raw dataset. To speed up the analysis, we generate a random sample of the original dataset. This representative subset of the data is then cleaned. More specifically, we filter profanity and entries we don’t want to predict such as punctuation. We then move to the analysis of the data and extract n-grams to understand the frequency of single words and sequences of words encountered in the corpus. In the last section, we lay down plans for the building of the learning algorithm and the development of the associated shiny application.
Three text files are available in the english corpus. A summary statistics is given below.
dir <- "../../final/en_US/"
files <- paste0(dir, list.files(dir) )
nFiles <- length(files)
info <- function(fileName){
size <- file.info(fileName)$size / 1024^2 # size of the file in MB
wc <- unlist(strsplit(system2("wc", args = paste("-lw", fileName), stdout = TRUE), " ") )
nLines <- (wc[wc != ""])[1] # number of lines in the file
nWords <- (wc[wc != ""])[2] # number of words in the file
df <- data.frame(as.numeric(size), as.numeric(nLines), as.numeric(nWords) )
colnames(df) <- c("size", "nLines", "nWords")
rownames(df) <- sub("^.*/", "", fileName)
return(df)
}
for (i in seq(nFiles) ) {
if (i == 1) files.info <- info(files[i])
else files.info = rbind(files.info, info(files[i]) )
}
library("knitr")
kable(files.info, format.args = list(big.mark = ","), digit = 0, col.names = c("Size (MB)", "Lines", "Words") )
| Size (MB) | Lines | Words | |
|---|---|---|---|
| en_US.blogs.txt | 200 | 899,288 | 37,334,690 |
| en_US.news.txt | 196 | 1,010,242 | 34,372,720 |
| en_US.twitter.txt | 159 | 2,360,148 | 30,374,206 |
The files are pretty large. For this reason, we will consider only a small sample in this analysis.
We randomly sample 5,000 entries from every document available and save the extracted lines in a file on disk.
set.seed(128) # for reproducibility
nSample <- 5000
sampleFile <- function(fileName) {
keep <- sample(info(fileName)$nLines, nSample)
file <- readLines(fileName, skipNul = TRUE, encoding = "UTF-8")
sample <- file[keep]
return(sample)
}
for (i in seq(nFiles) ) {
if (i == 1) selected <- sampleFile(files[i])
else selected <- rbind(selected, sampleFile(files[i]) )
}
writeLines(selected, con = "sample.txt")
sample.info <- info("sample.txt")
This file is only 2 MB and encloses 15,000 lines and 439,363 words.
Let’s start by lowercasing all characters.
clean <- tolower(selected)
We now remove, urls, email addresses, hashtags and twitter usernames.
clean <- gsub("http\\S+\\s*", "", clean); clean <- gsub("www\\S+\\s*", "", clean) # urls
clean <- gsub('\\S+@\\S+', "", clean) # emails
clean <- gsub('#\\S+', "", clean); clean <- gsub('@\\S+', "", clean) # twitter
In the next chunk, we now extract profanity and offensive words. The list has been downloaded at this url.
badwords <- readLines("badwords.txt", skipNul = TRUE, encoding = "UTF-8")
for (word in badwords) clean <- gsub(word, "", clean)
Finally, we remove numbers and punctuations. Also, possible multi/trailing spaces need to be removed/collapsed. This can easily be achieved using the ngram package. Functions from this package manipulate a single string. For this reason, we first concatenate the corpus.
library("ngram")
words <- concatenate(clean)
words <- preprocess(words, remove.punct = TRUE, remove.numbers = TRUE, fix.spacing = TRUE)
Note that we did not remove stopwords (words such as the, also, and, …) from the corpus. Although these words have little significance, they are extremely frequent and we believe it’s important to consider them in this project. Our model will heavily rely on the n-grams, i.e., an ordered sequence of words pf length n, extracted from the corpus. A vast majority of n_grams would be meaningless if stopwords were to be removed and, in turn, the model will perform poorly.
The next step consists in retrieving n-grams from the cleaned corpus. The ngram package makes this easy. We also write in the next chunk some useful functions for performing n-gram analysis.
getNGram <- function(words, n, print = TRUE) {
ngram <- ngram(words, n = n)
df <- get.phrasetable(ngram)
if (print == TRUE) {
print(kable(head(df, n = 5), format.args = list(big.mark = ","), digit = 5,
col.names = c(paste0(n, "-gram"), "Frequency", "Proportion") ) )
}
return(df)
}
library("RColorBrewer")
library("wordcloud")
getNGramWordCloud <- function(ngram) {
wordcloud(ngram$ngrams, ngram$freq, max.words = 50, min.freq = 5, scale = c(2.5, .5),
colors = brewer.pal(6, "Dark2") )
}
library("ggplot2")
getNGramFreq <- function(ngram, n = 15) {
title <- paste0(n, " most frequent ", length(strsplit(ngram[1,1], " ")[[1]]), "-gram.")
ggplot(ngram[1:n,], aes(x = reorder(ngrams, freq), y = freq) ) + geom_bar(stat = "identity") +
geom_text(aes(label = sprintf("%1.2f%%", 100*prop) ), hjust = 1.25, colour = "white", size = 3) +
coord_flip() + labs(title = title, x = "", y = "Count")
}
We can now produce the (1, 2 and 3)-grams and give some basic statistics. Word clouds and bar plots are shown along with the tables.
unigram <- getNGram(words, 1, print = TRUE)
| 1-gram | Frequency | Proportion |
|---|---|---|
| the | 21,931 | 0.05149 |
| to | 12,154 | 0.02853 |
| and | 11,125 | 0.02612 |
| a | 10,551 | 0.02477 |
| of | 9,375 | 0.02201 |
getNGramFreq(unigram)
getNGramWordCloud(unigram)
It is worth noting that the 10 most frequent unigram represent 22 of all the unigram extracted.
bigram <- getNGram(words, 2, print = TRUE)
| 2-gram | Frequency | Proportion |
|---|---|---|
| of the | 2,027 | 0.00476 |
| in the | 1,957 | 0.00459 |
| to the | 1,010 | 0.00237 |
| on the | 922 | 0.00216 |
| for the | 811 | 0.00190 |
getNGramFreq(bigram)
getNGramWordCloud(bigram)
trigram <- getNGram(words, 3, print = TRUE)
| 3-gram | Frequency | Proportion |
|---|---|---|
| one of the | 147 | 0.00035 |
| a lot of | 139 | 0.00033 |
| to be a | 92 | 0.00022 |
| as well as | 78 | 0.00018 |
| out of the | 73 | 0.00017 |
getNGramFreq(trigram)
getNGramWordCloud(trigram)
The very next step is to develop an algorithm to predict the sequence of words. The n-gram models allow for the assignment of probabilities to sequence of words. Using the n-grams that have been extracted from the corpus, we can easily estimate the probability of the last word of an n-gram given the previous words. Markov Chains are a easy way to store and query n-gram probabilities. There are still multiple questions to consider for the building of our first model for the relationship between words. For instance:
Once we get an accurate and efficient predictive model, we will develop a shiny web application. The user will be asked to enter some text and the application will then suggest words that most likely follow what the user has typed in.