The goal of this project is to build an application with text prediction capabilities. The prediction model will be built based a dataset that can be freely downloaded from here. The dataset is mainly retrieved from HC Corpora, containing the output of crawls of blogs, news sites and twitter. More details about the dataset can be found here. The given dataset contains data across four languages (Russian, Finnish, German and English). This project will only focus on the datasets for English language, including the following three files:
library(NLP)
library(tm)
library(SnowballC)
library(ggplot2)
source("./src/load_data.R")
source("./src/summary.R")
source("./src/sampling.R")
set.seed(1717)
First, my approach is to download the remote file and unzip it. In order to keep the raw data intact, I re-save the data to the other folder for later analysis. When I need to perform analysis, the function should first check if the files are available locally and load them, otherwise download the files from remote server.
# Use the above functions to download the data
downloadDataset("./files/Coursera-SwiftKey.zip", "https://goo.gl/cn60C1")
dataList <- list(blogs = readData("blogs"),
twitter = readData("twitter"),
news = readData("news"))
# Get file size of the data files
fileSource <- DirSource(file.path(".", "final", "en_US"))
lapply(seq_along(fileSource$filelist),
getFileSize,
path=fileSource$filelist,
name=c("blogs", "twitter", "news"))
## [[1]]
## [1] "blogs 200.42 MB"
##
## [[2]]
## [1] "twitter 196.28 MB"
##
## [[3]]
## [1] "news 159.36 MB"
# Get lines of the data files
(listOfLines <- lapply(dataList, getLines))
## $blogs
## [1] 899288
##
## $twitter
## [1] 2360148
##
## $news
## [1] 1010242
# Get shortest entry of each file
lapply(dataList, getShortestEntry)
## $blogs
## [1] 1
##
## $twitter
## [1] 2
##
## $news
## [1] 1
# Get longest entry of each data file
lapply(dataList, getLongestEntry)
## $blogs
## [1] 40833
##
## $twitter
## [1] 140
##
## $news
## [1] 11384
Sampling is required because the size of the given data is considerably large while we only have limited computational resources. The sample size is 10% of the original size. This is to keep the proportion among different data source in order not to over or under-estimate some of the data sources. The sampled data are saved as a text file (.txt) to the local disk.
sampleDataList <- lapply(dataList, getSamples)
saveSampledData <- function(data, index){
fileName <- names(sampleDataList[index])
filePath <- paste("./data/sampled/", fileName, ".txt", sep = "")
writeToFile(data, filePath)
}
mapply(saveSampledData, sampleDataList, seq_along(sampleDataList))
# Remove unnecessary data to save memory
rm(dataList, sampleDataList)
In this report, we use a package specifically for text mining - tm, which provides many built-in functions to perform basic text-mining tasks. First we read the sampled data from local disk and create the corpus. Before cleaning up, all texts are transformed to lower case in order to avoid case-sensitive issues in the later filtering tasks.
corpusData <- Corpus(DirSource(file.path(".", "data", "sampled")))
# convert all texts to lower case for later clean-up
corpusData <- tm_map(corpusData, content_transformer(tolower))
# initial word count
wordCount.initial <- sum(rowSums(as.matrix(TermDocumentMatrix(corpusData))))
# initial number of unique words
wordCount.unique.initial <- length(rowSums(as.matrix(TermDocumentMatrix(corpusData))))
One of tasks is to perform profanity filtering. Instead of making a list from scratch, some resources are available on the Internet.vishalsurana has made a list of English-bad-words, and the text file of the list can be retrieved [here]]5.
downloadDataset("./data/bad_words/bad_word_list.txt", "https://goo.gl/To9w5B")
badWords <- readFile("./data/bad_words/bad_word_list.txt")
corpusData <- tm_map(corpusData, removeWords, badWords)
Use regular expression to identify and remove the unwanted texts, e.g. email addresses and web addresses(including http and https).
# create the function to remove specific patterns
toRemove <- content_transformer(function(x, pattern) gsub(pattern, "", x))
# remove "@" related texts, e.g. email and @twitter_account
corpusData <- tm_map(corpusData, toRemove, "[^ ]*@[^ ]+", lazy = TRUE)
# remove URL like http: or https:
corpusData <- tm_map(corpusData, toRemove, "([ ]|^)https?:[^ ]+", lazy = TRUE)
Stop words are words that are commonly used in a language but contain less significance in meaning, such as articles(the, a, an). Filtering stop words or not depends on the system you are building. Some systems avoid filtering stop words in order to support phrase search see wikipedia. In this report, we decided to remove stop words because we have rather limited number of samples. In order to increase recognition rate, we only need words with more value in terms of meaning.
# remove stopwords
corpusData <- tm_map(corpusData, removeWords, stopwords("english"))
Remove unwanted characters such as punctuations, numbers and white space.
# remove punctuations
corpusData <- tm_map(corpusData, removePunctuation)
# remove numbers
corpusData <- tm_map(corpusData, removeNumbers)
# remove whitespace
corpusData <- tm_map(corpusData, stripWhitespace)
# remove non-English texts
corpusData <- tm_map(corpusData, toRemove, "[^[:alpha:][:space:]]*", lazy = TRUE)
# convert to plain text document
corpusData <- tm_map(corpusData, PlainTextDocument)
Next we perform stemming in order to remove several word endings, e.g. “ing” and “es”. This will make a word more recognizable despite whether or not it may have various endings in the original text. After stemming, stemmed words would become incomplete. Then we perform stemCompletion to heuristically complete the words. The stemCompletion function might fail when working with tm_map, so a fix is needed to proceed the task.
# stem the data
corpusData_stemmed <- tm_map(corpusData, stemDocument, lazy = TRUE)
# perfrom stem completion using the original corpus
# the original
# the following funtion is to fix the bug when using stemCompletion with tm_map
stemCompletion2 <- function(x, dictionary) {
x <- unlist(strsplit(as.character(x), " "))
x <- x[x != ""]
x <- stemCompletion(x, dictionary=dictionary, type="shortest")
x <- paste(x, sep="", collapse=" ")
PlainTextDocument(stripWhitespace(x))
}
corpusData <- lapply(corpusData_stemmed,
stemCompletion2,
dictionary=corpusData)
We look into the corpus and inspect the frequently used words in the corpus.
# Specify the source to be character vectors
corpusData <- Corpus(VectorSource(corpusData))
# create a term document matrix
tdm <- TermDocumentMatrix(corpusData)
# remove 90% of the terms that do not appear in all of the documents
tdm <- removeSparseTerms(tdm, 0.90)
# inspect frequent words
head(findFreqTerms(tdm, lowfreq = 1000), 40)
## [1] "also" "back" "can" "day" "even" "first" "get"
## [8] "going" "good" "got" "great" "just" "know" "last"
## [15] "like" "love" "make" "much" "new" "now" "one"
## [22] "people" "really" "right" "said" "see" "still" "think"
## [29] "time" "today" "two" "want" "way" "well" "will"
## [36] "work" "year"
# calculate the frequency
termFrequency <- rowSums(as.matrix(tdm))
# calculate the number of words after cleaning up
wordCount.cleaned <- sum(termFrequency)
wordCount.cleanedPercentage <- round((wordCount.initial - wordCount.cleaned) / wordCount.initial * 100,2)
# calculate the number of unique words after cleaning up
wordCount.unique.cleaned <- length(termFrequency)
wordCount.unique.cleanedPercentage <- round((wordCount.unique.initial - wordCount.unique.cleaned) / wordCount.unique.initial * 100,2)
# Print top 20 of the most frequent words
termFrequency <- sort(termFrequency, decreasing = T)
head(termFrequency, 20)
## will just said one like can get time new good
## 3192 3071 3033 2891 2687 2533 2213 2185 1933 1844
## now know day love people back see first going make
## 1833 1695 1672 1584 1539 1416 1359 1334 1319 1301
# freqTerms <- subset(termFrequency, termFrequency >= 1000)
freqTerms <- head(termFrequency, 40)
dfFreqTerms <- data.frame(term = names(freqTerms), freq = freqTerms)
frequencyPlot <- ggplot(dfFreqTerms, aes(x = reorder(term, freq), y = freq)) +
geom_bar(stat = "identity") +
xlab("Terms") + ylab("Count") + coord_flip()
frequencyPlot
We visualize the word frequency by generating a word cloud.
library(wordcloud)
colorSetting <- brewer.pal(6, "Dark2")
wordcloud(names(termFrequency), termFrequency, scale=c(5, .3), 50, random.order = F, colors = colorSetting)
Use LDA to find possible topics out of the corpus. However, it is difficult to find meaningful combinations. This is because the given dataset is not topic-based and many of the texts from Twitter and blogs are published in a informal format.
library(topicmodels)
lda <- LDA(t(tdm), k = 7)
(term <- terms(lda, 10))
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7
## [1,] "said" "said" "will" "can" "will" "just" "one"
## [2,] "just" "will" "much" "day" "just" "like" "can"
## [3,] "two" "one" "like" "now" "next" "love" "just"
## [4,] "one" "new" "get" "tonight" "today" "thanks" "will"
## [5,] "day" "year" "just" "people" "said" "get" "like"
## [6,] "even" "last" "time" "think" "great" "good" "time"
## [7,] "get" "can" "many" "today" "new" "will" "good"
## [8,] "years" "also" "can" "great" "think" "back" "make"
## [9,] "city" "like" "now" "last" "get" "know" "think"
## [10,] "good" "state" "first" "see" "good" "one" "know"
The initial sample data contains 8.1101310^{5} words and 107247 unique words. After cleaning up, the number of words reduced to 5.3453510^{5} and 54904 unrepeated words, which means that we have excluded 34.09% of unwanted words and 48.81% of unrepeated words.
We have an overview of the commonly used words from the word cloud and the word count diagram. However, it is not clear if the corpus is clean enough for making precise predictions.
It needs to be noted that the stem and stemCompletion does not seem to make huge difference on the outcome while it required expensive computational power and time. This is due to the fact that, if stemCompletion is executed, the corpus is very much similar to the un-stemmed one. For example, ‘cities’ will be stemmed as ‘citi’ and then be restored as ‘cities’.
However, if stemCompletion is not performed, some words would become misspelled, e.g. both ‘city’ and ‘cities’ would become ‘citi’. This leads to another critical issue - misspelling. Considering the time constraints, we would have chosen proceed without stemming.
As mentioned earlier, misspelling is an issue that is hard to remedy. One possible approach can be replacing the words in the corpus with the most similar words in a discitonary list. However, we did not perform this task in this report.
We use RWeka package to generate n-gram word list. In this report, we tokenized the corpus and made unigram, bigram, and trigram word lists.
library(RWeka)
finalTdm <- TermDocumentMatrix(finalCorpus)
# Sets the default number of threads to use
options(mc.cores=1)
UnigramTokenizer <- function(data) NGramTokenizer(data, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(data) NGramTokenizer(data, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(data) NGramTokenizer(data, Weka_control(min = 3, max = 3))
QuadgramTokenizer <- function(data) NGramTokenizer(data, Weka_control(min = 4, max = 4))
finalTdm.unigram <- TermDocumentMatrix(finalCorpus,
control = list(tokenize = UnigramTokenizer))
finalTdm.bigram <- TermDocumentMatrix(finalCorpus,
control = list(tokenize = BigramTokenizer))
finalTdm.trigram <- TermDocumentMatrix(finalCorpus,
control = list(tokenize = TrigramTokenizer))
finalTdm.quadgram <- TermDocumentMatrix(finalCorpus,
control = list(tokenize = QuadgramTokenizer))
countSum <- function(tdm){
tdm <- removeSparseTerms(tdm, 0.75)
return(sort(rowSums(as.matrix(tdm)), decreasing=TRUE))
}
finalTdm.unigram <- countSum(finalTdm.unigram)
finalTdm.bigram <- countSum(finalTdm.bigram)
finalTdm.trigram <- countSum(finalTdm.trigram)
finalTdm.quadgram <- countSum(finalTdm.quadgram)
# top 20 unigram
head(finalTdm.unigram, 20)
## will just said one like can get time new good
## 3192 3071 3033 2891 2687 2533 2213 2185 1933 1844
## now know day love people back see first going make
## 1833 1695 1672 1584 1539 1416 1359 1334 1319 1301
# top 20 bigram
head(finalTdm.bigram, 20)
## right now last year new york last night
## 266 199 195 159
## high school first time feel like years ago
## 141 137 135 133
## last week can get make sure looking forward
## 119 117 117 110
## even though good morning can see just got
## 101 98 92 91
## one day happy birthday every day looks like
## 89 88 87 84
# top 20 trigram
head(finalTdm.trigram, 20)
## happy mothers day let us know new york city
## 37 26 23
## beep beep beep happy new year cinco de mayo
## 20 20 19
## new york times first time since two years ago
## 19 18 17
## u u u president barack obama looking forward seeing
## 17 14 12
## gov chris christie pates fountain parks just got back
## 11 11 10
## world war ii let know can rock n roll
## 10 9 9
## st louis county two weeks ago
## 9 9
# top 20 quadgram
head(finalTdm.quadgram, 20)
## beep beep beep beep u u u u
## 19 9
## classic pates fountain parks add boston add boston
## 8 6
## amazon services llc amazon happy mothers day moms
## 6 6
## services llc amazon eu boston add boston add
## 6 5
## lovz lovz lovz lovz please let us know
## 5 5
## g fat g saturated g fiber mg sodium
## 4 4
## ha ha ha ha just finished mi run
## 4 4
## let us know can let us know think
## 4 4
## lies lies lies lies martin luther king jr
## 4 4
## o o o o th street th street
## 4 4
getGram <- function(word, n){
return(unlist(strsplit(word, "[ ]"))[n])
}
convertToDf <- function(tdm, numGrams){
df <- data.frame(frequency = tdm)
for(i in numGrams:1){
columnNames <- c(i,names(df))
df <- cbind(as.matrix(lapply(names(tdm), getGram, n=i)), df)
colnames(df) <- columnNames
}
return(df)
}
df.unigram <- convertToDf(finalTdm.unigram, 1)
saveRDS(df.unigram, file = "./data/final/unigram.RDS")
rm(df.unigram, finalTdm.unigram)
df.bigram <- convertToDf(finalTdm.bigram, 2)
saveRDS(df.bigram, file = "./data/final/bigram.RDS")
rm(df.bigram, finalTdm.bigram)
df.trigram <- convertToDf(finalTdm.trigram, 3)
saveRDS(df.trigram, file = "./data/final/trigram.RDS")
rm(df.trigram, finalTdm.trigram)
df.quadgram <- convertToDf(finalTdm.quadgram, 4)
saveRDS(df.quadgram, file = "./data/final/quadgram.RDS")
rm(df.quadgram, finalTdm.quadgram)