Introduction

The goal of this project is to build an application with text prediction capabilities. The prediction model will be built based a dataset that can be freely downloaded from here. The dataset is mainly retrieved from HC Corpora, containing the output of crawls of blogs, news sites and twitter. More details about the dataset can be found here. The given dataset contains data across four languages (Russian, Finnish, German and English). This project will only focus on the datasets for English language, including the following three files:

  1. en_US.blogs.txt
  2. en_US.twitter.txt
  3. en_US.news.txt
library(NLP)
library(tm)
library(SnowballC)
library(ggplot2)
source("./src/load_data.R")
source("./src/summary.R")
source("./src/sampling.R")

set.seed(1717)

Load the data

First, my approach is to download the remote file and unzip it. In order to keep the raw data intact, I re-save the data to the other folder for later analysis. When I need to perform analysis, the function should first check if the files are available locally and load them, otherwise download the files from remote server.

# Use the above functions to download the data
downloadDataset("./files/Coursera-SwiftKey.zip", "https://goo.gl/cn60C1")
dataList <- list(blogs = readData("blogs"),
                 twitter = readData("twitter"),
                 news = readData("news"))
# Get file size of the data files
fileSource <- DirSource(file.path(".", "final", "en_US"))
lapply(seq_along(fileSource$filelist), 
       getFileSize,
       path=fileSource$filelist, 
       name=c("blogs", "twitter", "news"))
## [[1]]
## [1] "blogs 200.42 MB"
## 
## [[2]]
## [1] "twitter 196.28 MB"
## 
## [[3]]
## [1] "news 159.36 MB"
# Get lines of the data files
(listOfLines <- lapply(dataList, getLines))
## $blogs
## [1] 899288
## 
## $twitter
## [1] 2360148
## 
## $news
## [1] 1010242
# Get shortest entry of each file
lapply(dataList, getShortestEntry)
## $blogs
## [1] 1
## 
## $twitter
## [1] 2
## 
## $news
## [1] 1
# Get longest entry of each data file
lapply(dataList, getLongestEntry)
## $blogs
## [1] 40833
## 
## $twitter
## [1] 140
## 
## $news
## [1] 11384

Perform Sampling

Sampling is required because the size of the given data is considerably large while we only have limited computational resources. The sample size is 10% of the original size. This is to keep the proportion among different data source in order not to over or under-estimate some of the data sources. The sampled data are saved as a text file (.txt) to the local disk.

sampleDataList <- lapply(dataList, getSamples)

saveSampledData <- function(data, index){
  fileName <- names(sampleDataList[index])
  filePath <- paste("./data/sampled/", fileName, ".txt", sep = "")  
  writeToFile(data, filePath)
}

mapply(saveSampledData, sampleDataList, seq_along(sampleDataList))

# Remove unnecessary data to save memory
rm(dataList, sampleDataList)

Pre-process Data

In this report, we use a package specifically for text mining - tm, which provides many built-in functions to perform basic text-mining tasks. First we read the sampled data from local disk and create the corpus. Before cleaning up, all texts are transformed to lower case in order to avoid case-sensitive issues in the later filtering tasks.

corpusData <- Corpus(DirSource(file.path(".", "data", "sampled")))

# convert all texts to lower case for later clean-up
corpusData <- tm_map(corpusData,  content_transformer(tolower))

# initial word count
wordCount.initial <- sum(rowSums(as.matrix(TermDocumentMatrix(corpusData))))

# initial number of unique words
wordCount.unique.initial <- length(rowSums(as.matrix(TermDocumentMatrix(corpusData))))

Profanity filtering

One of tasks is to perform profanity filtering. Instead of making a list from scratch, some resources are available on the Internet.vishalsurana has made a list of English-bad-words, and the text file of the list can be retrieved [here]]5.

downloadDataset("./data/bad_words/bad_word_list.txt", "https://goo.gl/To9w5B")
badWords <- readFile("./data/bad_words/bad_word_list.txt")

corpusData <- tm_map(corpusData, removeWords, badWords)

Remove unwanted texts using regular expression

Use regular expression to identify and remove the unwanted texts, e.g. email addresses and web addresses(including http and https).

# create the function to remove specific patterns
toRemove <- content_transformer(function(x, pattern) gsub(pattern, "", x))

# remove "@" related texts, e.g. email and @twitter_account
corpusData <- tm_map(corpusData, toRemove, "[^ ]*@[^ ]+", lazy = TRUE)

# remove URL like http: or https:
corpusData <- tm_map(corpusData, toRemove, "([ ]|^)https?:[^ ]+", lazy = TRUE)

Remove stopwords

Stop words are words that are commonly used in a language but contain less significance in meaning, such as articles(the, a, an). Filtering stop words or not depends on the system you are building. Some systems avoid filtering stop words in order to support phrase search see wikipedia. In this report, we decided to remove stop words because we have rather limited number of samples. In order to increase recognition rate, we only need words with more value in terms of meaning.

# remove stopwords
corpusData <- tm_map(corpusData, removeWords, stopwords("english"))

Clean corpus

Remove unwanted characters such as punctuations, numbers and white space.

# remove punctuations
corpusData <- tm_map(corpusData, removePunctuation)

# remove numbers
corpusData <- tm_map(corpusData, removeNumbers)

# remove whitespace
corpusData <- tm_map(corpusData, stripWhitespace)

# remove non-English texts
corpusData <- tm_map(corpusData, toRemove, "[^[:alpha:][:space:]]*", lazy = TRUE)

# convert to plain text document
corpusData <- tm_map(corpusData, PlainTextDocument)

Stem the corpus

Next we perform stemming in order to remove several word endings, e.g. “ing” and “es”. This will make a word more recognizable despite whether or not it may have various endings in the original text. After stemming, stemmed words would become incomplete. Then we perform stemCompletion to heuristically complete the words. The stemCompletion function might fail when working with tm_map, so a fix is needed to proceed the task.

# stem the data
corpusData_stemmed <- tm_map(corpusData, stemDocument, lazy = TRUE)

# perfrom stem completion using the original corpus
# the original 
# the following funtion is to fix the bug when using stemCompletion with tm_map
stemCompletion2 <- function(x, dictionary) {
  x <- unlist(strsplit(as.character(x), " "))
  x <- x[x != ""]
  x <- stemCompletion(x, dictionary=dictionary, type="shortest")
  x <- paste(x, sep="", collapse=" ") 
  PlainTextDocument(stripWhitespace(x))
}

corpusData <- lapply(corpusData_stemmed, 
                             stemCompletion2, 
                             dictionary=corpusData)

Explore the data

Frequency of words

We look into the corpus and inspect the frequently used words in the corpus.

# Specify the source to be character vectors
corpusData <- Corpus(VectorSource(corpusData))

# create a term document matrix
tdm <- TermDocumentMatrix(corpusData)

# remove 90% of the terms that do not appear in all of the documents
tdm <- removeSparseTerms(tdm, 0.90)

# inspect frequent words
head(findFreqTerms(tdm, lowfreq = 1000), 40)
##  [1] "also"   "back"   "can"    "day"    "even"   "first"  "get"   
##  [8] "going"  "good"   "got"    "great"  "just"   "know"   "last"  
## [15] "like"   "love"   "make"   "much"   "new"    "now"    "one"   
## [22] "people" "really" "right"  "said"   "see"    "still"  "think" 
## [29] "time"   "today"  "two"    "want"   "way"    "well"   "will"  
## [36] "work"   "year"
# calculate the frequency
termFrequency <- rowSums(as.matrix(tdm))

# calculate the number of words after cleaning up
wordCount.cleaned <- sum(termFrequency)
wordCount.cleanedPercentage <- round((wordCount.initial - wordCount.cleaned) / wordCount.initial * 100,2)

# calculate the number of unique words after cleaning up
wordCount.unique.cleaned <- length(termFrequency)
wordCount.unique.cleanedPercentage <- round((wordCount.unique.initial - wordCount.unique.cleaned) / wordCount.unique.initial * 100,2)

# Print top 20 of the most frequent words
termFrequency <- sort(termFrequency, decreasing = T)
head(termFrequency, 20)
##   will   just   said    one   like    can    get   time    new   good 
##   3192   3071   3033   2891   2687   2533   2213   2185   1933   1844 
##    now   know    day   love people   back    see  first  going   make 
##   1833   1695   1672   1584   1539   1416   1359   1334   1319   1301
# freqTerms <- subset(termFrequency, termFrequency >= 1000)
freqTerms <- head(termFrequency, 40)
dfFreqTerms <- data.frame(term = names(freqTerms), freq = freqTerms)
frequencyPlot <- ggplot(dfFreqTerms, aes(x = reorder(term, freq), y = freq)) + 
                 geom_bar(stat = "identity") + 
                 xlab("Terms") + ylab("Count") + coord_flip()
frequencyPlot

Generate Word Cloud

We visualize the word frequency by generating a word cloud.

library(wordcloud)

colorSetting <- brewer.pal(6, "Dark2")
wordcloud(names(termFrequency), termFrequency, scale=c(5, .3), 50, random.order = F, colors = colorSetting)

Finding possible topics

Use LDA to find possible topics out of the corpus. However, it is difficult to find meaningful combinations. This is because the given dataset is not topic-based and many of the texts from Twitter and blogs are published in a informal format.

library(topicmodels)
lda <- LDA(t(tdm), k = 7)
(term <- terms(lda, 10))
##       Topic 1 Topic 2 Topic 3 Topic 4   Topic 5 Topic 6  Topic 7
##  [1,] "said"  "said"  "will"  "can"     "will"  "just"   "one"  
##  [2,] "just"  "will"  "much"  "day"     "just"  "like"   "can"  
##  [3,] "two"   "one"   "like"  "now"     "next"  "love"   "just" 
##  [4,] "one"   "new"   "get"   "tonight" "today" "thanks" "will" 
##  [5,] "day"   "year"  "just"  "people"  "said"  "get"    "like" 
##  [6,] "even"  "last"  "time"  "think"   "great" "good"   "time" 
##  [7,] "get"   "can"   "many"  "today"   "new"   "will"   "good" 
##  [8,] "years" "also"  "can"   "great"   "think" "back"   "make" 
##  [9,] "city"  "like"  "now"   "last"    "get"   "know"   "think"
## [10,] "good"  "state" "first" "see"     "good"  "one"    "know"

Summary of Exploratory Analysis

The initial sample data contains 8.1101310^{5} words and 107247 unique words. After cleaning up, the number of words reduced to 5.3453510^{5} and 54904 unrepeated words, which means that we have excluded 34.09% of unwanted words and 48.81% of unrepeated words.

We have an overview of the commonly used words from the word cloud and the word count diagram. However, it is not clear if the corpus is clean enough for making precise predictions.

To stem or not to stem

It needs to be noted that the stem and stemCompletion does not seem to make huge difference on the outcome while it required expensive computational power and time. This is due to the fact that, if stemCompletion is executed, the corpus is very much similar to the un-stemmed one. For example, ‘cities’ will be stemmed as ‘citi’ and then be restored as ‘cities’.

However, if stemCompletion is not performed, some words would become misspelled, e.g. both ‘city’ and ‘cities’ would become ‘citi’. This leads to another critical issue - misspelling. Considering the time constraints, we would have chosen proceed without stemming.

Misspelling

As mentioned earlier, misspelling is an issue that is hard to remedy. One possible approach can be replacing the words in the corpus with the most similar words in a discitonary list. However, we did not perform this task in this report.

N-Gram Word List

We use RWeka package to generate n-gram word list. In this report, we tokenized the corpus and made unigram, bigram, and trigram word lists.

library(RWeka)
finalTdm <- TermDocumentMatrix(finalCorpus) 

# Sets the default number of threads to use
options(mc.cores=1)

UnigramTokenizer <- function(data) NGramTokenizer(data, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(data) NGramTokenizer(data, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(data) NGramTokenizer(data, Weka_control(min = 3, max = 3))
QuadgramTokenizer <- function(data) NGramTokenizer(data, Weka_control(min = 4, max = 4))

finalTdm.unigram <- TermDocumentMatrix(finalCorpus, 
                    control = list(tokenize = UnigramTokenizer))
finalTdm.bigram <- TermDocumentMatrix(finalCorpus, 
                    control = list(tokenize = BigramTokenizer))
finalTdm.trigram <- TermDocumentMatrix(finalCorpus, 
                    control = list(tokenize = TrigramTokenizer))
finalTdm.quadgram <- TermDocumentMatrix(finalCorpus, 
                    control = list(tokenize = QuadgramTokenizer))

countSum <- function(tdm){
  tdm <- removeSparseTerms(tdm, 0.75)
  return(sort(rowSums(as.matrix(tdm)), decreasing=TRUE))
}

finalTdm.unigram <- countSum(finalTdm.unigram)
finalTdm.bigram <- countSum(finalTdm.bigram)
finalTdm.trigram <- countSum(finalTdm.trigram)
finalTdm.quadgram <- countSum(finalTdm.quadgram)

# top 20 unigram
head(finalTdm.unigram, 20)
##   will   just   said    one   like    can    get   time    new   good 
##   3192   3071   3033   2891   2687   2533   2213   2185   1933   1844 
##    now   know    day   love people   back    see  first  going   make 
##   1833   1695   1672   1584   1539   1416   1359   1334   1319   1301
# top 20 bigram
head(finalTdm.bigram, 20)
##       right now       last year        new york      last night 
##             266             199             195             159 
##     high school      first time       feel like       years ago 
##             141             137             135             133 
##       last week         can get       make sure looking forward 
##             119             117             117             110 
##     even though    good morning         can see        just got 
##             101              98              92              91 
##         one day  happy birthday       every day      looks like 
##              89              88              87              84
# top 20 trigram
head(finalTdm.trigram, 20)
##      happy mothers day            let us know          new york city 
##                     37                     26                     23 
##         beep beep beep         happy new year          cinco de mayo 
##                     20                     20                     19 
##         new york times       first time since          two years ago 
##                     19                     18                     17 
##                  u u u president barack obama looking forward seeing 
##                     17                     14                     12 
##     gov chris christie   pates fountain parks          just got back 
##                     11                     11                     10 
##           world war ii           let know can            rock n roll 
##                     10                      9                      9 
##        st louis county          two weeks ago 
##                      9                      9
# top 20 quadgram
head(finalTdm.quadgram, 20)
##          beep beep beep beep                      u u u u 
##                           19                            9 
## classic pates fountain parks        add boston add boston 
##                            8                            6 
##   amazon services llc amazon       happy mothers day moms 
##                            6                            6 
##       services llc amazon eu        boston add boston add 
##                            6                            5 
##          lovz lovz lovz lovz           please let us know 
##                            5                            5 
##            g fat g saturated            g fiber mg sodium 
##                            4                            4 
##                  ha ha ha ha         just finished mi run 
##                            4                            4 
##              let us know can            let us know think 
##                            4                            4 
##          lies lies lies lies        martin luther king jr 
##                            4                            4 
##                      o o o o          th street th street 
##                            4                            4
getGram <- function(word, n){
  return(unlist(strsplit(word, "[ ]"))[n])
}

convertToDf <- function(tdm, numGrams){
  df <- data.frame(frequency = tdm)
  for(i in numGrams:1){
    columnNames <- c(i,names(df))
    df <- cbind(as.matrix(lapply(names(tdm), getGram, n=i)), df)
    colnames(df) <- columnNames
  }
  return(df)
}

df.unigram <- convertToDf(finalTdm.unigram, 1)
saveRDS(df.unigram, file = "./data/final/unigram.RDS")
rm(df.unigram, finalTdm.unigram)

df.bigram <- convertToDf(finalTdm.bigram, 2)
saveRDS(df.bigram, file = "./data/final/bigram.RDS")
rm(df.bigram, finalTdm.bigram)

df.trigram <- convertToDf(finalTdm.trigram, 3)
saveRDS(df.trigram, file = "./data/final/trigram.RDS")
rm(df.trigram, finalTdm.trigram)

df.quadgram <- convertToDf(finalTdm.quadgram, 4)
saveRDS(df.quadgram, file = "./data/final/quadgram.RDS")
rm(df.quadgram, finalTdm.quadgram)