1 Executive Summary

The company SwiftKey develops smart prediction software for easier mobile typing on smartphones etc. This project attempts to analyze their mammoth, publicly available, twitter/news/blogs database for suitability in developing word prediction algorithms. Key challenge was to come up with a software and hardware combination to complete the basic data processing tasks to generate ngrams (sequence of word phrases), on which a word prediction algorithm can be applied.

This report fulfills requirements of the Week 2 milestone for Johns Hopkins University’s Developing Data Products class. The code and output demonstrates that the SwiftKey data has been read in and unzipped. Basic line count, word count statistics are generated and the quanteda Natural Language Processing (NLP) package is sufficient to explore the database and report frequency matrix data on word tokens and ngrams. The final section details further work that is required to complete word prediction.

2. Loading the data

The zip file location for the corpus has been provided by the course instructors.

fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
destFile <- "Coursera-SwiftKey.zip"
dataDir <- "final"

if (!file.exists(dataDir)) {
     
     cat("Downloading ZIP file from web.... \n")
     download.file(fileUrl, destfile = destFile, method="curl")
     
     unzip(destFile)
     print("unzip complete....")
     
     # Did the download and unzip work as expected ?
     if(file.exists(dataDir)) 
          stop ("ZIP file download and unzip process failed. data directory not created.")
     
} else  cat("Using existing data directory: ", dataDir, "\n")
## Using existing data directory:  final

Text data has been provided in Deutsch Finnish Russian and English languages. We will only be using the English data.

list.files("final", include.dirs = T)
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"

3. Exploratory analysis

How big is this data?

3.1 File sizes in Megabytes:

finfo <- file.info(list.files(file.path("final","en_US"),full.names=T, pattern="*.txt"))
fsizeMB <- finfo$size/(1024^2)
names(fsizeMB) <- basename(rownames(finfo))
fsizeMB
  en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
         200.4242          196.2775          159.3641 

This is a rather large corpus (>500Mb) requiring long processing times on a 8GB RAM Mac if read in as one big blob. It presents a whole slew of problems related to memory size, CPU runtimes, issues with non-ASCII characters, in addition to needing the NLP features.

Switching from a personal Macbook Pro hardware to a more capable Google Compute Engine machine and using quanteda NLP package (instead of tm), resolved all the issues.

twit <- read_lines("final/en_US/en_US.twitter.txt")
news <- read_lines("final/en_US/en_US.news.txt")
blog <- read_lines("final/en_US/en_US.blogs.txt")

3.2 Individual file Line counts

lc <- c(length(blog), length(news), length(twit))
names(lc) <- c("blog", "news", "twit")
barplot(lc/1000, col = c("pink", "green", "lightblue"), main="Line Count in Corpus", ylab = "(in thousands)")

News is a bigger database in size but the linecount is higher for twitter, likely due to the twitter character limitations.

3.3 Individual file Word count

To deteremine word counts, I could’ve written my own function but decided to use NLP packages to get familiar with them.

# Read the text data into NLP package
corpus.blog <- corpus(paste(iconv(blog, "UTF-8", "ASCII"),collapse = "\n"))
corpus.twit <- corpus(paste(iconv(twit, "UTF-8", "ASCII"),collapse = "\n"))
corpus.news <- corpus(paste(iconv(news, "UTF-8", "ASCII"),collapse = "\n"))

# Generate document frequency matrices. Looong run-times on this huge corpus as text is split into words
dfm.blog <- dfm(corpus.blog)
dfm.twit <- dfm(corpus.twit)
dfm.news <- dfm(corpus.news)

wc <- c(rowSums(dfm.blog), rowSums(dfm.news), rowSums(dfm.twit))
names(wc) <- c("blog","news","twit")
barplot(wc/1000^2, col = c("pink", "green", "lightblue"), main="Word Count in Corpus", ylab = "(in millions)")

wc
##     blog     news     twit 
## 24736471 33517462 35340301
cat("Total ", sum(wc)/(1000^2), " million words in the database.\n")
## Total  93.59423  million words in the database.

4.Combining individual files into one large Corpus

The next task was to combine the data all into one file. Since, the goal of the capstone project is to predict words, we don’t necessarily need to separate out the sources of our training data.

Combining corpus is very easy in quanteda with a “+” operation:

corpus.all <- corpus.blog + corpus.news + corpus.twit
dfm.all <- dfm(corpus.all)

# Get Word count from Corpus.
# Define function for reuse later.
getwc <- function (mydfm) {
  wc <- rowSums(mydfm);
  names(wc) <- c("blog", "news", "twit");
  wc
}

wc.all <- getwc(dfm.all)
wc.all
##     blog     news     twit 
## 24736471 33517462 35340301

The combined word counts match up exactly with the individual ones earlier. We can proceed with generating interesting info.

5. Interesting features of the data

We can answer some simple/fun/interesting/academic questions to demonstrate that processing in quanteda will be adequate for our purpose.

5.1 What are the top 20 words used in the database?

topfeatures.all <- topfeatures(dfm.all, 20)
topfeatures.all
##       .     the       ,      to       a     and      of       !       i 
## 5424403 3604592 3339277 2117364 1836267 1800544 1473181 1382555 1306250 
##      in       "     for      is     you    that      it      on       : 
## 1262582 1005690  884585  822494  776799  760898  699970  653700  576062 
##    with       ? 
##  550084  531829

Looks like we will need to clean out punctuations and single characters later.

5.2 Is there a lot of profanity in the database?

We can remove profane words. A file with profane words has already been downloaded locally.

proflist <- scan("profanity_list.txt", what="")
dfm.noprofanity <- dfm(corpus.all, remove = c(proflist))
wc.noprofanity <- getwc(dfm.noprofanity)
wc.profanity <- wc.all - wc.noprofanity
wc.profanity
##   blog   news   twit 
##  26443   9150 137850

From this data, twitter users are more profane than on other media.

5.3 Removing punctuations/numbers/hyphens, what else bubbles up?

dfm.nopun <- dfm(corpus.all, remove_punct = T, remove_numbers = T, remove_separators = T, 
                 remove = c(proflist))
wc.nopun <- getwc(dfm.nopun)
topfeatures.nopun <- topfeatures(dfm.nopun, 20)
topfeatures.nopun
##     the      to       a     and      of       i      in     for      is 
## 3604592 2117364 1836267 1800544 1473181 1306250 1262582  884585  822494 
##     you    that      it      on    with      my      na     was      at 
##  776799  760898  699970  653700  550084  486497  476976  466440  459849 
##      be    have 
##  423970  412184

5.4 How about additionally removing stop words?

dfm.nostop <- dfm(corpus.all, remove_punct = T, remove_numbers = T,remove_separators = T, 
                  remove = c(stopwords("english"), proflist))
topfeatures.nostop <- topfeatures(dfm.nostop, 20)
wc.nostop <- getwc(dfm.nostop)
topfeatures.nostop
##     na   just   said    one   like    can    get   time    new   good 
## 476976 247627 242483 218769 213320 191906 185902 167965 156514 150006 
##    now    day   love   know people     go   back    see  great  first 
## 146297 141700 137087 128968 119301 115294 114347 111689 106293 103717
textplot_wordcloud(dfm.nostop, min.freq=50000, random.order=F, rot.per=0.25, colors=RColorBrewer::brewer.pal(8,"Dark2"))

We need to clean the data some more, like removing “na”, but that’s a task for another milestone.

5.5 Summary of word counts

barcolumns <- c("All words", "No profanity", "No punctuations", "No stop words")
wc.sums <- c(sum(wc.all), sum(wc.noprofanity), sum(wc.nopun), sum(wc.nostop))
wc.sums <- wc.sums/1024^2
names(wc.sums) <- barcolumns
barplot(wc.sums, col = c("red","blue","green","yellow"), 
        main="Corpus word count after cleaning", ylab = "in millions")

6 Creating ngrams

We can create ngrams which is a frequency matrix of sequence of words in our corpus. Once this matrix is available, in our final implementation of the project, we can look for ngram prefixes while the user is typing. The quanteda package makes this extremely easy.

# Memory & runtime issue: have to save nfgram2.dfm and load for knitr
#ngram2 <- tokens(corpus.all, what="word", ngrams=2L, remove_punct=T)
#ngram2.dfm <- dfm(ngram2)
ngram2.dfm <- readRDS("ngram2.trim.RDS")
barplot(topfeatures(ngram2.dfm/1000, 20), las=2, main ="Histogram of Top 2-grams in database", ylab = "(in thousands)", col="green")

rm(ngram2.dfm)

# Memory & runtime issue: have to save ngram3.dfm and load for knit
#ngram3 <- tokens(corpus.all, what="word", ngrams=3L, remove_punct=T)
#ngram3.dfm <- dfm(ngram3)
ngram3.dfm <- readRDS("ngram3.trim.RDS")
barplot(topfeatures(ngram3.dfm/1000, 20), las=2, main ="Histogram of Top 3-grams in database", ylab = "(in thousands)", col="green")

rm(ngram3.dfm)

# Memory & runtime issue: have to save ngram4.dfm and load for knit
#ngram4 <- tokens(corpus.all, what="word", ngrams=4L, remove_punct=T)
#ngram4.dfm <- dfm(ngram4)
ngram4.dfm <- readRDS("ngram4.trim.RDS")
par(mar = c(9,4,4,2) + 0.1)
barplot(topfeatures(ngram4.dfm/1000, 20), las=2, main ="Histogram of Top 4-grams in database", ylab = "(in thousands)", col="green")

rm(ngram4.dfm)

7 conclusion and future work