1 Executive Summary

This report attempts to fulfill requirements of the Week 2 milestone for Coursera’s Developing Data Products class. The code and output shows that the SwiftKey data has been read in and unzipped. Basic line count, word count statistics are generated and the quanteda package is used to explore the database and report frequency matrix data on word tokens and ngrams. The final section details further Work that is required to be performed.

2. Loading the data

The zip file location for the corpus has been provided by Coursera.

fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
destFile <- "Coursera-SwiftKey.zip"
dataDir <- "final"

if (!file.exists(dataDir)) {
     
     cat("Downloading ZIP file from web.... \n")
     download.file(fileUrl, destfile = destFile, method="curl")
     
     unzip(destFile)
     print("unzip complete....")
     
     # Did the download and unzip work as expected ?
     if(file.exists(dataDir)) 
          stop ("ZIP file download and unzip process failed. data directory not created.")
     
} else  cat("Using existing data directory: ", dataDir, "\n")
## Using existing data directory:  final

Text data has been provided in Deutsch Finnish Russian and English. We will only be using the English data.

list.files("final", include.dirs = T)
## [1] "de_DE" "en_us" "fi_FI" "ru_RU"

All the text files have been concatenated into one mega file called all.txt. And the data has been re-arranged for ease.

3. Exploratory analysis

How big is this data?

3.1 File sizes in Megabytes:

finfo <- file.info(list.files(file.path("final","en_US"),full.names=T, pattern="*.txt"))
fsizeMB <- finfo$size/(1024^2)
names(fsizeMB) <- basename(rownames(finfo))
fsizeMB
  en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
         200.4242          196.2775          159.3641 

This is a rather large corpus requiring long processing times on my Mac if read as one big blob. It presented a whole slew of problems I never encountered earlier.

The first problem was that the news data was not being read just fine on my Mac, but not on a Windows machine that I had to migrate to later. I had to switch to using the read_lines() function from readr package which gets around this problem.

twit <- read_lines("final/en_US/en_US.twitter.txt")
news <- read_lines("final/en_US/en_US.news.txt")
blog <- read_lines("final/en_US/en_US.blogs.txt")

3.2 Individual file Line counts

lc <- c(length(blog), length(news), length(twit))
names(lc) <- c("blog", "news", "twit")
barplot(lc, col = c("pink", "green", "lightblue"), main="Line Count in Corpus")

News is a bigger database in size but the linecount is higher for twitter. We can guess that is because tweets are short.

3.3 Individual file Word count

To deteremine word counts, I could’ve written my own function but decided to use off-the-shelf packages to get familiar with them, and started using the tm package. This worked fine until I used the TermDocumentMatrix() function on the news database. It never came back after 12-14 hours and the R session died multiple times.

So, I switched to using the “quanteda” package as recommended by mentors and others for further analysis.

corpus.blog <- corpus(paste(iconv(blog, "UTF-8", "ASCII"),collapse = "\n"))
corpus.twit <- corpus(paste(iconv(twit, "UTF-8", "ASCII"),collapse = "\n"))
corpus.news <- corpus(paste(iconv(news, "UTF-8", "ASCII"),collapse = "\n"))

dfm.blog <- dfm(corpus.blog)
dfm.twit <- dfm(corpus.twit)
dfm.news <- dfm(corpus.news)

wc <- c(rowSums(dfm.blog), rowSums(dfm.news), rowSums(dfm.twit))
names(wc) <- c("blog","news","twit")
barplot(wc, col = c("pink", "green", "lightblue"), main="Word Count in Corpus")

wc
##     blog     news     twit 
## 24736471 33517462 35340301
cat("Total ", sum(wc)/(1024^2), " million words in the database.\n")
## Total  89.25842  million words in the database.

The quanteda package has a similar flow to the tm package, but completed fine all the files.

4.Combining individual files into one large Corpus

The next task was to combine the data all into one file. Since, the goal of the capstone project is to predict words, we don’t necessarily need to separate out the sources of our training data.

Combining corpus is very easy with quanteda with a “+” operation:

corpus.all <- corpus.blog + corpus.news + corpus.twit
dfm.all <- dfm(corpus.all)

# Get Word count from Corpus.
# Define function for reuse later.
getwc <- function (mydfm) {
  wc <- rowSums(mydfm);
  names(wc) <- c("blog", "news", "twit");
  wc
}

wc.all <- getwc(dfm.all)
wc.all
##     blog     news     twit 
## 24736471 33517462 35340301

The combined word counts match up exactly with the individual ones earlier. We can proceed with generating interesting info.

5. Interesting features of the data

We can answer some simple/fun/interesting/academic questions to demonstrate that processing in quanteda will be adequate for our purpose.

5.1 What are the top 20 words used in the database?

topfeatures.all <- topfeatures(dfm.all, 20)
topfeatures.all
##       .     the       ,      to       a     and      of       !       i 
## 5424403 3604592 3339277 2117364 1836267 1800544 1473181 1382555 1306250 
##      in       "     for      is     you    that      it      on       : 
## 1262582 1005690  884585  822494  776799  760898  699970  653700  576062 
##    with       ? 
##  550084  531829

Looks like we will need to clean out punctuations and single characters later.

5.2 Is there a lot of profanity in the database?

We can remove profane words. A file with profane words has already been downloaded locally.

proflist <- scan("profanity_list.txt", what="")
dfm.noprofanity <- dfm(corpus.all, remove = c(proflist))
wc.noprofanity <- getwc(dfm.noprofanity)
wc.profanity <- wc.all - wc.noprofanity
wc.profanity
##   blog   news   twit 
##  26443   9150 137850

From this data, twitter users are more profane than on other media.

5.3 Removing punctuations/numbers/hyphens, what else bubbles up?

dfm.nopun <- dfm(corpus.all, remove_punct = T, remove_numbers = T, remove_separators = T, 
                 remove = c(proflist))
wc.nopun <- getwc(dfm.nopun)
topfeatures.nopun <- topfeatures(dfm.nopun, 20)
topfeatures.nopun
##     the      to       a     and      of       i      in     for      is 
## 3604592 2117364 1836267 1800544 1473181 1306250 1262582  884585  822494 
##     you    that      it      on    with      my      na     was      at 
##  776799  760898  699970  653700  550084  486497  476976  466440  459849 
##      be    have 
##  423970  412184

5.4 How about additionally removing stop words?

dfm.nostop <- dfm(corpus.all, remove_punct = T, remove_numbers = T,remove_separators = T, 
                  remove = c(stopwords("english"), proflist))
topfeatures.nostop <- topfeatures(dfm.nostop, 20)
wc.nostop <- getwc(dfm.nostop)
topfeatures.nostop
##     na   just   said    one   like    can    get   time    new   good 
## 476976 247627 242483 218769 213320 191906 185902 167965 156514 150006 
##    now    day   love   know people     go   back    see  great  first 
## 146297 141700 137087 128968 119301 115294 114347 111689 106293 103717
textplot_wordcloud(dfm.nostop, min.freq=100000, random.order=F, rot.per=0.25, colors=RColorBrewer::brewer.pal(8,"Dark2"))

We need to clean the data some more, like removing “na”, but that’s a task for another milestone.

5.5 Summary of word counts

barcolumns <- c("All words", "No profanity", "No punctuations", "No stop words")
wc.sums <- c(sum(wc.all), sum(wc.noprofanity), sum(wc.nopun), sum(wc.nostop))
wc.sums <- wc.sums/1024^2
names(wc.sums) <- barcolumns
barplot(wc.sums, col = c("red","blue","green","yellow"), 
        main="Corpus word count after cleaning", ylab = "in millions")

6 Creating ngrams

We can create ngrams which is a frequency matrix of commonly occuring sequence of words in our corpus. Once this matrix is available, in our final implementation of the capstone, we can look for commonly occuring words. The quanteda package makes this extremely easy.

# Memory & runtime issue: have to save nfgram2.dfm and load for knitr
#ngram2 <- tokens(corpus.all, what="word", ngrams=2L, remove_punct=T)
#ngram2.dfm <- dfm(ngram2)
load("ngram2.dfm")
barplot(topfeatures(ngram2.dfm/1000, 20), las=2, main ="Histogram of Top 2-grams in database", ylab = "(in thousands)", col="green")

rm(ngram2.dfm)

# Memory & runtime issue: have to save ngram3.dfm and load for knit
#ngram3 <- tokens(corpus.all, what="word", ngrams=3L, remove_punct=T)
#ngram3.dfm <- dfm(ngram3)
load("ngram3.dfm")
barplot(topfeatures(ngram3.dfm/1000, 20), las=2, main ="Histogram of Top 3-grams in database", ylab = "(in thousands)", col="green")

rm(ngram3.dfm)

At the time of completing this assignment, creating ngram4 runs out of memory on my comoputer. I will figure out a solution in the future.

7 conclusion and future work