Data Science Capstone Milestone Report

Summary

This is a milestone report in the Coursera Data Science Specialization Capstone project March-May 2016. Students were given three text files from blogs, news media and twitter to analyze.

This report shows what I’ve done and found so far, and what think I must/should/could do next. The full text files were loaded and measured. The sample size this report is built upon is only 1% of each of the three full files, for the work to go fast enough. The most problematic overdue tokenizing work is cleaning the twitter file in a better way, while retaining sentence boundaries. Next step now is to create training and test data sets, and then build the first version of prediction models. Most of the work-in-progress code is included in the appendix.

Done

Downloaded the Coursera-Swiftkey dataset
- Only looked at the en_US files, not the other languages
- Had to read the news file in binary mode to get all the lines
Measured the size and counts of the full files:

File	Size	Rows	Words	UniqueWords	Characters	AvgWordsPerLine	AvgWordLength
en_US.blog.txt	248.5 Mb	899288	37546246	409829	206824505	41.75108	5.508527
en_US.news.txt	249.6 Mb	1010242	34762395	384992	203223159	34.40997	5.846063
en_US.twitter.txt	301.4 Mb	2360148	30093410	442624	162096241	12.75065	5.386437

Done some preprocessing with regex
- Removed non-ASCII characters
- Removed hashtags, usernames, urls
- Removed punctuation without removing apostrophes and in-word dashes (which my tm package version seemed to do)
- Tried to split texts into sentences using qdap::sent_detect(), but no success (way too many mistakes)
Used tm::Corpus(), VectorSource() and tm_map() strenghts to further process the data quality
- lowercase’d it all
- Done profanity filtering by removing (most of the) words found in a not-too-strict and not-too-tolerant list kindly hosted by github user ryanlewis
- Removed numbers, but noted that this removes eg. 9th, and is a simplification with a cost which I may have to revisit
- Tried word stemming, but decided not to do it for now (saw too many mistakes)
- Tried stopword removal, but decides not to do that either, as I am uncertain of the consequences for later prediction quality
- Stripped unnecessary white space
Measured the size and counts of the sample objects in the Corpus:

Sample	Size	Rows	Words	UniqueWords	Characters	AvgWordsPerLine	AvgWordLength
blog	2.5 Mb	8992	382155	29654	2098440	42.49944	5.491070
news	2.4 Mb	10102	340993	31204	2008911	33.75500	5.891356
twtr	3 Mb	23601	295027	23755	1626481	12.50061	5.512990

Tokenized the samples into 1, 2 and 3 word strings (uni-, bi- and trigrams)
- First I made a quanteda corpus of the tm Corpus, because I could not get NLP nor RWeka to do a correct job on my setup. Quanteda does a better job with intra-word punctuation (RWeka wrong: “I don t”, quanteda correct: “I don’t”. May well be my incorrect usage)
- Then created the n-grams on the corpus with dfm()
Inspected the n-grams
- Tested removing sparsely occuring terms, but decide to leave that code unused for now, until I understand the memory handling issues
- Looked at and plotted cumulative distributions and most-occuring terms
- Spent some time with quanteda’s topfeatures() and kwic() looking at different words

The raw code for most of these steps (and the plots below) is in the appendix.

Findings

What I’ve found so far in the exploratory data analysis is as expected - a tough to analyze “twitter language”, quite different from the blog and news texts. Blogs and news display a wider vocabulary. In all three documents, we can observe Zipf’s law in that a few words occur very frequently, whereas a multitude of other words occur very seldom. This is especially true for single words (due in part to stopwords), but also visible for bigrams. For trigrams this effect is quite reduced.

These cumulative distributions combined show for example that we need only 105 words to cover 50% of the single word occurences in the samples. Coverage for 50, 64, 90, 95 and 100%:

Coverage	Unigram	Bigram	Trigram
50%	105	21892	142989
64%	388	59776	196175
90%	5906	158550	294949
95%	12318	177545	313944
Full	29561	196539	332938

Plotting the most frequent terms:

We see that the vocabularies in the three documents complement each other. The most occuring trigrams are different between the three.

To do

Must
Search wide and deep for information on all the topics below, and develop what’s necessary.

Rewrite to split samples into training and test data sets
Build a basic Markov chain n-gram model to predict next word
Develop a simple model to handle situations where the user input is not observed in the data, probably smoothing in an efficient form, maybe peek into the Coursera Stanford NLP course and use a Good-Turing Model
Improve the treatment of unseen n-grams by building a Katz back-off model
Evaluate the model using training and test data sets predicting some next words correctly, and also using measures of RAM consumption and runtime. Will look into object.size(), Rprof() and gc() as recommended in our Task 3 description
Build the shiny app and publish it asap. The input should be truncated to the last few words, then used to subset the data against which the prediction is applied
Determine if I need to separate sentences, remove stopwords and/or revisit other preprocessing steps (like sparse items removal) to improve efficiency
Iterate, finding the balance between annoying response time and silly predictions
Prepare a 5-page slide deck and publish it

Other must-do steps? Please leave a comment when you review this, my dear data science peer.

Should
1. Clean up the preprocessing code, refactor to functions, skip using the tm package
2. Look for better models, fixes and tricks than the ones suggested in the task descriptions
3. The profanity filtering removes words, but should remove the whole sentences not to confuse prediction results

Could
1. Try to find other corpora on the web to get better datasets
2. Visualize more, maybe get inspired by ISOVIS’ collection

Appendix: Raw code

The code is included here at the end for your convenience, in case you want to dig deeper into the why and how.

knitr::opts_chunk$set(fig.width=9.8, dpi=150, 
                      echo=FALSE, warning=FALSE, message=FALSE)
library(quanteda) # ntype, corpus
library(tm) # tm_map, Corpus, tdm, dtm
library(stringi) # stri_count_words
library(knitr) # kable
library(ggplot2) # ggplot
library(grid) #unit
library(reshape2) #melt
library(gridExtra) #grid.arrange()
#############   Download and unzip data if not already done    ###############

if (!file.exists("./data")) { # Create a data folder if needed
    dir.create("./data") 
} 

url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip?accessType=DOWNLOAD"

destfile = file.path( "./data" , "Coursera-SwiftKey.zip" )

if (!file.exists(destfile)) { # Download zipped file if it is not already done
    download.file( url = url, 
                   destfile = destfile, 
                   mode = "wb") # Windows. Other OS may use method=curl
} 

datafolder <- file.path("./data" , "final")

if (!file.exists(datafolder)) { # Unzip it if is not already done
    unzip(destfile, exdir='./data')  
}

##############   Read all rows from original files and create sample files   #################

datasubfolder <- file.path(datafolder, "en_US")

blogfile <- file.path(datasubfolder, "en_US.blogs.txt")
newsfile <- file.path(datasubfolder, "en_US.news.txt")
twitterfile <- file.path(datasubfolder, "en_US.twitter.txt")


blogsamplefile <- file.path(datasubfolder, "blog.Rdata")

if (!file.exists(blogsamplefile)) {
    blog.full <- readLines(blogfile, encoding="UTF-8", skipNul=TRUE)
    blogrows <- as.integer(length(blog.full))  # 899288L used both in sampling and in summary
    # Other key measures for summary of the file
    blog.fullwords <- stri_count_words(blog.full) 
    blog.fullsize <- format(object.size(blog.full), units="Mb")
    blog.fullsumwords <- sum(blog.fullwords)
    blog.uniquewords <- ntype(toLower(toString(blog.full)), removePunct=TRUE) #quanteda package 
    blog.fullsumnchar <- sum(nchar(blog.full))
    blog.fullavgwords <- mean(blog.fullwords)
    blog.fullwordlength <- blog.fullsumnchar / blog.fullsumwords
    
    set.seed(899)
    blog <- as.data.frame(blog.full[sample(blogrows, blogrows * 0.01, replace=FALSE)])
    rm(blog.full)
    blog[,1] <- as.character(blog[,1])
    colnames(blog) <- "Text"
    save(blog, file=file.path(datasubfolder, "blog.RData")) 
} else {
    load(blogsamplefile, envir=.GlobalEnv)
}


newssamplefile <- file.path(datasubfolder, "news.Rdata")

# The news data are more troublesome. Had to use binary here (read_lines {readr} also worked).
# With readLines + write.table the file was cut at 77259 rows (it has > 1M)

if (!file.exists(newssamplefile)) {
    con <- file(newsfile, open="rb")
    news.full <- readLines(con, encoding="UTF-8", skipNul=TRUE)
    close(con)
    newsrows <- as.integer(length(news.full))  # 1010242L
    news.fullwords = stri_count_words(news.full)
    news.fullsize <- format(object.size(news.full), units="Mb")
    news.fullsumwords <- sum(news.fullwords)
    news.uniquewords <- ntype(toLower(toString(news.full)), removePunct=TRUE) #quanteda package 
    news.fullsumnchar <- sum(nchar(news.full))
    news.fullavgwords <- mean(news.fullwords)
    news.fullwordlength <- news.fullsumnchar / news.fullsumwords
    
    set.seed(1010)
    news <- as.data.frame(news.full[sample(newsrows, newsrows * 0.01, replace=FALSE)])
    rm(news.full)
    news[,1] <- as.character(news[,1])
    colnames(news) <- "Text"
    save(news, file=file.path(datasubfolder, "news.RData"))
} else {
    load(newssamplefile, envir=.GlobalEnv)
}


twittersamplefile <- file.path(datasubfolder, "twtr.Rdata")

if (!file.exists(twittersamplefile)) {
    twtr.full <- readLines(twitterfile, encoding="UTF-8", skipNul=TRUE)
    twitterrows <- as.integer(length(twtr.full))  # 2360148L
    twtr.fullwords <- stri_count_words(twtr.full)
    twtr.fullsize <- format(object.size(twtr.full), units="Mb")
    twtr.fullsumwords <- sum(twtr.fullwords)
    twtr.uniquewords <- ntype(toLower(toString(twtr.full)), removePunct=TRUE) #quanteda package 
    twtr.fullsumnchar <- sum(nchar(twtr.full))
    twtr.fullavgwords <- mean(twtr.fullwords)
    twtr.fullwordlength <- twtr.fullsumnchar / twtr.fullsumwords
    
    set.seed(2360)
    twtr <- as.data.frame(twtr.full[sample(twitterrows, twitterrows * 0.01, replace=FALSE)])
    rm(twtr.full)
    twtr[,1] <- as.character(twtr[,1])
    colnames(twtr) <- "Text"
    save(twtr, file=file.path(datasubfolder, "twtr.RData"))
} else {
    load(twittersamplefile, envir=.GlobalEnv)
}
##############        Summarize full file key sizes           ##################
fullfilesummaryfile <- file.path(datasubfolder, "fullfilesummary.Rdata")

if (!file.exists(fullfilesummaryfile)) {
    fullfilesummary <- data.frame(
        File = c("en_US.blog.txt","en_US.news.txt","en_US.twitter.txt")
        ,Size = c(blog.fullsize, news.fullsize, twtr.fullsize) 
        ,Rows = c(blogrows, newsrows, twitterrows) 
        ,Words = c(blog.fullsumwords, news.fullsumwords, twtr.fullsumwords)
        ,UniqueWords = c(blog.uniquewords, news.uniquewords, twtr.uniquewords)
        ,Characters = c(blog.fullsumnchar, news.fullsumnchar, twtr.fullsumnchar) 
        ,AvgWordsPerLine = c(blog.fullavgwords, news.fullavgwords, twtr.fullavgwords)
        ,AvgWordLength = c(blog.fullwordlength, news.fullwordlength, twtr.fullwordlength)
    )
    save(fullfilesummary, file=file.path(datasubfolder, "fullfilesummary.RData"))
} else {
    load(fullfilesummaryfile, envir=.GlobalEnv)
}
##############     Preprocess texts with regex     ##############

#Remove non-ASCII characters like <f0><U+009F>
blog$Text <-  gsub("[^\x20-\x7E]", "", blog$Text) 
news$Text <-  gsub("[^\x20-\x7E]", "", news$Text) 
twtr$Text <-  gsub("[^\x20-\x7E]", "", twtr$Text) 

#Prevent & from being removed as punctuation
blog$Text <- gsub("&", " and ",blog$Text)
news$Text <- gsub("&", " and ",news$Text)
twtr$Text <- gsub("&", " and ",twtr$Text)

#Remove hash tags, user names, urls
twtr$Text <- gsub("#\\w+", " ", twtr$Text)
twtr$Text <- gsub("@\\w+", " ", twtr$Text)
twtr$Text <- gsub("http\\w+", " ", twtr$Text)
twtr$Text <- gsub("www\\w+", " ", twtr$Text)
blog$Text <- gsub("http\\w+", " ", blog$Text)
blog$Text <- gsub("www\\w+", " ", blog$Text)
news$Text <- gsub("http\\w+", " ", news$Text)
news$Text <- gsub("www\\w+", " ", news$Text)

#Split texts into sentences using qdap - too many mistakes
# tmp <- as.data.frame(sent_detect(blog$Text, incomplete.sub = F), stringsAsFactors = FALSE)
# colnames(tmp) <- "Text"
# blog <- tmp
# tmp <- as.data.frame(sent_detect(news$Text, incomplete.sub = F), stringsAsFactors = FALSE)
# colnames(tmp) <- "Text"
# news <- tmp
# tmp <- as.data.frame(sent_detect(twtr$Text, incomplete.sub = T, rm.bracket = F), stringsAsFactors = FALSE)
# colnames(tmp) <- "Text"
# twtr <- tmp
# rm(tmp)

#Remove punctuation without removing apostrophes and in-word dashes (which my tm package version does)
blog$Text <- gsub("[^[:alnum:]['-]", " ",blog$Text)
news$Text <- gsub("[^[:alnum:]['-]", " ",news$Text)
twtr$Text <- gsub("[^[:alnum:]['-]", " ",twtr$Text)

#To do: 
# Refine further, eg. replace " u s " with " U.S. "
blog$Text <- gsub(" [Uu] [Ss] ", " U.S. ",blog$Text)
news$Text <- gsub(" [Uu] [Ss] ", " U.S. ",news$Text)
twtr$Text <- gsub(" [Uu] [Ss] ", " U.S. ",twtr$Text)

#To do: Add extra treatment of misplaced apostrophes
#To do: Compress into a function
##############   Create corpus and perform further cleaning using {tm}   #################
corp <- Corpus(VectorSource(c(blog, news, twtr)))
meta(corp[[1]], tag="id") <- "blog"
meta(corp[[2]], tag="id") <- "news"
meta(corp[[3]], tag="id") <- "twtr"

# Lowercase letters
corp <- tm_map(corp, content_transformer(tolower))

# To do: How to deal with stopwords (common words of questionable analytic value)
# corp <- tm_map(corp, removeWords, stopwords("english")) 
# No, needed in prediction? 

# Profanity filtering, to prevent suggesting bad words
# To do: Should probably remove whole sentences/lines, instead of removing
# words, which will change prediction results
if(!file.exists(file.path(datasubfolder,"bad-words.txt"))){
    download.file("https://gist.githubusercontent.com/ryanlewis/a37739d710ccdb4b406d/raw/0fbd315eb2900bb736609ea894b9bde8217b991a/google_twunter_lol", 
                  destfile = file.path(datasubfolder,"bad-words.txt"))
}
badwords <- read.table(file.path(datasubfolder,"bad-words.txt"), stringsAsFactors = F)
badwords <- subset(badwords, nchar(badwords$V1)>1)
badwords <- badwords[!(badwords$V1 %in% c("job", "hit", "God", "blow")), ]
corp <- tm_map(corp, removeWords, c(unlist(badwords))) 
rm(badwords)

# Numbers, note that this removes eg. 9th, and is a simplification with a cost
corp <- tm_map(corp, removeNumbers) 

#To do: Remove common word endings to make the words more recognizable? At what stage in the process?
#corp <- tm_map(corp, stemDocument) 

# Unnecessary spaces removal
corp <- tm_map(corp, stripWhitespace)

# To do: Determine if this is necessary
#corp <- tm_map(corp, PlainTextDocument) 

# To do: Clean up the above and make a function of it

#head(corp[[1]]$content, 3)
#head(corp[[2]]$content, 3)
#head(corp[[3]]$content, 3)

# Need to break into sentences? sent_detect (tm) or tokenize(text, what="sentence") in quanteda
# Differ between function/structure words and context/lexical words?
#########     Counts and sizes of the samples     ##############
# Transpose too, for use in the sample summary table
tdm <- TermDocumentMatrix(corp) #Maybe add weighting here also
#tdm
#dim(tdm)
#inspect(tdm[11:20, 1:3]) #Term 11 to 20, all three docs

samplesizes <- sapply(corp, function(x) format(object.size(x), units = "Mb"))
samplewords <- sapply(corp, stri_count_words)
samplelength <- sapply(samplewords, length)
samplesumwords <- sapply(samplewords, sum)
samplesumnchar <- sapply(corp,  function(x) sum(nchar(x)))
sampleavgwords <- sapply(samplewords, mean)

samplenames = c("blog","news","twtr")
sampleuniquewords <- vector("integer")
for (i in 1:3) {
    sampleuniquewords[i] <- nrow(subset(as.matrix(tdm), as.matrix(tdm)[, i] >0))
}
names(sampleuniquewords) <- samplenames

samplesummary <- data.frame(
    Sample = samplenames 
    ,Size = samplesizes 
    ,Rows = samplelength 
    ,Words = samplesumwords
    ,UniqueWords = sampleuniquewords
    ,Characters = samplesumnchar 
    ,AvgWordsPerLine = sampleavgwords
    ,AvgWordLength = samplesumnchar / samplesumwords
)
rownames(samplesummary) <- NULL
# Couldn't prevent RWeka tokenizer from removing punctuation, ended up with "i don t" etc.
# quanteda does a better job with intra-word punctuation
qcorp <- corpus(corp)
#summary(qcorp)
unigram <- dfm(qcorp, ngrams=1)
bigram <- dfm(qcorp, ngrams=2, concatenator=" ")
trigram <- dfm(qcorp, ngrams=3, concatenator=" ")
  
#####################   EDA plots   ###########################
ggprep <- function(mydfm, cols, t30=FALSE) {
    # Prepares a quanteda dfm and returns the wanted columns as a melted data frame for ggplots
    #
    # Args: 
    #    dfm:  A quanteda n-gram from dfm()
    #    cols: The columns to be returned, default all eight
    # Returns:
    #    df: The prepared dataframe
    df <- as.data.frame(t(mydfm))
    df <- cbind(Terms = rownames(df), df)
    rownames(df) <- NULL
    df <- melt(df, id=1, variable.name="Docs", value.name="Freq")
    df <- subset(df, Freq>0)
    df <- df[with(df, order(Docs, -Freq)), ]
    df <- transform(df, Cumulative.frequency=ave(Freq, Docs, FUN=cumsum))
    df <- transform(df, Sum.group.frequency=ave(Freq, Docs, FUN=sum))
    df$Cumulative.share <- df$Cumulative.frequency / df$Sum.group.frequency
    df$Share <- df$Freq / df$Sum.group.frequency
    # The term's rank in the document based on its share within the doc
    df <- transform(df, Term.index=ave(Freq==Freq, Docs, FUN=cumsum)) 
    if(t30==TRUE) {df <- subset(df, Term.index<=30)}
    df <- df[, cols]
}

cumulativeplot <- function(df, title) {
    #Plot function for the cumulative word distribution
    #
    #Args:
    #   df:     A melted data frame ready for ggplot()
    #Returns:
    #   pl:     The ggplot
    pl <<- ggplot(df, aes(x=Term.index, y=Cumulative.share, colour=Docs, group=Docs )) + 
        geom_line(size=1.05, show_guide=FALSE) + 
        theme_bw() +
        ggtitle(title) +
        scale_color_brewer(palette="Dark2") + # http://colorbrewer2.org/ - Qualitative, color blind safe
        theme(axis.title.x=element_blank(),
              axis.title.y=element_blank())
}
    
top30plot <- function(df, type) {
    #Plot function showing the most frequent terms as bars of size=share
    #
    #Args:
    #   df:     A melted data frame ready for ggplot()
    #   type:   "unigram", "bigram" or "trigram", used in annotation
    #Returns:
    #   gg:     The ggplot
    gg <<- ggplot(df, aes(x=reorder(Terms, Share), y=Share, fill=Docs)) +
        geom_bar(stat="identity") +
        facet_wrap(~ Docs) +
        coord_flip() +
        theme_bw() +
        scale_fill_brewer(palette="Dark2") +
        ggtitle(paste0("30 most occuring ", type, " from each of the three samples (", length(unique(df$Terms)), " ", type, " total)")) +
        ylab(paste0("The ", type, "' share (in %) of the total number of ", type, " occurrences in its document")) +
        theme(axis.title.y=element_blank(),
              panel.margin=unit(1, "lines"), # A bit extra space between the facets
              strip.text.x = element_text(size=12)) +
        guides(fill=FALSE) 
}

###########   Prep df's for the cumulative plot  #############
# and behold https://en.wikipedia.org/wiki/Zipf%27s_law
cols <-  c("Docs", "Cumulative.share", "Term.index")
df1 <- ggprep(unigram, cols)
pl1 <- cumulativeplot(df1, title="Unigram") 
df2 <- ggprep(bigram, cols)
pl2 <-cumulativeplot(df2, title="Bigram")
df3 <- ggprep(trigram, cols)
pl3 <-cumulativeplot(df3, title="Trigram")
  
###########    Cumulative plot    #############
grid.arrange(pl1, pl2, pl3, ncol=3, 
             top="Cumulative share per term in each document\n(green=blog, orange=news, purple=twtr)",
             bottom="x = Term rank, from most to least frequent per document",
             left="y = Cumulative share")

###########     Table of cumulative coverage    ###########
termcumcoverage <- function(df, coverage) {
    min(which(df$Cumulative.share >= coverage))
}

coverage <- data.frame(Coverage=character(),
                       Unigram=integer(),
                       Bigram=integer(),
                       Trigram=integer(), 
                       stringsAsFactors = FALSE)


coverage[1, ] <- c("50%", mapply(termcumcoverage, df=list(df1, df2, df3), coverage=c(0.50)))
coverage[2, ] <- c("64%", mapply(termcumcoverage, df=list(df1, df2, df3), coverage=c(0.64)))
coverage[3, ] <- c("90%", mapply(termcumcoverage, df=list(df1, df2, df3), coverage=c(0.90)))
coverage[4, ] <- c("95%", mapply(termcumcoverage, df=list(df1, df2, df3), coverage=c(0.95)))
coverage[5, ] <- c("Full", mapply(termcumcoverage, df=list(df1, df2, df3), coverage=c(1.00)))
  
#########     Plot the top 30 term occurences per document      ##########
cols <- c("Docs", "Terms", "Share")
df <- ggprep(unigram, cols, t30=TRUE)
print(top30plot(df, type="unigrams"))
df <- ggprep(bigram, cols, t30=TRUE)
print(top30plot(df, type="bigrams"))
df <- ggprep(trigram, cols, t30=TRUE)
print(top30plot(df, type="trigrams"))

###############       Citations      #################
# Citations are due to these and more
#citation("tm")
#citation("quanteda")

# IEEE PacificVis 2015 short paper
#http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7156366 

############   Session info     ###############

# 24 Gb Ram, Windows Server 2008 R2
Sys.getlocale()
R.version.string
sessionInfo()