This is a milestone report in the Coursera Data Science Specialization Capstone project March-May 2016. Students were given three text files from blogs, news media and twitter to analyze.
This report shows what I’ve done and found so far, and what think I must/should/could do next. The full text files were loaded and measured. The sample size this report is built upon is only 1% of each of the three full files, for the work to go fast enough. The most problematic overdue tokenizing work is cleaning the twitter file in a better way, while retaining sentence boundaries. Next step now is to create training and test data sets, and then build the first version of prediction models. Most of the work-in-progress code is included in the appendix.
| File | Size | Rows | Words | UniqueWords | Characters | AvgWordsPerLine | AvgWordLength |
|---|---|---|---|---|---|---|---|
| en_US.blog.txt | 248.5 Mb | 899288 | 37546246 | 409829 | 206824505 | 41.75108 | 5.508527 |
| en_US.news.txt | 249.6 Mb | 1010242 | 34762395 | 384992 | 203223159 | 34.40997 | 5.846063 |
| en_US.twitter.txt | 301.4 Mb | 2360148 | 30093410 | 442624 | 162096241 | 12.75065 | 5.386437 |
| Sample | Size | Rows | Words | UniqueWords | Characters | AvgWordsPerLine | AvgWordLength |
|---|---|---|---|---|---|---|---|
| blog | 2.5 Mb | 8992 | 382155 | 29654 | 2098440 | 42.49944 | 5.491070 |
| news | 2.4 Mb | 10102 | 340993 | 31204 | 2008911 | 33.75500 | 5.891356 |
| twtr | 3 Mb | 23601 | 295027 | 23755 | 1626481 | 12.50061 | 5.512990 |
The raw code for most of these steps (and the plots below) is in the appendix.
What I’ve found so far in the exploratory data analysis is as expected - a tough to analyze “twitter language”, quite different from the blog and news texts. Blogs and news display a wider vocabulary. In all three documents, we can observe Zipf’s law in that a few words occur very frequently, whereas a multitude of other words occur very seldom. This is especially true for single words (due in part to stopwords), but also visible for bigrams. For trigrams this effect is quite reduced.
These cumulative distributions combined show for example that we need only 105 words to cover 50% of the single word occurences in the samples. Coverage for 50, 64, 90, 95 and 100%:
| Coverage | Unigram | Bigram | Trigram |
|---|---|---|---|
| 50% | 105 | 21892 | 142989 |
| 64% | 388 | 59776 | 196175 |
| 90% | 5906 | 158550 | 294949 |
| 95% | 12318 | 177545 | 313944 |
| Full | 29561 | 196539 | 332938 |
Plotting the most frequent terms:
We see that the vocabularies in the three documents complement each other. The most occuring trigrams are different between the three.
Must
Search wide and deep for information on all the topics below, and develop what’s necessary.
Other must-do steps? Please leave a comment when you review this, my dear data science peer.
Should
1. Clean up the preprocessing code, refactor to functions, skip using the tm package
2. Look for better models, fixes and tricks than the ones suggested in the task descriptions
3. The profanity filtering removes words, but should remove the whole sentences not to confuse prediction results
Could
1. Try to find other corpora on the web to get better datasets
2. Visualize more, maybe get inspired by ISOVIS’ collection
The code is included here at the end for your convenience, in case you want to dig deeper into the why and how.
knitr::opts_chunk$set(fig.width=9.8, dpi=150,
echo=FALSE, warning=FALSE, message=FALSE)
library(quanteda) # ntype, corpus
library(tm) # tm_map, Corpus, tdm, dtm
library(stringi) # stri_count_words
library(knitr) # kable
library(ggplot2) # ggplot
library(grid) #unit
library(reshape2) #melt
library(gridExtra) #grid.arrange()
############# Download and unzip data if not already done ###############
if (!file.exists("./data")) { # Create a data folder if needed
dir.create("./data")
}
url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip?accessType=DOWNLOAD"
destfile = file.path( "./data" , "Coursera-SwiftKey.zip" )
if (!file.exists(destfile)) { # Download zipped file if it is not already done
download.file( url = url,
destfile = destfile,
mode = "wb") # Windows. Other OS may use method=curl
}
datafolder <- file.path("./data" , "final")
if (!file.exists(datafolder)) { # Unzip it if is not already done
unzip(destfile, exdir='./data')
}
############## Read all rows from original files and create sample files #################
datasubfolder <- file.path(datafolder, "en_US")
blogfile <- file.path(datasubfolder, "en_US.blogs.txt")
newsfile <- file.path(datasubfolder, "en_US.news.txt")
twitterfile <- file.path(datasubfolder, "en_US.twitter.txt")
blogsamplefile <- file.path(datasubfolder, "blog.Rdata")
if (!file.exists(blogsamplefile)) {
blog.full <- readLines(blogfile, encoding="UTF-8", skipNul=TRUE)
blogrows <- as.integer(length(blog.full)) # 899288L used both in sampling and in summary
# Other key measures for summary of the file
blog.fullwords <- stri_count_words(blog.full)
blog.fullsize <- format(object.size(blog.full), units="Mb")
blog.fullsumwords <- sum(blog.fullwords)
blog.uniquewords <- ntype(toLower(toString(blog.full)), removePunct=TRUE) #quanteda package
blog.fullsumnchar <- sum(nchar(blog.full))
blog.fullavgwords <- mean(blog.fullwords)
blog.fullwordlength <- blog.fullsumnchar / blog.fullsumwords
set.seed(899)
blog <- as.data.frame(blog.full[sample(blogrows, blogrows * 0.01, replace=FALSE)])
rm(blog.full)
blog[,1] <- as.character(blog[,1])
colnames(blog) <- "Text"
save(blog, file=file.path(datasubfolder, "blog.RData"))
} else {
load(blogsamplefile, envir=.GlobalEnv)
}
newssamplefile <- file.path(datasubfolder, "news.Rdata")
# The news data are more troublesome. Had to use binary here (read_lines {readr} also worked).
# With readLines + write.table the file was cut at 77259 rows (it has > 1M)
if (!file.exists(newssamplefile)) {
con <- file(newsfile, open="rb")
news.full <- readLines(con, encoding="UTF-8", skipNul=TRUE)
close(con)
newsrows <- as.integer(length(news.full)) # 1010242L
news.fullwords = stri_count_words(news.full)
news.fullsize <- format(object.size(news.full), units="Mb")
news.fullsumwords <- sum(news.fullwords)
news.uniquewords <- ntype(toLower(toString(news.full)), removePunct=TRUE) #quanteda package
news.fullsumnchar <- sum(nchar(news.full))
news.fullavgwords <- mean(news.fullwords)
news.fullwordlength <- news.fullsumnchar / news.fullsumwords
set.seed(1010)
news <- as.data.frame(news.full[sample(newsrows, newsrows * 0.01, replace=FALSE)])
rm(news.full)
news[,1] <- as.character(news[,1])
colnames(news) <- "Text"
save(news, file=file.path(datasubfolder, "news.RData"))
} else {
load(newssamplefile, envir=.GlobalEnv)
}
twittersamplefile <- file.path(datasubfolder, "twtr.Rdata")
if (!file.exists(twittersamplefile)) {
twtr.full <- readLines(twitterfile, encoding="UTF-8", skipNul=TRUE)
twitterrows <- as.integer(length(twtr.full)) # 2360148L
twtr.fullwords <- stri_count_words(twtr.full)
twtr.fullsize <- format(object.size(twtr.full), units="Mb")
twtr.fullsumwords <- sum(twtr.fullwords)
twtr.uniquewords <- ntype(toLower(toString(twtr.full)), removePunct=TRUE) #quanteda package
twtr.fullsumnchar <- sum(nchar(twtr.full))
twtr.fullavgwords <- mean(twtr.fullwords)
twtr.fullwordlength <- twtr.fullsumnchar / twtr.fullsumwords
set.seed(2360)
twtr <- as.data.frame(twtr.full[sample(twitterrows, twitterrows * 0.01, replace=FALSE)])
rm(twtr.full)
twtr[,1] <- as.character(twtr[,1])
colnames(twtr) <- "Text"
save(twtr, file=file.path(datasubfolder, "twtr.RData"))
} else {
load(twittersamplefile, envir=.GlobalEnv)
}
############## Summarize full file key sizes ##################
fullfilesummaryfile <- file.path(datasubfolder, "fullfilesummary.Rdata")
if (!file.exists(fullfilesummaryfile)) {
fullfilesummary <- data.frame(
File = c("en_US.blog.txt","en_US.news.txt","en_US.twitter.txt")
,Size = c(blog.fullsize, news.fullsize, twtr.fullsize)
,Rows = c(blogrows, newsrows, twitterrows)
,Words = c(blog.fullsumwords, news.fullsumwords, twtr.fullsumwords)
,UniqueWords = c(blog.uniquewords, news.uniquewords, twtr.uniquewords)
,Characters = c(blog.fullsumnchar, news.fullsumnchar, twtr.fullsumnchar)
,AvgWordsPerLine = c(blog.fullavgwords, news.fullavgwords, twtr.fullavgwords)
,AvgWordLength = c(blog.fullwordlength, news.fullwordlength, twtr.fullwordlength)
)
save(fullfilesummary, file=file.path(datasubfolder, "fullfilesummary.RData"))
} else {
load(fullfilesummaryfile, envir=.GlobalEnv)
}
############## Preprocess texts with regex ##############
#Remove non-ASCII characters like <f0><U+009F>
blog$Text <- gsub("[^\x20-\x7E]", "", blog$Text)
news$Text <- gsub("[^\x20-\x7E]", "", news$Text)
twtr$Text <- gsub("[^\x20-\x7E]", "", twtr$Text)
#Prevent & from being removed as punctuation
blog$Text <- gsub("&", " and ",blog$Text)
news$Text <- gsub("&", " and ",news$Text)
twtr$Text <- gsub("&", " and ",twtr$Text)
#Remove hash tags, user names, urls
twtr$Text <- gsub("#\\w+", " ", twtr$Text)
twtr$Text <- gsub("@\\w+", " ", twtr$Text)
twtr$Text <- gsub("http\\w+", " ", twtr$Text)
twtr$Text <- gsub("www\\w+", " ", twtr$Text)
blog$Text <- gsub("http\\w+", " ", blog$Text)
blog$Text <- gsub("www\\w+", " ", blog$Text)
news$Text <- gsub("http\\w+", " ", news$Text)
news$Text <- gsub("www\\w+", " ", news$Text)
#Split texts into sentences using qdap - too many mistakes
# tmp <- as.data.frame(sent_detect(blog$Text, incomplete.sub = F), stringsAsFactors = FALSE)
# colnames(tmp) <- "Text"
# blog <- tmp
# tmp <- as.data.frame(sent_detect(news$Text, incomplete.sub = F), stringsAsFactors = FALSE)
# colnames(tmp) <- "Text"
# news <- tmp
# tmp <- as.data.frame(sent_detect(twtr$Text, incomplete.sub = T, rm.bracket = F), stringsAsFactors = FALSE)
# colnames(tmp) <- "Text"
# twtr <- tmp
# rm(tmp)
#Remove punctuation without removing apostrophes and in-word dashes (which my tm package version does)
blog$Text <- gsub("[^[:alnum:]['-]", " ",blog$Text)
news$Text <- gsub("[^[:alnum:]['-]", " ",news$Text)
twtr$Text <- gsub("[^[:alnum:]['-]", " ",twtr$Text)
#To do:
# Refine further, eg. replace " u s " with " U.S. "
blog$Text <- gsub(" [Uu] [Ss] ", " U.S. ",blog$Text)
news$Text <- gsub(" [Uu] [Ss] ", " U.S. ",news$Text)
twtr$Text <- gsub(" [Uu] [Ss] ", " U.S. ",twtr$Text)
#To do: Add extra treatment of misplaced apostrophes
#To do: Compress into a function
############## Create corpus and perform further cleaning using {tm} #################
corp <- Corpus(VectorSource(c(blog, news, twtr)))
meta(corp[[1]], tag="id") <- "blog"
meta(corp[[2]], tag="id") <- "news"
meta(corp[[3]], tag="id") <- "twtr"
# Lowercase letters
corp <- tm_map(corp, content_transformer(tolower))
# To do: How to deal with stopwords (common words of questionable analytic value)
# corp <- tm_map(corp, removeWords, stopwords("english"))
# No, needed in prediction?
# Profanity filtering, to prevent suggesting bad words
# To do: Should probably remove whole sentences/lines, instead of removing
# words, which will change prediction results
if(!file.exists(file.path(datasubfolder,"bad-words.txt"))){
download.file("https://gist.githubusercontent.com/ryanlewis/a37739d710ccdb4b406d/raw/0fbd315eb2900bb736609ea894b9bde8217b991a/google_twunter_lol",
destfile = file.path(datasubfolder,"bad-words.txt"))
}
badwords <- read.table(file.path(datasubfolder,"bad-words.txt"), stringsAsFactors = F)
badwords <- subset(badwords, nchar(badwords$V1)>1)
badwords <- badwords[!(badwords$V1 %in% c("job", "hit", "God", "blow")), ]
corp <- tm_map(corp, removeWords, c(unlist(badwords)))
rm(badwords)
# Numbers, note that this removes eg. 9th, and is a simplification with a cost
corp <- tm_map(corp, removeNumbers)
#To do: Remove common word endings to make the words more recognizable? At what stage in the process?
#corp <- tm_map(corp, stemDocument)
# Unnecessary spaces removal
corp <- tm_map(corp, stripWhitespace)
# To do: Determine if this is necessary
#corp <- tm_map(corp, PlainTextDocument)
# To do: Clean up the above and make a function of it
#head(corp[[1]]$content, 3)
#head(corp[[2]]$content, 3)
#head(corp[[3]]$content, 3)
# Need to break into sentences? sent_detect (tm) or tokenize(text, what="sentence") in quanteda
# Differ between function/structure words and context/lexical words?
######### Counts and sizes of the samples ##############
# Transpose too, for use in the sample summary table
tdm <- TermDocumentMatrix(corp) #Maybe add weighting here also
#tdm
#dim(tdm)
#inspect(tdm[11:20, 1:3]) #Term 11 to 20, all three docs
samplesizes <- sapply(corp, function(x) format(object.size(x), units = "Mb"))
samplewords <- sapply(corp, stri_count_words)
samplelength <- sapply(samplewords, length)
samplesumwords <- sapply(samplewords, sum)
samplesumnchar <- sapply(corp, function(x) sum(nchar(x)))
sampleavgwords <- sapply(samplewords, mean)
samplenames = c("blog","news","twtr")
sampleuniquewords <- vector("integer")
for (i in 1:3) {
sampleuniquewords[i] <- nrow(subset(as.matrix(tdm), as.matrix(tdm)[, i] >0))
}
names(sampleuniquewords) <- samplenames
samplesummary <- data.frame(
Sample = samplenames
,Size = samplesizes
,Rows = samplelength
,Words = samplesumwords
,UniqueWords = sampleuniquewords
,Characters = samplesumnchar
,AvgWordsPerLine = sampleavgwords
,AvgWordLength = samplesumnchar / samplesumwords
)
rownames(samplesummary) <- NULL
# Couldn't prevent RWeka tokenizer from removing punctuation, ended up with "i don t" etc.
# quanteda does a better job with intra-word punctuation
qcorp <- corpus(corp)
#summary(qcorp)
unigram <- dfm(qcorp, ngrams=1)
bigram <- dfm(qcorp, ngrams=2, concatenator=" ")
trigram <- dfm(qcorp, ngrams=3, concatenator=" ")
##################### EDA plots ###########################
ggprep <- function(mydfm, cols, t30=FALSE) {
# Prepares a quanteda dfm and returns the wanted columns as a melted data frame for ggplots
#
# Args:
# dfm: A quanteda n-gram from dfm()
# cols: The columns to be returned, default all eight
# Returns:
# df: The prepared dataframe
df <- as.data.frame(t(mydfm))
df <- cbind(Terms = rownames(df), df)
rownames(df) <- NULL
df <- melt(df, id=1, variable.name="Docs", value.name="Freq")
df <- subset(df, Freq>0)
df <- df[with(df, order(Docs, -Freq)), ]
df <- transform(df, Cumulative.frequency=ave(Freq, Docs, FUN=cumsum))
df <- transform(df, Sum.group.frequency=ave(Freq, Docs, FUN=sum))
df$Cumulative.share <- df$Cumulative.frequency / df$Sum.group.frequency
df$Share <- df$Freq / df$Sum.group.frequency
# The term's rank in the document based on its share within the doc
df <- transform(df, Term.index=ave(Freq==Freq, Docs, FUN=cumsum))
if(t30==TRUE) {df <- subset(df, Term.index<=30)}
df <- df[, cols]
}
cumulativeplot <- function(df, title) {
#Plot function for the cumulative word distribution
#
#Args:
# df: A melted data frame ready for ggplot()
#Returns:
# pl: The ggplot
pl <<- ggplot(df, aes(x=Term.index, y=Cumulative.share, colour=Docs, group=Docs )) +
geom_line(size=1.05, show_guide=FALSE) +
theme_bw() +
ggtitle(title) +
scale_color_brewer(palette="Dark2") + # http://colorbrewer2.org/ - Qualitative, color blind safe
theme(axis.title.x=element_blank(),
axis.title.y=element_blank())
}
top30plot <- function(df, type) {
#Plot function showing the most frequent terms as bars of size=share
#
#Args:
# df: A melted data frame ready for ggplot()
# type: "unigram", "bigram" or "trigram", used in annotation
#Returns:
# gg: The ggplot
gg <<- ggplot(df, aes(x=reorder(Terms, Share), y=Share, fill=Docs)) +
geom_bar(stat="identity") +
facet_wrap(~ Docs) +
coord_flip() +
theme_bw() +
scale_fill_brewer(palette="Dark2") +
ggtitle(paste0("30 most occuring ", type, " from each of the three samples (", length(unique(df$Terms)), " ", type, " total)")) +
ylab(paste0("The ", type, "' share (in %) of the total number of ", type, " occurrences in its document")) +
theme(axis.title.y=element_blank(),
panel.margin=unit(1, "lines"), # A bit extra space between the facets
strip.text.x = element_text(size=12)) +
guides(fill=FALSE)
}
########### Prep df's for the cumulative plot #############
# and behold https://en.wikipedia.org/wiki/Zipf%27s_law
cols <- c("Docs", "Cumulative.share", "Term.index")
df1 <- ggprep(unigram, cols)
pl1 <- cumulativeplot(df1, title="Unigram")
df2 <- ggprep(bigram, cols)
pl2 <-cumulativeplot(df2, title="Bigram")
df3 <- ggprep(trigram, cols)
pl3 <-cumulativeplot(df3, title="Trigram")
########### Cumulative plot #############
grid.arrange(pl1, pl2, pl3, ncol=3,
top="Cumulative share per term in each document\n(green=blog, orange=news, purple=twtr)",
bottom="x = Term rank, from most to least frequent per document",
left="y = Cumulative share")
########### Table of cumulative coverage ###########
termcumcoverage <- function(df, coverage) {
min(which(df$Cumulative.share >= coverage))
}
coverage <- data.frame(Coverage=character(),
Unigram=integer(),
Bigram=integer(),
Trigram=integer(),
stringsAsFactors = FALSE)
coverage[1, ] <- c("50%", mapply(termcumcoverage, df=list(df1, df2, df3), coverage=c(0.50)))
coverage[2, ] <- c("64%", mapply(termcumcoverage, df=list(df1, df2, df3), coverage=c(0.64)))
coverage[3, ] <- c("90%", mapply(termcumcoverage, df=list(df1, df2, df3), coverage=c(0.90)))
coverage[4, ] <- c("95%", mapply(termcumcoverage, df=list(df1, df2, df3), coverage=c(0.95)))
coverage[5, ] <- c("Full", mapply(termcumcoverage, df=list(df1, df2, df3), coverage=c(1.00)))
######### Plot the top 30 term occurences per document ##########
cols <- c("Docs", "Terms", "Share")
df <- ggprep(unigram, cols, t30=TRUE)
print(top30plot(df, type="unigrams"))
df <- ggprep(bigram, cols, t30=TRUE)
print(top30plot(df, type="bigrams"))
df <- ggprep(trigram, cols, t30=TRUE)
print(top30plot(df, type="trigrams"))
############### Citations #################
# Citations are due to these and more
#citation("tm")
#citation("quanteda")
# IEEE PacificVis 2015 short paper
#http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7156366
############ Session info ###############
# 24 Gb Ram, Windows Server 2008 R2
Sys.getlocale()
R.version.string
sessionInfo()