The company SwiftKey develops smart prediction software for easier mobile typing on smartphones etc. This project attempts to analyze their mammoth, publicly available, twitter/news/blogs database for suitability in developing word prediction algorithms. Key challenge was to come up with a software and hardware combination to complete the basic data processing tasks to generate ngrams (sequence of word phrases), on which a word prediction algorithm can be applied.
This report fulfills requirements of the Week 2 milestone for Johns Hopkins University’s Developing Data Products class. The code and output demonstrates that the SwiftKey data has been read in and unzipped. Basic line count, word count statistics are generated and the quanteda Natural Language Processing (NLP) package is sufficient to explore the database and report frequency matrix data on word tokens and ngrams. The final section details further work that is required to complete word prediction.
The zip file location for the corpus has been provided by the course instructors.
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
destFile <- "Coursera-SwiftKey.zip"
dataDir <- "final"
if (!file.exists(dataDir)) {
cat("Downloading ZIP file from web.... \n")
download.file(fileUrl, destfile = destFile, method="curl")
unzip(destFile)
print("unzip complete....")
# Did the download and unzip work as expected ?
if(file.exists(dataDir))
stop ("ZIP file download and unzip process failed. data directory not created.")
} else cat("Using existing data directory: ", dataDir, "\n")
## Using existing data directory: final
Text data has been provided in Deutsch Finnish Russian and English languages. We will only be using the English data.
list.files("final", include.dirs = T)
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
How big is this data?
finfo <- file.info(list.files(file.path("final","en_US"),full.names=T, pattern="*.txt"))
fsizeMB <- finfo$size/(1024^2)
names(fsizeMB) <- basename(rownames(finfo))
fsizeMB
en_US.blogs.txt en_US.news.txt en_US.twitter.txt
200.4242 196.2775 159.3641
This is a rather large corpus (>500Mb) requiring long processing times on a 8GB RAM Mac if read in as one big blob. It presents a whole slew of problems related to memory size, CPU runtimes, issues with non-ASCII characters, in addition to needing the NLP features.
Switching from a personal Macbook Pro hardware to a more capable Google Compute Engine machine and using quanteda NLP package (instead of tm), resolved all the issues.
twit <- read_lines("final/en_US/en_US.twitter.txt")
news <- read_lines("final/en_US/en_US.news.txt")
blog <- read_lines("final/en_US/en_US.blogs.txt")
lc <- c(length(blog), length(news), length(twit))
names(lc) <- c("blog", "news", "twit")
barplot(lc/1000, col = c("pink", "green", "lightblue"), main="Line Count in Corpus", ylab = "(in thousands)")
News is a bigger database in size but the linecount is higher for twitter, likely due to the twitter character limitations.
To deteremine word counts, I could’ve written my own function but decided to use NLP packages to get familiar with them.
# Read the text data into NLP package
corpus.blog <- corpus(paste(iconv(blog, "UTF-8", "ASCII"),collapse = "\n"))
corpus.twit <- corpus(paste(iconv(twit, "UTF-8", "ASCII"),collapse = "\n"))
corpus.news <- corpus(paste(iconv(news, "UTF-8", "ASCII"),collapse = "\n"))
# Generate document frequency matrices. Looong run-times on this huge corpus as text is split into words
dfm.blog <- dfm(corpus.blog)
dfm.twit <- dfm(corpus.twit)
dfm.news <- dfm(corpus.news)
wc <- c(rowSums(dfm.blog), rowSums(dfm.news), rowSums(dfm.twit))
names(wc) <- c("blog","news","twit")
barplot(wc/1000^2, col = c("pink", "green", "lightblue"), main="Word Count in Corpus", ylab = "(in millions)")
wc
## blog news twit
## 24736471 33517462 35340301
cat("Total ", sum(wc)/(1000^2), " million words in the database.\n")
## Total 93.59423 million words in the database.
The next task was to combine the data all into one file. Since, the goal of the capstone project is to predict words, we don’t necessarily need to separate out the sources of our training data.
Combining corpus is very easy in quanteda with a “+” operation:
corpus.all <- corpus.blog + corpus.news + corpus.twit
dfm.all <- dfm(corpus.all)
# Get Word count from Corpus.
# Define function for reuse later.
getwc <- function (mydfm) {
wc <- rowSums(mydfm);
names(wc) <- c("blog", "news", "twit");
wc
}
wc.all <- getwc(dfm.all)
wc.all
## blog news twit
## 24736471 33517462 35340301
The combined word counts match up exactly with the individual ones earlier. We can proceed with generating interesting info.
We can answer some simple/fun/interesting/academic questions to demonstrate that processing in quanteda will be adequate for our purpose.
topfeatures.all <- topfeatures(dfm.all, 20)
topfeatures.all
## . the , to a and of ! i
## 5424403 3604592 3339277 2117364 1836267 1800544 1473181 1382555 1306250
## in " for is you that it on :
## 1262582 1005690 884585 822494 776799 760898 699970 653700 576062
## with ?
## 550084 531829
Looks like we will need to clean out punctuations and single characters later.
We can remove profane words. A file with profane words has already been downloaded locally.
proflist <- scan("profanity_list.txt", what="")
dfm.noprofanity <- dfm(corpus.all, remove = c(proflist))
wc.noprofanity <- getwc(dfm.noprofanity)
wc.profanity <- wc.all - wc.noprofanity
wc.profanity
## blog news twit
## 26443 9150 137850
From this data, twitter users are more profane than on other media.
dfm.nopun <- dfm(corpus.all, remove_punct = T, remove_numbers = T, remove_separators = T,
remove = c(proflist))
wc.nopun <- getwc(dfm.nopun)
topfeatures.nopun <- topfeatures(dfm.nopun, 20)
topfeatures.nopun
## the to a and of i in for is
## 3604592 2117364 1836267 1800544 1473181 1306250 1262582 884585 822494
## you that it on with my na was at
## 776799 760898 699970 653700 550084 486497 476976 466440 459849
## be have
## 423970 412184
dfm.nostop <- dfm(corpus.all, remove_punct = T, remove_numbers = T,remove_separators = T,
remove = c(stopwords("english"), proflist))
topfeatures.nostop <- topfeatures(dfm.nostop, 20)
wc.nostop <- getwc(dfm.nostop)
topfeatures.nostop
## na just said one like can get time new good
## 476976 247627 242483 218769 213320 191906 185902 167965 156514 150006
## now day love know people go back see great first
## 146297 141700 137087 128968 119301 115294 114347 111689 106293 103717
textplot_wordcloud(dfm.nostop, min.freq=50000, random.order=F, rot.per=0.25, colors=RColorBrewer::brewer.pal(8,"Dark2"))
We need to clean the data some more, like removing “na”, but that’s a task for another milestone.
barcolumns <- c("All words", "No profanity", "No punctuations", "No stop words")
wc.sums <- c(sum(wc.all), sum(wc.noprofanity), sum(wc.nopun), sum(wc.nostop))
wc.sums <- wc.sums/1024^2
names(wc.sums) <- barcolumns
barplot(wc.sums, col = c("red","blue","green","yellow"),
main="Corpus word count after cleaning", ylab = "in millions")
We can create ngrams which is a frequency matrix of sequence of words in our corpus. Once this matrix is available, in our final implementation of the project, we can look for ngram prefixes while the user is typing. The quanteda package makes this extremely easy.
# Memory & runtime issue: have to save nfgram2.dfm and load for knitr
#ngram2 <- tokens(corpus.all, what="word", ngrams=2L, remove_punct=T)
#ngram2.dfm <- dfm(ngram2)
ngram2.dfm <- readRDS("ngram2.trim.RDS")
barplot(topfeatures(ngram2.dfm/1000, 20), las=2, main ="Histogram of Top 2-grams in database", ylab = "(in thousands)", col="green")
rm(ngram2.dfm)
# Memory & runtime issue: have to save ngram3.dfm and load for knit
#ngram3 <- tokens(corpus.all, what="word", ngrams=3L, remove_punct=T)
#ngram3.dfm <- dfm(ngram3)
ngram3.dfm <- readRDS("ngram3.trim.RDS")
barplot(topfeatures(ngram3.dfm/1000, 20), las=2, main ="Histogram of Top 3-grams in database", ylab = "(in thousands)", col="green")
rm(ngram3.dfm)
# Memory & runtime issue: have to save ngram4.dfm and load for knit
#ngram4 <- tokens(corpus.all, what="word", ngrams=4L, remove_punct=T)
#ngram4.dfm <- dfm(ngram4)
ngram4.dfm <- readRDS("ngram4.trim.RDS")
par(mar = c(9,4,4,2) + 0.1)
barplot(topfeatures(ngram4.dfm/1000, 20), las=2, main ="Histogram of Top 4-grams in database", ylab = "(in thousands)", col="green")
rm(ngram4.dfm)