This report attempts to fulfill requirements of the Week 2 milestone for Coursera’s Developing Data Products class. The code and output shows that the SwiftKey data has been read in and unzipped. Basic line count, word count statistics are generated and the quanteda package is used to explore the database and report frequency matrix data on word tokens and ngrams. The final section details further Work that is required to be performed.
The zip file location for the corpus has been provided by Coursera.
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
destFile <- "Coursera-SwiftKey.zip"
dataDir <- "final"
if (!file.exists(dataDir)) {
cat("Downloading ZIP file from web.... \n")
download.file(fileUrl, destfile = destFile, method="curl")
unzip(destFile)
print("unzip complete....")
# Did the download and unzip work as expected ?
if(file.exists(dataDir))
stop ("ZIP file download and unzip process failed. data directory not created.")
} else cat("Using existing data directory: ", dataDir, "\n")
## Using existing data directory: final
Text data has been provided in Deutsch Finnish Russian and English. We will only be using the English data.
list.files("final", include.dirs = T)
## [1] "de_DE" "en_us" "fi_FI" "ru_RU"
All the text files have been concatenated into one mega file called all.txt. And the data has been re-arranged for ease.
How big is this data?
finfo <- file.info(list.files(file.path("final","en_US"),full.names=T, pattern="*.txt"))
fsizeMB <- finfo$size/(1024^2)
names(fsizeMB) <- basename(rownames(finfo))
fsizeMB
en_US.blogs.txt en_US.news.txt en_US.twitter.txt
200.4242 196.2775 159.3641
This is a rather large corpus requiring long processing times on my Mac if read as one big blob. It presented a whole slew of problems I never encountered earlier.
The first problem was that the news data was not being read just fine on my Mac, but not on a Windows machine that I had to migrate to later. I had to switch to using the read_lines() function from readr package which gets around this problem.
twit <- read_lines("final/en_US/en_US.twitter.txt")
news <- read_lines("final/en_US/en_US.news.txt")
blog <- read_lines("final/en_US/en_US.blogs.txt")
lc <- c(length(blog), length(news), length(twit))
names(lc) <- c("blog", "news", "twit")
barplot(lc, col = c("pink", "green", "lightblue"), main="Line Count in Corpus")
News is a bigger database in size but the linecount is higher for twitter. We can guess that is because tweets are short.
To deteremine word counts, I could’ve written my own function but decided to use off-the-shelf packages to get familiar with them, and started using the tm package. This worked fine until I used the TermDocumentMatrix() function on the news database. It never came back after 12-14 hours and the R session died multiple times.
So, I switched to using the “quanteda” package as recommended by mentors and others for further analysis.
corpus.blog <- corpus(paste(iconv(blog, "UTF-8", "ASCII"),collapse = "\n"))
corpus.twit <- corpus(paste(iconv(twit, "UTF-8", "ASCII"),collapse = "\n"))
corpus.news <- corpus(paste(iconv(news, "UTF-8", "ASCII"),collapse = "\n"))
dfm.blog <- dfm(corpus.blog)
dfm.twit <- dfm(corpus.twit)
dfm.news <- dfm(corpus.news)
wc <- c(rowSums(dfm.blog), rowSums(dfm.news), rowSums(dfm.twit))
names(wc) <- c("blog","news","twit")
barplot(wc, col = c("pink", "green", "lightblue"), main="Word Count in Corpus")
wc
## blog news twit
## 24736471 33517462 35340301
cat("Total ", sum(wc)/(1024^2), " million words in the database.\n")
## Total 89.25842 million words in the database.
The quanteda package has a similar flow to the tm package, but completed fine all the files.
The next task was to combine the data all into one file. Since, the goal of the capstone project is to predict words, we don’t necessarily need to separate out the sources of our training data.
Combining corpus is very easy with quanteda with a “+” operation:
corpus.all <- corpus.blog + corpus.news + corpus.twit
dfm.all <- dfm(corpus.all)
# Get Word count from Corpus.
# Define function for reuse later.
getwc <- function (mydfm) {
wc <- rowSums(mydfm);
names(wc) <- c("blog", "news", "twit");
wc
}
wc.all <- getwc(dfm.all)
wc.all
## blog news twit
## 24736471 33517462 35340301
The combined word counts match up exactly with the individual ones earlier. We can proceed with generating interesting info.
We can answer some simple/fun/interesting/academic questions to demonstrate that processing in quanteda will be adequate for our purpose.
topfeatures.all <- topfeatures(dfm.all, 20)
topfeatures.all
## . the , to a and of ! i
## 5424403 3604592 3339277 2117364 1836267 1800544 1473181 1382555 1306250
## in " for is you that it on :
## 1262582 1005690 884585 822494 776799 760898 699970 653700 576062
## with ?
## 550084 531829
Looks like we will need to clean out punctuations and single characters later.
We can remove profane words. A file with profane words has already been downloaded locally.
proflist <- scan("profanity_list.txt", what="")
dfm.noprofanity <- dfm(corpus.all, remove = c(proflist))
wc.noprofanity <- getwc(dfm.noprofanity)
wc.profanity <- wc.all - wc.noprofanity
wc.profanity
## blog news twit
## 26443 9150 137850
From this data, twitter users are more profane than on other media.
dfm.nopun <- dfm(corpus.all, remove_punct = T, remove_numbers = T, remove_separators = T,
remove = c(proflist))
wc.nopun <- getwc(dfm.nopun)
topfeatures.nopun <- topfeatures(dfm.nopun, 20)
topfeatures.nopun
## the to a and of i in for is
## 3604592 2117364 1836267 1800544 1473181 1306250 1262582 884585 822494
## you that it on with my na was at
## 776799 760898 699970 653700 550084 486497 476976 466440 459849
## be have
## 423970 412184
dfm.nostop <- dfm(corpus.all, remove_punct = T, remove_numbers = T,remove_separators = T,
remove = c(stopwords("english"), proflist))
topfeatures.nostop <- topfeatures(dfm.nostop, 20)
wc.nostop <- getwc(dfm.nostop)
topfeatures.nostop
## na just said one like can get time new good
## 476976 247627 242483 218769 213320 191906 185902 167965 156514 150006
## now day love know people go back see great first
## 146297 141700 137087 128968 119301 115294 114347 111689 106293 103717
textplot_wordcloud(dfm.nostop, min.freq=100000, random.order=F, rot.per=0.25, colors=RColorBrewer::brewer.pal(8,"Dark2"))
We need to clean the data some more, like removing “na”, but that’s a task for another milestone.
barcolumns <- c("All words", "No profanity", "No punctuations", "No stop words")
wc.sums <- c(sum(wc.all), sum(wc.noprofanity), sum(wc.nopun), sum(wc.nostop))
wc.sums <- wc.sums/1024^2
names(wc.sums) <- barcolumns
barplot(wc.sums, col = c("red","blue","green","yellow"),
main="Corpus word count after cleaning", ylab = "in millions")
We can create ngrams which is a frequency matrix of commonly occuring sequence of words in our corpus. Once this matrix is available, in our final implementation of the capstone, we can look for commonly occuring words. The quanteda package makes this extremely easy.
# Memory & runtime issue: have to save nfgram2.dfm and load for knitr
#ngram2 <- tokens(corpus.all, what="word", ngrams=2L, remove_punct=T)
#ngram2.dfm <- dfm(ngram2)
load("ngram2.dfm")
barplot(topfeatures(ngram2.dfm/1000, 20), las=2, main ="Histogram of Top 2-grams in database", ylab = "(in thousands)", col="green")
rm(ngram2.dfm)
# Memory & runtime issue: have to save ngram3.dfm and load for knit
#ngram3 <- tokens(corpus.all, what="word", ngrams=3L, remove_punct=T)
#ngram3.dfm <- dfm(ngram3)
load("ngram3.dfm")
barplot(topfeatures(ngram3.dfm/1000, 20), las=2, main ="Histogram of Top 3-grams in database", ylab = "(in thousands)", col="green")
rm(ngram3.dfm)
At the time of completing this assignment, creating ngram4 runs out of memory on my comoputer. I will figure out a solution in the future.