March 15, 2016
This is the milestone report for the Data Science Capstone Project.
The goal of this report was to build a simple model for the relationship between words, as a first step in creating a predictive text mining application.
The following sections describe my methods for analysing the datasets
Load the necessary libraries
library(tm)
library(knitr)
library(ggplot2)
The dataset can be downloaded from here
Included are three different data files containing text sampled from blogs, news articles, and twitter feeds. The the English versions was used.
## Read in the files
blog <- readLines("data/final/en_US/en_US.blogs.txt")
news <- readLines("data/final/en_US/en_US.news.txt")
twitter <- readLines("data/final/en_US/en_US.twitter.txt")
Combined the datasets contain several million lines of text, with over 20 million characters.
kable(data.frame(
"Data File" = c("Blogs", "News", "Twitter"),
"Line Count" = c(length(blog), length(news), length(twitter)),
"Character Count" = c( sum(nchar(blog)), sum(nchar(news)), sum(nchar(twitter)))
))
| Data.File | Line.Count | Character.Count |
|---|---|---|
| Blogs | 899288 | 208361438 |
| News | 77259 | 15683765 |
| 2360148 | 162384825 |
In order to be able to predict the next word with the highest degree of accuracy, in a reasonably efficient manner, the data set needed to be cleaned up. Numbers, punctuation, special characters, and stop words were removed. In addition, words where converted back to their stems.
The next sections uses the tm package, you can find an introduction here
Samples of the data sets are used to reduce the memory footprint.
set.seed(90210)
combined_set <- c(blog, news, twitter)
sample_set <- sample(combined_set, size = 7500, replace = TRUE)
# encode
sample_set <- iconv(sample_set, "latin1", "ASCII", "")
# create the corpus
swiftkey = Corpus(VectorSource(sample_set))
# clean up objects
rm(blog, news, twitter, sample_set, combined_set)
Tidy up the data sets by removing elements and converting to a common case.
swiftkey <- tm_map(swiftkey, content_transformer(tolower))
# remove the most commonly used words in the english language
swiftkey <- tm_map(swiftkey, removeWords, stopwords("english"))
# reduce inflected (or sometimes derived) words to their word stem
swiftkey <- tm_map(swiftkey, stemDocument)
# clean up
swiftkey <- tm_map(swiftkey, stripWhitespace)
swiftkey <- tm_map(swiftkey, removePunctuation)
swiftkey <- tm_map(swiftkey, removeNumbers)
The final section deals with modeling n-grams and graphing the frequency of unigram, bigram, and trigram.
For reference, see the following links if you are interesting in learning more about text mining with R
unigram <- function(x) unlist(lapply(ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE)
bigram <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
trigram <- function(x) unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
tdm1 <- TermDocumentMatrix(swiftkey, control = list(tokenize = unigram))
tdm2 <- TermDocumentMatrix(swiftkey, control = list(tokenize = bigram))
tdm3 <- TermDocumentMatrix(swiftkey, control = list(tokenize = trigram))
rm(swiftkey)
freq1 <- sort(rowSums(as.matrix(tdm1)), decreasing = TRUE)[1:15]
rm(tdm1)
freq2 <- sort(rowSums(as.matrix(tdm2)), decreasing = TRUE)[1:15]
rm(tdm2)
freq3 <- sort(rowSums(as.matrix(tdm3)), decreasing = TRUE)[1:15]
rm(tdm3)
Unigrams ordered by frequency
x1 <- data.frame(shingles = names(freq1), fq = freq1)
ggplot(x1, aes(x=reorder(shingles, fq), y=fq)) +
geom_bar(stat="identity") + xlab("Term(s)") + ylab("Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Bigrams ordered by frequency
x2 <- data.frame(shingles = names(freq2), fq = freq2)
ggplot(x2, aes(x=reorder(shingles, fq), y=fq)) +
geom_bar(stat="identity") + xlab("Term(s)") + ylab("Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Trigrams ordered by frequency
x3 <- data.frame(shingles = names(freq3), fq = freq3)
ggplot(x3, aes(x=reorder(shingles, fq), y=fq)) +
geom_bar(stat="identity") + xlab("Term(s)") + ylab("Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
This was a good first step towards the final goal of creating a prediction algorithm. I can see some large hurdles, such as quality of the data and performance. Not to mention choosing the overall methodology of the prediction. As far as next steps go, I am going to continue to model/clean the data and look for ways to make it more useful.