The goal of the capstone project is to create a predictive text model using a large text corpus of documents as training data. Natural language processing techniques will be used to perform the analysis and build the predictive model.
This document explains the major features of the SwiftKey data identified and briefly summarize our plans for creating the prediction algorithm and Shiny app.
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this report is to understand the basic relationships we observed in the data and prepare to build your first linguistic models.
First we need to download and extract the data:
if (!file.exists("Coursera-SwiftKey.zip")) {
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
destfile = "Coursera-SwiftKey.zip")
if(!dir.exists("data")){
unzip("Coursera-SwiftKey.zip", exdir = "data")
}
}
The data sets consist of text of 4 different languages: 1) German, 2) English, 3) Finnish and 4) Russian. Each language contains data from 3 sources: 1) News, 2) Blogs and 3) Twitter. In this project, we will only focus on the English data sets:
blogs <- readLines("data/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("data/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("data/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- iconv(twitter, to = "UTF-8", sub="")
news <- iconv(news, to = "UTF-8", sub="")
blogs <- iconv(blogs, to = "UTF-8", sub="")
First lets explore the dataset in terms of size:
mb <- 1024*1024
blogs.size <- file.info("data/final/en_US/en_US.blogs.txt")$size/mb
news.size <- file.info("data/final/en_US/en_US.news.txt")$size/mb
twitter.size <- file.info("data/final/en_US/en_US.twitter.txt")$size/mb
Then let us examine the datasets in terms of word count:
library(stringi)
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)
library(knitr)
summary <- data.frame(source = c("blogs", "news", "twitter"),
file.size.MB = c(blogs.size, news.size, twitter.size),
num.lines = c(length(blogs), length(news), length(twitter)),
num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
kable(summary,caption = "Data Summary", col.names = c("Dataset", "File Size (MB)",
"Number of Lines", "Number of Words",
"Mean Number of Words"))
Dataset | File Size (MB) | Number of Lines | Number of Words | Mean Number of Words |
---|---|---|---|---|
blogs | 200.4242 | 899288 | 37541795 | 41.74613 |
news | 196.2775 | 1010242 | 34762303 | 34.40988 |
159.3641 | 2360148 | 30092907 | 12.75043 |
The datasets made available are large and hard to process, thus we decided to work with a sample of 10% of the data available:
blogs.sample <- sample(blogs,length(blogs)*.5/100)
news.sample <- sample(news,length(news)*.5/100)
twitter.sample <- sample(twitter,length(twitter)*.5/100)
sample <- c(blogs.sample,news.sample,twitter.sample)
# remove temporary variables
rm(twitter,news,blogs,blogs.sample,news.sample,twitter.sample)
After examining the size of the dataset and getting the feeling of the amount of data available in them, we use an empirimistic approach for cleaning the data.
Using the tm package, the sampled data is used to create a corpus. Subsequently, the the following transformations are performed:
library(tm)
library(RWeka)
library(ggplot2)
sample.corpus<-Corpus(VectorSource(sample))
sample.corpus <- tm_map(sample.corpus, content_transformer(stripWhitespace))
sample.corpus <- tm_map(sample.corpus, content_transformer(tolower))
sample.corpus <- tm_map(sample.corpus, content_transformer(removePunctuation))
sample.corpus <- tm_map(sample.corpus, content_transformer(removeNumbers))
sample.corpus <- tm_map(sample.corpus, content_transformer(PlainTextDocument))
sample.corpus <- tm_map(sample.corpus, removeWords, stopwords("en"))
An n-gram is a contiguous sequence of one or more elements inside a text or speech. These items can be words, syllables or letters. The n-grams typically are collected from a text or speech corpus.An n-gram of size 1 is referred to as a unigram; size 2 is a bigram (or, less commonly, a digram); size 3 is a trigram.
The datastructure used for these analysis is the TermDocument Matrix, which consists of a matrix that relates the term (unigram, bigram, trigram) to the documents in which they appear. The following analysis filters each Term Document Matrix structure, selecting the most frequent terms. We chose to divide the analysis according to the number of terms at the n-grams.
options(mc.cores=1)
uniGramTokenizer = function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
uniGramMatrix <- TermDocumentMatrix(sample.corpus, control = list(tokenize = uniGramTokenizer))
After collecting the unigrams and assembring a TermDocument data structure we can classify the terms (unigrams) by their frequency:
freqTerms <- findFreqTerms(uniGramMatrix, lowfreq = 1000)
termFrequency <- rowSums(as.matrix(uniGramMatrix[freqTerms,]))
termFrequency <- data.frame(unigram=names(termFrequency), frequency=termFrequency)
Then, we can plot the result as a bar chart, where the bars represent the ammount of repetitions of the term inside the data collection.
g <- ggplot(termFrequency, aes(x=reorder(unigram, frequency), y=frequency)) +
geom_bar(stat = "identity") +
theme(legend.title=element_blank()) +
xlab("Unigram") + ylab("Frequency") +
labs(title = "Top Unigrams by Frequency")
print(g)
biGramTokenizer = function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
biGramMatrix <- TermDocumentMatrix(sample.corpus, control = list(tokenize = biGramTokenizer))
freqTerms <- findFreqTerms(biGramMatrix, lowfreq = 50)
termFrequency <- rowSums(as.matrix(biGramMatrix[freqTerms,]))
termFrequency <- data.frame(bigram=names(termFrequency), frequency=termFrequency)
g <- ggplot(termFrequency, aes(x=reorder(bigram, frequency), y=frequency)) +
geom_bar(stat = "identity") +
theme(legend.title=element_blank()) +
xlab("Bigram") + ylab("Frequency") +
labs(title = "Top Bigrams by Frequency")
print(g)
triGramTokenizer = function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
triGramMatrix <- TermDocumentMatrix(sample.corpus, control = list(tokenize = triGramTokenizer))
freqTerms <- findFreqTerms(triGramMatrix, lowfreq = 10)
termFrequency <- rowSums(as.matrix(triGramMatrix[freqTerms,]))
termFrequency <- data.frame(trigram=names(termFrequency), frequency=termFrequency)
g <- ggplot(termFrequency, aes(x=reorder(trigram, frequency), y=frequency)) +
geom_bar(stat = "identity") +
theme(legend.title=element_blank()) +
xlab("Trigram") + ylab("Frequency") +
labs(title = "Top Trigrams by Frequency")
print(g)
Another very popular way of summarizing the data is by using a WordCloud:
library(wordcloud)
wordcloud(sample.corpus, max.words = 30, scale=c(4,0.2), random.order = FALSE, rot.per = 0.35, use.r.layout = FALSE)
Now that we have performed some exploratory analysis, we are ready to start building the predictive model(s) and eventually the data product. Below are high-level plans to achieve this goal: