This is an exploratory analysis of the Switf Key dataset. The goal of the analysis is to determine the features which can be used for predicting models. Based on the analysis, it seems the frequencies of ngrams are useful characteristics for future predictive model.
We will download the archive from the URL provided by Coursera and extract data from it.
setwd('~/coursera/capstone')
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists("./Coursera-SwiftKey.zip")) { download.file(url, "./Coursera-SwiftKey.zip") }
if (!file.exists("./final")) { unzip("./Coursera-SwiftKey.zip") }
list.files("./final", include.dirs=FALSE, recursive=TRUE)
## [1] "de_DE/de_DE.blogs.txt" "de_DE/de_DE.news.txt"
## [3] "de_DE/de_DE.twitter.txt" "en_US/en_US.blogs.txt"
## [5] "en_US/en_US.news.txt" "en_US/en_US.twitter.txt"
## [7] "fi_FI/fi_FI.blogs.txt" "fi_FI/fi_FI.news.txt"
## [9] "fi_FI/fi_FI.twitter.txt" "ru_RU/ru_RU.blogs.txt"
## [11] "ru_RU/ru_RU.news.txt" "ru_RU/ru_RU.twitter.txt"
We can see that the archive contains 3 different datasets(blogs, news and tweets) for 4 different languages (de, en, fi, ru). We will concentrate on english language, so we will use only data from en_US directory.
We will read data from different files into 3 different variables blogs, news and tweets, skipping empty lines of the documents.
blogs <- readLines('final/en_US/en_US.blogs.txt', skipNul = TRUE)
news <- readLines('final/en_US/en_US.news.txt', skipNul = TRUE)
tweets <- readLines('final/en_US/en_US.twitter.txt', skipNul = TRUE)
Let’s see how big those documents are and how many lines of text they contain.
files_stats <- rbind(
c('blogs', format(object.size(blogs), units = 'MB'), length(blogs)),
c('news', format(object.size(news), units = 'MB'), length(news)),
c('tweets', format(object.size(tweets), units = 'MB'), length(tweets))
)
colnames(files_stats) <- c('File name', 'Size', '# of lines')
knitr::kable(files_stats, caption = 'File statistics')
| File name | Size | # of lines |
|---|---|---|
| blogs | 248.5 Mb | 899288 |
| news | 249.6 Mb | 1010242 |
| tweets | 301.4 Mb | 2360148 |
Let’s take a look what’s in those files.
head(blogs, n=3)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
head(news, n=3)
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
head(tweets, n=3)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
Now let’s go deeper and see if frequency of different words and phrases can help us to build a strategy for our model.
options(java.parameters = "-Xmx8192m")
library(rJava)
library(tm)
library(RWeka)
library(SnowballC)
library(stringr)
The data can contain a lot of infromation we don’t really need right now, such as numbers, punctuatons and words and characters from other languagues. We also have to prepare our training and test data the same way, so it’s better to write a function for it.
Below is a list of steps we need to do to clean our data:
iconv function to convert the text to latin1, removing all non-convertible bytes)prepareString <- function (x) {
x <- iconv(x, 'latin1', 'ASCII', sub='')
x <- tolower(x)
x <- removePunctuation(x)
x <- removeNumbers(x)
x <- stripWhitespace(x)
str_trim(x)
}
We also will add a function to create a Document Term Matrix to make our code cleaner.
buildDTMatrix <- function (corpus, n, language = 'english') {
DocumentTermMatrix(
corpus,
control = list(
tokenize = function(x) {
NGramTokenizer(x, Weka_control(min = n, max = n))
},
language = language,
stemWords = FALSE
)
)
}
Different text formats can have their own specifics for words and phrases’ frequencies, to get a common understanding we will combine all documents we have together. We also have to limit the amout of data we will put into our training set because of memory and time limits. We will take 100.000 random lines from combined dataset.
all <- c(blogs, news, tweets)
set.seed(1)
inTrain <- sample(length(all), 100000)
training <- all[inTrain]
We will also collapse all documents into one, to get one row with a summed up freuquency for every word.
corpus <- Corpus(VectorSource(paste(training, collapse = ' ')))
corpus <- tm_map(corpus, prepareString)
corpus <- tm_map(corpus, PlainTextDocument)
We will calculate frequencies for 1,2 and 3-grams.
options(mc.cores=1)
unigram_dtm <- buildDTMatrix(corpus, 1)
bigram_dtm <- buildDTMatrix(corpus, 2)
trigram_dtm <- buildDTMatrix(corpus, 3)
Let’s sort it, to how fast the frequencies decline.
matrix <- as.matrix(unigram_dtm)
uni_sorted <- matrix[1,order(matrix[1,], decreasing = TRUE)]
matrix <- as.matrix(bigram_dtm)
bi_sorted <- matrix[1,order(matrix[1,], decreasing = TRUE)]
matrix <- as.matrix(trigram_dtm)
tri_sorted <- matrix[1,order(matrix[1,], decreasing = TRUE)]
barplot(uni_sorted[1:200], main = 'Single word frequency')
barplot(bi_sorted[1:200], main = 'BiGram frequency')
barplot(tri_sorted[1:200], main = 'TriGram frequency')
We can see than frequencies for ngrams decline very fast. Let’s check what percentage of ngrams can cover 70% of the text.
coverage <- rbind(
c(round(1600 * 100 / length(uni_sorted)), round(sum(uni_sorted[1:1600]) * 100 / sum(uni_sorted))),
c(round(250000 * 100 / length(bi_sorted)), round(sum(bi_sorted[1:250000]) * 100 / sum(bi_sorted))),
c(round(1180000 * 100 / length(tri_sorted)), round(sum(tri_sorted[1:1180000]) * 100 / sum(tri_sorted)))
)
colnames(coverage) <- c('percent of ngrams', 'coverage')
rownames(coverage) <- c('unigrams', 'bigrams', 'trigrams')
knitr::kable(coverage, caption = 'Coverage')
| percent of ngrams | coverage | |
|---|---|---|
| unigrams | 2 | 70 |
| bigrams | 27 | 71 |
| trigrams | 63 | 70 |
We can see that frequencies for ngrams decline very fast (especially for unigrams and bigrams), it means that having just small fraction of all ngrams is enough to cover major part of a text. With frequencies we’ve got we can build a model, using Stupid Back-off algorithm, which is presented by formula: \[P(\omega_{i}|\omega^{i-1}_{i-k+1})= \begin{cases} p(\omega^{i}_{i-k+1}),& \text{if } (\omega^{i}_{i-k+1}) \text{ is found}\\ \lambda(\omega^{i-1}_{i-k+1})P(\omega^{i}_{i-k+2}), & \text{otherwise} \end{cases}\]
Where \(p(\cdot)\) are pre-computed and stored probabilities, and \(\lambda(\cdot)\) are back-off weights1.