Summary

During the Data Science Capstone, we’re aiming to somewhat mimic Swiftkey’s most well-known features - predicting what you’ll write next based on the last few words you just typed. The simplest train of thought is going from one word to the next, evolving to the more context-based approach of analysing text sentence by sentence, paragraph by paragraph, trying to encompass the best possible match for what you’ll write next.

Here we’re basically told to explore a training dataset and perform the usual exploratory data analysis and start assessing how I’d create my own prediction algorithm and submit it to R Pubs.

Data analysis

Load tools to work

I’m starting my report by loading all the necessary packages I will use along my analysis, along with the data itself - English is easier for me - and converting it into a more readable type to form a complete dataset of words and sentences easier to manipulate.

library(knitr)
library(stringr)
library(ggplot2)
suppressMessages(library(tm)) # suppressing messages to shorten output
library(RWeka)

# I already did this, so I'm commenting this part out, though you can remove the "#" character in order to run it.
# download and load all 3 files
# if(!file.exists("./data")){dir.create("./data")}
# fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# download.file(fileUrl, destfile = "./data/Coursera-SwiftKey.zip")
# unzip('./data/Coursera-SwiftKey.zip', exdir = "./data")

# loading the text datasets
blogs_txt <- readLines("./Coursera-SwiftKey/en_US/en_US.blogs.txt")
twitter_txt <- readLines("./Coursera-SwiftKey/en_US/en_US.twitter.txt")

## Warning in readLines("./Coursera-SwiftKey/en_US/en_US.twitter.txt"): line
## 167155 appears to contain an embedded nul

## Warning in readLines("./Coursera-SwiftKey/en_US/en_US.twitter.txt"): line
## 268547 appears to contain an embedded nul

## Warning in readLines("./Coursera-SwiftKey/en_US/en_US.twitter.txt"): line
## 1274086 appears to contain an embedded nul

## Warning in readLines("./Coursera-SwiftKey/en_US/en_US.twitter.txt"): line
## 1759032 appears to contain an embedded nul

news_txt <- readLines("./Coursera-SwiftKey/en_US/en_US.news.txt")

# processing the text
blogs <- iconv(blogs_txt, "latin1", "ASCII", sub="")
twitter <- iconv(twitter_txt, "latin1", "ASCII", sub="")
news <- iconv(news_txt, "latin1", "ASCII", sub="")

Get a basic feel of the data

Before diving into a deeper analysis, we should first assess our documents’ size in terms of Megabytes (MB), as well as lines of text and total number of characters.

# Check file sizes in MBytes and create dataframes
blogs_size <- round(file.info("./Coursera-SwiftKey/en_US/en_US.blogs.txt")$size/1024^2, 1)
twitter_size <- round(file.info("./Coursera-SwiftKey/en_US/en_US.twitter.txt")$size/1024^2, 1)
news_size <- round(file.info("./Coursera-SwiftKey/en_US/en_US.news.txt")$size/1024^2, 1)
all_sizes <- c(blogs_size, twitter_size, news_size)

# Count total number of lines for all 3 text files
blogs_length <- length(blogs_txt)
twitter_length <- length(twitter_txt)
news_length <- length(news_txt)
all_length <- c(blogs_length, twitter_length, news_length)

# Count total number of characters per line for all files
blgs <- NULL
for (i in (1:blogs_length)){
    blgs[i] <- nchar(blogs_txt[i])
}

twtt <- NULL
for (i in (1:twitter_length)){
    twtt[i] <- nchar(twitter_txt[i])
}

news <- NULL
for (i in (1:news_length)){
    news[i] <- nchar(news_txt[i])
}

# Merging all the info for a basic summary statistics of all 3 documents
basic_summary <- data.frame(all_sizes, all_length, rbind(sum(blgs), sum(twtt), sum(news)), row.names = c("blogs", "twitter", "news"))
colnames(basic_summary) <- c("Size (in MB)", "Number of lines", "Number of chars")
kable(basic_summary)

	Size (in MB)	Number of lines	Number of chars
blogs	200.4	899288	206824505
twitter	159.4	2360148	162096031
news	196.3	1010242	203223159

Finally, after these calculations, we can see that unsurprisingly “blogs” is our biggest file in terms of MB as well as in number of chars, totalling 200.4 MB in file size and 207 million characters.

Regarding the twitter dataset, despite having more than double of the entries than our second largest dataset - close to 2.4 million tweets! - it was also the only one with well under 200 million characters. This was to be expected as the maximum amount of characters on this platform up until a short while ago was 140 characters per tweet (i.e. per entry).

Character count analysis

# creating matrix with most important stats from character counts in the three files
blogs_char_summary <- pastecs::stat.desc(blgs)[c(4, 5, 8, 9, 13)]
twitter_char_summary <- pastecs::stat.desc(twtt)[c(4, 5, 8, 9, 13)]
news_char_summary <- pastecs::stat.desc(news)[c(4, 5, 8, 9, 13)]
mid_char_stats <- round(rbind(blogs_char_summary, twitter_char_summary, news_char_summary), 2)
mid_char_stats_matrix <- matrix(mid_char_stats, 3, dimnames = list(c("blogs", "twitter", "news"), c("Minimum","Maximum", "Median", "Mean", "Standard deviations")))
quantiles <- matrix(c(quantile(blgs, probs = 0.999), quantile(twtt, probs = 0.999), quantile(news, probs = 0.999)), 3, dimnames = list(c(), c("99.9% quantile")))
final_char_stat <- cbind(mid_char_stats_matrix, quantiles)
kable(final_char_stat)

	Minimum	Maximum	Median	Mean	Standard deviations	99.9% quantile
blogs	1	40833	156	229.99	258.66	1994
twitter	2	140	64	68.68	37.23	140
news	1	11384	185	201.16	133.22	1007

Further analysis into the number of characters per line confirms that our twitter dataset has a maximum of 140 characters per line, while news 10x that at 11 thousand and blogs more than quadruples that value with an astonishing maximum of 40833 lines in its biggest entry!!

On the other end of the spectrum though, 23 blog posts were written with just one word; if this is correct or a text processing error is hard to know. News also had 12 posts with a single character which makes me ponder more on an error having ocurred than on correct data processing

It’s hard to transmit any sort of valuable information with less than 2 characters of which the twitter dataset has 2 entries, the most informative being D;.

Word count analysis

# creating df with most important words from character counts in the three files
blgs_w <- NULL
for (i in (1:blogs_length)){
    blgs_w[i] <- str_count(blogs_txt[i], "\\w+")
}

twtt_w <- NULL
for (i in (1:twitter_length)){
    twtt_w[i] <- str_count(twitter_txt[i], "\\w+")
}

news_w <- NULL
for (i in (1:news_length)){
    news_w[i] <- str_count(news_txt[i], "\\w+")
}

# creating matrix with most important stats from word counts in the three files
mid_word_stats <- round(rbind(pastecs::stat.desc(blgs_w), pastecs::stat.desc(twtt_w), pastecs::stat.desc(news_w))[,c(4, 5, 8, 9, 13)], 2)
mid_word_stats_matrix <- matrix(mid_word_stats, 3, dimnames = list(c("blogs", "twitter", "news"), c("Minimum","Maximum", "Median", "Mean", "Standard deviations")))
quantiles <- matrix(c(quantile(blgs_w, probs = 0.999), quantile(twtt_w, probs = 0.999), quantile(news_w, probs = 0.999)), 3, dimnames = list(c(), c("99.9% Quantiles")))
final_word_stat <- cbind(mid_word_stats_matrix, quantiles)
kable(final_word_stat)

	Minimum	Maximum	Median	Mean	Standard deviations	99.9% Quantiles
blogs	1	6851	29	42.60	47.43	366
twitter	1	47	12	13.14	7.12	32
news	1	1928	32	35.26	23.37	179

A simple word count analysis further confirms our expectations regarding the post sizes of each platform, twitter being the shortest due to its limitations with blogs being the largest.

Characters to number of words ratio

I thought it curious to check the ratio of number of characters to number of words per line in each platform, as twitter users might shorten words in an SMS-like way of the old days, so as to provide the largest amount of information using the least amount of characters.

# check characters/word ratio per platform
blgs_r <- blgs/blgs_w
twtt_r <- twtt/twtt_w
news_r <- news/news_w
char_to_word_stats <- round(rbind(pastecs::stat.desc(blgs_r), pastecs::stat.desc(twtt_r), pastecs::stat.desc(news_r))[,c(4, 5, 8, 9, 13)], 2)
char_to_word_stats_matrix <- matrix(char_to_word_stats, 3, dimnames = list(c("blogs", "twitter", "news"), c("Minimum","Maximum", "Median", "Mean", "Standard deviations")))
quantiles <- matrix(c(quantile(blgs/blgs_w, probs = 0.999), quantile(twtt/twtt_w, probs = 0.999), quantile(news/news_w, probs = 0.999)), 3, dimnames = list(c(), c("Quantiles")))
final_char_to_word_stats_matrix <- cbind(char_to_word_stats_matrix, quantiles)
kable(final_char_to_word_stats_matrix)

	Minimum	Maximum	Median	Mean	Standard deviations	Quantiles
blogs	1.0	136.00	5.33	5.45	1.07	12.00
twitter	1.5	42.33	5.17	5.28	0.92	10.50
news	1.0	96.00	5.70	5.73	0.73	10.25

We can confirm our suspicions from this table but let’s try to visualize it a bit better. As an example, I took a sample of tweets and selected one to show the phenomenon I was talking, which is as follows: “I’m coo… Jus at work hella tired r u ever in cali”.

Visualising our data

I’ll try to now see this ratio in a more visual form.

blgs_df <- data.frame("chars" = blgs, "words" = blgs_w, "char2word" = blgs_r, "Type" = "blogs")
twtt_df <- data.frame("chars" = twtt, "words" = twtt_w, "char2word" = twtt_r, "Type" = "twitter")
news_df <- data.frame("chars" = news, "words" = news_w, "char2word" = news_r, "Type" = "news")

combined <- rbind(blgs_df, twtt_df, news_df)
# charts
g_chars <- ggplot(combined, aes(chars, fill = Type)) + geom_density(alpha = 0.2) + xlim(0,1000) + labs(x = "Number of characters per line", y = "Density", title = "Density function for total number of characters per line")
g_words <- ggplot(combined, aes(words, fill = Type)) + geom_density(alpha = 0.2) + xlim(0,150) + labs(x = "Number of words per line", y = "Density", title = "Density function for total number of words per line")
g_chars2word <- ggplot(combined, aes(char2word, fill = Type)) + geom_density(alpha = 0.2) + xlim(0,12) + labs(x = "Character to word ratio per line", y = "Density", title = "Density function for character to word ratio per line")

## Warning: Removed 1692 rows containing non-finite values (stat_density).

Based on this graph, it seems clear that news use longer words and, in theory, more complex than other less formal platforms.

plot(ecdf(blgs_r), xlim = c(0,10), verticals = TRUE, do.points = FALSE, main = "Cumulative Distribution Functions per document", xlab = "Characters to word ratio", ylab = "% of document")
plot(ecdf(news_r), xlim = c(0,10), verticals = TRUE, do.points = FALSE, add = TRUE, col = 'red', main = "Cumulative Distribution Functions per document", xlab = "Characters to word ratio", ylab = "% of document")
plot(ecdf(twtt_r), xlim = c(0,10), verticals = TRUE, do.points = FALSE, add = TRUE, col = 'blue', main = "Cumulative Distribution Functions per document", xlab = "Characters to word ratio", ylab = "% of document")
legend(0.2, 0.9, legend=c("blogs", "twitter", "news"),
       col=c("black", "blue", "red"), lty=1:2, cex=0.8)

As for twitter and blogs, both go together though twitter peaks slightly earlier at a steeper pace as shown by plotting a cumulative distribution function graph.

For the sake of being thorough, let’s visualize the characters and word count per line of all 3 datasets.

## Warning: Removed 14419 rows containing non-finite values (stat_density).

## Warning: Removed 29447 rows containing non-finite values (stat_density).

Sampling from the data

The data itself is a bit too much for my current processing setup so I’ll randomly sample 5% of the data to build a corpus. For reproducibility purpose, I’ll set the seed at the value of the new year - 2018!

set.seed(2018)
sample_data <- c(sample(blogs, length(blogs) * 0.05),
                 sample(twitter, length(twitter) * 0.05),
                 sample(news, length(news) * 0.05))

Using the tm package and taking the sampled data, I shall now convert capital letters into lower-cased ones and remove: punctuation, numbers, whitespace, formatting. Since we’re trying to predict text, I’m keeping the stopwords (i.e. common short words such as “a” or “the”) as in a day to day basis, one must write these very often.

# build corpus
corpus <- VCorpus(VectorSource(sample_data))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

Building tokenizers

Now it’s time to build tokenizers using n-grams.

N-grams are combinations of N amount of words. For example a 1-gram tokenizer simply counts words 1 by 1. A 2-gram tokenizer takes all words into sets of 2. A 3-gram tokenizer does it in sets of 3s.

Let’s take a simple example using the following sentence: “I love data science”.

Using n-grams, we get the following counts and combinations:

1-gram: count of 4, i.e. “I” and “love” and “data” and “science”;
2-gram: count of 3, i.e. “I love”, “love data” and “data science”;
3-gram: count of 2, i.e. “I love data” and “love data science”;
4-gram: count of 1, i.e. “I love data science”.

For the current project, I’ll use a 3-gram max and build a matrix of all the combinations. Higher order n-grams or more complex algorithms such as neural networks could be used with a higher likelihood of success. The latter is in fact used for the newer and improved versions of many translation software applications.

uni_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bi_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tri_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

# sample data
uni_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = uni_tokenizer))
bi_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = bi_tokenizer))
tri_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = tri_tokenizer))

Taking these matrices, I reduce all the combinations to their 50 most frequent ones, as in theory they shall prove to be the most useful.

uni_corpus <- findFreqTerms(uni_matrix,lowfreq = 50)
bi_corpus <- findFreqTerms(bi_matrix,lowfreq=50)
tri_corpus <- findFreqTerms(tri_matrix,lowfreq=50)

I then turn these matrices into dataframes with a word and frequency columns.

uni_corpus_freq <- rowSums(as.matrix(uni_matrix[uni_corpus,]))
uni_corpus_freq <- data.frame(word=names(uni_corpus_freq), frequency=uni_corpus_freq, row.names = NULL)
bi_corpus_freq <- rowSums(as.matrix(bi_matrix[bi_corpus,]))
bi_corpus_freq <- data.frame(word=names(bi_corpus_freq), frequency=bi_corpus_freq, row.names = NULL)
tri_corpus_freq <- rowSums(as.matrix(tri_matrix[tri_corpus,]))
tri_corpus_freq <- data.frame(word=names(tri_corpus_freq), frequency=tri_corpus_freq, row.names = NULL)
kable(head(tri_corpus_freq))

word	frequency
a beautiful day	52
a bit more	74
a bit of	225
a bunch of	155
a chance to	129
a couple of	358

For simplicity sake, I created a function for plotting the n-grams I set up.

plot_n_grams <- function(data, title, num) {
  df2 <- data[order(-data$frequency),]
  df2 <- df2[1:num,]
  df2$word <- factor(df2$word, levels=unique(as.character(df2$word)))
ggplot(data=df2, aes(x = word, y = frequency)) + geom_bar(stat = "identity", fill = "red", colour = "black", width = 0.80) +
    labs(title = title) +
    xlab("Words") +
    ylab("Count") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
}

Firstly I’ll plot the 1-gram frequency table:

plot_n_grams(uni_corpus_freq, "Top 10 Unigrams from sample", 10)

Secondly, the 2-gram frequecy table:

plot_n_grams(bi_corpus_freq, "Top 10 Bigrams from sample", 10)

And finally, the 3-gram frequecy table:

plot_n_grams(tri_corpus_freq, "Top 10 Trigrams from sample", 10)

Conclusion

As to be expected, the most common words present seem to be the so-called stopwords, as previously mentioned, with “the” taking the top spot by far. From our whole top 10, only 3 are anything other than these set of words.

Even when looking at the 2-gram plot, we seem that 6 out of the top 10 are combinations of these stopwords.

It is only when we observe the 3-gram plots that more valuable information seems to arise as we start to see common expressions such as “a lot of” or “looking forward to”.

Limitations:

Processed 5% of the data as my setup takes a lot of time to perform more computing;
Limited to a maximum of 3-gram analysis. Could further improve this by creating 4 or even 5-gram datasets;
Did not analyse context - i.e. if it was present in blogs, twitter or news - so as to create particular datasets and adjuste predictions for particular writing formats. A further improvement on this would be acquiring data on wether each line was written on mobile or not as that itself may change considerably the writing style of the author.

Opportunities for improvement on building the app:

Always sort the frequency order as the user writes;
Since the user will most often start with a letter, that in itself will help the algorithm further subset the next prediction;
I could build the shiny app so as to build a training dataset (i.e. personalised dictionary) that would contribute for better predictions for each user - these would become better with time;
Users might not opt for auto-complete and the algorithm could still be correct in its assessments. The app could probably have an implementation where it analyses the top 5 suggestions previous to the writing of a word and then go back and double check if any of them was actually written or not. These would help in learning more about the user.

Capstone - Peer-graded Assignment: Milestone Report

22/01/2018