This report is an exploratory analysis on the Coursera-SwiftKey English data sets. There are 3 data sets containing snippets from blog posts, news articles and tweets. First, I show how the loading and preprocessing of the data sets were performed. Then, I examine the distribution of words per sentence and list the top 20 words with and without stop words for each data set, discussing some interesting findings. Lastly, I present my future plans for the prediction algorithm and Shiny application.
Needed libraries are loaded.
library(readr)
library(quanteda)
There are 3 data sets involved, the first data set (en_US.blogs.txt) contains snippets from blog posts, the second data set (en_US.news.txt) contains snippets from news articles, while the third data set (en_US.twitter.txt) contains Twitter tweets. A look at the first few pages of each data set suggests that they each have different writing styles and focus on different topics, so they will be treated separately in this analysis.
The data sets are read in using readr::read_lines, in binary mode, assuming that the encoding is UTF-8. Binary mode is used because I don’t want the operating system to do any transformations while loading the data sets, while the UTF-8 encoding is used because the file Linux command line tool reports that the data sets’ encoding is UTF-8, verified by visual inspection of the data sets in Vim.
blogs <- read_lines(file("final/en_US/en_US.blogs.txt", "rb", encoding="UTF-8"))
news <- read_lines(file("final/en_US/en_US.news.txt", "rb", encoding="UTF-8"))
twitter <- read_lines(file("final/en_US/en_US.twitter_noNUL.txt", "rb", encoding="UTF-8"))
There was special preprocessing for the Twitter data set because it contains embedded NULs which readr::read_lines aborts on. These NULs were replaced with spaces using the following Linux command line:
tr '\000' ' ' < en_US.twitter.txt > en_US.twitter_noNUL.txt
To verify that the data sets have been loaded successfully, I compare the size in bytes of the loaded data sets with what ls (a Linux directory command) reports:
210160014 en_US.blogs.txt
205811889 en_US.news.txt
167105338 en_US.twitter_noNUL.txt
From the Linux file command, it is known that the data sets are CRLF-terminated, i.e. two bytes end each line. Therefore, the following statements should all be true:
sum(nchar(blogs, type="bytes")) + 2*length(blogs) == 210160014
## [1] TRUE
sum(nchar(news, type="bytes")) + 2*length(news) == 205811889
## [1] TRUE
sum(nchar(twitter, type="bytes")) + 2*length(twitter) == 167105338
## [1] TRUE
Voila! The data sets appear to have been successfully loaded.
In the Unicode UTF-8 encoding, punctuation like double quotation marks, single quotation marks, apostrophes and hyphens can be encoded by several symbols each. I normalize the data so that punctuation in the data sets is encoded only by the ASCII symbol for that punctuation. This is done because it is expected that any R text manipulation function can handle ASCII and I do not need to assume that they can also handle the Unicode UTF-8 symbols as well. For example, tools that remove punctuation might consider only ASCII punctuation symbols and miss out on other encodings of those symbols.
gsub_fast <- function(pattern, replacement, x) {
gsub(pattern, replacement, x, fixed = TRUE, useBytes = TRUE)
}
normalize_data <- function(data) {
data <- gsub_fast("\u0093", "\"", data)
data <- gsub_fast("\u0094", "\"", data)
data <- gsub_fast("\u201c", "\"", data)
data <- gsub_fast("\u201d", "\"", data)
data <- gsub_fast("\u0091", "'", data)
data <- gsub_fast("\u0092", "'", data)
data <- gsub_fast("\u2018", "'", data)
data <- gsub_fast("\u2019", "'", data)
data <- gsub_fast("\u0096", "-", data)
data <- gsub_fast("\u0097", "--", data)
data <- gsub_fast("\u2013", "-", data)
data <- gsub_fast("\u2014", "--", data)
data <- gsub_fast("\u0095", "", data) # \u0095 is a point.
data <- gsub_fast("\u00F8", "", data) # \u00F8 is probably standing in for a point.
# To be continued...
# Since using gsub(useBytes = TRUE), redeclare encoding as UTF-8.
Encoding(data) <- "UTF-8"
data
}
blogs <- normalize_data(blogs)
news <- normalize_data(news)
twitter <- normalize_data(twitter)
The tools used for normalization were tools::showNonASCII and the r12a Unicode code converter. Due to time constraints, this normalization is not yet complete, but I expect that completing the normalization will not significantly affect any findings.
For profanity removal, I chose the list from the Banned Word List website, the reasons being that the list is free and that it is conservative i.e. it does not include double-entendres or words that are only offensive in context. I may add on to the list later.
profanity <- read_lines(file("swearWords.txt", "rb", encoding="UTF-8"))
A basic line count summary table can quickly be produced as follows:
c(blogs = length(blogs), news = length(news), twitter = length(twitter))
## blogs news twitter
## 899288 1010242 2360148
However, each line can have multiple sentences which makes the usage of lines for analysis to be less than ideal. Therefore, I use the quanteda::tokenize function to split the data sets into sentences. This sentence segmenter isn’t perfect, but I expect that any mistakes will not be statistically significant.
blogs.sentence <- tokenize(blogs, what = "sentence", simplify = TRUE)
news.sentence <- tokenize(news, what = "sentence", simplify = TRUE)
twitter.sentence <- tokenize(twitter, what = "sentence", simplify = TRUE)
For the word count summary, I will present the histograms of words per sentence (or sentence length) for each data set. I do so because Twitter has a 140 character limit and I expect an interesting finding from the histograms.
I split the sentences into words…
blogs.sentence.words <- tokenize(blogs.sentence, removePunct = TRUE)
news.sentence.words <- tokenize(news.sentence, removePunct = TRUE)
twitter.sentence.words <- tokenize(twitter.sentence, removePunct = TRUE)
… and count the number of words per sentence, or sentence length.
blogs.sentence.length <- sapply(blogs.sentence.words, length)
news.sentence.length <- sapply(news.sentence.words, length)
twitter.sentence.length <- sapply(twitter.sentence.words, length)
I take the median and 95% quantile of the sentence lengths:
c(median = median(blogs.sentence.length), quantile(blogs.sentence.length, probs = 0.95))
## median 95%
## 13 37
c(median = median(news.sentence.length), quantile(news.sentence.length, probs = 0.95))
## median 95%
## 16 36
c(median = median(twitter.sentence.length), quantile(twitter.sentence.length, probs = 0.95))
## median 95%
## 7 19
The results show that the median length for news sentences is the highest, followed by blog sentences and then tweet sentences. Also, at least 95% of the sentences are of length 50 or less.
The histograms follow:
hist(blogs.sentence.length[blogs.sentence.length <= 50], freq = FALSE,
main = "Histogram of blog sentence length, with cutoff at 50 words",
xlab = "Blog sentence length")
hist(news.sentence.length[news.sentence.length <= 50], freq = FALSE,
main = "Histogram of news sentence length, with cutoff at 50 words",
xlab = "News sentence length")
hist(twitter.sentence.length, freq = FALSE,
main = "Histogram of tweet sentence length",
xlab = "Tweet sentence length")
The histograms show that tweet sentence lengths are usually very short (less than 10 words), while blog and news sentence lengths are usually longer (more than 10 words).
In this report, I will also show the top 20 words for each data set with and without stop words. I do so because I expect another interesting finding here, due to the different data set sources.
First, with stop words but without profanity, I create a document-feature matrix for each data set:
blogs.dfm.withStop <- dfm(blogs, ignoredFeatures = profanity)
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 899,288 documents
## ... indexing features: 373,541 feature types
## ... removed 65 features, from 77 supplied (glob) feature types
## ... created a 899288 x 373476 sparse dfm
## ... complete.
## Elapsed time: 65.82 seconds.
news.dfm.withStop <- dfm(news, ignoredFeatures = profanity)
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 1,010,242 documents
## ... indexing features: 326,752 feature types
## ... removed 50 features, from 77 supplied (glob) feature types
## ... created a 1010242 x 326702 sparse dfm
## ... complete.
## Elapsed time: 61.5 seconds.
twitter.dfm.withStop <- dfm(twitter, ignoredFeatures = profanity)
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 2,360,148 documents
## ... indexing features: 418,945 feature types
## ... removed 65 features, from 77 supplied (glob) feature types
## ... created a 2360148 x 418880 sparse dfm
## ... complete.
## Elapsed time: 120.69 seconds.
I then obtain the top 20 words:
topfeatures(blogs.dfm.withStop, 20)
## the and to a of i in that is
## 1858296 1092671 1066341 898210 875328 774102 594283 460576 432533
## it for you with was on my this as
## 403372 363319 298244 286590 278308 274437 270716 258893 223697
## have be
## 218770 208491
topfeatures(news.dfm.withStop, 20)
## the to and a of in for that is
## 1972403 901187 885352 875429 771095 674225 351258 346828 284181
## on with said was he it at as his
## 266847 254792 250408 228959 228695 219262 212783 187425 157648
## i be
## 157390 152197
topfeatures(twitter.dfm.withStop, 20)
## the to i a you and for in of is
## 936567 787269 722297 609341 547678 438192 385056 377946 359149 358708
## it my on that me be at with your have
## 294728 291780 277000 234579 202302 187634 186486 173414 171153 168602
The word lists appear to be quite similar to each other, though there are differences: “my” is a common word in both the blog and Twitter data set, “me” is a common word in the Twitter data set, while “said”, “he” and “his” are common words in the news data set.
Second, without both stop words and profanity, I create a document-feature matrix for each data set:
blogs.dfm.noStop <- dfm(blogs, ignoredFeatures = c(profanity, stopwords("english")))
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 899,288 documents
## ... indexing features: 373,541 feature types
## ... removed 239 features, from 251 supplied (glob) feature types
## ... created a 899288 x 373302 sparse dfm
## ... complete.
## Elapsed time: 62.39 seconds.
news.dfm.noStop <- dfm(news, ignoredFeatures = c(profanity, stopwords("english")))
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 1,010,242 documents
## ... indexing features: 326,752 feature types
## ... removed 224 features, from 251 supplied (glob) feature types
## ... created a 1010242 x 326528 sparse dfm
## ... complete.
## Elapsed time: 63.51 seconds.
twitter.dfm.noStop <- dfm(twitter, ignoredFeatures = c(profanity, stopwords("english")))
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 2,360,148 documents
## ... indexing features: 418,945 feature types
## ... removed 239 features, from 251 supplied (glob) feature types
## ... created a 2360148 x 418706 sparse dfm
## ... complete.
## Elapsed time: 97.09 seconds.
Again, I obtain the top 20 words:
topfeatures(blogs.dfm.noStop, 20)
## one will just like can time get know now people
## 124812 112700 100642 98715 98303 88605 70873 60277 60148 59450
## also new even first day make back us see really
## 55337 54480 52073 50890 50669 50656 50571 50217 50058 49948
topfeatures(news.dfm.noStop, 20)
## said will one new also can two year just
## 250408 108168 83226 70341 58763 58696 57355 54815 53168
## first last time like state people years get city
## 52651 51529 51159 49434 48058 47589 46828 43503 37118
## now percent
## 36108 34427
topfeatures(twitter.dfm.noStop, 20)
## just like get love good will day can thanks rt
## 151022 122022 112321 106176 100750 94683 90074 89735 89532 89301
## now one know u great time today go lol new
## 83797 82017 79821 77139 75995 75587 72850 72405 70026 69624
The word lists now appear less similar to each other, possibly reflecting the different topics and writing styles of each data set source.
On sentence length, there is a distinct difference in lengths for tweet sentences compared to news and blog sentences, which is likely due to the 140 character limit for tweets. This may be significant because I expect writing styles will change due to the limit, thus affecting the words used.
On the top 20 word lists, the differences suggest that prediction would be more accurate if there are separate predictors for each data set source. This assumes though that a predictor application can detect whether a user is writing a blog, a news article or a tweet.
I plan to build a combination (ensemble) predictor that is a weighted combination of predictors for each category: blog, news or Twitter. The predictors will be based on Markov bigram models – I will also test trigram models if hardware permits. The Shiny application will allow the user to choose the sentence category, resulting in higher weight for the corresponding predictor, or just state the category as unknown. The application will show the top 5 predictions.
If hardware permits, the predictor will be trained on the whole training data set, since more data means that the predictor’s model contains more knowledge. If hardware is an issue, than sampling will need to be done.
Testing will be done in 2 ways. First, test data will be taken from the training set. This reflects the case where the training set is the writing history of a person and that person is in a situation that he or she has been in before. Second, test data will be taken from very recent news articles and tweets to test the predictor’s general knowledge.
I do not yet know how to handle words that are not in the predictor’s vocabulary. Perhaps individual word frequencies will be considered? Certainly, category predictors that are trying to predict based on words that are not in that predictor’s vocabulary will be weighted lower.