Coursera’s Data Science Capstone project’s goal is to develop Shiny application to recommend next word based on previous words on-the-fly, while typeing. This application will be based on predictive text model, trained on the English text datasets from variuos sources. Text models and its distinctive features, like n-grams for example, fall in the area of Natural Language Processing - for which in R we can use packages like tm or RWeka.
This report deals with exploratory analysis on proposed English text datasets and lay ground for further analysis and model building and selection. First part represents setup (loading packages and getting the data), second part shows some immediate dataset features and cleaning of data, and third part builds simple n-grams from datasets.
It is shown that twitter data has less words per line :), and a bit smaller words on average, that cleaned data is not much different to full unproccessed dataset except for the fact that it doesn’t contain words with <3 chars. 2-grams and 3-grams calculated on smaller datasets show that most frequent combinations contain common stop-words. Some remarks are emphasised in conclusion about how to proceed with prediction models.
Data we are going to use is compiled from 3 sources - news, blogs, and tweets. Dataset actually contains similar texts in other languages, but we are going to use only English datasets and build English only predictive model.
handle <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists("Coursera-SwiftKey/final/en_US/en_US.blogs.txt")){
download.file(url=handle, "Coursera-SwiftKey.zip", mode="wb")
unzip("Coursera-SwiftKey.zip")
}blogs <- readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding="UTF-8", skipNul=T)
news <- readLines("Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding="UTF-8", skipNul=T)
twitter <- readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding="UTF-8", skipNul=T)Package tm and stringi are used for text mining and to manipulate text datasets, quanteda for fast word frequency calculation, SnowballC for stemming, RWeka for building n-grams and dplyr and ggplot2 for basic data manipulation and plotting.
library(tm)
library(stringi)
library(quanteda)
library(SnowballC)
library(RWeka)
library(dplyr)
library(scales)
library(ggplot2)This report is built in following environment (shown here for reproducibility reasons):
sessionInfo()## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.5 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_2.1.0 scales_0.4.0 dplyr_0.4.3 RWeka_0.4-29
## [5] SnowballC_0.5.1 quanteda_0.9.6-9 stringi_1.1.1 tm_0.6-2
## [9] NLP_0.1-9
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.5 ca_0.64 knitr_1.13
## [4] magrittr_1.5 RWekajars_3.9.0-1 munsell_0.4.3
## [7] colorspace_1.2-6 lattice_0.20-33 R6_2.1.2
## [10] plyr_1.8.4 stringr_1.0.0 tools_3.3.0
## [13] parallel_3.3.0 grid_3.3.0 gtable_0.2.0
## [16] data.table_1.9.6 DBI_0.4-1 htmltools_0.3.5
## [19] assertthat_0.1 yaml_2.1.13 digest_0.6.9
## [22] Matrix_1.2-6 rJava_0.9-8 formatR_1.4
## [25] evaluate_0.9 slam_0.1-34 rmarkdown_0.9.6
## [28] chron_2.3-47
Most important features we need are about words in the dataset: total number of words, number of chars per word, number of words per line, etc. These and other features are given below, separately for each text source:
Words.per.Line <- sapply(list(blogs,news,twitter), function(x) summary(stri_count_words(x))['Mean'])
Chars.per.Word <- sapply(list(blogs,news,twitter),
function(x) stri_stats_general(x)[c("Chars")] / stri_stats_latex(x)[c("Words")])
stats <- data.frame(
Source=c("blogs","news","twitter"),
t(rbind(
sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),],
Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',],
Words.per.Line,
Chars.per.Word)
))
print(stats)## Source Lines Chars Words Words.per.Line Chars.per.Word
## 1 blogs 899288 206824382 37570839 41.75 5.504918
## 2 news 1010242 203223154 34494539 34.41 5.891459
## 3 twitter 2360148 162096241 30451170 12.75 5.323153
As expected, twitter data has smaller Words.per.Line, while Chars.per.Word are similar in all three sources. For further analysis sources are merged and used as single corpus, and overall word frequency is ploted.
word_freq <- dfm(c(blogs, news, twitter), verbose = FALSE)
most_freq <- topfeatures(word_freq)
freq.df <- data.frame(words=names(most_freq), freq=most_freq)
ggplot(freq.df, aes(words, freq)) + geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
scale_y_continuous(labels=comma)Cleaning of the text corpus could be very tiresome process. For purposes of this report it is good to show which steps are probably be included in final prediction model, but on much smaller subset of data. Sample is 0.1% from corpus.
set.seed(2121)
blogs <- iconv(blogs,to="utf-8-mac")
news <- iconv(news,to="utf-8-mac")
twitter <- iconv(twitter,to="utf-8-mac")
sample_corpus <- c(sample(blogs, length(blogs) * 0.001),
sample(news, length(news) * 0.001),
sample(twitter, length(twitter) * 0.001))
corpus <- VCorpus(VectorSource(sample_corpus))
corpus <- tm_map(corpus, content_transformer(tolower), lazy=T)
corpus <- tm_map(corpus, removePunctuation, lazy=T)
corpus <- tm_map(corpus, removeNumbers, lazy=T)
corpus <- tm_map(corpus, stripWhitespace, lazy=T)
corpus <- tm_map(corpus, stemDocument, lazy=T)
corpus <- tm_map(corpus, PlainTextDocument, lazy=T)Cleaned data word frequency (from sample) is plotted below:
words_matrix <- TermDocumentMatrix(corpus)
words_corpus <- findFreqTerms(words_matrix, lowfreq=100)
words_corpus <- words_corpus[!is.na(words_corpus)]
words_corpus_freq <- rowSums(as.matrix(words_matrix[words_corpus,]))
words_corpus_freq <- data.frame(word=names(words_corpus_freq), frequency=words_corpus_freq)
data <- words_corpus_freq
num <- 10
title <- "Top words by freq"
df <- data[order(-data$frequency),][1:num,]
ggplot(df, aes(x = seq(1:num), y = frequency)) +
geom_bar(stat = "identity", width = 0.80) +
coord_cartesian(xlim = c(0, num+1)) +
labs(title = title) +
xlab("Words") +
ylab("Count") +
scale_x_discrete(breaks=c(1:num), limits=c(1:num), labels=rownames(df)[1:num]) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Tokenization is done on 2-grams and 3-grams and then token frequency calculated:
bi_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bi_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = bi_tokenizer))
bi_corpus <- findFreqTerms(bi_matrix, lowfreq=40)
bi_corpus <- bi_corpus[!is.na(bi_corpus)]
bi_corpus_freq <- rowSums(as.matrix(bi_matrix[bi_corpus,]))
bi_corpus_freq <- data.frame(word=names(bi_corpus_freq), frequency=bi_corpus_freq)tri_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tri_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = tri_tokenizer))
tri_corpus <- findFreqTerms(tri_matrix, lowfreq=2)
tri_corpus <- tri_corpus[!is.na(tri_corpus)]
tri_corpus_freq <- rowSums(as.matrix(tri_matrix[tri_corpus,]))
tri_corpus_freq <- data.frame(word=names(tri_corpus_freq), frequency=tri_corpus_freq)data2 <- bi_corpus_freq
num <- 10
title <- "Top 2-grams"
df2 <- data2[order(-data2$frequency),][1:num,]
ggplot(df2, aes(x = seq(1:num), y = frequency)) +
geom_bar(stat = "identity", width = 0.80) +
coord_cartesian(xlim = c(0, num+1)) +
labs(title = title) +
xlab("Words") +
ylab("Count") +
scale_x_discrete(breaks=c(1:num), limits=c(1:num), labels=rownames(df2)[1:num]) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))data3 <- tri_corpus_freq
num <- 10
title <- "Top 3-grams"
df3 <- data3[order(-data3$frequency),][1:num,]
ggplot(df3, aes(x = seq(1:num), y = frequency)) +
geom_bar(stat = "identity", width = 0.80) +
coord_cartesian(xlim = c(0, num+1)) +
labs(title = title) +
xlab("Words") +
ylab("Count") +
scale_x_discrete(breaks=c(1:num), limits=c(1:num), labels=rownames(df3)[1:num]) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))In this report there are some basic features shown about datasets. Profanity filter is not considered, nor is advanced normalization or cleaning of data. For example, 2-grams and 3-grams are mostly composed of stop words, while non-stop words in n-grams are most interested cases for prediction. Prediction models should tackle these problems but also problem of optimizing model training because even for these small datasets n-gram tokenization takes really long time.