This report is part of the capstone project within the Johns Hopkins Data Science Specialization offered by Coursera. It was developed in cooperation with SwiftKey, who build so-called smart keyboards abd are experts for predictive text models. While the ulitmate goal is to develop amd present a prediction algorithm that provides options for the next word when writing a text, this intermediate report aims at providing an overview of the available datasets and their major features.
More specifically, the goals are to:
The training data that is the basis for this report, can be downloaded from the Coursera Site. It contains text files from news, blogs and twitter messages in four languages (English, German, Russian, Finnish). For this exercise we will only consider English text files. As a first step, we have a look at the file and their size.
setwd("~/coursera/capstone/final/en_US/")
file.info(dir())[,1:2]
## size isdir
## en_US.blogs.txt 210160014 FALSE
## en_US.news.txt 205811889 FALSE
## en_US.twitter.txt 167105338 FALSE
Some basics statistics are provided in R by using the stringi package.
library(stringi)
blogs <- stri_read_lines("~/coursera//capstone/final/en_US/en_US.blogs.txt")
stri_stats_general(blogs)
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
rm(blogs)
news <- stri_read_lines("~/coursera//capstone/final/en_US/en_US.news.txt")
stri_stats_general(news)
## Lines LinesNEmpty Chars CharsNWhite
## 1010242 1010242 203223154 169860866
rm(news)
Already from considering a subset of the actual data, it is possible to obtain information on word and bigram frequencies of the whole data set. For this report, we choose a subset of 5000 lines per file.
connection <- file("~/coursera//capstone/final//en_US/en_US.twitter.txt", "r", encoding = "UTF-8")
myTwitter <- readLines(connection,5000)
close(connection)
connection <- file("~/coursera//capstone/final//en_US/en_US.blogs.txt", "r", encoding = "UTF-8")
myBlogs <- readLines(connection,5000)
close(connection)
connection <- file("~/coursera//capstone/final//en_US/en_US.news.txt", "r", encoding = "UTF-8")
myNews <- readLines(connection,5000)
close(connection)
The tm package provides a very useful set of functions for text mining. After copying the files in one corpus, i.e. into one object database for text files, we perform some basic pre-processing steps for cleaning the files. For the moment, we choose to remove punctuations, to transform all content to lower cases and strip the lines of texts of any additional white spaces. We explicitly decide to leave numbers and stopwords, since they contain typical input when typing a text.
library(NLP)
library(tm)
myCorpus <- Corpus(VectorSource(myTwitter))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, stripWhitespace)
The same steps are performed for all three files.
In order to get familiar with the data, we transform each text file into matrices of frequent terms and bigrams.
tdm <- TermDocumentMatrix(myCorpus)
freq <- rowSums(as.matrix(tdm))
ordered <- order(freq)
freq[tail(ordered,n=10)]
## have your are this with that for and you the
## 333 334 335 358 368 531 800 911 1136 1959
The result for the most frequent terms is as expected consisting mostly of stopwords.
BigramTokenizer <- function(x)
unlist(lapply(ngrams(words(x), 2), paste, "", collapse = " "), use.names = FALSE)
tdm2 <- TermDocumentMatrix(myCorpus, control = list(tokenize = BigramTokenizer))
freq <- rowSums(as.matrix(tdm2))
wf <- data.frame(word=names(freq), freq=freq)
The same holds for the bigrams with a frequency of larger than 50.
library(ggplot2)
p <- ggplot(subset(wf, freq>50), aes(word, freq))
p <- p + geom_bar(stat="identity", fill="blue", colour="darkblue")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p
In order to avoid memory and performance problems, I intend to do the pre-processing with python and to perform the final analysis in R. In order to get rid of misspelled words and wrong encodings, I will try to remove sparse bigrams and words that contain letters more than three times. In addition, a profanity dictionary will be used to remove the corresponding words.