Milestone Report for Data Science Capstone

Background and Tasks

This report is part of the capstone project within the Johns Hopkins Data Science Specialization offered by Coursera. It was developed in cooperation with SwiftKey, who build so-called smart keyboards abd are experts for predictive text models. While the ulitmate goal is to develop amd present a prediction algorithm that provides options for the next word when writing a text, this intermediate report aims at providing an overview of the available datasets and their major features.

More specifically, the goals are to:

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

Data Source and Basic Statistics

The training data that is the basis for this report, can be downloaded from the Coursera Site. It contains text files from news, blogs and twitter messages in four languages (English, German, Russian, Finnish). For this exercise we will only consider English text files. As a first step, we have a look at the file and their size.

setwd("~/coursera/capstone/final/en_US/")
file.info(dir())[,1:2]

##                        size isdir
## en_US.blogs.txt   210160014 FALSE
## en_US.news.txt    205811889 FALSE
## en_US.twitter.txt 167105338 FALSE

Basic statistics of the files

Some basics statistics are provided in R by using the stringi package.

library(stringi)
blogs <- stri_read_lines("~/coursera//capstone/final/en_US/en_US.blogs.txt")
stri_stats_general(blogs)

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539

rm(blogs)
news <- stri_read_lines("~/coursera//capstone/final/en_US/en_US.news.txt")
stri_stats_general(news)

##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010242   203223154   169860866

rm(news)

Loading a subset of the data

Already from considering a subset of the actual data, it is possible to obtain information on word and bigram frequencies of the whole data set. For this report, we choose a subset of 5000 lines per file.

connection <- file("~/coursera//capstone/final//en_US/en_US.twitter.txt", "r", encoding = "UTF-8")
myTwitter <- readLines(connection,5000)
close(connection)
connection <- file("~/coursera//capstone/final//en_US/en_US.blogs.txt", "r", encoding = "UTF-8")
myBlogs <- readLines(connection,5000)
close(connection)
connection <- file("~/coursera//capstone/final//en_US/en_US.news.txt", "r", encoding = "UTF-8")
myNews <- readLines(connection,5000)
close(connection)

Pre-processing

The tm package provides a very useful set of functions for text mining. After copying the files in one corpus, i.e. into one object database for text files, we perform some basic pre-processing steps for cleaning the files. For the moment, we choose to remove punctuations, to transform all content to lower cases and strip the lines of texts of any additional white spaces. We explicitly decide to leave numbers and stopwords, since they contain typical input when typing a text.

library(NLP)
library(tm)

myCorpus <- Corpus(VectorSource(myTwitter))  
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, stripWhitespace)

The same steps are performed for all three files.

Exploratory Analysis

In order to get familiar with the data, we transform each text file into matrices of frequent terms and bigrams.

tdm <- TermDocumentMatrix(myCorpus)
freq <- rowSums(as.matrix(tdm))
ordered <- order(freq)
freq[tail(ordered,n=10)]

## have your  are this with that  for  and  you  the 
##  333  334  335  358  368  531  800  911 1136 1959

The result for the most frequent terms is as expected consisting mostly of stopwords.

BigramTokenizer <- function(x)
      unlist(lapply(ngrams(words(x), 2), paste, "", collapse = " "), use.names = FALSE)
tdm2 <- TermDocumentMatrix(myCorpus, control = list(tokenize = BigramTokenizer))
freq <- rowSums(as.matrix(tdm2))
wf <- data.frame(word=names(freq), freq=freq)

The same holds for the bigrams with a frequency of larger than 50.

library(ggplot2)
p <- ggplot(subset(wf, freq>50), aes(word, freq))    
p <- p + geom_bar(stat="identity", fill="blue", colour="darkblue")   
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))   
p

Creating a prediction algorithm

In order to avoid memory and performance problems, I intend to do the pre-processing with python and to perform the final analysis in R. In order to get rid of misspelled words and wrong encodings, I will try to remove sparse bigrams and words that contain letters more than three times. In addition, a profanity dictionary will be used to remove the corresponding words.

References

CRAN Tasks NLP

Wikibooks on R Programming