CapstoneMilestone1

Summary

The final goal of this capstone’s project is to build a predictive model for predicting the next word in a sentence (or clause) based on its previous words. However, the goal of this milestone project is just to display that we’ve gotten used to working with the data. Concretely, in this report we would: 1. Demonstrate that we have successfully downloaded and read the data, 2. Create a basic report of summary statistics of the data, 3. Report some feedback on our plans for creating a prediction algorithm. The dataset provided for this project is downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The thorough dataset is multi-language and we are going to use only the English sub-datasets, namely “en_US_twitter.txt”, “en_US_news.txt” and “en_US_blogs.txt”.

Data Processing

Downloading the data

if(!file.exists("./data")){
  dir.create("./data")
  url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  download.file(Url, destfile="./data/Coursera-SwiftKey.zip", mode = "wb")
  unzip(zipfile="./data/Coursera-SwiftKey.zip", exdir="./data")
}

Read the datasets

dataBlogs <- readLines("./data/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
dataNews <- readLines("./data/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)

## Warning in readLines("./data/en_US/en_US.news.txt", encoding = "UTF-8", :
## incomplete final line found on './data/en_US/en_US.news.txt'

dataTwitter <- readLines("./data/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Display statistics of the three datasets

stri_stats_general(dataBlogs)

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539

stri_stats_general(dataNews)

##       Lines LinesNEmpty       Chars CharsNWhite 
##       77259       77259    15639408    13072698

stri_stats_general(dataTwitter)

##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096241   134082806

Data Preparation

Sample the data and create the corpus

subdataBlogs <- sample(dataBlogs, size = 1000)
subdataNews <- sample(dataNews, size = 1000)
subdataTwitter <- sample(dataTwitter, size = 1000)
sampledData <- c(subdataBlogs, subdataNews, subdataTwitter)
corpus <- VCorpus(VectorSource(sampledData))

Remove stopwords, punctuation, whitespaces, numbers etc. from the corpuses

toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "/|@|//|$|:|:)|*|&|!|?|_|-|#|")
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords())
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, stripWhitespace)

Create the DocumentTermMatrizes

dtm1 <- TermDocumentMatrix(corpus)
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm2 <- TermDocumentMatrix(corpus, control = list(tokenize = bigram))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm3 <- TermDocumentMatrix(corpus, control = list(tokenize = trigram))

Data Exploration

1-gram Frequency

freq1 <- rowSums(as.matrix(dtm1))
freq1 <- sort(freq1, decreasing = TRUE)
dfFreq1 <- data.frame(word = names(freq1), freq=freq1)
ggplot(dfFreq1[1:20, ], aes(word, freq)) +
  geom_bar(stat="identity", fill="red", colour="red") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) + ggtitle("1-gram Frequency")

2-gram Frequency

freq2 <- rowSums(as.matrix(dtm2))
freq2 <- sort(freq2, decreasing = TRUE)
dfFreq2 <- data.frame(word = names(freq2), freq=freq2)
ggplot(dfFreq2[1:20, ], aes(word, freq)) +
  geom_bar(stat="identity", fill="red", colour="red") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) + ggtitle("2-gram Frequency")

3-gram Frequency

freq3 <- rowSums(as.matrix(dtm3))
freq3 <- sort(freq3, decreasing = TRUE)
dfFreq3 <- data.frame(word = names(freq3), freq=freq3)
ggplot(dfFreq3[1:20, ], aes(word, freq)) +
  geom_bar(stat="identity", fill="red", colour="red") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) + ggtitle("3-gram Frequency")

Generating the Word Cloud

US News Word Cloud

wordcloud(dfFreq1$word[1:100], dfFreq1$freq[1:100], colors=brewer.pal(8, "Dark2"))

## Warning in wordcloud(dfFreq1$word[1:100], dfFreq1$freq[1:100], colors =
## brewer.pal(8, : will could not be fit on page. It will not be plotted.

Observations and reflections

From observing the dataset, and looking in discussion forumas there are number of issues to solve to achieve the final goal of this project:

It appears not to complicated at first; however, natural language is unlike a programming language. It does not follow strict rules, or not everyone follows those rules;
There are exceptions to the rule, making it harder; e.g. “can’t” is the same as cannot; so how to take that into account makes the algorithm more difficult;
Punctuation characters aren’t always used as punctuation anymore; the biggest example is the use of smiley characters - such as :-). They do have meaning or at least add information.
n-grams seems only interesting in a sentence context. That however complicates reading the data considerably. How to define end of sentence? At first “.!?” seems to do it, but nowadays people use combinations of these as well. And how to ‘.’ as part of abbrevations (for instance e.g.);
How to determine which encoding is used for a text file without prior knowlegde of the maker(s); could be pretty essential when you want to take monitary characters into account like the euro sign;
I am tempted to drop stopwords (and, the, a, etcetera), because they add little information; the question is if it is doable. how to deal with URLs, website names, e-mail addresses etc.

Plans for the Future

For the further analysis, text modelling, and text prediction, we need to do the following studies:

N-Gram modelling of the full text data sets
Optimizing model for low memory utilization
Implement model as a Shiny App and more user-friendly environment