Report summary

This is the first milestone report for the capstone project in the course of the finalization of JHU Data Science specialization by Coursera. As explained by instructions, the purpose of this report is “just to demonstrate that we’ve gotten used to working with the textual data and that we are on track to create our prediction algorithm.”. Hence, here you will find an initial exploratory data analysis of the data set provided by Coursera and SwiftKey, as well as ideas how the modelling task could be tackled.

We’ll be dealing with the English database which is actually a subset of a corpus called HC Corpora. The integral version of the data set which was used in this analysis can be found HERE. It contains data in four languages : English, German, Russian and Finish. There are three corpora per language which contain data generated by twitter, blogs and news feeds.

The code which was used to generate this report can be found in my GitHub repo.

Setting up the working environment and loading the data

# Loading needed packages 
library(tm)        
library(stringr)     
library(qdap)
library(RColorBrewer)
library(SnowballC)
library(wordcloud)
library(RWeka)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(graphics)

Loading the data, i.e. corpora that will be used in the analysis

  con1 <- file("data/en_US.blogs.txt", "r") 
  blogs <- readLines(con1, encoding = "UTF-8", skipNul = TRUE)
  close(con1)
  con2 <- file("data/en_US.news.txt", "r") 
  news <- readLines(con2, encoding = "UTF-8", skipNul = TRUE)
  close(con2)
  con3 <- file("data/en_US.twitter.txt", "r") 
  twitter <- readLines(con3, encoding = "UTF-8", skipNul = TRUE)
  close(con3)

Summary statistics

The table below provides the basic summary for our corpora. As you can see data sets are a bit bulky so we will sample and aggregate 10000 lines out of each corpus for further analysis. This is actually an approach recommended by good people from JHU.

File Name File Size [MB] Number of Lines Number of Words
Blogs 200.42 899288 37334131
News 196.28 77259 2643969
Twitter 159.36 2360148 30373583

Forming the sample corpus

As already told we’ll be using an aggregated sample of the initial data set for further analysis.

sample_blogs <- blogs[sample(1:length(blogs),10000)]
sample_news <- news[sample(1:length(news),10000)]
sample_twitter <- twitter[sample(1:length(twitter),10000)]
text_sample <- c(sample_blogs,sample_news,sample_twitter)

Let’s check the summary statistics for our sample corpus:

File Name File Size [MB] Number of Lines Number of Words
Sample 4.86 30000 890090

Data preprocessing - cleaning the corpus

Before the corpus analysis certain preprocessing steps are usually performed. These include the following:

All these operation are conducted by the help of the tm package, which is probably the best known and the most used package for text mining in R.

Initial corpus exploration - frequent terms

First step that has to be performed in order for us to be able to explore frequency distribution of certain words, i.e. terms in our corpus is to build a term-document matrix. This matrix contains terms as rows and documents where these terms occur as columns.

# Let's build a term-document matrix out of our clean corpus
sample_tdm <- TermDocumentMatrix(clean_sample)
sample_tdm
<<TermDocumentMatrix (terms: 40350, documents: 30000)>>
Non-/sparse entries: 431951/1210068049
Sparsity           : 100%
Maximal term length: 64
Weighting          : term frequency (tf)
# An easy way to start analyzing the information contained in TDM is to change it into a simple matrix 
sample_m <- as.matrix(sample_tdm)
#Let's check how our matrix looks like
dim(sample_m)
[1] 40350 30000
sample_m[10000:10007, 2000:2010]
                 Docs
Terms             2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
  dictatori          0    0    0    0    0    0    0    0    0    0    0
  dictatorship       0    0    0    0    0    0    0    0    0    0    0
  dictatorshiplit    0    0    0    0    0    0    0    0    0    0    0
  dictionari         0    0    0    0    0    0    0    0    0    0    0
  dictionaryush      0    0    0    0    0    0    0    0    0    0    0
  didact             0    0    0    0    0    0    0    0    0    0    0
  diddi              0    0    0    0    0    0    0    0    0    0    0
  diddley            0    0    0    0    0    0    0    0    0    0    0

Let’s check what are the 20 most common terms in our cleaned sample corpora:

# Calculate the rowSums: term_frequency
term_frequency <- rowSums(sample_m)
# Sort term_frequency in descending order
term_frequency <- sort(term_frequency, decreasing = TRUE)

View the top 20 most common words:

term_frequency[1:20]
 said  will   one  like   get  just  time  year   can  make   new   day  love  work  good 
 2935  2819  2658  2354  2310  2271  2189  2018  1976  1743  1633  1632  1464  1460  1382 
  say peopl   now  know  want 
 1366  1350  1348  1337  1335 

Plot a bar chart of the 20 most common words:

barplot(term_frequency[20:1], col = "steelblue", las = 2)

Word cloud

A word cloud is a very popular way to visualize frequency of terms (actually they are quite often overused). In a word cloud, size is usually scaled to frequency and in some cases the colors may indicate another measurement.

Let’s check what are the 50 most frequently occurring single words, i.e. unigrams in our clean corpus:

# Create word_freqs
word_freqs <- data.frame(term = names(term_frequency),
                          num = term_frequency)
# Print the wordcloud with the specified colors
wordcloud(word_freqs$term,
          word_freqs$num,
          max.words = 50,
          colors = c("grey60", "darkorange", "steelblue")
          )

N-Gram tokenization

Now we will shift our focus to tokens containing two and three words. This can help extract useful phrases which can lead to some additional insights or provide improved predictive attributes for construction of a machine learning algorithm.

Insights

Next steps

The future work on the capstone project will be directed towards the development of proper strategy that will be used for modelling, i.e. choosing and constructing an adequate set of features, choosing and implementing the optimal prediction algorithm which will enable building fast and user friendly app.

---
title: 'Data Science Specialization Milestone Report: Initial EDA' 
author: "Igor Hut"
date: "February 19, 2017"
output:
  html_notebook:
    highlight: textmate
    theme: cerulean
    toc: yes
  html_document:
    toc: yes
---

## Report summary

This is the first milestone report for the capstone project in the course of the finalization of JHU Data Science specialization by Coursera. As explained by instructions, the purpose of this report is "just to demonstrate that we've gotten used to working with the textual data and that we are on track to create our prediction algorithm.". Hence, here you will find an initial exploratory data analysis of the data set provided by Coursera and SwiftKey, as well as ideas how the modelling task could be tackled.

We'll be dealing with the English database which is actually a subset of a corpus called [HC Corpora](http://data.danetsoft.com/corpora.heliohost.org). The integral version of the data set which was used in this analysis can be found [HERE](https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip). It contains data in four languages : English, German, Russian and Finish. There are three corpora per language which contain data generated by twitter, blogs and news feeds.

The code which was used to generate this report can be found in my [GitHub repo](https://github.com/IgorHut/JHU_Data_Science_Capstone).

## Setting up the working environment and loading the data

```{r, message=FALSE, warning=FALSE}
# Loading needed packages 
library(tm)        
library(stringr)     
library(qdap)
library(RColorBrewer)
library(SnowballC)
library(wordcloud)
library(RWeka)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(graphics)
```

Loading the data, i.e. corpora that will be used in the analysis
```{r, message=FALSE, warning=FALSE}

  con1 <- file("data/en_US.blogs.txt", "r") 
  blogs <- readLines(con1, encoding = "UTF-8", skipNul = TRUE)
  close(con1)

  con2 <- file("data/en_US.news.txt", "r") 
  news <- readLines(con2, encoding = "UTF-8", skipNul = TRUE)
  close(con2)

  con3 <- file("data/en_US.twitter.txt", "r") 
  twitter <- readLines(con3, encoding = "UTF-8", skipNul = TRUE)
  close(con3)

```

## Summary statistics

```{r, include=FALSE}
## Checking the size and length of the files and calculate the word count
blogsFile <- file.info("data/en_US.blogs.txt")$size / 1024^2 
newsFile <- file.info("data/en_US.news.txt")$size / 1024^2 
twitterFile <- file.info("data/en_US.twitter.txt")$size / 1024^2 


blogsLength <- length(blogs)
newsLength <- length(news)
twitterLength <- length(twitter)


blogsWords <- sum(sapply(gregexpr("\\S+", blogs), length))
newsWords <- sum(sapply(gregexpr("\\S+", news), length))
twitterWords <- sum(sapply(gregexpr("\\S+", twitter), length))

file_summary <- data.frame(
        fileName = c("Blogs","News","Twitter"),
        fileSize = c(round(blogsFile, digits = 2), 
                     round(newsFile,digits = 2), 
                     round(twitterFile, digits = 2)),
                     lineCount = c(blogsLength, newsLength, twitterLength),
                     wordCount = c(blogsWords, newsWords, twitterWords)                  
)

colnames(file_summary) <- c("File Name", "File Size [MB]", "Number of Lines", "Number of Words")

saveRDS(file_summary, file = "summary.Rda")

summary_DF <- readRDS("summary.Rda")

```
The table below provides the basic summary for our corpora. As you can see data sets are a bit bulky so we will sample and aggregate 10000 lines out of each corpus for further analysis. This is actually an approach recommended by good people from JHU.

```{r, echo=FALSE}
knitr::kable(head(summary_DF, 10))
```

## Forming the sample corpus

As already told we'll be using an aggregated sample of the initial data set for further analysis.
```{r, message=FALSE, warning=FALSE}
sample_blogs <- blogs[sample(1:length(blogs),10000)]
sample_news <- news[sample(1:length(news),10000)]
sample_twitter <- twitter[sample(1:length(twitter),10000)]
text_sample <- c(sample_blogs,sample_news,sample_twitter)
```

```{r, include=FALSE}
## Save sample
writeLines(text_sample, "text_sample.txt")
con <- file("text_sample.txt")
sample <- readLines(con)
close(con)
```

Let's check the summary statistics for our sample corpus:

```{r, echo=FALSE}
sampleFile <- file.info("text_sample.txt")$size / 1024^2
sampleLength <- length(sample)
sampleWords <- sum(sapply(gregexpr("\\S+", sample), length))

file_summary_sample <- data.frame(
        fileName = "Sample",
        fileSize = round(sampleFile, digits = 2), 
                     lineCount = sampleLength,
                     wordCount = sampleWords                  
                     )

colnames(file_summary_sample) <- c("File Name", "File Size [MB]", "Number of Lines", "Number of Words")

saveRDS(file_summary_sample, file = "summary_sample.Rda")

summary_DF_sample <- readRDS("summary_sample.Rda")

knitr::kable(head(summary_DF_sample))
```

## Data preprocessing - cleaning the corpus

Before the corpus analysis certain preprocessing steps are usually performed. These include the following:

- Text normalization:
    - text lowercasing
    - removal of numbers
    - removal of punctuation signs
    - removal of URLs
    - white space striping,
    - profanity filtering
   
- Replacement of special characters such as emoticons, special utf-8 characters and control characters

All these operation are conducted by the help of the [`tm package`](http://tm.r-forge.r-project.org/index.html), which is probably the best known and the most used package for text mining in R. 

```{r, include=FALSE}
bad_words <- read.table("bad_words.txt", header = FALSE, stringsAsFactors = FALSE)$V1

## Build the corpus, and specify the source to be character vectors 
clean_sample <- Corpus(VectorSource(sample))

## Convert to lower case
clean_sample <- tm_map(clean_sample, content_transformer(tolower), lazy = TRUE)

## Remove punction, numbers, URLs, stop, profanity and perform stemming
clean_sample <- tm_map(clean_sample, content_transformer(removePunctuation))
clean_sample <- tm_map(clean_sample, content_transformer(removeNumbers))
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) 
clean_sample <- tm_map(clean_sample, content_transformer(removeURL))
clean_sample <- tm_map(clean_sample, stripWhitespace)
clean_sample <- tm_map(clean_sample, removeWords, stopwords("english"))
clean_sample <- tm_map(clean_sample, removeWords, bad_words)
clean_sample <- tm_map(clean_sample, stemDocument)
clean_sample <- tm_map(clean_sample, stripWhitespace)

## Saving the final corpus
saveRDS(clean_sample, file = "clean_corpus.RDS")

```

## Initial corpus exploration - frequent terms

First step that has to be performed in order for us to be able to explore frequency distribution of certain words, i.e. terms in our corpus is to build a term-document matrix. This matrix contains terms as rows and documents where these terms occur as columns.

```{r}
# Let's build a term-document matrix out of our clean corpus
sample_tdm <- TermDocumentMatrix(clean_sample)
sample_tdm

# An easy way to start analyzing the information contained in TDM is to change it into a simple matrix 
sample_m <- as.matrix(sample_tdm)

#Let's check how our matrix looks like
dim(sample_m)
sample_m[10000:10007, 2000:2010]
```


Let's check what are the 20 most common terms in our cleaned sample corpora:

```{r}
# Calculate the rowSums: term_frequency
term_frequency <- rowSums(sample_m)

# Sort term_frequency in descending order
term_frequency <- sort(term_frequency, decreasing = TRUE)
```
View the top 20 most common words:
```{r}
term_frequency[1:20]
```
Plot a bar chart of the 20 most common words:
```{r}
barplot(term_frequency[20:1], col = "steelblue", las = 2)
```

### Word cloud
A word cloud is a very popular way to visualize frequency of terms (actually they are quite often overused). In a word cloud, size is usually scaled to frequency and in some cases the colors may indicate another measurement.

Let's check what are the 50 most frequently occurring single words, i.e. unigrams in our clean corpus:
```{r}
# Create word_freqs
word_freqs <- data.frame(term = names(term_frequency),
                          num = term_frequency)

# Print the wordcloud with the specified colors
wordcloud(word_freqs$term,
          word_freqs$num,
          max.words = 50,
          colors = c("grey60", "darkorange", "steelblue")
          )
```

## N-Gram tokenization
Now we will shift our focus to tokens containing two and three words. This can help extract useful phrases which can lead to some additional insights or provide improved predictive attributes for construction of a machine learning algorithm. 

```{r, echo=FALSE, fig.height=5}

bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bigrams <- unlist(lapply(clean_sample, bigramTokenizer))
nbBigrams <- length(table(bigrams))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
trigrams <- unlist(lapply(clean_sample, trigramTokenizer))
nbTrigrams <- length(table(trigrams))
```
```{r, echo=FALSE}

dfBigramFreq <- tbl_df(data.frame(table(bigrams))) %>% arrange(Freq) %>%
                mutate(bigrams = factor(bigrams,levels = as.character(bigrams)))

dfTrigramFreq <- tbl_df(data.frame(table(trigrams))) %>% arrange(Freq) %>%
                 mutate(trigrams = factor(trigrams,levels = as.character(trigrams)))

gbigram <- ggplot(tail(dfBigramFreq,10),aes(x = bigrams,y = Freq, fill = Freq)) + 
           geom_bar(stat = "identity") + theme(axis.title.x = element_text(colour = "steelblue", size = 14), axis.text.x  = element_text(angle = 90, face = "bold", size = 12)) + ggtitle("Most frequent bigrams") + scale_fill_gradient(low = "grey60", high = "steelblue")

gtrigram <- ggplot(tail(dfTrigramFreq,10),aes(x = trigrams,y = Freq, fill = Freq)) + 
           geom_bar(stat = "identity") + theme(axis.title.x = element_text(colour = "steelblue", size = 14), axis.text.x  = element_text(angle = 90, face = "bold", vjust = 0.5, size = 12)) + ggtitle("Most frequent trigrams") + scale_fill_gradient(low = "grey60", high = "steelblue")

grid.arrange(gbigram,gtrigram,ncol = 2)
```

## Insights
- Stemming needs to be adjusted so we don't have trigrams like "happi mother day" or "presid barack obama"
- Lowcasing induces a loss of information re the presence of personal names, city names, state names and alike
- Changing N-gram order, i.e. from bigram to trigram, yields to drastic decrease of observed counts 
- Corpora is huge, I need more research into how to efficiently deal with it, especially in terms of building a model trained on the complete data set.

## Next steps
The future work on the capstone project will be directed towards the development of proper strategy that will be used for modelling, i.e. choosing and constructing an adequate set of features, choosing and implementing the optimal prediction algorithm which will enable building fast and user friendly app.

