Exploratory data analysis
In this section, we will understand the distribution of words and relationship between the words in the corpora, then we will be able to answer these questions.
- Some words are more frequent than others - what are the distributions of word frequencies?
- What are the frequencies of 2-grams and 3-grams in the dataset?
- How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
- How do you evaluate how many of the words come from foreign languages?
- Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?
Loading the data
The corpora is charged to calculate basic statistics in each type of source: twitter, news and blogs.
f <- file.path(getwd(), "Coursera-SwiftKey.zip")
# Reading the files
de_twitter <- read.table(unz(f,"final/de_DE/de_DE.twitter.txt"), header=F, sep = "\n",stringsAsFactors = F)
de_news <- read.table(unz(f,"final/de_DE/de_DE.news.txt"), header=F, sep = "\n",stringsAsFactors = F)
de_blogs <- read.table(unz(f,"final/de_DE/de_DE.blogs.txt"), header=F, sep = "\n",stringsAsFactors = F)
words_blogs <- stri_count_words(de_blogs$V1)
words_news <- stri_count_words(de_news$V1)
words_twitter <- stri_count_words(de_twitter$V1)
size_blogs <- file.info("final/de_DE/de_DE.blogs.txt")$size/1024^2
size_news <- file.info("final/de_DE/de_DE.news.txt")$size/1024^2
size_twitter <- file.info("final/de_DE/de_DE.twitter.txt")$size/1024^2
basic_stats <- data.frame(filename = c("de_blogs","de_news","de_twitter"),
file_size_MB = c(size_blogs, size_news, size_twitter),
lines = c(length(de_blogs$V1),length(de_news$V1),length(de_twitter$V1)),
num_words = c(sum(words_blogs),sum(words_news),sum(words_twitter)),
mean_num_words = c(mean(words_blogs),mean(words_news),mean(words_twitter)))
basic_stats
Now, we need to take a sample to perform some more complex calculations.
# Taking a sample.
set.seed(85)
de_sample = rbind(de_twitter$V1[sample(length(de_twitter$V1), 1000)],
de_news$V1[sample(length(de_news$V1), 1000)],
de_blogs$V1[sample(length(de_blogs$V1), 1000)])
remove(de_twitter,de_news,de_blogs)
With the sample a Corpus is created, then we will remove profanity (numbers, punctations and multiple whitespace characters). After, the sparse terms are removed, the maximal allowed sparsity is the 0.999. This helps to remove words from other languages or very uncommon ones.
# Creting Corpus.
de_corpus <- VCorpus(VectorSource(de_sample))
# Removing profanity.
de_corpus <- tm_map(de_corpus, function(x) iconv(x, from='UTF-8', to="latin1"))
de_corpus <- tm_map(de_corpus, removeNumbers)
de_corpus <- tm_map(de_corpus, removePunctuation)
de_corpus <- tm_map(de_corpus, stripWhitespace)
de_corpus <- tm_map(de_corpus, PlainTextDocument)
# Creating a document-term matrix.
de_tdm <- TermDocumentMatrix(de_corpus)
nTerms(de_tdm)
[1] 35317
de_tdm <- removeSparseTerms(de_tdm, 0.999)
nTerms(de_tdm)
[1] 4732
Analyzing frequent words
We need to see the frequent words in our corpora. They might appear in our 2-grams, 3-grams and 4-grams. I’m not removing stop words, as the intention of the SwiftKey is to predict the next word to be typed.
# Finding frequent terms
de_freq <- sort(rowSums(as.matrix(de_tdm)),decreasing = T)
de_wc = data.frame(term=names(de_freq),frequency=de_freq)
de_wc[, 'cum_freq'] <- cumsum(de_wc[, 2])
# Number of words with more than 50% of instances
words_50 <- sum(de_wc$cum_freq < tail(de_wc$cum_fre,n=1)*0.5)
words_50
[1] 82
# Number of words with more than 90% of instances
words_90 <- sum(de_wc$cum_freq < tail(de_wc$cum_fre,n=1)*0.9)
words_90
[1] 2177
# Hitogram of frequent terms
p <- ggplot(subset(de_wc, frequency>1000), aes(x=reorder(term, frequency),y=frequency))
p <- p + geom_bar(aes(fill = frequency),stat="identity") + coord_flip() +xlab('words')
p

# Wordcloud
wordcloud(names(de_freq),de_freq, min.freq=300, colors=brewer.pal(6,"Accent"))

Analyzing frequent n-grams
So far, we have explored the behaviour of individual words in the corpora. Time to see 2-grams and 3-grams.
# Creating tokenizers.
BigramTokenizer <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
TrigramTokenizer <- function(x) unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
# 2-grams
de_tdm_2g <- TermDocumentMatrix(de_corpus, control=list(tokenize=BigramTokenizer))
de_tdm_2g <- removeSparseTerms(de_tdm_2g, 0.999)
# Finding frequent terms
de_freq_2g <- sort(rowSums(as.matrix(de_tdm_2g)),decreasing = T)
findFreqTerms(de_tdm_2g,lowfreq=100)
[1] "an der" "auch die" "auf dem" "auf den" "auf der" "auf die" "aus dem"
[8] "bei der" "das ist" "dass die" "für den" "für die" "in den" "in der"
[15] "in die" "mehr als" "mit dem" "mit der" "mit einem" "nicht mehr" "sich die"
[22] "über die" "um die" "und der" "und die" "von der"
de_wc_g = data.frame(term=names(de_freq_2g),occurrences=de_freq_2g)
# Wordcloud
wordcloud(names(de_freq_2g),de_freq_2g, min.freq=75, colors=brewer.pal(6,"Accent"))

# 3-grams
de_tdm_3g <- TermDocumentMatrix(de_corpus, control=list(tokenize=TrigramTokenizer))
de_tdm_3g <- removeSparseTerms(de_tdm_3g, 0.999)
# Finding frequent terms
de_freq_3g <- sort(rowSums(as.matrix(de_tdm_3g)),decreasing = T)
findFreqTerms(de_tdm_3g,lowfreq=12)
[1] "auf jeden fall" "das ist ein" "den vergangenen jahren"
[4] "die zahl der" "im vergangenen jahr" "in den letzten"
[7] "in den nächsten" "in den vergangenen" "in diesem jahr"
[10] "nach wie vor" "sich in den" "sich in der"
de_wc_g = data.frame(term=names(de_freq_3g),occurrences=de_freq_3g)
# Wordcloud
wordcloud(names(de_freq_3g),de_freq_3g, min.freq=50, scale = c(2,.25) , max.words=10,colors=brewer.pal(3,"Accent"))

---
title: "Milestone Report"
author: Sandra Meneses
output: html_notebook
---

```{r warning=FALSE, error=FALSE}
library('knitr')
setwd('~/Data_Science/R/Tasks/SwiftKey')
opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```


# Project Swiftkey: German Corpora

The main goal of this project is to build a predictive text model. That means, I will use some data science, or to be more specific, NLP (Natural Language Processing) to predict the next words a user intends to type. This reduces the time a person needs to write a message. To start with this project the data exploration that you will see in this documents is performed.

*Packages*

```{r warning = FALSE, message = FALSE}
library('tm')
library('stringi')
library('wordcloud')
library('ggplot2')

```

- tm: A framework for text mining applications within R.
- stringi: Allows for fast, correct, consistent, portable, as well as convenient character string/text processing in every locale and any native encoding.
- wordcloud: Pretty word clouds.
- ggplot2: A system for 'declaratively' creating graphics, based on "The Grammar of Graphics"


# Exploratory data analysis

In this section, we will understand the distribution of words and relationship between the words in the corpora, then we will be able to answer these questions.

1. Some words are more frequent than others - what are the distributions of word frequencies?
2. What are the frequencies of 2-grams and 3-grams in the dataset?
3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
4. How do you evaluate how many of the words come from foreign languages?
5. Can you think of a way to increase the coverage -- identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

## Loading the data

The corpora is charged to calculate basic statistics in each type of source: twitter, news and blogs.

```{r cache=TRUE, warning = FALSE, message = FALSE}
f <- file.path(getwd(), "Coursera-SwiftKey.zip")

# Reading the files
de_twitter <- read.table(unz(f,"final/de_DE/de_DE.twitter.txt"), header=F, sep = "\n",stringsAsFactors = F)
de_news <-  read.table(unz(f,"final/de_DE/de_DE.news.txt"), header=F, sep = "\n",stringsAsFactors = F)
de_blogs <- read.table(unz(f,"final/de_DE/de_DE.blogs.txt"), header=F, sep = "\n",stringsAsFactors = F)

words_blogs <- stri_count_words(de_blogs$V1)
words_news <- stri_count_words(de_news$V1)
words_twitter <- stri_count_words(de_twitter$V1)
size_blogs <- file.info("final/de_DE/de_DE.blogs.txt")$size/1024^2
size_news <- file.info("final/de_DE/de_DE.news.txt")$size/1024^2
size_twitter <- file.info("final/de_DE/de_DE.twitter.txt")$size/1024^2
basic_stats <- data.frame(filename = c("de_blogs","de_news","de_twitter"),
                            file_size_MB = c(size_blogs, size_news, size_twitter),
                            lines = c(length(de_blogs$V1),length(de_news$V1),length(de_twitter$V1)),
                            num_words = c(sum(words_blogs),sum(words_news),sum(words_twitter)),
                            mean_num_words = c(mean(words_blogs),mean(words_news),mean(words_twitter)))
basic_stats

```

Now, we need to take a sample to perform some more complex calculations. 

```{r}
# Taking a sample.

set.seed(85)
de_sample = rbind(de_twitter$V1[sample(length(de_twitter$V1), 1000)],
                                   de_news$V1[sample(length(de_news$V1), 1000)],
                                    de_blogs$V1[sample(length(de_blogs$V1), 1000)])
remove(de_twitter,de_news,de_blogs)

```

With the sample a Corpus is created, then we will remove profanity (numbers, punctations and multiple whitespace characters). After, the sparse terms are removed, the maximal allowed sparsity is the 0.999. This helps to remove words from other languages or very uncommon ones.

```{r}

# Creating Corpus.
de_corpus <- VCorpus(VectorSource(de_sample))

# Removing profanity.

de_corpus <- tm_map(de_corpus, function(x) iconv(x, from='UTF-8', to="latin1"))
de_corpus <- tm_map(de_corpus, removeNumbers)
de_corpus <- tm_map(de_corpus, removePunctuation)
de_corpus <- tm_map(de_corpus, stripWhitespace)
de_corpus <- tm_map(de_corpus, PlainTextDocument)

# Creating a document-term matrix.

de_tdm <- TermDocumentMatrix(de_corpus)
nTerms(de_tdm)
de_tdm <- removeSparseTerms(de_tdm, 0.999)
nTerms(de_tdm)

```


## Analyzing frequent words

We need to see the frequent words in our corpora. They might appear in our 2-grams, 3-grams and 4-grams. I'm not removing stop words, as the intention of the SwiftKey is to predict the next word to be typed.

```{r}

# Finding frequent terms
de_freq <- sort(rowSums(as.matrix(de_tdm)),decreasing = T)
de_wc = data.frame(term=names(de_freq),frequency=de_freq)
de_wc[, 'cum_freq'] <- cumsum(de_wc[, 2])

# Number of words with more than 50% of instances
words_50 <- sum(de_wc$cum_freq < tail(de_wc$cum_fre,n=1)*0.5)
words_50

# Number of words with more than 90% of instances
words_90 <- sum(de_wc$cum_freq < tail(de_wc$cum_fre,n=1)*0.9)
words_90

# Hitogram of frequent terms
p <- ggplot(subset(de_wc, frequency>1000), aes(x=reorder(term, frequency),y=frequency))
p <- p + geom_bar(aes(fill = frequency),stat="identity") + coord_flip() +xlab('words')
p

# Wordcloud
wordcloud(names(de_freq),de_freq, min.freq=300, colors=brewer.pal(6,"Accent"))

```


## Analyzing frequent n-grams

So far, we have explored the behaviour of individual words in the corpora. Time to see 2-grams and 3-grams. 

```{r}
# Creating tokenizers.
BigramTokenizer <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
TrigramTokenizer <- function(x) unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)

# 2-grams
de_tdm_2g <- TermDocumentMatrix(de_corpus,  control=list(tokenize=BigramTokenizer))
de_tdm_2g <- removeSparseTerms(de_tdm_2g, 0.999)

# Finding frequent terms
de_freq_2g <- sort(rowSums(as.matrix(de_tdm_2g)),decreasing = T)
findFreqTerms(de_tdm_2g,lowfreq=100)
de_wc_g = data.frame(term=names(de_freq_2g),occurrences=de_freq_2g)

# Wordcloud
wordcloud(names(de_freq_2g),de_freq_2g, min.freq=75, colors=brewer.pal(6,"Accent"))

# 3-grams
de_tdm_3g <- TermDocumentMatrix(de_corpus,  control=list(tokenize=TrigramTokenizer))
de_tdm_3g <- removeSparseTerms(de_tdm_3g, 0.999)

# Finding frequent terms
de_freq_3g <- sort(rowSums(as.matrix(de_tdm_3g)),decreasing = T)
findFreqTerms(de_tdm_3g,lowfreq=12)
de_wc_g = data.frame(term=names(de_freq_3g),occurrences=de_freq_3g)

# Wordcloud
wordcloud(names(de_freq_3g),de_freq_3g, min.freq=50, scale = c(2,.25) , max.words=10,colors=brewer.pal(3,"Accent"))

```

# Plan for further steps

Goals of the prediction model:

 - Explore different methods to predict text in German language.
 - Validate different models using a metric or accuracy measure.
 - Build a final text prediction model using the most frequent n-grams.


To build a prediction model, I have to find an answer to these questions.

 - How can you efficiently store an n-gram model?
 - How can you use the knowledge about word frequencies to make your model smaller and more efficient?
 - How many parameters do you need (i.e. how big is n in your n-gram model)?
 - Can you think of simple ways to "smooth" the probabilities?
 - How do you evaluate whether your model is any good?
 - How can you use backoff models to estimate the probability of unobserved n-grams?

Plans for creating a prediction algorithm and Shiny app.

 1. Define a metric to evaluate the performance of the model.
 2. Design the Shiny application.
 3. Clean and transform the daa following the steps done tin this data exploration.
 4. Set a simple baseline to build a model by improving over the baseline.
 5. Explore different parameters (like n in the n-gram model) to balance simplisity vs. accuracy.
 6. Explore cluster analysis or K-means to find close words in 'unobserverd' words or n-grams.
 7. Build and validate different models to find an adequate solution.
 8. Evaluate the final model (which should be better than the baseline defined in step 4) using the  metric in step 1.
 9. Develop a Shiny application according to its design.

