Executive Summary

The purpose of this report is to demonstrate the first steps of building a predictive text model for the Coursera Data Science Certificate program. The steps this report will cover are obtaining the data, cleaning it, and finally an exploratory analysis.

Getting and Loading Data

First step in the project is to get the data for the project, and import it. Also required, is a list of words that would be described as profanity. I used a list of banned words by Google. The list can be found here.

Once I downloaded the datasets, I unzipped them, and loaded into my R session.

library(tidytext); library(dplyr); library(caret); library(quanteda);library(stringr)
library(tidyr);library(stringi);library(igraph);library(ggraph);library(widyr)
setwd("D:/Personal/Coursea/10.  Capstone Project")
con <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
#Create Connections to load data
con1 <- file("en_US.blogs.txt", "r")
con2 <- file("en_US.news.txt", "r")
con3 <- file("en_US.twitter.txt", "r")
con4 <- file("profane_list.txt","r")
blogdata <- readLines(con1)
newsdata <- readLines(con2)
twitterdata <- readLines(con3)
profanedata <- readLines(con4)
profanedata <- as.data.frame(profanedata)

Corpora Summaries

Before I get to cleaning the data so I can perform some modelling, let’s take a look at some basic summary statistics of the 3 corpora used in the project.

Blog data has 38,601,176 words and 899,288 lines.

News data has 2,755,797 words and 77,259 lines.

Twitter data has 31,130,580 words and 2,360,148 lines.

Cleaning the Data

Before cleaning the data, I decided that since there were so many lines and words between the 3 corpora, I would combine them then take a random sample of 5% of the lines, to be used further. Here is the code:

#Combine the 3 datasets into 1
textdata <- Reduce(function(...) merge(..., all=TRUE), list(blogdata, newsdata, twitterdata))
textdata <- textdata %>%
    gather(-corpus, key = "test", value = "text") %>%
    select(-test)

textdata <- filter(textdata,!is.na(text))

#Randomly sample 5% of rows (166835 of 3336695 rows), then write the data to a csv
set.seed(123987)
sample_data <- textdata %>%
    sample_n(166835)

#remove non ascii characters
sample_data$text <- gsub("[^\x20-\x7E]", "", sample_data$text)

write.csv(sample_data,"sample_textdata.csv")

#Load sample data
sample_data <- read.csv("sample_textdata.csv")

Now that we have a smaller sample of the text, I used the tidytext R package to tokenize the dataset. A token is a meaninful unit of text, most often a word that can then be used for further analysis. Tokenization is then splitting the text into a one token per document per row format. The unnest_tokens function of the tidytext package accomplishes the tokenization task for me, as well, it strips punctuation, converts all words to lower case, and retains the line number from which the word came from. Here is a sample few lines from this process:

head(toktextdata)

Exploratory Data Analysis

Basic Exploration

Now that we have tidied up the data, let’s explore it. First, I want to explore the word frequency. We then could measure the importance of the word by it’s frequency, which is called term frequency(tf). A common problem of this is that common English words like “the”, “of”, “and” would then be classified as the most important. There are various approaches to dealing with this, but we will use a technique called inverse document frequency (idf). Using inverse document frequency reduces the importance (weight) of the most frequent words, and increases the weight of words that appear less frequently. Finally, if you multiply the term frequency and inverse document frequency together, you get a statistic (called tf-idf) that is intended to measure how important a word is to a document.

To start, lets count the occurence of each word in the corpus, and the total number of words. Below are the first few results.

head(toktextdata_words)

As expected, “the” is the most common word, with 146,809 occurences out of 3,550,923 possible words. We can then calculate the term frequency (tf) by dividing n/total. After calculating the term frequency, we calculate the inverse document frequeny, then the “term frequence - inverse document frequency statistic” and plot the results.

plot_tf

For the three different corpora, the results were not unexpected. For example, with the twitter data, I was expecting to see things like abbreviations as the imporant words, based on their frequency. For the news corpora, I was expecting to see a bit more nouns (people, places) as news would be reporting on something. For the blogs data, I was expecting a bit more random information since anyone can write a blog, and that’s what it looks like happened.

N-Grams

Moving on, we will take a look at relationships between words, starting with n-grams. N-grams are a tokenization of adjacent words. In this analysis, we will look at bi-grams (two adjacent words) and tri-grams (three adjacent words). Taking a first look at our bigrams, yields the below:

bigrams

The most frequent bigrams include the most frequent words, which is not very interesting. A method of finding the more interesting pairs of words is to remove stop words. Stop words are common words that do not really add any information about the text. So, next we can remove the stop words.

bigrams

The bigrams are far more interesting than the common ones. Now, let’s look at the tri-grams. Following the same process as above and removing the stop words, we get the following:

trigrams

Tri-grams are a little less interesting because most of the information in the top 10 are either numbers or a special occassion.

As a next step, let’s build a network map of our bigrams so we can visualize all the relationships between words.

markovchain

The visualization above is an example of a Markov Chain in text analysis. A Markov Chain in our context, is a word that depends solely on the previous word. What it shows is the links between two words. In this case, only bigrams that appear greater than 50 times are visualized, otherwise the visualization would be unreadable. The other thing to note is there are several clusters of words. For example, “minutes” is in the middle, and is surrounded by numbers. This symbolizes that combinations of phrases, which as “10 minutes” or “30 minutes” were very popular in the corpora.

Next Steps

Next steps of this project are to start building the first predictive model, so, stay tuned!

---
title: "Data Science Capstone Project - Milestone 1"
output: html_notebook
author:  Chris Selig
---

##Executive Summary 
The purpose of this report is to demonstrate the first steps of building a predictive text model for the Coursera Data Science Certificate program.  The steps this report will cover are obtaining the data, cleaning it, and finally an exploratory analysis.  

##Getting and Loading Data
First step in the project is to get the data for the project, and import it.  Also required, is a list of words that would be described as profanity.  I used a list of banned words by Google.  The list can be found [here](https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/).

Once I downloaded the datasets, I unzipped them, and loaded into my R session.

```{r, cache=TRUE, message=FALSE, warning=FALSE}
library(tidytext); library(dplyr); library(caret); library(quanteda);library(stringr)
library(tidyr);library(stringi);library(igraph);library(ggraph);library(widyr)

setwd("D:/Personal/Coursea/10.  Capstone Project")

con <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

#Create Connections to load data
con1 <- file("en_US.blogs.txt", "r")
con2 <- file("en_US.news.txt", "r")
con3 <- file("en_US.twitter.txt", "r")
con4 <- file("profane_list.txt","r")

blogdata <- readLines(con1)
newsdata <- readLines(con2)
twitterdata <- readLines(con3)
profanedata <- readLines(con4)

profanedata <- as.data.frame(profanedata)
```
```{r message=FALSE, include=FALSE}
setwd("D:/Personal/Coursea/10.  Capstone Project")
sample_data <- read.csv("sample_textdata.csv")
```
##Corpora Summaries
Before I get to cleaning the data so I can perform some modelling, let's take a look at some basic summary statistics of the 3 corpora used in the project.
```{r, include=FALSE, cache=TRUE, message=FALSE, warning=FALSE}
blogwordcount <- prettyNum(sum(str_count(blogdata,'\\w+'),na.rm = TRUE), big.mark = ",")
newswordcount <- prettyNum(sum(str_count(newsdata,'\\w+'),na.rm = TRUE), big.mark = ",")
twitterwordcount <- prettyNum(sum(str_count(twitterdata,'\\w+'),na.rm = TRUE), big.mark = ",")
bloglinecount <- prettyNum(length(blogdata), big.mark = ",")
newslinecount <- prettyNum(length(newsdata), big.mark = ",")
twitterlinecount <- prettyNum(length(twitterdata), big.mark = ",")
```

Blog data has `r blogwordcount` words and `r bloglinecount` lines.

News data has `r newswordcount` words and `r newslinecount` lines.

Twitter data has `r twitterwordcount` words and `r twitterlinecount` lines.


##Cleaning the Data

Before cleaning the data, I decided that since there were so many lines and words between the 3 corpora, I would combine them then take a random sample of 5% of the lines, to be used further.  Here is the code:

```{r, cache=TRUE, warning=FALSE, message=FALSE, eval=FALSE}
#Combine the 3 datasets into 1
textdata <- Reduce(function(...) merge(..., all=TRUE), list(blogdata, newsdata, twitterdata))
textdata <- textdata %>%
    gather(-corpus, key = "test", value = "text") %>%
    select(-test)

textdata <- filter(textdata,!is.na(text))

#Randomly sample 5% of rows (166835 of 3336695 rows), then write the data to a csv
set.seed(123987)
sample_data <- textdata %>%
    sample_n(166835)

#remove non ascii characters
sample_data$text <- gsub("[^\x20-\x7E]", "", sample_data$text)

write.csv(sample_data,"sample_textdata.csv")

#Load sample data
sample_data <- read.csv("sample_textdata.csv")
```

Now that we have a smaller sample of the text, I used the tidytext R package to tokenize the dataset.  A token is a meaninful unit of text, most often a word that can then be used for further analysis.  Tokenization is then splitting the text into a one token per document per row format.  The unnest_tokens function of the tidytext package accomplishes the tokenization task for me, as well, it strips punctuation, converts all words to lower case, and retains the line number from which the word came from.  Here is a sample few lines from this process: 

```{r, cache=TRUE, message = FALSE, include = FALSE}
names(sample_data) <- c("row","corpus","text")
sample_data$text <- as.character(sample_data$text)

toktextdata <- sample_data %>%
    unnest_tokens(word, text)
```
```{r}
head(toktextdata)
```
##Exploratory Data Analysis


### Basic Exploration
Now that we have tidied up the data, let's explore it.  First, I want to explore the word frequency.  We then could measure the importance of the word by it's frequency, which is called term frequency(tf).  A common problem of this is that common English words  like "the", "of", "and" would then be classified as the most important.  There are various approaches to dealing with this, but we will use a technique called inverse document frequency (idf).  Using inverse document frequency reduces the importance (weight) of the most frequent words, and increases the weight of words that appear less frequently.  Finally, if you multiply the term frequency and inverse document frequency together, you get a statistic (called tf-idf) that is intended to measure how important a word is to a document. 


To start, lets count the occurence of each word in the corpus, and the total number of words.  Below are the first few results.

```{r, cache=TRUE, message=FALSE, include=FALSE}
#Count words in document
toktextdata_words <- toktextdata %>%
    count(corpus, word,sort = TRUE) %>%
    ungroup()

#Count total words, to be used later
toktextdata_total_words <- toktextdata_words %>%
    group_by(corpus) %>%
    summarize(
        total = sum(n))

toktextdata_words <- left_join(toktextdata_words, toktextdata_total_words)
```
```{r message=FALSE}
head(toktextdata_words)
```

As expected, "the" is the most common word, with 146,809 occurences out of 3,550,923 possible words.  We can then calculate the term frequency (tf) by dividing n/total.  After calculating the term frequency, we calculate the inverse document frequeny, then the "term frequence - inverse document frequency statistic" and plot the results.

```{r, include=FALSE, message=FALSE, cache=TRUE}
#Calculate the term frequency, inverse document frequency, and tf-idf
termfreq <- toktextdata_words %>%
    bind_tf_idf(word, corpus, n) %>%
    arrange(desc(tf_idf))

#Plot tf-idf
plot_tf <- termfreq %>%
    arrange(desc(tf_idf)) %>%
    mutate(word = factor(word, levels = rev(unique(word))))

plot_tf <- plot_tf %>%
    group_by(corpus)%>%
    top_n(10) %>%
    ungroup %>%
    ggplot(aes(word, tf_idf, fill = corpus)) +
    geom_col() +
    labs(x = NULL, y = "tf-idf") +
    coord_flip()+
    facet_wrap(~corpus, ncol = 2, scales = "free")+
    theme_classic()
```
```{r}
plot_tf
```

For the three different corpora, the results were not unexpected.  For example, with the twitter data, I was expecting to see things like abbreviations as the imporant words, based on their frequency.  For the news corpora, I was expecting to see a bit more nouns (people, places) as news would be reporting on something.  For the blogs data, I was expecting a bit more random information since anyone can write a blog, and that's what it looks like happened.  

###N-Grams

Moving on, we will take a look at relationships between words, starting with n-grams.  N-grams are a tokenization of adjacent words.  In this analysis, we will look at bi-grams (two adjacent words) and tri-grams (three adjacent words).  Taking a first look at our bigrams, yields the below: 

```{r cache=TRUE, include=FALSE, message=FALSE, echo=FALSE}
bigrams <- sample_data %>%
    unnest_tokens(bigram, text, token = "ngrams", n = 2) 
```
```{r cache = TRUE, include = FALSE}
bigrams %>%
    count(bigram,sort = TRUE)
```
```{r}
bigrams
```

The most frequent bigrams include the most frequent words, which is not very interesting.  A method of finding the more interesting pairs of words is to remove stop words.  Stop words are common words that do not really add any information about the text. So, next we can remove the stop words.

```{r cache=TRUE, include=FALSE}
#Remove stop words so we can see more interesting pairs of words
data("stop_words")
bigrams_separated <- bigrams %>%
    separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
    filter(!word1 %in% stop_words$word) %>%
    filter(!word2 %in% stop_words$word)

# new bigram counts:
bigrams <- bigrams_filtered %>% 
    count(word1, word2, sort = TRUE)
```
```{r}
bigrams
```
The bigrams are far more interesting than the common ones.  Now, let's look at the tri-grams.  Following the same process as above and removing the stop words, we get the following:

```{r cache=TRUE, echo=FALSE, message=FALSE, include=FALSE}
#Tri-grams
trigrams <- sample_data %>%
    unnest_tokens(trigram, text, token = "ngrams", n = 3) 

trigrams %>%
    count(trigram,sort = TRUE)

#Remove stop words so we can see more interesting pairs of words
trigrams_separated <- trigrams %>%
    separate(trigram, c("word1", "word2","word3"), sep = " ")

trigrams_filtered <- trigrams_separated %>%
    filter(!word1 %in% stop_words$word) %>%
    filter(!word2 %in% stop_words$word) %>%
    filter(!word3 %in% stop_words$word)

# new trigram counts:
trigrams <- trigrams_filtered %>% 
    count(word1, word2,word3, sort = TRUE)
```
```{r}
trigrams
```

Tri-grams are a little less interesting because most of the information in the top 10 are either numbers or a special occassion.

As a next step, let's build a network map of our bigrams so we can visualize all the relationships between words.
```{r cache=TRUE, echo= FALSE}
bigram_graph <- bigrams %>%
    filter(n > 50) %>%
    graph_from_data_frame()

markovchain <- ggraph(bigram_graph, layout = "fr") +
    geom_edge_link() +
    geom_node_point() +
    geom_node_text(aes(label = name), vjust = 1, hjust = 1)
```
```{r}
markovchain
````

The visualization above is an example of a Markov Chain in text analysis.  A Markov Chain in our context, is a word that depends solely on the previous word.  What it shows is the links between two words. In this case, only bigrams that appear greater than 50 times are visualized, otherwise the visualization would be unreadable.  The other thing to note is there are several clusters of words.  For example, "minutes" is in the middle, and is surrounded by numbers.  This symbolizes that combinations of phrases, which as "10 minutes" or "30 minutes" were very popular in the corpora.  


## Next Steps
Next steps of this project are to start building the first predictive model, so, stay tuned!

