Introduction

The goal of this project is just to display that I’ve gotten used to working with the data and that I am on track to create your prediction algorithm. This submission explains the exploratory data analysis and the goals for the eventual app and algorithm. This document is concise and explain only the major features of the data that had been identified and briefly summarize the plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. The report use tables and plots to illustrate important summaries of the data set. The motivation for this project is to:

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Load the libraries

library(stringi)
library(tm)
library(wordcloud)
#library(RColorBrewer)
library(RWeka)
library(ggplot2)

Download and Import the Data

The data is formed by 4 languages, but only English will be used. The dataset has three files includes en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt.

# Download and unzip file if it doesn't exist
if(!file.exists('Coursera-SwiftKey.zip')){
    download.file('https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip',
                  destfile = './Coursera-SwiftKey.zip', method = 'curl', quiet = T)
    unzip('./Coursera-SwiftKey.zip')
}

# Load blogs data
blogsFileName <- "final/en_US/en_US.blogs.txt"
con <- file(blogsFileName, open = "r")
blogs <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

# Load news data
newsFileName <- "final/en_US/en_US.news.txt"
con <- file(newsFileName, open = "r")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

# Load twitter data
twitterFileName <- "final/en_US/en_US.twitter.txt"
con <- file(twitterFileName, open = "r")
twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

# Remove variables no longer needed to free up memory
rm(con)

Original Data Summary

For each files, it was calculated some summary stats between can see: Size in Megabytes, number of lines (rows), total number of characters by file, average number of characters by line, total number of words by file and average number of words by line.

original_summary <- data.frame(
                'File' = c("Blogs","News","Twitter"),
                'FileSizeinMB' = round(file.info(c(blogsFileName,                            
                                                   newsFileName,
                                                   twitterFileName))$size / 1024 ^ 2),
                'NumberofLines' = sapply(list(blogs, news, twitter), length),
                'TotalCharacters' = sapply(list(blogs, news, twitter), function(x){sum(nchar(x))}),
                'MeanCharactersinLine' = sapply(list(blogs, news, twitter), 
                                               function(x){mean(unlist(lapply(x, function(y) nchar(y))))}), 
                'TotalWords' = sapply(list(blogs,news,twitter), stri_stats_latex)[4,],
                'MeanWordsinLine' = sapply(list(blogs, news, twitter),
                                    function(x) summary(stri_count_words(x))[c('Mean')])                
                               )
original_summary
##      File FileSizeinMB NumberofLines TotalCharacters MeanCharactersinLine
## 1   Blogs          200        899288       206824505            229.98695
## 2    News          196         77259        15639408            202.42830
## 3 Twitter          159       2360148       162096241             68.68054
##   TotalWords MeanWordsinLine
## 1   37570839        41.75107
## 2    2651432        34.61779
## 3   30451170        12.75065

Sample Data Summary

To build models it doesn’t need to load in and use all of the data. By this reason, can be use randomly representative sample that can infer facts about a population. Since the data are so big (see above dataset summary table), I am only going to proceed the analysis with a subset of the original dataset (1% of each file) because running the calculations using the big files will be really slow.

# Set seed for reproducability
set.seed(93)

# Assign sample size
sampleSize = 0.01

# Sample all three data sets
blogs <- sample(blogs, length(blogs) * sampleSize, replace = FALSE)
news <- sample(news, length(news) * sampleSize, replace = FALSE)
twitter <- sample(twitter, length(twitter) * sampleSize, replace = FALSE)

# Remove all non english characters as they cause issues
blogs <- iconv(blogs, "latin1", "ASCII", sub = "")
news <- iconv(news, "latin1", "ASCII", sub = "")
twitter <- iconv(twitter, "latin1", "ASCII", sub = "")

# Combine all three data sets into a single data set 
sampleData <- c(blogs, news, twitter)

# Save this sample dataset
sampleDataFileName <- "final/en_US/en_US.sample.txt"
write(sampleData, sampleDataFileName)

# Remove variables no longer needed to free up memory
rm(blogs, news, twitter)

# Stats of sample data
sample_summary <- data.frame(
                'File' = c("Sample"),
                'FileSizeinMB' = round(file.info(c(sampleDataFileName))$size / 1024 ^ 2),
                'NumberofLines' = sapply(list(sampleData), length),
                'TotalCharacters' = sapply(list(sampleData), function(x){sum(nchar(x))}),
                'MeanCharactersinLine' = sapply(list(sampleData), 
                                               function(x){mean(unlist(lapply(x, function(y) nchar(y))))}), 
                'TotalWords' = sapply(list(sampleData), stri_stats_latex)[4,],
                'MeanWordsinLine' = sapply(list(sampleData),
                                    function(x) summary(stri_count_words(x))[c('Mean')]),
                row.names = NULL
                             )
sample_summary
##     File FileSizeinMB NumberofLines TotalCharacters MeanCharactersinLine
## 1 Sample            4         33365         3823203             114.5872
##   TotalWords MeanWordsinLine
## 1     701325         21.0071

Data Preprocessing

The final selected text data needs to be cleaned to be used in the word prediction model. By this reason, the sample data will be cleaned using techniques such as removing whitespaces, numbers, URLs, punctuations, profanity, stopwords, …

# Fuction to build Corpus
buildCorpus <- function (dataSet) {
    data <- VCorpus(VectorSource(dataSet))
    toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
    
    # Remove URL, Twitter handles and email patterns
    data <- tm_map(data, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
    data <- tm_map(data, toSpace, "@[^\\s]+")
    data <- tm_map(data, toSpace, "\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b")
    
    # Pass to lower all words
    data <- tm_map(data, tolower)
    
    # Remove profane words from the sample data set
    ### Download the file if it doesn't exist
    if(!file.exists('bad-words.txt')){
        download.file('http://www.cs.cmu.edu/~biglou/resources/bad-words.txt',
                      destfile = 'bad-words.txt', quiet = T)
    }
    ### Load profanity words
    con <- file("bad-words.txt", open = "r")  # 
    profanityWords <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
    close(con)
    ### Remove all non english characters as they cause issues
    profanityWords <- iconv(profanityWords, "latin1", "ASCII", sub = "")
    ### Remove profane words
    data <- tm_map(data, removeWords, profanityWords)
    
    ### Remove stopwords, signal punctuation, numbers and extras spaces
    data <- tm_map(data, removeWords, stopwords("english"))
    data <- tm_map(data, removePunctuation)
    data <- tm_map(data, removeNumbers)
    data <- tm_map(data, stripWhitespace)
    data <- tm_map(data, PlainTextDocument)
    
    return(data)
}

# Build the corpus 
sampleCorpus <- buildCorpus(sampleData)

# Remove variables no longer needed to free up memory
rm(sampleData)

Exploratory Data Analysis

Now what the corpus has cleaned the data, I can perform exploratory analysis on the tidy data.

Word Frequencies

The word cloud that can see below illustrate the most frequently occurring words in the data.

wordcloud(sampleCorpus, max.words = 200, random.order = FALSE, rot.per = 0.15,
          use.r.layout = FALSE, colors=brewer.pal(8, "Dark2"), scale = c(2.9, 0.4))

Tokenizing and N-Gram Generation

It is needed to format this cleaned data in to a format which is most useful for NLP. The best format is N-grams stored in Term Document Matrices, which the representation is: documents as the rows, terms/words as the columns and frequency of the term in the document as the entries. Because the number of unique words in the corpus the dimension can be large, ngram models are created to explore word frequencies.

# Function to get ngrams matrix word frequency
getMatrixFreq <- function (corpus, n, val) {
  
    ntokenizer <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
    
    # Create term document matrix for the corpus
    termDocMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = ntokenizer))
    
    # Eliminate sparse terms for each n-gram and get frequencies of most common n-grams
    matrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(termDocMatrix, val))), decreasing = TRUE)
    matrixFreq <- data.frame(word = names(matrixFreq), freq = matrixFreq)
    
    return(matrixFreq)
}

Unigrams

Unigram Analysis shows the most frequent words and their respective frequency. Unigram is based on individual words.

unigramMatrixFreq <- getMatrixFreq(sampleCorpus, 1, 0.99)

# Plot
ggplot(unigramMatrixFreq[1:20,], aes(x = reorder(word, freq), y = freq)) +
       geom_bar(stat = "identity", fill = I("grey50")) +
       coord_flip() +
       geom_text(aes(label = freq ), hjust = -0.10, size = 3) +
       ggtitle("Most Common Unigrams") + xlab("") + ylab("Frequency") +
       theme(plot.title = element_text(size = 14, hjust = 0.5))

Bigrams

Bigram Analysis shows the most frequent words and their respective frequency. Bigram is based on two word combinations.

bigramMatrixFreq <- getMatrixFreq(sampleCorpus, 2, 0.999)

# Plot
ggplot(bigramMatrixFreq[1:20,], aes(x = reorder(word, freq), y = freq)) +
       geom_bar(stat = "identity", fill = I("grey50")) +
       coord_flip() +
       geom_text(aes(label = freq ), hjust = -0.10, size = 3) +
       ggtitle("Most Common Bigrams") + xlab("") + ylab("Frequency") +
       theme(plot.title = element_text(size = 14, hjust = 0.5))

Trigrams

Trigram Analysis shows the most frequent words and their respective frequency. Trigram is based on three word combinations.

trigramMatrixFreq <- getMatrixFreq(sampleCorpus, 3, 0.9999)

# Plot
ggplot(trigramMatrixFreq[1:20,], aes(x = reorder(word, freq), y = freq)) +
       geom_bar(stat = "identity", fill = I("grey50")) +
       coord_flip() +
       geom_text(aes(label = freq ), hjust = -0.10, size = 3) +
       ggtitle("Most Common Trigrams") + xlab("") + ylab("Frequency") +
       theme(plot.title = element_text(size = 14, hjust = 0.5))

Conclusion

Through this analysis we can see the most common words and infer the main topics of the the texts.

Future Goals

The next steps of this capstone project would be to finalize our predictive algorithm, and deploy our algorithm using shiny application. For the Shiny application, the plan is to create an application with a simple interface where the user can enter a string of text. Our prediction model will then give a list of suggested words to update the next word.