Capstone Milestone Report - Data Exploration Analysis

Summary

The purpose of this project is to build predictive text models in the smart keyboard that makes it easier for people to type on their mobile devices. And, in this report, we will run a data exploration analysis to better understand the dataset, then creating a plan for developing the prediction alogirthm and shiny app.

The dataset can be downloaded from a corpus - HC Corpora. And, the report will include all the steps to download, process and explore the data to ensure reproducibility. Lastly, we are using base, tm and other packages where you can find them in the appendix.

Downloading the data

First of all, the dataset is downloaded and unzipped into a folder, data/swiftkey. Then, the script will run through each file, and generate a basic summary of the dataset in the /final/en-US folder.

# Create folder if folder doesn't exist
if(!dir.exists("data/swiftkey")) {
  dir.create("data") 
  dir.create("data/swiftkey")
}

# Download files if file doesn't exist. Then, extract them to the designated folder. 
# And, delete the downloaded file
if(!file.exists("data/swiftkey/final/en_US/en_US.twitter.txt")) {
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", 
                destfile = "data/swiftkey/Coursera-SwiftKey.zip")
  unzip("data/swiftkey/Coursera-SwiftKey.zip", exdir = "data/swiftkey")
  unlink("data/swiftkey/Coursera-Swiftkey.zip")
}

# Get the names of all the files in the particular folder using dir function.
filenames <- dir("data/swiftkey/final/en_US", pattern = ".txt")

# Create an empty list for storing all the file content.
data <- list()

# Looping through the number of files.
for (i in 1:length(filenames)) {
  con <- file(paste0("data/swiftkey/final/en_US/", filenames[i])) # Create connection to the file
  data[[filenames[i]]] <- readLines(con, encoding = "UTF-8", skipNul = TRUE) # Read the content, and save them to the list
  # Print out the summary of the file information.
  print(paste0("File: ",filenames[i], ", ", 
               "Size: ", sprintf("%.1f",(file.size(paste0("data/swiftkey/final/en_US/", filenames[i])))/1048576), " MB", ", ",
               "Number of lines: ", length(readLines(con, encoding = "UTF-8", skipNul = TRUE)), ", ",
               "Max # of characters in 1 sentence: ", max(nchar(readLines(con, encoding = "UTF-8", skipNul = TRUE))), ", ",
               "Number of words: ", wordcount(readLines(con, encoding = "UTF-8", skipNul = TRUE))))
  close(con) # Close the connection after completion
}

## [1] "File: en_US.blogs.txt, Size: 200.4 MB, Number of lines: 899288, Max # of characters in 1 sentence: 40833, Number of words: 37334131"
## [1] "File: en_US.news.txt, Size: 196.3 MB, Number of lines: 77259, Max # of characters in 1 sentence: 5760, Number of words: 2643969"
## [1] "File: en_US.twitter.txt, Size: 159.4 MB, Number of lines: 2360148, Max # of characters in 1 sentence: 140, Number of words: 30373583"

Pre-processing the data

As the data is huge, instead of using all of the content, we will sample of 10,000 lines of content from each file. Then, create a corpus out of those sampled content.

Thereafter, we will clean the data in the following manner:

Replace punctuation that are joining words like ., -, : into whitespace.
Replace garbage text into whitespace.
Convert all the content into lowercase.
Remove punctuations and numbers
Strip all the whitespace.

set.seed(12345)
data.sample <- list()

for(i in 1:length(data)) {
  # Looping through the list, sample 10,000 lines of content from each file. 
  data.sample[[i]] <- data[[i]][sample(1:length(data[[i]]), 10000, replace = FALSE)]
}

# Create corpus and give metadata for traceability purposes
content <- VCorpus(VectorSource(data.sample))
meta(content, tag ="From") <- c("blogs", "news", "twitter")

# Creating a helper function to find something and replace it with a space. 
toSpace <- content_transformer( function(x, pattern) {
    return (gsub(pattern, " ", x))
})

# Replace joining words with punctuations to space
content <- tm_map(content, toSpace, "\\.")
content <- tm_map(content, toSpace, "-")
content <- tm_map(content, toSpace, ":")

# Replace garbage text into space
content <- tm_map(content, content_transformer(gsub), pattern = "[^[:graph:]]", replacement = " ")

# Standard - Change text to lower case, remove punctuations, numbers and stripping white space.
content <- tm_map(content, content_transformer(tolower))
content <- tm_map(content, removePunctuation)
content <- tm_map(content, removeNumbers)
content <- tm_map(content, stripWhitespace)

Document Term Matrix

Then, creating a matrix that lists all occurrences of words in the corpus using Document Term Matrix.

# Generating a document term matrix from the corpus
dtm_content <- DocumentTermMatrix(content)
dtm_content

## <<DocumentTermMatrix (documents: 3, terms: 46490)>>
## Non-/sparse entries: 71335/68135
## Sparsity           : 49%
## Maximal term length: 98
## Weighting          : term frequency (tf)

There are dtm_content$ncol terms in dtm_content$nrow documents. And 49% of terms are sparse. The longest words is 64 characters and this is likely a combination of 2 or more words.

Word Frequency

Using the matrix, we can find the most frequent words appear in these 3 documents.

# Sum all the occurences of terms across the documents
freq <- colSums(as.matrix(dtm_content))

# Then, sort the freq in descending order of the occurrences of terms.
ord <- order(freq, decreasing = TRUE)

# List the top 10 terms and create a histogram
dtm_content_wf <- data.frame(terms = names(freq[head(ord, 20)]), occurrences = freq[head(ord, 20)])
p <- ggplot(dtm_content_wf, aes(terms, occurrences))
p <- p + geom_bar(stat = "identity")
p <- p + labs(title = "Most Frequent Words")
p

Bi-gram Exploration

Following the above process, we can also explore which are the most frequently used bi-gram words in these 3 documents. But, to generate a bi-gram matrix, we need to construct a bi-gram tokenizer using the NLP functionality.

# Helper function to generate bi-gram matrix using the ngrams function from NLP package.
BigramTokenizer <- function(x) {
  unlist(lapply(ngrams(words(x),2), paste, collapse=" "), use.names = FALSE)
}

#Then generate a Document Term Matrix
dtm_content_bigram <- DocumentTermMatrix(content, control = list(tokenize = BigramTokenizer))
dtm_content_bigram

## <<DocumentTermMatrix (documents: 3, terms: 414590)>>
## Non-/sparse entries: 480563/763207
## Sparsity           : 61%
## Maximal term length: 104
## Weighting          : term frequency (tf)

# Sum all the occurrences of bi-gram terms across the documents
freqbi <- colSums(as.matrix(dtm_content_bigram))

# Then, sort them in the descending order.
ordbi <- order(freqbi, decreasing = TRUE)

## Creating a dataframe with the top 20 words
dtm_content_wf_bigram <- data.frame(terms = names(freqbi[head(ordbi, 20)]), occurrences = freqbi[head(ordbi, 20)])

pb <- ggplot(dtm_content_wf_bigram, aes(terms, occurrences))
pb <- pb + geom_bar(stat = "identity")
pb <- pb + theme(axis.text.x = element_text(angle = 45, hjust = 1))
pb <- pb + labs(title = "Most Frequent Bi-gram Words")
pb

Next Step

The plan for next step is to build basic bi-gram and tri-gram models to predict the next 1 - 2 words based on the word given by the user. And, also to develop a back-off model to handle those unseen words that don’t appear in the corpora.

Then, we will apply these models into the Shiny app. These app should include an area where user type in a word. Then, the app will suggest a few most frequent used words from the bi-gram dictionary. Then, when users type in the second word, the app will suggest the next word from the bi-gram and tri-gram dictionaries.

An example

If a user typed the first word, from, in the Shiny app. it will suggest the top 5 most frequently used words from the bi-gram dictionary.

## Creating a dataframe with full bi-gram words
dtm_content_wff_bigram <- data.frame(terms = names(freqbi), occurrences = freqbi)

## Extract the bi-gram words that begins with "from"
x <- subset(dtm_content_wff_bigram, grepl("^from", dtm_content_wff_bigram$terms))

## order the results by occurrences in decreasing manner
ordx <- order(x$occurrences, decreasing = TRUE)

head(x[ordx,])

##                 terms occurrences
## from the     from the         806
## from a         from a         220
## from my       from my          69
## from his     from his          61
## from to       from to          54
## from their from their          45

Appendix

These are the packages loaded during the report creation.

## R version 3.4.2 (2017-09-28)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 15063)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ngram_3.0.4   ggplot2_2.2.1 magrittr_1.5  tm_0.7-1      NLP_0.1-11   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.13     knitr_1.17       munsell_0.4.3    colorspace_1.3-2
##  [5] rlang_0.1.2      stringr_1.2.0    plyr_1.8.4       tools_3.4.2     
##  [9] parallel_3.4.2   grid_3.4.2       gtable_0.2.0     htmltools_0.3.6 
## [13] yaml_2.1.14      lazyeval_0.2.0   rprojroot_1.2    digest_0.6.12   
## [17] tibble_1.3.4     evaluate_0.10.1  slam_0.1-40      rmarkdown_1.6   
## [21] labeling_0.3     stringi_1.1.5    compiler_3.4.2   scales_0.5.0    
## [25] backports_1.1.1