2. Main flow steps involved in the Capstone Project
2.1 Downloading data
The training data for this Capstone SwiftKey project is downloaded from Coursera Site. Only en_US* files are unzipped into ‘data’ directory for NLP analysis.
## Load libraries needed
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(stringr))
suppressPackageStartupMessages(library(dplyr))
## Code to download
source("scripts/extFiles.R")
source("scripts/processLines.R")
source("scripts/getFileInfo.R")
# download_data("en_US", "original")
<- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
zipFile extFiles(zpath=zipFile, datadir="data", lname="en_US")
## Preliminary Data Analysis and Setup for Processing
## Create tibbles and gather files information
<- processLines(file = "data/news.txt", num = Inf)
news <- tibble(line=1:length(news), news)
news_df names(news_df) <- c("line","word")
<-getFileInfo(file = "data/news.txt", txtFile = news)
newsInfo
## Create blogs tibble
<- processLines(file = "data/blogs.txt", num = Inf)
blogs <- tibble(line=1:length(blogs), blogs)
blogs_df names(blogs_df) <- c("line","word")
<-getFileInfo(file = "data/blogs.txt", txtFile = blogs)
blogsInfo
## Create twitter tibble
<- processLines(file = "data/twitter.txt", num = Inf)
twitter <- tibble(line=1:length(twitter), twitter)
twitter_df names(twitter_df) <- c("line","word")
<-getFileInfo(file = "data/twitter.txt", txtFile = twitter) twitterInfo
2.2 Gathering Datasets stats
### Information about the text files
## Create data.frame with file size, length, longest line info
`File Size` <- c(newsInfo$FileSize,blogsInfo$FileSize,twitterInfo$FileSize)
`File Length` <- c(newsInfo$FileLength,blogsInfo$FileLength,twitterInfo$FileLength)
`Longest Line` <- c(newsInfo$LongestLineLength,blogsInfo$LongestLineLength,twitterInfo$LongestLineLength)
## Combine files into a dataframe
<- data.frame(`File Size`, `File Length`, `Longest Line`)
all_files_info row.names(all_files_info) <- c("News", "Blogs", "Twitter")
print(all_files_info)
## Remove unneeded VARs from memory
rm(news, newsInfo,twitter,blogs,twitterInfo, blogsInfo)
2.3 Preprocessing data to create tibbles
This preprocess step includes cleaning (removing profane words, numbers, puncutations, spaces, etc), tokenization to create n-grams, frequency capture of words for prediction, merging of data frames (needed for Shiny App).
The following are the steps incolved in preprocessing of the text files
1. Convert capital letters to lower case
2. Remove Numbers
3. Remove White spaces
4. Remove Emojis
5. Remove AlphaNumerals
6. Remove Profanity words
2.4 Creating N-Grams
For creating n-grams, We will be looking at two different approaches:
1. Using tidy
2. Using tm
Note: Creating trigrams and quadgrams is very memory intensive, especially for blogs.txt. If it is really needed to process huge files, best to split the dataframe into manageable chuncks -or- skip reading lines if greater than a certain length.
Since final step in the project is to create a Shiny App, only 33% of blogs.txt and twitter.txt will be used in this project to showcase Shiny App.
2.4.1 Using tidy unnest tokens ‘tidyverse’ package
Citation: Text Mining with R a book by Julia Silge & David Robinson The Life-Changing Magic of Tidying Text | Julia Silge
Here are is sample tidy code for creating unigram:
data(stop_words)
<- df %>%
df_sort unnest_tokens(word, word) %>%
anti_join(stop_words) %>%
count(word, sort=T)
2.4.2 Using TextMatrixDocumentation ‘tm’ package
Citation: Jaehyeon’s Blog Part1, Part 2, and Part 3
require(parallel); require(tm)
<- detectCores() -1
n_cores <- makeCluster(n_cores, type="PSOCK") ## on windows
cl invisible(clusterEvalQ(cl, library(tm)))
clusterExport(cl,"vdocs", envir = environment()) ## Export variable and cluster
parLapply(cl, 1, function(x)
{<- tm_map(vdocs, content_transformer(tolower)) ## to lowercase
vdocs <- tm_map(vdocs, stripWhitespace) ## strip white space
vdocs ## You can add more tm_map steps
}stopCluster(cl)
2.4.3 Merge each Corpus ngram to one file
After creating ngrams (bigram, trigram, quadgram) Document matrix for each corpus, the following steps will be performed:
1. Merge the 3 ngram dataframes into one
2. Reduce them by ‘full_join’ based on ‘word’ column
3. Transform the merged dataframe and create ‘total’ column
4. Separate the words, and drop the count for individual ngram
5. Merge the final dataframe for each corpus into one
6. Efficient storage of the n-gram model (Markov Chains)
2.4.4 Prediction
- For the model to be more efficient, depending on the memory requirement for Shiny App on server, may have to drop some rows based on frequency
- Use backoff models to estimate probability of unobserved n-grams
- Create train and test data to build models for prediction algorithm to predict next word