The objective of this project is to create a predictive text model that reduces the number of required keystrokes and effectively predicts the next word typed based on word frequency and context. Natural language processing techniques will be used to perform the analysis and build the predictive model.
This milestone report describes the major features of the training data with our exploratory data analysis and summarizes our plans for creating the predictive model. It meets the following benchmarks:
Data for this comes from a corpus called HC Corpora. See readme file details.
The corpus provides three types of sources: blog, news and twitter. For the purposes of this project, all sources will be assumed to be of equal quality, though there are some notable differences. For example, the twitter text data may contain more grammar errors and mispellings. Yet, on the other hand the focus on short topical phrases may make twitter text ideal for prediction of phrases with 2-4 words, the focus of this project.
All text data are provided in 4 different languages: 1) German, 2) English - United States, 3) Finnish and 4) Russian. In this project, we will only focus on the English - United States data sets.
For this report we will be load the quanteda package which connects with several other r library functions.
# clear any prior values in environment
rm(list = ls())
# Load or install packages used
library(ggplot2) # enhanced grahics
library(ggthemes) # advanced themes
library(quanteda) # corpus tokenizer and more
Since the twitter data contains emojis and symbols, it is important to remove non-ASCII characters and clean the data. Fortunately, the quanteda package provides the needed functionality to do word frequencies without extensive manual regex construction.
This allow for removing URLs, special characters, punctuations, numbers, excess whitespace, stopwords, stemming words and changing the text to lower case.
# Download and unzip the data to local disk
if (!file.exists("Coursera-SwiftKey.zip")) {
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
}
# Define function to read and strip out non-ascii characters
remove_nonasc <- function(file){
print(file)
# Read the data and force UTF-8 encoding
text <- readLines(file, encoding = "UTF-8", skipNul = TRUE)
# print max char
print(max(nchar(text)))
# find indices of words with non-ASCII characters
nonascIndex <- grep("text.tmp", iconv(text, "latin1", "ASCII", sub="text.tmp"))
# subset original vector of words to exclude words with non-ASCII char
text <- text[-nonascIndex]
}
# load local files
blogsData <- remove_nonasc("./Coursera-SwiftKey/final//en_US//en_US.blogs.txt")
newsData <- remove_nonasc("./Coursera-SwiftKey/final//en_US//en_US.news.txt")
twitterData <- remove_nonasc("./Coursera-SwiftKey/final//en_US//en_US.twitter.txt")
# get file size info
dir <-"./Coursera-SwiftKey/final//en_US//"
filelist <- list.files(dir)
# filelist
To speed up processing we extract a small percent of total records subset for exploratory purposes.
# Sample the data using percentage approach to test coverage
set.seed(416)
# replace is false, these probabilities are applied sequentially, that is the probability of choosing # the next item is proportional to the weights amongst the remaining items.
data.sample <- c(sample(blogsData, length(blogsData) * 0.03, replace = FALSE),
sample(newsData, length(newsData) * 0.03, replace = FALSE),
sample(twitterData, length(twitterData) * 0.03, replace = FALSE))
if (!file.exists("data.sample.Rdata")) {
#ata.sample <- paste(newsData[1:5000], blogsData[1:5000], twitterData[1:5000])
save("data.sample",file="data.sample.Rdata")
# Save the memory for processing needed later
rm(blogsData)
rm(newsData)
rm(twitterData)
} else {
#rm(blog)
load("data.sample.Rdata")
}
To ontinue the analysis, we create 3 term-document matrices for a) unigrams, b) bigrams and c) trigrams. These are commonly referred to as n-grams, a contiguous sequence of n items from a given sequence of text or speech. The matrices created will serve for word prediction in the algorithm to be built in the next phase of our capstone project.
For the purposes of this report, we filter out basic words with high frequency such as “the”. We also use stemming to combine words with common root meanings. We also removed hashtags to combine online topics. This has the effect of increasing the twitter influence on word count.
Here are the top twenty words that appear most frequently in our sample:
## S3 method for class 'character' creates sparse data frame of unigrams
mydf1 <- dfm(data.sample, verbose = TRUE, toLower = TRUE,
removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
removeTwitter = TRUE, stem = TRUE, ignoredFeatures = c("will", stopwords("english")),
keptFeatures = NULL, language = "english", thesaurus = NULL,
dictionary = NULL, valuetype = c("glob", "regex", "fixed"))
#user quanteda to get quick freq count
top20unigrams <- topfeatures(mydf1, 20) # 20 top words
uni20_df <- data.frame(word=names(top20unigrams), freq=top20unigrams, row.names=NULL)
rm(mydf1)
# Define frequency plot function
makePlot <- function(data, label) {
ggplot(data[1:20,], aes(reorder(word, -freq), freq)) +
labs(x = label, y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 11, hjust = 1)) + theme_economist() +
coord_flip() +
geom_bar(stat = "identity", fill = I("grey50"))
}
# show plot
makePlot(uni20_df, "20 Most Common Unigrams")
The two word combinations called for more adjustments,in addition to the switch to the fastest answer, which has the impact of decreasing accuracy, the concatenator definition assures the character in between multi-word dictionary values is a blank not an underscore - which would really change the results returned.
So, here is the histogram of the 20 most common bigrams in the data sample, with these adjustments:
# S3 method for class 'character' creates sparse data frame of bigrams
mydf2 <- dfm(data.sample, ngrams=2, concatenator = " ",
what = "fastestword",
verbose = FALSE, toLower = TRUE,
removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
removeTwitter = FALSE,
stem = FALSE, ignoredFeatures = c("will", stopwords("english")),
keptFeatures = NULL, language = "english", thesaurus = NULL,
dictionary = NULL, valuetype = "fixed")
#user quanteda to get quick freq count
top20bigrams <- topfeatures(mydf2, 20)
bi20_df <- data.frame(word=names(top20bigrams), freq=top20bigrams, row.names=NULL)
rm(mydf2)
# show plot
makePlot(bi20_df, "20 Most Common Bigrams")
Using a similar configuration, here is a histogram of the 20 most common trigrams in the data sample:
# S3 method for class 'character' creates sparse data frame of bigrams
mydf3 <- dfm(data.sample, ngrams=3, concatenator = " ",
what = "fastestword",
verbose = FALSE, toLower = TRUE,
removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
removeTwitter = FALSE, stem = FALSE, ignoredFeatures = c("will", stopwords("english")),
keptFeatures = NULL, language = "english", thesaurus = NULL,
dictionary = NULL, valuetype = "fixed")
#user quanteda to get quick freq count
top20trigrams <- topfeatures(mydf3, 20)
tri20_df <- data.frame(word=names(top20trigrams), freq=top20trigrams, row.names=NULL)
# plot a word cloud if min freq not in set then will plot all!!
# plot(mydf1, max.words = 20,
# random.order = FALSE,
# rot.per = .25,
# colors = RColorBrewer::brewer.pal(8,"Dark2"))
rm(mydf3)
makePlot(tri20_df, "20 Most Common Trigrams")
Having a data management strategy is key to being able to build a model.
Initially we attempted to use the more traditional tm package. However, after running into memory issues on this set we switched to Quanteda’s dfm() clearly faster functionality.
The tokenization in Quanteda is very conservative: by default, it only removes separator characters without additional definitions. So there are still strings and word combinations that are candidates for more regex scrubbing.
On a positive note, for fast content analysis, the quanteda package allows us to also look at similarities in data and other features such as building dictionaries of terms and meta-tagging content to create a richer search experience.
Model development will include:
Creating the prediction algorithm.
Increasing the sample size
Optimizing the final corpus to achieve appropriate coverage and improve prediction accuracy
Then the Shiny app server algorithm will receive the typed or pasted text and perform the following actions:
The idea here is to keep this simple. The shiny application is not geared toward long sentences or paragraphs which would require another modeling approach.
DataScience - Milestone Report http://rpubs.com/iwebconsultant/milestone-report