The objective of the overal project is to create a predictive text model that reduces the number of required keystrokes and effectively predicts the next word typed based on word frequency and context. This milestone report describes the major features of the training data with our exploratory data analysis and summarizes our plans for creating the prediction model. In this report, we endeavor to:
-Properly download and clean text data. -Create a basic report of summary statistics about the data sets. -Do Exploratory Data Analysis -Report on some interesting findings. -Give some some feedback on a plan to create a prediction algorithm and Shiny app.
Data for this project is from a corpus called HC Corpora.
The corpus provides three types of text data: blogs, news and twitter. For the purposes of this project, all sources will be assumed to be of equal quality, though there are some notable differences. For example, the twitter text data may contain more grammar errors and mispellings. Yet, on the other hand the focus on short topical phrases may make twitter text ideal for prediction of phrases with 2-4 words, the focus of this project.
All text data are provided in 4 different languages: German, English(United States), Finnish and Russian. In this project, we will only focus on the English - United States data sets.
For this report we will load the quanteda package which connects with several other R library functions.
####Load or install packages used
library(ggplot2) # enhanced grahics
library(ggthemes) # advanced themes
library(quanteda) # corpus tokenizer and more
## quanteda version 0.99.22
## Using 3 of 4 threads for parallel computing
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
library(stringi)
# load local files
blogs_size <- file.info("en_US.blogs.txt")$size
news_size <- file.info("en_US.news.txt")$size
twitter_size <- file.info("en_US.twitter.txt")$size
# In-memory size (in MB)
blogs_size
## [1] 210160014
news_size
## [1] 205811889
twitter_size
## [1] 167105338
####Words in lines
blogs_words <- stri_count_words(blogs_size)
news_words <- stri_count_words(news_size)
twitter_words <- stri_count_words(twitter_size)
max(blogs_words)
## [1] 1
max(news_words)
## [1] 1
max(twitter_words)
## [1] 1
Since the twitter data contains emojis and symbols, it is important to remove non-ASCII characters and clean the data. Fortunately, the quanteda package provides the needed functionality to do word frequencies without extensive manual regex construction.
Natural language processing techniques will be used to perform the analysis and build the predictive model. This allows for removing URLs, special characters, punctuations, numbers, excess whitespace, stopwords, stemming words and changing the text to lower case.
# Download and unzip the data to local disk
if (!file.exists("Coursera-SwiftKey.zip")) {
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
}
# Define function to read and strip out non-ascii characters
remove_nonasc <- function(file){
print(file)
# Read the data and force UTF-8 encoding
text <- readLines(file, encoding = "UTF-8", skipNul = TRUE)
# print max char
print(max(nchar(text)))
# find indices of words with non-ASCII characters
nonascIndex <- grep("text.tmp", iconv(text, "latin1", "ASCII", sub="text.tmp"))
# subset original vector of words to exclude words with non-ASCII char
text <- text[-nonascIndex]
}
## Summary Stats
# load local files deleting non-ascoo items
blogsData <- remove_nonasc("en_US.blogs.txt")
## [1] "en_US.blogs.txt"
## [1] 40833
newsData <- remove_nonasc("en_US.news.txt")
## [1] "en_US.news.txt"
## [1] 11384
twitterData <- remove_nonasc("en_US.twitter.txt")
## [1] "en_US.twitter.txt"
## [1] 140
# Memory
object.size(blogsData)
## 154428320 bytes
object.size(newsData)
## 220678896 bytes
object.size(twitterData)
## 304364720 bytes
## Max Words in lines
max(stri_count_words(blogsData))
## [1] 1657
max(stri_count_words(newsData))
## [1] 1796
max(stri_count_words(twitterData))
## [1] 47
# Length os Test files
length(blogsData)
## [1] 636261
length(newsData)
## [1] 874278
length(twitterData)
## [1] 2282717
# Sample the data using percentage approach to test coverage
set.seed(416)
# replace is false, these probabilities are applied sequentially, that is the probability of choosing the next item is proportional to the weights amongst the remaining items.
data.sample <- c(sample(blogsData, length(blogsData) * 0.03, replace = FALSE),
sample(blogsData, length(blogsData) * 0.03, replace = FALSE),
sample(newsData, length(newsData) * 0.03, replace = FALSE),
sample(twitterData, length(twitterData) * 0.03, replace = FALSE))
if (!file.exists("data.sample.Rdata")) {
data.sample <- paste(blogsData[1:5000], newsData[1:5000], twitterData[1:5000])
save("data.sample", file="data.sample.Rdata")
} else {
load("data.sample.Rdata")
}
To continue the analysis, we create 3 term-document matrices for a) unigrams, b) bigrams and c) trigrams. These are commonly referred to as n-grams, a contiguous sequence of n items from a given sequence of text or speech. The matrices created will serve for word prediction in the algorithm to be built in the next phase of our capstone project.
For the purposes of this report, we filter out basic words with high frequency such as “the”. We also use stemming to combine words with common root meanings. We also removed hashtags to combine online topics. This has the effect of increasing the twitter influence on word count.
Here are the top fifteen words that appear most frequently in our datasample, with adjustments:
## Creates sparse data frame of unigrams
mydf1 <- dfm(data.sample, ngrams=1, verbose = TRUE, toLower = TRUE,
removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
removeTwitter = TRUE, stem = TRUE, ignoredFeatures = stopwords("english"),
keptFeatures = NULL, language = "english", thesaurus = NULL,
dictionary = NULL, valuetype = c("glob", "regex", "fixed"))
## Creating a dfm from a character input...
## Warning: Arguments toLower, removeNumbers, removePunct, removeSeparators,
## removeTwitter, ignoredFeatures, keptFeatures, language not used.
## ... lowercasing
## ... found 113,796 documents, 98,858 features
## ... stemming features (English)
## , trimmed 23730 feature variants
## ... created a 113,796 x 75,128 sparse dfm
## ... complete.
## Elapsed time: 15.8 seconds.
#user quanteda to get quick freq count
top15unigrams <- topfeatures(mydf1, 15) # 15 top words
uni15_df <- data.frame(word=names(top15unigrams), freq=top15unigrams, row.names=NULL)
rm(mydf1)
# Define frequency plot function
makePlot <- function(data, label) {
ggplot(data[1:15,], aes(reorder(word, -freq), freq)) +
labs(x = label, y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 11, hjust = 1)) + theme_economist() +
geom_bar(stat = "identity", fill = ("blue")) + coord_flip()
}
rm(mydf1)
## Warning in rm(mydf1): object 'mydf1' not found
# show plot
makePlot(uni15_df, "15 Most Common Unigrams")
The two word combinations called for more adjustments,in addition to the switch to the fastest answer, which has the impact of decreasing accuracy, the concatenator definition assures the character in between multi-word dictionary values is a blank not an underscore - which would really change the results returned.
Here is a histogram showin the 15 most common bigrams in the data sample, with adjustments:
# create sparse data frame of bigrams
mydf2 <- dfm(data.sample, ngrams=2, concatenator = " ",
what = "fastestword",
verbose = FALSE, toLower = TRUE,
removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
removeTwitter = FALSE,
stem = FALSE, ignoredFeatures = stopwords("english"),
keptFeatures = NULL, language = "english", thesaurus = NULL,
dictionary = NULL, valuetype = "fixed")
## Warning: Arguments toLower, removeNumbers, removePunct, removeSeparators,
## removeTwitter, ignoredFeatures, keptFeatures, language not used.
# user quanteda to get freq count
top15bigrams <- topfeatures(mydf2, 15)
bi15_df <- data.frame(word=names(top15bigrams), freq=top15bigrams, row.names=NULL)
rm(mydf2)
# show plot
makePlot(bi15_df, "15 Most Common Bigrams")
Using a similar configuration, here is a histogram of the 15 most common trigrams in the data sample:
#### Creating sparse data frame of trigrams
mydf3 <- dfm(data.sample, ngrams=3, concatenator = " ",
what = "fastestword",
verbose = FALSE, toLower = TRUE,
removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
removeTwitter = FALSE,
stem = FALSE, ignoredFeatures = stopwords("english"),
keptFeatures = NULL, language = "english", thesaurus = NULL,
dictionary = NULL, valuetype = "fixed")
## Warning: Arguments toLower, removeNumbers, removePunct, removeSeparators,
## removeTwitter, ignoredFeatures, keptFeatures, language not used.
#### ser quanteda to get freq count
top15trigrams <- topfeatures(mydf3, 15)
tri15_df <- data.frame(word=names(top15trigrams), freq=top15trigrams, row.names=NULL)
rm(mydf3)
# show plot
makePlot(tri15_df, "15 Most Common Trigrams")
Having a data management strategy is key to being able to build a model.
Initially we attempted to use the more traditional tm package. However, after running into memory issues on this set we switched to Quanteda’s dfm() clearly faster functionality.
The tokenization in Quanteda is very conservative: by default, it only removes separator characters without additional definitions. So there are still strings and word combinations that are candidates for more regex scrubbing.
On a positive note, for fast content analysis, the quanteda package allows us to also look at similarities in data and other features such as building dictionaries of terms and meta-tagging content to create a richer search experience.
Creating the prediction algorithm. Increasing the sample size Optimizing the final corpus to achieve appropriate coverage and improve prediction accuracy
Then the Shiny app server algorithm will receive the typed or pasted text and perform the following actions:
Converting to lowercase
Removing non-ASCII characters, numbers, punctuation and white spaces
Filtering English stop words
Searching the last 2 words in the trigrams and retrieve matching patterns to send back to the UI
If not found, search the last word in bigrams and retrieve matching word patterns to send back to the UI If not found, use smoothed probabilities to estimate the most likely words to follow If not found, use backoff models to estimate the probability of unobserved n-grams
The idea here is to keep this simple. The shiny application is not geared toward long sentences or paragraphs which would require another modeling approach.