The purpose of this project is to build predictive text models in the smart keyboard that makes it easier for people to type on their mobile devices. And, in this report, we will run a data exploration analysis to better understand the dataset, then creating a plan for developing the prediction alogirthm and shiny app.
The dataset can be downloaded from a corpus - HC Corpora. And, the report will include all the steps to download, process and explore the data to ensure reproducibility. Lastly, we are using base, tm and other packages where you can find them in the appendix.
First of all, the dataset is downloaded and unzipped into a folder, data/swiftkey. Then, the script will run through each file, and generate a basic summary of the dataset in the /final/en-US folder.
# Create folder if folder doesn't exist
if(!dir.exists("data/swiftkey")) {
dir.create("data")
dir.create("data/swiftkey")
}
# Download files if file doesn't exist. Then, extract them to the designated folder.
# And, delete the downloaded file
if(!file.exists("data/swiftkey/final/en_US/en_US.twitter.txt")) {
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
destfile = "data/swiftkey/Coursera-SwiftKey.zip")
unzip("data/swiftkey/Coursera-SwiftKey.zip", exdir = "data/swiftkey")
unlink("data/swiftkey/Coursera-Swiftkey.zip")
}
# Get the names of all the files in the particular folder using dir function.
filenames <- dir("data/swiftkey/final/en_US", pattern = ".txt")
# Create an empty list for storing all the file content.
data <- list()
# Looping through the number of files.
for (i in 1:length(filenames)) {
con <- file(paste0("data/swiftkey/final/en_US/", filenames[i])) # Create connection to the file
data[[filenames[i]]] <- readLines(con, encoding = "UTF-8", skipNul = TRUE) # Read the content, and save them to the list
# Print out the summary of the file information.
print(paste0("File: ",filenames[i], ", ",
"Size: ", sprintf("%.1f",(file.size(paste0("data/swiftkey/final/en_US/", filenames[i])))/1048576), " MB", ", ",
"Number of lines: ", length(readLines(con, encoding = "UTF-8", skipNul = TRUE)), ", ",
"Max # of characters in 1 sentence: ", max(nchar(readLines(con, encoding = "UTF-8", skipNul = TRUE))), ", ",
"Number of words: ", wordcount(readLines(con, encoding = "UTF-8", skipNul = TRUE))))
close(con) # Close the connection after completion
}
## [1] "File: en_US.blogs.txt, Size: 200.4 MB, Number of lines: 899288, Max # of characters in 1 sentence: 40833, Number of words: 37334131"
## [1] "File: en_US.news.txt, Size: 196.3 MB, Number of lines: 77259, Max # of characters in 1 sentence: 5760, Number of words: 2643969"
## [1] "File: en_US.twitter.txt, Size: 159.4 MB, Number of lines: 2360148, Max # of characters in 1 sentence: 140, Number of words: 30373583"
As the data is huge, instead of using all of the content, we will sample of 10,000 lines of content from each file. Then, create a corpus out of those sampled content.
Thereafter, we will clean the data in the following manner:
., -, : into whitespace.set.seed(12345)
data.sample <- list()
for(i in 1:length(data)) {
# Looping through the list, sample 10,000 lines of content from each file.
data.sample[[i]] <- data[[i]][sample(1:length(data[[i]]), 10000, replace = FALSE)]
}
# Create corpus and give metadata for traceability purposes
content <- VCorpus(VectorSource(data.sample))
meta(content, tag ="From") <- c("blogs", "news", "twitter")
# Creating a helper function to find something and replace it with a space.
toSpace <- content_transformer( function(x, pattern) {
return (gsub(pattern, " ", x))
})
# Replace joining words with punctuations to space
content <- tm_map(content, toSpace, "\\.")
content <- tm_map(content, toSpace, "-")
content <- tm_map(content, toSpace, ":")
# Replace garbage text into space
content <- tm_map(content, content_transformer(gsub), pattern = "[^[:graph:]]", replacement = " ")
# Standard - Change text to lower case, remove punctuations, numbers and stripping white space.
content <- tm_map(content, content_transformer(tolower))
content <- tm_map(content, removePunctuation)
content <- tm_map(content, removeNumbers)
content <- tm_map(content, stripWhitespace)
Then, creating a matrix that lists all occurrences of words in the corpus using Document Term Matrix.
# Generating a document term matrix from the corpus
dtm_content <- DocumentTermMatrix(content)
dtm_content
## <<DocumentTermMatrix (documents: 3, terms: 46490)>>
## Non-/sparse entries: 71335/68135
## Sparsity : 49%
## Maximal term length: 98
## Weighting : term frequency (tf)
There are dtm_content$ncol terms in dtm_content$nrow documents. And 49% of terms are sparse. The longest words is 64 characters and this is likely a combination of 2 or more words.
Using the matrix, we can find the most frequent words appear in these 3 documents.
# Sum all the occurences of terms across the documents
freq <- colSums(as.matrix(dtm_content))
# Then, sort the freq in descending order of the occurrences of terms.
ord <- order(freq, decreasing = TRUE)
# List the top 10 terms and create a histogram
dtm_content_wf <- data.frame(terms = names(freq[head(ord, 20)]), occurrences = freq[head(ord, 20)])
p <- ggplot(dtm_content_wf, aes(terms, occurrences))
p <- p + geom_bar(stat = "identity")
p <- p + labs(title = "Most Frequent Words")
p
Following the above process, we can also explore which are the most frequently used bi-gram words in these 3 documents. But, to generate a bi-gram matrix, we need to construct a bi-gram tokenizer using the NLP functionality.
# Helper function to generate bi-gram matrix using the ngrams function from NLP package.
BigramTokenizer <- function(x) {
unlist(lapply(ngrams(words(x),2), paste, collapse=" "), use.names = FALSE)
}
#Then generate a Document Term Matrix
dtm_content_bigram <- DocumentTermMatrix(content, control = list(tokenize = BigramTokenizer))
dtm_content_bigram
## <<DocumentTermMatrix (documents: 3, terms: 414590)>>
## Non-/sparse entries: 480563/763207
## Sparsity : 61%
## Maximal term length: 104
## Weighting : term frequency (tf)
# Sum all the occurrences of bi-gram terms across the documents
freqbi <- colSums(as.matrix(dtm_content_bigram))
# Then, sort them in the descending order.
ordbi <- order(freqbi, decreasing = TRUE)
## Creating a dataframe with the top 20 words
dtm_content_wf_bigram <- data.frame(terms = names(freqbi[head(ordbi, 20)]), occurrences = freqbi[head(ordbi, 20)])
pb <- ggplot(dtm_content_wf_bigram, aes(terms, occurrences))
pb <- pb + geom_bar(stat = "identity")
pb <- pb + theme(axis.text.x = element_text(angle = 45, hjust = 1))
pb <- pb + labs(title = "Most Frequent Bi-gram Words")
pb
The plan for next step is to build basic bi-gram and tri-gram models to predict the next 1 - 2 words based on the word given by the user. And, also to develop a back-off model to handle those unseen words that don’t appear in the corpora.
Then, we will apply these models into the Shiny app. These app should include an area where user type in a word. Then, the app will suggest a few most frequent used words from the bi-gram dictionary. Then, when users type in the second word, the app will suggest the next word from the bi-gram and tri-gram dictionaries.
If a user typed the first word, from, in the Shiny app. it will suggest the top 5 most frequently used words from the bi-gram dictionary.
## Creating a dataframe with full bi-gram words
dtm_content_wff_bigram <- data.frame(terms = names(freqbi), occurrences = freqbi)
## Extract the bi-gram words that begins with "from"
x <- subset(dtm_content_wff_bigram, grepl("^from", dtm_content_wff_bigram$terms))
## order the results by occurrences in decreasing manner
ordx <- order(x$occurrences, decreasing = TRUE)
head(x[ordx,])
## terms occurrences
## from the from the 806
## from a from a 220
## from my from my 69
## from his from his 61
## from to from to 54
## from their from their 45
These are the packages loaded during the report creation.
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 15063)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ngram_3.0.4 ggplot2_2.2.1 magrittr_1.5 tm_0.7-1 NLP_0.1-11
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.13 knitr_1.17 munsell_0.4.3 colorspace_1.3-2
## [5] rlang_0.1.2 stringr_1.2.0 plyr_1.8.4 tools_3.4.2
## [9] parallel_3.4.2 grid_3.4.2 gtable_0.2.0 htmltools_0.3.6
## [13] yaml_2.1.14 lazyeval_0.2.0 rprojroot_1.2 digest_0.6.12
## [17] tibble_1.3.4 evaluate_0.10.1 slam_0.1-40 rmarkdown_1.6
## [21] labeling_0.3 stringi_1.1.5 compiler_3.4.2 scales_0.5.0
## [25] backports_1.1.1