Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:
I went to the
the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.
The purpose of this report is provide you with a basic understanding of the data loading, preparation and exploration activities that I have completed this week in preparation for my model and Shiny app creation activities that will commence starting next week.
Three English based datasets were downloaded from the URL below and evaluated using R studio. The datasets consisted of a Twitter dataset, a blogs dataset and a US News dataset. I evaluated these three text based datasets using R’s text mining and the Stanford Open NLP package. I found that the datasets were very large (a combined ~56Mb) and as a result I only explored three one percent samples (one percent of each dataset). I found that I was able to explore larger sample sizes of the Twitter and News datasets but increasing the size of the blogs dataset would cause R to crash with memory overload errors. Despite the 1% limitation, I was able to do a number of exploration activities and found that 147 words and <1% of the rows made up 50% coverage of my sample and 7286 words and ~25% fo the rows made up 90% coverage of my sample. This leads me to believe that very good text prediction models can result from analyzing 1, 2 and 3 word tokens.
The data for this project was downloaded from this URL: “https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip”
Obtaining the data - The data was downloaded from the URL above and loaded into R Studio.
setwd('C:/Users/jgpolanc/Desktop/Coursera/Capstone')
folder <- "C:/Users/jgpolanc/Desktop/Coursera/Capstone/data"
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
fname <- "Coursera-SwiftKey.zip"
path <- paste(folder, fname, sep="/")
if (!file.exists(path)){
download.file(url, destfile=path)
}
unzip(zipfile=path, exdir=folder)
Data Evaluation - This section summarizes the data and tells the user what it looks like
folder <- "C:/Users/jgpolanc/Desktop/Coursera/Capstone/data/final"
flist <- list.files(path=folder, recursive=T, pattern=".*en_.*.txt")
l <- lapply(paste(folder, flist, sep="/"), function(f) {
fsize <- file.info(f)[1]/1024/1024
con <- file(f, open="rb")
lines <- readLines(con)
nchars <- lapply(lines, nchar)
maxchars <- which.max(nchars)
nwords <- sum(sapply(strsplit(lines, "\\s+"), length))
close(con)
return(c(f, format(round(fsize, 2), nsmall=2), length(lines), maxchars, nwords))
})
df <- data.frame(matrix(unlist(l), nrow=length(l), byrow=T))
colnames(df) <- c("file", "size(MB)", "num.of.lines", "longest.line", "num.of.words")
df
## file
## 1 C:/Users/jgpolanc/Desktop/Coursera/Capstone/data/final/en_US/en_US.blogs.txt
## 2 C:/Users/jgpolanc/Desktop/Coursera/Capstone/data/final/en_US/en_US.news.txt
## 3 C:/Users/jgpolanc/Desktop/Coursera/Capstone/data/final/en_US/en_US.twitter.txt
## size(MB) num.of.lines longest.line num.of.words
## 1 200.42 899288 483415 37334441
## 2 196.28 1010242 123628 34372598
## 3 159.36 2360148 1484357 30373792
## Warning in readLines(twit_data): line 167155 appears to contain an embedded
## nul
## Warning in readLines(twit_data): line 268547 appears to contain an embedded
## nul
## Warning in readLines(twit_data): line 1274086 appears to contain an
## embedded nul
## Warning in readLines(twit_data): line 1759032 appears to contain an
## embedded nul
This section seeks to understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.
g1
g2
g3
g4
We can see in the following calculation how many words and % of rows it would take to reach 50% coverage in our one word model.
words_counted
## [1] 147
percent_rows
## [1] 0.4922314
We can see in the following calculation how many words and % of rowsit would take to reach 90% coverage in our one word model.
words_counted
## [1] 7286
percent_rows
## [1] 24.39727
Starting next week I will build my prediction models the predict the next word for 2, 3 and 4 word phrases. From there I will build a Shiny App that displays the results of those predictions.