Background

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.

Motivation

The purpose of this report is provide you with a basic understanding of the data loading, preparation and exploration activities that I have completed this week in preparation for my model and Shiny app creation activities that will commence starting next week.

Executive Summary

Three English based datasets were downloaded from the URL below and evaluated using R studio. The datasets consisted of a Twitter dataset, a blogs dataset and a US News dataset. I evaluated these three text based datasets using R’s text mining and the Stanford Open NLP package. I found that the datasets were very large (a combined ~56Mb) and as a result I only explored three one percent samples (one percent of each dataset). I found that I was able to explore larger sample sizes of the Twitter and News datasets but increasing the size of the blogs dataset would cause R to crash with memory overload errors. Despite the 1% limitation, I was able to do a number of exploration activities and found that 147 words and <1% of the rows made up 50% coverage of my sample and 7286 words and ~25% fo the rows made up 90% coverage of my sample. This leads me to believe that very good text prediction models can result from analyzing 1, 2 and 3 word tokens.

Data

The data for this project was downloaded from this URL: “https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip”

Obtaining the data - The data was downloaded from the URL above and loaded into R Studio.

setwd('C:/Users/jgpolanc/Desktop/Coursera/Capstone')
folder <- "C:/Users/jgpolanc/Desktop/Coursera/Capstone/data"
url  <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
fname <- "Coursera-SwiftKey.zip"
path <- paste(folder, fname, sep="/")
if (!file.exists(path)){
  download.file(url, destfile=path)
}
unzip(zipfile=path, exdir=folder)

Data Evaluation - This section summarizes the data and tells the user what it looks like

folder <- "C:/Users/jgpolanc/Desktop/Coursera/Capstone/data/final"
flist <- list.files(path=folder, recursive=T, pattern=".*en_.*.txt")
l <- lapply(paste(folder, flist, sep="/"), function(f) {
  fsize <- file.info(f)[1]/1024/1024
  con <- file(f, open="rb")
  lines <- readLines(con)
  nchars <- lapply(lines, nchar)
  maxchars <- which.max(nchars)
  nwords <- sum(sapply(strsplit(lines, "\\s+"), length))
  close(con)
  return(c(f, format(round(fsize, 2), nsmall=2), length(lines), maxchars, nwords))
})
df <- data.frame(matrix(unlist(l), nrow=length(l), byrow=T))
colnames(df) <- c("file", "size(MB)", "num.of.lines", "longest.line", "num.of.words")
df

##                                                                             file
## 1   C:/Users/jgpolanc/Desktop/Coursera/Capstone/data/final/en_US/en_US.blogs.txt
## 2    C:/Users/jgpolanc/Desktop/Coursera/Capstone/data/final/en_US/en_US.news.txt
## 3 C:/Users/jgpolanc/Desktop/Coursera/Capstone/data/final/en_US/en_US.twitter.txt
##   size(MB) num.of.lines longest.line num.of.words
## 1   200.42       899288       483415     37334441
## 2   196.28      1010242       123628     34372598
## 3   159.36      2360148      1484357     30373792

## Warning in readLines(twit_data): line 167155 appears to contain an embedded
## nul

## Warning in readLines(twit_data): line 268547 appears to contain an embedded
## nul

## Warning in readLines(twit_data): line 1274086 appears to contain an
## embedded nul

## Warning in readLines(twit_data): line 1759032 appears to contain an
## embedded nul

This section seeks to understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

g1

g2

g3

g4

We can see in the following calculation how many words and % of rows it would take to reach 50% coverage in our one word model.

words_counted

## [1] 147

percent_rows

## [1] 0.4922314

We can see in the following calculation how many words and % of rowsit would take to reach 90% coverage in our one word model.

words_counted

## [1] 7286

percent_rows

## [1] 24.39727

Next Steps

Starting next week I will build my prediction models the predict the next word for 2, 3 and 4 word phrases. From there I will build a Shiny App that displays the results of those predictions.

Coursera_Capstone_Milestone_Text_Prediction.rmd

Joel Polanco

May 1, 2016

Background

Motivation

Executive Summary

Data

Next Steps