Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:
I went to the
the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.
The goal of this project is to analyze large corpus of text documents and build a text prediction module base on the sample data provided by SwiftKey.
The training data is provided by SwiftKey: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
library(tm)
## Loading required package: NLP
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
Below is a quick summary of the 3 trainning files (Twitter, Blogs, News) that we will be using in this project.
## get the sizes of the training files in MB
tSize <- round(file.info("./Data/final/en_US/en_US.twitter.txt")$size /1024^2)
bSize <- round(file.info("./Data/final/en_US/en_US.blogs.txt")$size /1024^2)
nSize <-round(file.info("./Data/final/en_US/en_US.news.txt")$size /1024^2)
##Define simple custom functions to get the Word Count, max sentence length and lines
wordCount <- function(lns){
sum(sapply(gregexpr("\\S+", lns), length))
}
maxSentenceLength <- function(lns){
max(sapply(gregexpr("\\S+", lns), length))
}
lineCount <- function(lns){
length(lns)
}
## read the 3 files separately
tLines <- readLines(con <- file("./Data/final/en_US/en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)
bLines <- readLines(con <- file("./Data/final/en_US/en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)
nLines <- readLines(con <- file("./Data/final/en_US/en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines(con <- file("./Data/final/en_US/en_US.news.txt"),
## encoding = "UTF-8", : incomplete final line found on './Data/final/en_US/
## en_US.news.txt'
close(con)
tSummary <- c(tSize, lineCount(tLines), wordCount(tLines), maxSentenceLength(tLines))
bSummary <- c(bSize, lineCount(bLines), wordCount(bLines), maxSentenceLength(bLines))
nSummary <- c(nSize, lineCount(nLines), wordCount(nLines), maxSentenceLength(nLines))
summary <- rbind(tSummary, bSummary, nSummary)
rownames(summary) <- c("Twitter", "Blogs", "News")
colnames(summary) <- c("File Size (MB)", "Lines", "Words", "Max Words Per Line")
summary
## File Size (MB) Lines Words Max Words Per Line
## Twitter 159 2360148 30373583 47
## Blogs 200 899288 37334131 6630
## News 196 77259 2643969 1031
Below is a plot of the number of lines on each of the files
numlines <- c(lineCount(tLines), lineCount(bLines), lineCount(nLines))
numlines <- data.frame(numlines)
numlines$names <- c("Twitter","Blogs","News")
ggplot(numlines,aes(x=names,y=numlines)) + geom_bar(stat='identity',color='blue') + xlab('File source') + ylab('Total No. of Lines') + ggtitle('Total Line Count per File Source')