Data Science Capstone Project: SwiftKey

Introduction

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.

The goal of this project is to analyze large corpus of text documents and build a text prediction module base on the sample data provided by SwiftKey.

Data Source

The training data is provided by SwiftKey: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Loading Library

library(tm)

## Loading required package: NLP

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

Data Summary

Below is a quick summary of the 3 trainning files (Twitter, Blogs, News) that we will be using in this project.

## get the sizes of the training files in MB
tSize <- round(file.info("./Data/final/en_US/en_US.twitter.txt")$size /1024^2)
bSize <- round(file.info("./Data/final/en_US/en_US.blogs.txt")$size /1024^2)
nSize <-round(file.info("./Data/final/en_US/en_US.news.txt")$size /1024^2)

##Define simple custom functions to get the Word Count, max sentence length and lines
wordCount <- function(lns){
  sum(sapply(gregexpr("\\S+", lns), length))
}

maxSentenceLength <- function(lns){
  max(sapply(gregexpr("\\S+", lns), length))
}

lineCount <- function(lns){
  length(lns)
}


## read the 3 files separately
tLines <- readLines(con <- file("./Data/final/en_US/en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)

bLines <- readLines(con <- file("./Data/final/en_US/en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)

nLines <- readLines(con <- file("./Data/final/en_US/en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE)

## Warning in readLines(con <- file("./Data/final/en_US/en_US.news.txt"),
## encoding = "UTF-8", : incomplete final line found on './Data/final/en_US/
## en_US.news.txt'

close(con)


tSummary <- c(tSize, lineCount(tLines), wordCount(tLines), maxSentenceLength(tLines))
bSummary <- c(bSize, lineCount(bLines), wordCount(bLines), maxSentenceLength(bLines))
nSummary <- c(nSize, lineCount(nLines), wordCount(nLines), maxSentenceLength(nLines))


summary <- rbind(tSummary, bSummary, nSummary)
rownames(summary) <- c("Twitter", "Blogs", "News")
colnames(summary) <- c("File Size (MB)", "Lines", "Words", "Max Words Per Line")

summary

##         File Size (MB)   Lines    Words Max Words Per Line
## Twitter            159 2360148 30373583                 47
## Blogs              200  899288 37334131               6630
## News               196   77259  2643969               1031

Below is a plot of the number of lines on each of the files

numlines <- c(lineCount(tLines), lineCount(bLines), lineCount(nLines))
numlines <- data.frame(numlines)
numlines$names <- c("Twitter","Blogs","News")
ggplot(numlines,aes(x=names,y=numlines)) + geom_bar(stat='identity',color='blue') + xlab('File source') + ylab('Total No. of Lines') + ggtitle('Total Line Count per File Source')

Data Science Capstone Project: SwiftKey - Text Prediction

Wing Chum

October 22, 2017

Introduction

Data Source

Loading Library

Data Summary