Overview

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. Here we focus on analyzing a large corpus of text documents to discover the structure in the data and how words are put together.

Downloading and Loading in the Data

Let’s start by downloading the data zipfile provided by SwiftKey .On unzipping it,we find data in 3 different languages.This project focuses on English dataset.Then let us read in data from all 3 sources ,namely news,blogs and twitter.

if(!file.exists("Coursera-SwiftKey.zip")){
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip","Coursera-SwiftKey.zip")
}
if(!file.exists("final")){
  unzip("Coursera-SwiftKey.zip")
}
BlogCon<-file("final/en_US/en_US.blogs.txt","r")
blog_text<-readLines(BlogCon,skipNul = TRUE)
close(BlogCon)
TwitterCon<-file("final/en_US/en_US.twitter.txt","r")
twitter_text<-readLines(TwitterCon,skipNul = TRUE)
close(TwitterCon)
ReadinNews=function(news_data) {
  size = file.info(news_data)$size 
  characters = readChar( news_data, size, useBytes=T)
  strsplit( characters,"\n",fixed=T,useBytes=T)[[1]]
}
news_text<-ReadinNews("final/en_US/en_US.news.txt")

Sampling the Data

Now let’s take 3 samples, one from each source and then combine them into one variable called sample.

set.seed(999)
PartitionBlogData<-rbinom(length(blog_text),1,0.05)
sample_blog_text<-blog_text[PartitionBlogData==1]
PartitionTwitterData<-rbinom(length(twitter_text),1,0.03)
sample_twitter_text<-twitter_text[PartitionTwitterData==1]
PartitionNewsData<-rbinom(length(news_text),1,0.1)
sample_news_text<-news_text[PartitionNewsData==1]
sample<-c(sample_blog_text,sample_twitter_text,sample_news_text)

Exploratory Analysis

Here I plot word count and line count for all 3 data sources and explore the data some more.

sources <- list(blog = blog_text, twitter = twitter_text, news = news_text)
words<- function(sources) { sum(stringr::str_count(sources, "\\S+")) }
df <- data.frame(source = c("blog", "twitter", "news"), lines = NA, words = NA)
df$lines<- sapply(sources, length)
df$words <- sapply(sources, words)
summary(df)
##      source      lines             words         
##  blog   :1   Min.   : 899288   Min.   :30373832  
##  news   :1   1st Qu.: 954766   1st Qu.:32373215  
##  twitter:1   Median :1010243   Median :34372598  
##              Mean   :1423226   Mean   :34026957  
##              3rd Qu.:1685196   3rd Qu.:35853520  
##              Max.   :2360148   Max.   :37334441
str(df)
## 'data.frame':    3 obs. of  3 variables:
##  $ source: Factor w/ 3 levels "blog","news",..: 1 3 2
##  $ lines : int  899288 2360148 1010243
##  $ words : int  37334441 30373832 34372598
head(df)
##    source   lines    words
## 1    blog  899288 37334441
## 2 twitter 2360148 30373832
## 3    news 1010243 34372598
lines <- ggplot(df, aes(x = factor(source), y = lines/1e+06))
lines <- lines + geom_bar(stat = "identity")+ labs(x="File",y="Lines(In Million",title="Line Count")
lines

Words <- ggplot(df, aes(x = factor(source), y = words/1e+06))
Words<- Words + geom_bar(stat = "identity") + labs(y = "Words(In million)", x = "File",title="Word Count")
Words