Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:
I went to the
the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. Here we focus on analyzing a large corpus of text documents to discover the structure in the data and how words are put together.
Let’s start by downloading the data zipfile provided by SwiftKey .On unzipping it,we find data in 3 different languages.This project focuses on English dataset.Then let us read in data from all 3 sources ,namely news,blogs and twitter.
if(!file.exists("Coursera-SwiftKey.zip")){
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip","Coursera-SwiftKey.zip")
}
if(!file.exists("final")){
unzip("Coursera-SwiftKey.zip")
}
BlogCon<-file("final/en_US/en_US.blogs.txt","r")
blog_text<-readLines(BlogCon,skipNul = TRUE)
close(BlogCon)
TwitterCon<-file("final/en_US/en_US.twitter.txt","r")
twitter_text<-readLines(TwitterCon,skipNul = TRUE)
close(TwitterCon)
ReadinNews=function(news_data) {
size = file.info(news_data)$size
characters = readChar( news_data, size, useBytes=T)
strsplit( characters,"\n",fixed=T,useBytes=T)[[1]]
}
news_text<-ReadinNews("final/en_US/en_US.news.txt")
Now let’s take 3 samples, one from each source and then combine them into one variable called sample.
set.seed(999)
PartitionBlogData<-rbinom(length(blog_text),1,0.05)
sample_blog_text<-blog_text[PartitionBlogData==1]
PartitionTwitterData<-rbinom(length(twitter_text),1,0.03)
sample_twitter_text<-twitter_text[PartitionTwitterData==1]
PartitionNewsData<-rbinom(length(news_text),1,0.1)
sample_news_text<-news_text[PartitionNewsData==1]
sample<-c(sample_blog_text,sample_twitter_text,sample_news_text)
Here I plot word count and line count for all 3 data sources and explore the data some more.
sources <- list(blog = blog_text, twitter = twitter_text, news = news_text)
words<- function(sources) { sum(stringr::str_count(sources, "\\S+")) }
df <- data.frame(source = c("blog", "twitter", "news"), lines = NA, words = NA)
df$lines<- sapply(sources, length)
df$words <- sapply(sources, words)
summary(df)
## source lines words
## blog :1 Min. : 899288 Min. :30373832
## news :1 1st Qu.: 954766 1st Qu.:32373215
## twitter:1 Median :1010243 Median :34372598
## Mean :1423226 Mean :34026957
## 3rd Qu.:1685196 3rd Qu.:35853520
## Max. :2360148 Max. :37334441
str(df)
## 'data.frame': 3 obs. of 3 variables:
## $ source: Factor w/ 3 levels "blog","news",..: 1 3 2
## $ lines : int 899288 2360148 1010243
## $ words : int 37334441 30373832 34372598
head(df)
## source lines words
## 1 blog 899288 37334441
## 2 twitter 2360148 30373832
## 3 news 1010243 34372598
lines <- ggplot(df, aes(x = factor(source), y = lines/1e+06))
lines <- lines + geom_bar(stat = "identity")+ labs(x="File",y="Lines(In Million",title="Line Count")
lines
Words <- ggplot(df, aes(x = factor(source), y = words/1e+06))
Words<- Words + geom_bar(stat = "identity") + labs(y = "Words(In million)", x = "File",title="Word Count")
Words