Executive Summary :

The Data Science Project is the capstone project for the Coursera Data Science specialization track with the goal of building an auto-complete predictive text model that will come up with a list of words or phrases that are most likely to follow given input string. The training dataset used in this project is from HC Corpora and it can also be downloaded from Coursera-SwiftKey.zip. Note that Coursera-SwiftKey.zip contains four databases in four different language-English, German, Russian and Finnish respectively, and we will only be using the three files in the English database: * en_US.blogs.txt * en_US.news.txt * en_US.twitter.txt

This report serves as the milestone report and it will present the exploratory analysis that has been performed on the training datasets and some interesting findings that might lead to further research opportunities. Note that some of the analysis is performed using sample of the raw datasets, which makes sense since the raw datasets are very large datasets.

Constraint : I’d run this program partially in order to prevent the notebook from hang.

Milestone Chart

plot of chunk unnamed-chunk-1

START : Understand the Instruction

Instruction :

  1. To display Capstone Milestone.
  2. To subbmit report on R Pubs (http://rpubs.com/) within the explanation the exploratory analysis and goals for the eventual app and algorithm.
  3. To explain only the major features of the data you have identified
  4. To briefly summarize the plans for creating the prediction algorithm using Shiny app in a way that would be understandable .
  5. To illustrate the summaries of the data set using tables and plots

Motivation :

  1. Demonstrate the succesfull of data downloading
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

DATA PREPARATION

  1. Set as working directory
##setwd("C:/Users/Azlena Haron/Desktop/datascience/capstone/Data/Coursera-SwiftKey/final/en_US/")
  1. Read Data ; twitter, blogs and news
##twitter <- readLines (con <- file ("en_us.twitter.txt", encoding = "UTF-8",skipNul = TRUE))
##close (con)
##blogs <- readLines(con <- file("en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)
##close(con)
##news <- readLines(con <- file("en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE)
##close(con)
  1. Installation library and packages

To ensure the programming meet our needs, we have to included related packages and library

##install.packages("tm")
##install.packages("wordcloud")
##library (tm)
##library(wordcloud)
##library(ggplot2)

DATA EXPLORATORY

Twitter, blogs and news represented of different crowd but the data should have same features such as file size, length,file size dalm megabyte, the length of the longest line, word count and character. Here, the exploratory focus to basic features.

1. Basic exploratory

  1. Length
##twitterlength<-length(twitter)
##blogslength<-length(blogs)
##newslength<-length(news)
  1. File size
##twitterSize<-file.info("en_US.twitter.txt")$size / 1024 /1000
##newsSize<-file.info("en_US.news.txt")$size / 1024 /1000
##blogsSize<-file.info("en_US.blogs.txt")$size / 1024 /1000
  1. Word count
##twitterWords <- sum(sapply(gregexpr("\\S+", twitter), length))
##blogsWords <- sum(sapply(gregexpr("\\S+", blogs), length))
##newsWords <- sum(sapply(gregexpr("\\S+", news), length))
##words<-rbind(twitterWords, blogsWords, newsWords)
##lengths<-rbind(twitterlength, blogslength, newslength)
##sizes<-rbind(twitterSize, blogsSize, newsSize)
##df<-data.frame(c("twitter","blogs","news"))
##df<-data.frame(cbind(df,words,lengths,sizes))
##names(df)<-c("data","words","length","sizes")

2. Plot Graph

  1. Bar plot file size
##ggplot(data=df, aes(x=data,y=sizes))+geom_bar(stat="identity",color='grey60',fill='#FFE6FF')+geom_text(aes(label = format(sizes, big.mark=",")), size = 3, vjust=-0.3)+theme_bw() + xlab('Source')+ylab('File Size (MB)') + theme(legend.position='none')+ ggtitle("File sizes for three datasets")

plot of chunk unnamed-chunk-1

The above graph shows that blogs have the biggest size than news and twitter. Its shows that blogs containt more word/vocub.

  1. Bar plot word counts
##ggplot(data=df, aes(x=data,y=words))+geom_bar(stat="identity",color='grey60',fill='#FFE6FF')+geom_text(aes(label = format(words, big.mark=",")), size = 3, vjust=-0.3)+theme_bw() + xlab('Source')+ylab('Total Words Count') + theme(legend.position='none')+ ggtitle("Total word count for three datasets") 

plot of chunk unnamed-chunk-1

The above graph shows that blogs have more words than news and twitter.

  1. Bar plot length
##ggplot(data=df, aes(x=data,y=length))+geom_bar(stat="identity",color='grey60',fill='#FFE6FF')+geom_text(aes(label = format(length, big.mark=",")), size = 3, vjust=-0.3)+theme_bw() + xlab('Source')+ylab('Total Number of Lines') + theme(legend.position='none')+ ggtitle("Total Number of Lines for three datasets")

plot of chunk unnamed-chunk-1

The above graph shows that twitter have more line or sentences than blogs eventhough blogs have more words.

3.Playing with the words

The purposed of this part is display the words that content the twitter, blogs and news. It content of several step; 1.Convert uppercase letters to lower case ones; 2.Remove numbers, urls and non-word symbols; 3.Remove punctuation and stem documents; and 4. Remove stop words

  1. Twitter
##twitter <- readLines (con <- file ("en_us.twitter.txt", encoding = "UTF-8"))
##close (con)
##cleanedTwitter <-sapply(twitter,function(x) iconv (enc2utf8(x),sub="byte"))
##twitterSample<-sample(cleanedTwitter,10000)
##doc.vec<-VectorSource(twitterSample)
##doc.corpus<-Corpus(doc.vec)
##doc.corpus<-tm_map(doc.corpus,tolower)
##doc.corpus <-tm_map(doc.corpus,removePunctuation)
##doc.corpus <- tm_map(doc.corpus,removeNumbers)
##doc.corpus <- tm_map(doc.corpus,stripWhitespace)
##doc.corpus <- tm_map(doc.corpus,PlainTextDocument)
##wordcloud(doc.corpus, max.words=200, colors=brewer.pal(8,"Dark2"))

plot of chunk unnamed-chunk-2

  1. Blog
##blogs <- readLines (con <- file ("en_us.blogs.txt", encoding = "UTF-8"))
##close (con)
##cleanedBlogs <-sapply(blogs,function(x) iconv (enc2utf8(x),sub="byte"))
##BlogsSample<-sample(cleanedBlogs,10000)
##doc.vec<-VectorSource(BlogsSample)
##doc.corpus<-Corpus(doc.vec)
##doc.corpus<-tm_map(doc.corpus,tolower)
##doc.corpus <-tm_map(doc.corpus,removePunctuation)
##doc.corpus <- tm_map(doc.corpus,removeNumbers)
##doc.corpus <- tm_map(doc.corpus,stripWhitespace)
##doc.corpus <- tm_map(doc.corpus,PlainTextDocument)
##wordcloud(doc.corpus, max.words=200, colors=brewer.pal(8,"Dark2"))

plot of chunk unnamed-chunk-3

  1. News
##news <- readLines (con <- file ("en_us.news.txt", encoding = "UTF-8"))
##close (con)
##cleanedNews <-sapply(blogs,function(x) iconv (enc2utf8(x),sub="byte"))
##NewsSample<-sample(cleanedNews,10000)
##doc.vec<-VectorSource(NewsSample)
##doc.corpus<-Corpus(doc.vec)
##doc.corpus<-tm_map(doc.corpus,tolower)
##doc.corpus <-tm_map(doc.corpus,removePunctuation)
##doc.corpus <- tm_map(doc.corpus,removeNumbers)
##doc.corpus <- tm_map(doc.corpus,stripWhitespace)
##doc.corpus <- tm_map(doc.corpus,PlainTextDocument)
##wordcloud(doc.corpus, max.words=200, colors=brewer.pal(8,"Dark2"))

plot of chunk unnamed-chunk-4

Conclusion

This exploratory data report examines three copora of US English text (blogs, twitter, news). All three files are approximately 200 MBs in size. Nevertheless, the blogs and the news files seems to contain similar items count (~ million), while the twitter count is larger. This larger item counts may be due to the 140 character limit of the twitter items. This difference is not observed with word counts as all three files have about 200 million words each. Finally, the distributions of the frequencies of the twitter differ from those of blogs and news.