The Data Science Project is the capstone project for the Coursera Data Science specialization track with the goal of building an auto-complete predictive text model that will come up with a list of words or phrases that are most likely to follow given input string. The training dataset used in this project is from HC Corpora and it can also be downloaded from Coursera-SwiftKey.zip. Note that Coursera-SwiftKey.zip contains four databases in four different language-English, German, Russian and Finnish respectively, and we will only be using the three files in the English database: * en_US.blogs.txt * en_US.news.txt * en_US.twitter.txt
This report serves as the milestone report and it will present the exploratory analysis that has been performed on the training datasets and some interesting findings that might lead to further research opportunities. Note that some of the analysis is performed using sample of the raw datasets, which makes sense since the raw datasets are very large datasets.
Constraint : I’d run this program partially in order to prevent the notebook from hang.
##setwd("C:/Users/Azlena Haron/Desktop/datascience/capstone/Data/Coursera-SwiftKey/final/en_US/")
##twitter <- readLines (con <- file ("en_us.twitter.txt", encoding = "UTF-8",skipNul = TRUE))
##close (con)
##blogs <- readLines(con <- file("en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)
##close(con)
##news <- readLines(con <- file("en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE)
##close(con)
To ensure the programming meet our needs, we have to included related packages and library
##install.packages("tm")
##install.packages("wordcloud")
##library (tm)
##library(wordcloud)
##library(ggplot2)
Twitter, blogs and news represented of different crowd but the data should have same features such as file size, length,file size dalm megabyte, the length of the longest line, word count and character. Here, the exploratory focus to basic features.
##twitterlength<-length(twitter)
##blogslength<-length(blogs)
##newslength<-length(news)
##twitterSize<-file.info("en_US.twitter.txt")$size / 1024 /1000
##newsSize<-file.info("en_US.news.txt")$size / 1024 /1000
##blogsSize<-file.info("en_US.blogs.txt")$size / 1024 /1000
##twitterWords <- sum(sapply(gregexpr("\\S+", twitter), length))
##blogsWords <- sum(sapply(gregexpr("\\S+", blogs), length))
##newsWords <- sum(sapply(gregexpr("\\S+", news), length))
##words<-rbind(twitterWords, blogsWords, newsWords)
##lengths<-rbind(twitterlength, blogslength, newslength)
##sizes<-rbind(twitterSize, blogsSize, newsSize)
##df<-data.frame(c("twitter","blogs","news"))
##df<-data.frame(cbind(df,words,lengths,sizes))
##names(df)<-c("data","words","length","sizes")
##ggplot(data=df, aes(x=data,y=sizes))+geom_bar(stat="identity",color='grey60',fill='#FFE6FF')+geom_text(aes(label = format(sizes, big.mark=",")), size = 3, vjust=-0.3)+theme_bw() + xlab('Source')+ylab('File Size (MB)') + theme(legend.position='none')+ ggtitle("File sizes for three datasets")
The above graph shows that blogs have the biggest size than news and twitter. Its shows that blogs containt more word/vocub.
##ggplot(data=df, aes(x=data,y=words))+geom_bar(stat="identity",color='grey60',fill='#FFE6FF')+geom_text(aes(label = format(words, big.mark=",")), size = 3, vjust=-0.3)+theme_bw() + xlab('Source')+ylab('Total Words Count') + theme(legend.position='none')+ ggtitle("Total word count for three datasets")
The above graph shows that blogs have more words than news and twitter.
##ggplot(data=df, aes(x=data,y=length))+geom_bar(stat="identity",color='grey60',fill='#FFE6FF')+geom_text(aes(label = format(length, big.mark=",")), size = 3, vjust=-0.3)+theme_bw() + xlab('Source')+ylab('Total Number of Lines') + theme(legend.position='none')+ ggtitle("Total Number of Lines for three datasets")
The above graph shows that twitter have more line or sentences than blogs eventhough blogs have more words.
The purposed of this part is display the words that content the twitter, blogs and news. It content of several step; 1.Convert uppercase letters to lower case ones; 2.Remove numbers, urls and non-word symbols; 3.Remove punctuation and stem documents; and 4. Remove stop words
##twitter <- readLines (con <- file ("en_us.twitter.txt", encoding = "UTF-8"))
##close (con)
##cleanedTwitter <-sapply(twitter,function(x) iconv (enc2utf8(x),sub="byte"))
##twitterSample<-sample(cleanedTwitter,10000)
##doc.vec<-VectorSource(twitterSample)
##doc.corpus<-Corpus(doc.vec)
##doc.corpus<-tm_map(doc.corpus,tolower)
##doc.corpus <-tm_map(doc.corpus,removePunctuation)
##doc.corpus <- tm_map(doc.corpus,removeNumbers)
##doc.corpus <- tm_map(doc.corpus,stripWhitespace)
##doc.corpus <- tm_map(doc.corpus,PlainTextDocument)
##wordcloud(doc.corpus, max.words=200, colors=brewer.pal(8,"Dark2"))
##blogs <- readLines (con <- file ("en_us.blogs.txt", encoding = "UTF-8"))
##close (con)
##cleanedBlogs <-sapply(blogs,function(x) iconv (enc2utf8(x),sub="byte"))
##BlogsSample<-sample(cleanedBlogs,10000)
##doc.vec<-VectorSource(BlogsSample)
##doc.corpus<-Corpus(doc.vec)
##doc.corpus<-tm_map(doc.corpus,tolower)
##doc.corpus <-tm_map(doc.corpus,removePunctuation)
##doc.corpus <- tm_map(doc.corpus,removeNumbers)
##doc.corpus <- tm_map(doc.corpus,stripWhitespace)
##doc.corpus <- tm_map(doc.corpus,PlainTextDocument)
##wordcloud(doc.corpus, max.words=200, colors=brewer.pal(8,"Dark2"))
##news <- readLines (con <- file ("en_us.news.txt", encoding = "UTF-8"))
##close (con)
##cleanedNews <-sapply(blogs,function(x) iconv (enc2utf8(x),sub="byte"))
##NewsSample<-sample(cleanedNews,10000)
##doc.vec<-VectorSource(NewsSample)
##doc.corpus<-Corpus(doc.vec)
##doc.corpus<-tm_map(doc.corpus,tolower)
##doc.corpus <-tm_map(doc.corpus,removePunctuation)
##doc.corpus <- tm_map(doc.corpus,removeNumbers)
##doc.corpus <- tm_map(doc.corpus,stripWhitespace)
##doc.corpus <- tm_map(doc.corpus,PlainTextDocument)
##wordcloud(doc.corpus, max.words=200, colors=brewer.pal(8,"Dark2"))
This exploratory data report examines three copora of US English text (blogs, twitter, news). All three files are approximately 200 MBs in size. Nevertheless, the blogs and the news files seems to contain similar items count (~ million), while the twitter count is larger. This larger item counts may be due to the 140 character limit of the twitter items. This difference is not observed with word counts as all three files have about 200 million words each. Finally, the distributions of the frequencies of the twitter differ from those of blogs and news.