This report is a part of course work for Data-Science Specialization provided by Johns hopkins Universty at Coursera.
The goal of this project is to create a Shiny app that will predict the next word after the user inputs three words
This document will give an overview to features of the data used for this project and briefly summarize plans for creating the prediction algorithm
Download the data from the Coursera site https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The data is from a corpus called HC Corpora http://www.corpora.heliohost.org
See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available.
The data is available in four locales en_US, de_DE, ru_RU and fi_FI, for this exercise only en_US is used.
Further each language folder has three text documents extracted from news, blogs and twitter
The code to download the data in R
if (!file.exists("Coursera-SwiftKey.zip")){
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", "Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
}
Loading the document in R
con<-file('en_US/en_US.blogs.txt','r')
blogs<-readLines(con)
close(con)
con<-file('en_US/en_US.news.txt','rb')
news<-readLines(con)
close(con)
con<-file('en_US/en_US.twitter.txt','r')
twitter<-readLines(con)
close(con)
Getting more information regarding the documents loaded like the size, number of lines, characters and count
library(NLP)
library(tm)
library(RWeka)
library(stringi)
library(SnowballC)
library(plyr)
sizeBlogs<-file.info('en_US/en_US.blogs.txt')$size/1024^2 ## size
linesBlogs<-length(blogs) ##lines
charsBlogs<-sum(nchar(blogs)) ## no of characters
wordsBlogs<-sum(stri_count_words(blogs))
sizeNews<-file.info('en_US/en_US.news.txt')$size/1024^2 ## size
linesNews<-length(news) ##lines
charsNews<-sum(nchar(news)) ## no of characters
wordsNews<-sum(stri_count_words(news))
sizeTwitter<-file.info('en_US/en_US.twitter.txt')$size/1024^2 ## size
linesTwitter<-length(twitter) ##lines
charsTwitter<-sum(nchar(twitter)) ## no of characters
wordsTwitter<-sum(stri_count_words(twitter))
infoTable<-data.frame(TextType=c("blogs",'news','twitter'),
Size_MB=c(sizeBlogs,sizeNews,sizeTwitter),
NoOfLines=c(linesBlogs,linesNews,linesTwitter),
NoOfChars=c(charsBlogs,charsNews,charsTwitter),
NoOfWords=c(wordsBlogs,wordsNews,wordsTwitter))
infoTable
## TextType Size_MB NoOfLines NoOfChars NoOfWords
## 1 blogs 200.4242 899288 208361438 38154238
## 2 news 196.2775 1010242 203791405 35010782
## 3 twitter 159.3641 2360148 162384825 30218125
Pre-Processing the data to remove punctuation, digit, whitespace adn tranform all the characters to lower case
blogsProcessed<- gsub('[[:punct:]]+','',blogs)
newsProcessed<- gsub('[[:punct:]]+','',news)
twitterProcessed<- gsub('[[:punct:]]+','',twitter)
blogsProcessed<- gsub('[[:digit:]]+','',blogsProcessed)
newsProcessed<- gsub('[[:digit:]]+','',newsProcessed)
twitterProcessed<- gsub('[[:digit:]]+','',twitterProcessed)
blogsProcessed <- tolower(blogsProcessed)
newsProcessed <- tolower(newsProcessed)
twitterProcessed <- tolower(twitterProcessed)
blogsProcessed<-stripWhitespace(blogsProcessed)
newsProcessed<-stripWhitespace(newsProcessed)
twitterProcessed<-stripWhitespace(twitterProcessed)
After the all the files are processed KfNgram.exe was used to generate the ngrams
you can know more about it from here
Alternative to this was using package tm
Unigram, Bigram and Trigram were made using kFNgram for each of the processed data.
The ngram files from each blogs, news and twitter were combined and aggregated to form only one file sorted with descending order of the frequency of occurance of that word
“the” is the most frquent word from the text
“in the” and “of the” are the most frquent bigram from the text
“one of the” is the most frequent trigram from the text
This section excerpts the code for ngram handling and creating the wordcloud
blogsUnigram <- read.table("blogsProcessed.txt-01-ngrams-Freq.txt",sep="\t")
newsUnigram <- read.table("newsProcessed.txt-01-ngrams-Freq.txt",sep="\t")
twitterUnigram <- read.table("twitterProcessed.txt-01-ngrams-Freq.txt",sep="\t")
Unigram <- rbind(blogsUnigram, newsUnigram, twitterUnigram)
UnigramSum <- aggregate(V2~V1, data=Unigram, sum)
UnigramSort <- arrange(UnigramSum, desc(V2))
##WordCloud
pal <- brewer.pal(6,"Dark2")
wordcloud(UnigramSort$V1, UnigramSort$V2,c(8,.6),max.words=200,random.order=F,rot.per=0.15, colors=pal,use.r.layout=FALSE)
for Bigrams
newsBigram <- read.table("newsProcessed.txt-02-ngrams-Freq.txt", sep ="\t")
blogsBigram <- read.table("blogsProcessed.txt-02-ngrams-Freq.txt", sep ="\t")
twitterBigram <- read.table("twitterProcessed.txt-02-ngrams-Freq.txt", sep ="\t")
Bigram <- rbind(blogsBigram,newsBigram,twitterBigram)
BigramSum <- aggregate(V2~V1, data=Bigram, sum)
BigramSort <- arrange(BigramSum, desc(V2))
wordcloud(BigramSort$V1, BigramSort$V2,c(8,.3),max.words=100,random.order=F,rot.per=0.15, colors=pal,use.r.layout=FALSE)
For Trigrams
blogsTrigram <- read.table("blogsProcessed.txt-03-ngrams-Freq.txt", sep ="\t")
newsTrigram <- read.table("newsProcessed.txt-03-ngrams-Freq.txt", sep ="\t")
twitterTrigram <- read.table("twitterProcessed.txt-03-ngrams-Freq.txt", sep ="\t")
Trigram <- rbind(blogsTrigram,newsTrigram,twitterTrigram)
TrigramSum <- aggregate(V2~V1, data=Trigram, sum)
TrigramSOrt <- arrange(TrigramSum, desc(V2))
wordcloud(TrigramSOrt$V1, TrigramSOrt$V2,c(4,.2),max.words=100,random.order=F,rot.per=0.15, colors=pal,use.r.layout=FALSE)