Initilisation

The goal of this Capstone project is to create an application that can predict the next word the user is most likely to type, based on the previous (1, 2 or 3 words). To complete this project, we shall take advantage the Coursera-Swiftkey data set. Hence following are the activites to be done:-

  1. Collection and cleaning of Data form the source.
  2. Run some exploratory analysis on word frequency, and explore variations for different documents for different sample sizes.
  3. Create a dictionary based on the most frequent words.
  4. Run some exploratory analysis on n-gram
  5. Explore bi-gram and tri-gram frequencies
  6. Devise a strategy and prototype of a word prediction model
  7. for reproducibility, the main functions developed for this report are displayed in the appendix.

Data Collection

Data is being collected form the site.The data is publicly available on CapStone DataSet. The data is from a corpus called HC Corpora.The data is organized in 4 directories lang by languages:

English: en_US German: de_DE Finish: fi_FI Russian: ru_RU

folder<-"/Users/RajeevAK/Downloads/project1/Capston Project/final/en_US/"
setwd(folder)
fileslist<-list.files(folder)
blogtext<-readLines(fileslist[1])
blogtext<-paste(blogtext,collapse = " ")
newstext<-readLines(fileslist[2])
newstext<-paste(newstext,collapse = " ")
twittertext<-readLines(fileslist[3])
## Warning in readLines(fileslist[3]): line 167155 appears to contain an
## embedded nul
## Warning in readLines(fileslist[3]): line 268547 appears to contain an
## embedded nul
## Warning in readLines(fileslist[3]): line 1274086 appears to contain an
## embedded nul
## Warning in readLines(fileslist[3]): line 1759032 appears to contain an
## embedded nul
twittertext<-paste(twittertext,collapse = " ")
#Blog
stri_stats_general(blogtext)
##       Lines LinesNEmpty       Chars CharsNWhite 
##           1           1   207723792   170389662
#News
stri_stats_general(newstext)
##       Lines LinesNEmpty       Chars CharsNWhite 
##           1           1   204233400   169860871
#Twitter
stri_stats_general(twittertext)
##       Lines LinesNEmpty       Chars CharsNWhite 
##           1           1   164456178   134082634

Data Cleaning

Data is bing cleaned by eliminating stopwords, punchuactions numerics and unwanted whitespace, removing non ASCII, symbols and clubbed together.

#Clubbing all 3 files

corp<-c(blogtext,newstext,twittertext)
corp1<- tolower(corp)
corp1<- removeWords(corp1,stopwords("english"))
corp1<- removeNumbers(corp1)
corp1<- stripWhitespace(corp1)
corp1<- removePunctuation(corp1)
corp1<-replace_non_ascii(corp1, remove.nonconverted = TRUE)
corp1<-replace_symbol(corp1)

Data Representation

Now the data will be converted into corpus and Term Document Matrix for further exploration.

corp1<-Corpus(VectorSource(corp1))
termdoc<- TermDocumentMatrix(corp1, control=list())
colnames(termdoc)<-c("Blog","News","Twitter")
m<-as.matrix(termdoc)
cloudata<-as.data.frame(m)
cloudata$Total<-rowSums(cloudata,na.rm=TRUE)
cloudata[order(c(-cloudata$Total)),]->cloudata1

Identify top 20 words

List of top 20 words.

cloudata2<-cloudata1[(1:20),(1:3)]
# Top 20 words
print(cloudata2)
##          Blog   News Twitter
## will   112404 108162   93837
## said    36553 250362   17923
## just   100035  53151  148969
## one    125908  83465   81165
## like    98289  49400  120747
## can    107841  60324   89173
## get     70667  43513  111395
## time    88580  52009   74884
## new     54298  70337   69056
## now     59888  36093   82078
## good    48613  29863   99429
## day     50923  28999   90105
## know    60227  23507   79188
## love    44783   9591  105195
## people  60258  47700   51325
## back    50501  33391   57335
## see     49915  22037   66251
## first   50730  52661   31063
## make    50882  32443   47137
## also    55159  58740   16115
# plot everything
rownames(cloudata2)->cloudata2$Words
data.m <- melt(cloudata2, id.vars='Words')
ggplot(data.m, aes(Words, value)) + geom_bar(aes(fill = variable), position = "dodge", stat="identity")

Comparitive Word Cloud.

Let’s see how a comparitive word cloud looks like.