This is the milestone report for the Coursera Data Science capstone project. I will first load the data containing twitter, news and blog entries, clean the data and explore several aspects of the various databases. Subsequently, I will take a small sample of the data, and look at the combination of various words, in order to prepare for the prediction algorithm.
## Loading required package: NLP
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
I will be reading in the data from the website.
## Warning in readLines(con = "./final/en_US/en_US.news.txt", encoding =
## "UTF-8", : incomplete final line found on './final/en_US/en_US.news.txt'
Below you can find the length of the files, number of characters, and number of unique words for the three datasets.
Below the statistiics for the twitter database in the following order: length of the files, number of characters, and number of unique words.
length(twitter)
## [1] 2360148
sum(nchar(twitter))
## [1] 162096241
twit<-unique(twitter)
twit1<-as.data.table(sapply(strsplit(twit, " "), length))
sum(twit1$V1)
## [1] 30094580
Below you can find the statistics for the news database in the following order: length of the files, number of characters, and number of unique words.
length(news)
## [1] 77259
sum(nchar(news))
## [1] 15639408
new<-unique(news)
news1<-as.data.table(sapply(str_split(new, " "), length))
sum(news1$V1)
## [1] 2643969
Below you can find the statistics for the blogs database in the following order: length of the files, number of characters, and number of unique words.
## [1] 899288
## [1] 206824505
## [1] 37334131
Below I will sample the twitter database in order to show which words are the most frequent. I will visualize this using word cloud.
Below I will clean the dataset in order to avoid predicting irrelevant word combinations.
corp <- VCorpus((VectorSource(twit2)), readerControl= list( language = "en"))
remove <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
corp <- tm_map(corp, remove)
corp<-tm_map(corp,stripWhitespace)
corp <- tm_map(corp, PlainTextDocument)
corp<-tm_map(corp,content_transformer(removePunctuation))
corp<-tm_map(corp,content_transformer(tolower))
corp<-tm_map(corp,removeWords,stopwords('en'))
corp<-tm_map(corp,stemDocument)
corp <- tm_map(corp, removeNumbers)
Below I will summarize bigrams and trigams. Any prediction method used will be conditional on the words already typed, so unigrams can be safely ignored at this stage.
The next steps in the project will be to further explore the probabilities of the different word combinations. This is important for the actual prediction of the words. Furthermore, I will create a training and a test database, and roll out the prediction algorithm(s) on the training dataset. One important consideration will be improving the speed of the prediction, as the prediction should be interactive. Finally, the prediction algorithm will be fed into a Shiny app, with a siple interface.