Introduction

This is the milestone report for the Coursera Data Science capstone project. I will first load the data containing twitter, news and blog entries, clean the data and explore several aspects of the various databases. Subsequently, I will take a small sample of the data, and look at the combination of various words, in order to prepare for the prediction algorithm.

## Loading required package: NLP

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

Reading in data

I will be reading in the data from the website.

## Warning in readLines(con = "./final/en_US/en_US.news.txt", encoding =
## "UTF-8", : incomplete final line found on './final/en_US/en_US.news.txt'

Summary

Below you can find the length of the files, number of characters, and number of unique words for the three datasets.

Below the statistiics for the twitter database in the following order: length of the files, number of characters, and number of unique words.

length(twitter)

## [1] 2360148

sum(nchar(twitter))

## [1] 162096241

twit<-unique(twitter)
twit1<-as.data.table(sapply(strsplit(twit, " "), length))

sum(twit1$V1)

## [1] 30094580

Below you can find the statistics for the news database in the following order: length of the files, number of characters, and number of unique words.

length(news)

## [1] 77259

sum(nchar(news))

## [1] 15639408

new<-unique(news)
news1<-as.data.table(sapply(str_split(new, " "), length))
sum(news1$V1)

## [1] 2643969

Below you can find the statistics for the blogs database in the following order: length of the files, number of characters, and number of unique words.

## [1] 899288

## [1] 206824505

## [1] 37334131

Sampling

Below I will sample the twitter database in order to show which words are the most frequent. I will visualize this using word cloud.

Make corpus and clean corpus

Below I will clean the dataset in order to avoid predicting irrelevant word combinations.

corp <- VCorpus((VectorSource(twit2)), readerControl= list( language = "en"))

remove <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
corp <- tm_map(corp, remove)

corp<-tm_map(corp,stripWhitespace)
corp <- tm_map(corp, PlainTextDocument)
corp<-tm_map(corp,content_transformer(removePunctuation))
corp<-tm_map(corp,content_transformer(tolower))
corp<-tm_map(corp,removeWords,stopwords('en'))
corp<-tm_map(corp,stemDocument)
corp <- tm_map(corp, removeNumbers)

Creating ngrams

Below I will summarize bigrams and trigams. Any prediction method used will be conditional on the words already typed, so unigrams can be safely ignored at this stage.

Next steps

The next steps in the project will be to further explore the probabilities of the different word combinations. This is important for the actual prediction of the words. Furthermore, I will create a training and a test database, and roll out the prediction algorithm(s) on the training dataset. One important consideration will be improving the speed of the prediction, as the prediction should be interactive. Finally, the prediction algorithm will be fed into a Shiny app, with a siple interface.

Capstone milestone

Peter Makai