Summary

This report presents some basic exploratory analysis made for the Milestone Report Submission in the Data Science Capstone course by Johns Hopkins at Coursera. The course objective is to apply data science in the area of natural language processing. The final result of the course will be to construct a Shiny application that accepts some text inputed by the user and try to predict what the next word will be. As start the course have provided a set of files containing texts extracted from blogs, news/media sites and twitter, to be used as a input in the creation of a prediction algorithm. In the next sessions we will be analyzing a subset of the data provided.

DATA

The data provided in the course site comprises four sets of files (de_DE - Danish, en_US - English,fi_FI - Finnish an ru_RU - Russian), with each set containing 3 text files with texts from blogs, news/media sites and twitter. In this analysis we will focus english set of files: . en_US.blogs.txt . en_US.news.txt . en_US.twitter.txt

setwd("C:/CapstoneProject/final/en_US")
# Read files into variables
newsData <- readLines(file("en_US.news.txt"))
blogData <- readLines(file("en_US.blogs.txt"))
twitterData <- readLines(file("en_US.twitter.txt"))

print(paste("News Data Length = ", length(newsData),
            ", News Blog Length = ", length(blogData),
            ", News twitter Length = ", length(twitterData)
            ))
## [1] "News Data Length =  77259 , News Blog Length =  899288 , News twitter Length =  2360148"

Interesting findings

Build a merged corpus

Data from the 3 files are merged together and a text corpus is built using the tm library. For performance reasons, only the first 5000 lines are loaded for the purpose of this milestone report.

library(tm)
## Loading required package: NLP
#Load 5000 lines from every set in corpus
merged <- paste(newsData[1:5000], blogData[1:5000], twitterData[1:5000])
corpus <- VCorpus(VectorSource(merged))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords())

Find n-grams on the data

The following code contains a function that would build a n-gram data frame based on the RWeka library. This library is used to build 2-gram, 3-gram and 4-gram models. The top 100 word sequences is only used.

library(RWeka)

corpusDf <-data.frame(text=unlist(sapply(corpus, 
                  `[`, "content")), stringsAsFactors=F)

findNGrams <- function(corp, grams) {
  ngram <- NGramTokenizer(corp, Weka_control(min = grams, max = grams,
                      delimiters = " \\r\\n\\t.,;:\"()?!"))
  ngram2 <- data.frame(table(ngram))
  #pick only top 25
  ngram3 <- ngram2[order(ngram2$Freq,decreasing = TRUE),][1:100,]
  colnames(ngram3) <- c("String","Count")
  ngram3
}

TwoGrams <- findNGrams(corpusDf, 2)
ThreeGrams <- findNGrams(corpusDf, 3)
FourGrams <- findNGrams(corpusDf, 4)

Exploratory Data Analysis

Plotting WordCloud

library(wordcloud)
## Loading required package: RColorBrewer
require(RColorBrewer)

par(mfrow = c(1, 3))
palette <- brewer.pal(8,"Dark2")

wordcloud(TwoGrams[,1], TwoGrams[,2], min.freq =1, 
          random.order = F, ordered.colors = F, colors=palette)
text(x=0.5, y=0, "2-gram cloud")

wordcloud(ThreeGrams[,1], ThreeGrams[,2], min.freq =1, 
          random.order = F, ordered.colors = F, colors=palette)
text(x=0.5, y=0, "3-gram cloud")

wordcloud(FourGrams[,1], FourGrams[,2], min.freq =1, 
          random.order = F, ordered.colors = F, colors=palette)
text(x=0.5, y=0, "4-gram cloud")

Histograms of the grams

par(mfrow = c(1, 1))

barplot(TwoGrams[1:20,2], cex.names=0.5, names.arg=TwoGrams[1:20,1], col="red", main="2-Grams", las=2)

barplot(ThreeGrams[1:20,2], cex.names=0.5, names.arg=ThreeGrams[1:20,1], col="green", main="3-Grams", las=2)

barplot(FourGrams[1:20,2], cex.names=0.5, names.arg=FourGrams[1:20,1], col="blue", main="4-Grams", las=2)

Plan for Creating Prediction Algorithm and Application

To train the model: * Increase the number of documents (e.g. increase sample size) in order to have more 3- 4- and 5-grams.