The data science capstone project is in the area of natural language processing. The data can be obtained from the [Coursera site] (https://class.coursera.org/dsscapstone-006/wiki/Task_0). This report will contain exploratory data analysis of the given dataset. The dataset contains four languages (English, Russian, German and Finnish)

if(!file.exists("data/Coursera-SwiftKey.zip")) {
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                destfile = "/Users/skumaran/Documents/coursera/Coursera-SwiftKey.zip")
}
if(!file.exists("/Users/skumaran/Documents/coursera/final/en_US/en_US.blogs.txt")){
  unzip("/Users/skumaran/Documents/coursera/Coursera-SwiftKey.zip")
}

## [1] 200

## [1] 196

## [1] 159

The size of blogs data is 200MB, news data is 196MB and twitter data is 159MB.

Reading Data

We load the data in binary format.

conn <- file("/Users/skumaran/Documents/coursera/final/en_US/en_US.blogs.txt",open="rb")
blogs <- readLines(conn,encoding="UTF-8")
close(conn)
conn <- file("/Users/skumaran/Documents/coursera/final/en_US/en_US.news.txt",open="rb")
news <- readLines(conn,encoding="UTF-8")
close(conn)
conn <- file("/Users/skumaran/Documents/coursera/final/en_US/en_US.twitter.txt",open="rb")
twitter <- readLines(conn,encoding="UTF-8")
close(conn)
rm(conn)

Line counts and character counts for the blogs, news and twitter data are obtained as below.

blogs_lineCount <- length(blogs)
news_lineCount <- length(news)
twitter_lineCount <- length(twitter)

The total lines count in blogs, news and twitter is 899288, 1010242 and 2360148 respectively.

blogs_maxChar <- max(nchar(blogs)) 
news_maxChar <- max(nchar(news))
twitter_maxChar <- max(nchar(twitter))

The maximum character count in blogs, news and twitter is 40833, 11384 and 140 respectively.

Sampling

A sample dataset is created from randomly selected 10,000 lines from blogs, news and twitter datasets.

set.seed(1)
samTwitter <- twitter[sample(1:length(twitter),10000)]
samNews <- twitter[sample(1:length(news),10000)]
samBlogs <- twitter[sample(1:length(blogs),10000)]

combinedSample <- c(samTwitter,samNews,samBlogs)
writeLines(combinedSample, "/Users/skumaran/Documents/coursera/sample/CombinedSample.txt")

Tokenization

The sample text file is transformed into a R-readable format. Then, the sample data is processed in the following ways. 1. Symbols like “/” or “@” are replaced with blank space. 2. Word atricles (A, An, The, I) and numbers are removed. 3. Punctuation and then whitespaces are also removed.

library(tm) #text mining packages in R
library(SnowballC)
cname <- file.path("/Users/skumaran/Documents/coursera/sample")
docs <- Corpus(DirSource(cname)) #converting to R readable format

# data processing
docs <- tm_map(docs, content_transformer(tolower))
toSpace <- content_transformer(function(x,pattern) gsub(pattern," ",x))
docs <- tm_map(docs,toSpace,"/|@\\|")
docs <- tm_map(docs,removePunctuation)
docs <- tm_map(docs,removeNumbers)
docs <- tm_map(docs,stripWhitespace)
docs <- tm_map(docs,removeWords,stopwords("english"))
docs <- tm_map(docs,stemDocument)

Structured data from the sample text corpus is created as below. Then the frequency of unigrams, bigrams and trigrrams are obtained and tabulated.

library(tm)
library(SnowballC)
library(RWeka)
UniTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1,max=1))
unigram_dtm <- DocumentTermMatrix(docs,control = list(tokenize = UniTokenizer))

BiGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2,max=2))
bigram_dtm <- DocumentTermMatrix(docs,control = list(tokenize = BiGramTokenizer))

TriGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3,max=3))
trigram_dtm <- DocumentTermMatrix(docs,control = list(tokenize = TriGramTokenizer))

The data is processed to create visuals like word cloud and ggplot to see the most common unigrams, bigrams and trigrams from the sample corpus.

tm_unigram_freq <- sort(colSums(as.matrix(unigram_dtm)),decreasing = TRUE)
tm_bigram_freq <- sort(colSums(as.matrix(bigram_dtm)),decreasing = TRUE)
tm_trigram_freq <- sort(colSums(as.matrix(trigram_dtm)),decreasing = TRUE)

unigrams <- data.frame(word = names(tm_unigram_freq),frequency = tm_unigram_freq)
unigrams_plot <- unigrams[1:20,]
bigrams <- data.frame(word = names(tm_bigram_freq),frequency = tm_bigram_freq)
bigrams_plot <- bigrams[1:20,]
trigrams <- data.frame(word = names(tm_trigram_freq),frequency = tm_trigram_freq)
trigrams_plot <- trigrams[1:20,]

Data Visualisation

The barplots and word cloud below shows the most frequent unigrams, bigrams and trigrams.

library(ggplot2)
ggplot(unigrams_plot,aes(x=reorder(word,frequency),y=unigrams_plot$frequency,fill=unigrams_plot$frequency)) +
  guides(fill=guide_legend(title="Frequency"))+
  geom_bar(stat="identity")+
  coord_flip()+
  labs(y = "Frequency", x = "Unigrams", title = "Most common unigrams in the sample text")

ggplot(bigrams_plot,aes(x=reorder(word,frequency),y=bigrams_plot$frequency,fill=bigrams_plot$frequency)) +
  guides(fill=guide_legend(title="Frequency"))+
  geom_bar(stat="identity")+
  coord_flip()+
  labs(y = "Frequency", x = "Bigrams", title = "Most common bigrams in the sample text")

ggplot(trigrams_plot,aes(x=reorder(word,frequency),y=trigrams_plot$frequency,fill=trigrams_plot$frequency)) +
  guides(fill=guide_legend(title="Frequency"))+
  geom_bar(stat="identity")+
  coord_flip()+
  labs(y = "Frequency", x = "Trigrams", title = "Most common trigrams in the sample text")

library(wordcloud)
set.seed(1)
wordcloud(names(tm_unigram_freq),tm_unigram_freq,max.words=50,scale=c(5,.1))

wordcloud(names(tm_bigram_freq),tm_bigram_freq,max.words=50,scale=c(5,.1))

wordcloud(names(tm_trigram_freq),tm_trigram_freq,max.words=50,scale=c(5,.1))

Improvements

The strategy for modelling and prediction has not been finalised. A possible method of predicting the next word will start form triggrams. If none is found, bigrams will be used next.

The shiny app will consist of a simple user interface where the user will enter a few words. Then, the application will suggest a list of word choices to update the next possible word.

References

Coursera Stanford Natural Language Processing Lectures
[Building a Word Cloud using Text Mining in R] (http://www.analyticsvidhya.com/blog/2014/05/build-word-cloud-text-mining-tools/?utm_source=Linkedinstatus&utm_medium=Social&utm_campaign=070514)

Data Science Capstone Milestone Report

Shalini Ruppa Subramanian

December 29, 2015