This report is for the Coursera Capstone Project offered by Johns Hopkins Bloomberg School of Public Health. This report describes the exploratory basic data analysis of the Capstone Dataset.

The main objective in this course is to apply data science in the area of natural language processing.

The final result of this course will be to construct a Shiny application that accepts some text inputed by the user and try to predict what the next word will be.

Purpose of this report is to demonstrate the following:

Demonstrate that i've downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

Report contents and approach:

For the basic statistics, file sizes, general statistics, and word distributions are presented
Corpus is created by sampling 10% of the total lines in each file.
For the basic plot requirement, I have included a simple word count histogram.

The Data

The data as specified by the project can be downloaded from (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip).

Setting the environment

setwd("C:/project/capstone")
library(NLP) 
## Warning: package 'NLP' was built under R version 3.2.3
library(tm)
## Warning: package 'tm' was built under R version 3.2.3
library(stringi)
## Warning: package 'stringi' was built under R version 3.2.3
library(xtable)
## Warning: package 'xtable' was built under R version 3.2.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.2.3
library(RWekajars)
## Warning: package 'RWekajars' was built under R version 3.2.3
library(rJava)
## Warning: package 'rJava' was built under R version 3.2.3
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.2.3
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate
library(knitr)

Loading Data

Utilizing readLines functionality to load only the english corpora in UTF-8 endoding.

File and loaded object sizes of the downloaded data source.

fileURL <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL, destfile = "Dataset.zip", method = "curl")
## Warning: l'exécution de la commande 'curl "http://
## d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip" -o
## "Dataset.zip"' renvoie un statut 127
## Warning in download.file(fileURL, destfile = "Dataset.zip", method =
## "curl"): download had nonzero exit status
unlink(fileURL)
unzip("Dataset.zip")
## Warning in unzip("Dataset.zip"): erreur 1 lors de l'extraction d'un fichier
## zip

File Size (MB)

cat("en_US.news.txt: " , file.info("C:/project/capstone/en_US.news.txt")$size / (1024*1024) ,"mb")
## en_US.news.txt:  196.2775 mb
cat("en_US.blogs.txt: " , file.info("C:/project/capstone/en_US.blogs.txt")$size / (1024*1024) ,"mb")
## en_US.blogs.txt:  200.4242 mb
cat("en_US.twitter.txt: " ,file.info("C:/project/capstone/en_US.twitter.txt")$size / (1024*1024) ,"mb")
## en_US.twitter.txt:  159.3641 mb

Load in the first 10,000 lines of three documents

news <- readLines("C:/project/capstone/en_US.news.txt",n=10000,encoding="UTF-8")
blogs <- readLines("C:/project/capstone/en_US.blogs.txt",n=10000,encoding="UTF-8")
twitter <- readLines("C:/project/capstone/en_US.twitter.txt",n=10000,encoding="UTF-8")

Simple summary statistics

DataStats <- rbind(stri_stats_general(news), stri_stats_general(blogs), stri_stats_general(twitter))
DataStats <- as.data.frame(DataStats)
row.names(DataStats) <- c("news", "blogs", "twitter")
DataStats
##         Lines LinesNEmpty   Chars CharsNWhite
## news    10000       10000 2035687     1701758
## blogs   10000       10000 2277383     1876763
## twitter 10000       10000  681544      563870

We can see that blogs is the biggest file, followed by news then tweets

Creating the Corpus and cleaning the data

SampleData <- paste(news, blogs, twitter)
corpus <- VCorpus(VectorSource(SampleData))

Transformations

Once we have a corpus we typically want to modify the documents in it, e.g., stemming, stopword removal, et cetera. In tm, all this functionality is subsumed into the concept of a transformation. Transformations are done via the tm_map() function which applies (maps) a function to all elements of the corpus.

Eliminating Extra Whitespace

corpus <- tm_map(corpus, stripWhitespace)

Removal of Numbers

corpus <- tm_map(corpus, removeNumbers)

Removal of Punctuation

corpus <- tm_map(corpus, removePunctuation)

Removal of stopwords

corpus <- tm_map(corpus, removeWords, stopwords("english"))

Convert to Lower Case

corpus <- tm_map(corpus, content_transformer(tolower))

Inspecting Corpora

inspect(corpus[1:2])
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 177
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 219
meta(corpus[[2]], "id")
## [1] "2"
writeLines(as.character(corpus[[2]]))
## the st louis plant   close it  die  old age workers   making cars  since  onset  mass automotive production   s we love  mr brown when  meet someone special youll know your heart will beat  rapidly  youll smile   reason
lapply(corpus[1:2], as.character)
## $`1`
## [1] "he wasnt home alone apparently in  years thereafter    oil fields  platforms  named  pagan <U+0093>gods<U+0094> how   btw thanks   rt you gonna   dc anytime soon love  see  been way way  long"
## 
## $`2`
## [1] "the st louis plant   close it  die  old age workers   making cars  since  onset  mass automotive production   s we love  mr brown when  meet someone special youll know your heart will beat  rapidly  youll smile   reason"

Metadata Management

Creating Term-Document Matrices

dtm <- DocumentTermMatrix(corpus)

Operations on Term-Document Matrices

Analyzing Word Frequencies- finding terms that occur at least 1000 times

findFreqTerms(dtm, lowfreq=1000)
##  [1] "also"   "and"    "back"   "but"    "can"    "day"    "first" 
##  [8] "get"    "going"  "good"   "just"   "know"   "last"   "like"  
## [15] "love"   "make"   "much"   "new"    "now"    "one"    "people"
## [22] "said"   "see"    "the"    "time"   "two"    "well"   "will"  
## [29] "year"

We want to find associations (i.e., terms which correlate) with at least 0:15 correlation for the term “research”

findAssocs(dtm, "research", 0.15) # just looking for some associations 
## $research
##              allgrain                  baun            bedfellows 
##                  0.19                  0.19                  0.19 
##            changefind             charlie<U+0092>s          compensatory 
##                  0.19                  0.19                  0.19 
##         complementing          espnradiocom           evaluation<U+0094> 
##                  0.19                  0.19                  0.19 
##                  fmri            freemasons               geology 
##                  0.19                  0.19                  0.19 
##                   hci              headings heeeeellllllloooooooo 
##                  0.19                  0.19                  0.19 
##            jamburrito                 kappa oneofmyfavoritemovies 
##                  0.19                  0.19                  0.19 
##              overheat            overheated                pisano 
##                  0.19                  0.19                  0.19 
##              revision              sparging                 vents 
##                  0.19                  0.19                  0.19 
##              vermonts 
##                  0.19

Create Tokenizers

We find that the data contains many non english characters. We have to identify such tokens and remove them because we dont want to predict them. For this purpose we will use them “tm” package which is extensively used for text mining.

Convert the sample corpus to a data frame to plug into RWeka Tokenizer

corpus.df <- data.frame(text=unlist(sapply(corpus,'[',"content")),stringsAsFactors=F)
TokenizersDelimiters <- "\"\'\\t\\r\\n ().,;!?"

UniTokenizer <- NGramTokenizer(corpus.df, Weka_control(min = 1, max = 1))
BiTokenizer  <- NGramTokenizer(corpus.df, Weka_control(min = 2, max = 2, delimiters = TokenizersDelimiters))
TriTokenizer <- NGramTokenizer(corpus.df, Weka_control(min = 3, max = 3, delimiters = TokenizersDelimiters))

converting coupus to data frame

UniGram.df <- data.frame(table(UniTokenizer))
BiGram.df  <- data.frame(table(BiTokenizer))
TriGram.df <- data.frame(table(TriTokenizer))

sorting

UniGram.df<- UniGram.df[order(UniGram.df$Freq,decreasing = TRUE),]
BiGram.df   <- BiGram.df  [order(BiGram.df  $Freq,decreasing = TRUE),]
TriGram.df <- TriGram.df[order(TriGram.df$Freq,decreasing = TRUE),]

top 20 Uni, Bi and TriGrams

top20.UniGram  <- UniGram.df[1:20,]
top20.BiGram <- BiGram.df [1:20,]
top20.TriGram  <- TriGram.df[1:20,]

Exploratory Analysis

Using plots can give us a more visual knowledge of the data used in this analysis report.

here we show the top 20 frequencies of UniGrams, BiGrams and Trigrams.

Predictive algorithm

The simplest reasonable prediction model is a back-off model such as Katz back-off. Several other back-off models were discussed in the forums. The prediction algorithm has to handle the case where the n-gram being predicted was never observed in the corpora, and that is the next step.