This report is for the Coursera Capstone Project offered by Johns Hopkins Bloomberg School of Public Health. This report describes the exploratory basic data analysis of the Capstone Dataset.
The main objective in this course is to apply data science in the area of natural language processing.
The final result of this course will be to construct a Shiny application that accepts some text inputed by the user and try to predict what the next word will be.
Purpose of this report is to demonstrate the following:
Demonstrate that i've downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.
Report contents and approach:
For the basic statistics, file sizes, general statistics, and word distributions are presented
Corpus is created by sampling 10% of the total lines in each file.
For the basic plot requirement, I have included a simple word count histogram.
The Data
The data as specified by the project can be downloaded from (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip).
setwd("C:/project/capstone")
library(NLP)
## Warning: package 'NLP' was built under R version 3.2.3
library(tm)
## Warning: package 'tm' was built under R version 3.2.3
library(stringi)
## Warning: package 'stringi' was built under R version 3.2.3
library(xtable)
## Warning: package 'xtable' was built under R version 3.2.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.2.3
library(RWekajars)
## Warning: package 'RWekajars' was built under R version 3.2.3
library(rJava)
## Warning: package 'rJava' was built under R version 3.2.3
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.2.3
library(ggplot2)
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
library(knitr)
Utilizing readLines functionality to load only the english corpora in UTF-8 endoding.
File and loaded object sizes of the downloaded data source.
fileURL <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL, destfile = "Dataset.zip", method = "curl")
## Warning: l'exécution de la commande 'curl "http://
## d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip" -o
## "Dataset.zip"' renvoie un statut 127
## Warning in download.file(fileURL, destfile = "Dataset.zip", method =
## "curl"): download had nonzero exit status
unlink(fileURL)
unzip("Dataset.zip")
## Warning in unzip("Dataset.zip"): erreur 1 lors de l'extraction d'un fichier
## zip
File Size (MB)
cat("en_US.news.txt: " , file.info("C:/project/capstone/en_US.news.txt")$size / (1024*1024) ,"mb")
## en_US.news.txt: 196.2775 mb
cat("en_US.blogs.txt: " , file.info("C:/project/capstone/en_US.blogs.txt")$size / (1024*1024) ,"mb")
## en_US.blogs.txt: 200.4242 mb
cat("en_US.twitter.txt: " ,file.info("C:/project/capstone/en_US.twitter.txt")$size / (1024*1024) ,"mb")
## en_US.twitter.txt: 159.3641 mb
news <- readLines("C:/project/capstone/en_US.news.txt",n=10000,encoding="UTF-8")
blogs <- readLines("C:/project/capstone/en_US.blogs.txt",n=10000,encoding="UTF-8")
twitter <- readLines("C:/project/capstone/en_US.twitter.txt",n=10000,encoding="UTF-8")
DataStats <- rbind(stri_stats_general(news), stri_stats_general(blogs), stri_stats_general(twitter))
DataStats <- as.data.frame(DataStats)
row.names(DataStats) <- c("news", "blogs", "twitter")
DataStats
## Lines LinesNEmpty Chars CharsNWhite
## news 10000 10000 2035687 1701758
## blogs 10000 10000 2277383 1876763
## twitter 10000 10000 681544 563870
We can see that blogs is the biggest file, followed by news then tweets
SampleData <- paste(news, blogs, twitter)
corpus <- VCorpus(VectorSource(SampleData))
Once we have a corpus we typically want to modify the documents in it, e.g., stemming, stopword removal, et cetera. In tm, all this functionality is subsumed into the concept of a transformation. Transformations are done via the tm_map() function which applies (maps) a function to all elements of the corpus.
Eliminating Extra Whitespace
corpus <- tm_map(corpus, stripWhitespace)
Removal of Numbers
corpus <- tm_map(corpus, removeNumbers)
Removal of Punctuation
corpus <- tm_map(corpus, removePunctuation)
Removal of stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))
Convert to Lower Case
corpus <- tm_map(corpus, content_transformer(tolower))
inspect(corpus[1:2])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 177
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 219
meta(corpus[[2]], "id")
## [1] "2"
writeLines(as.character(corpus[[2]]))
## the st louis plant close it die old age workers making cars since onset mass automotive production s we love mr brown when meet someone special youll know your heart will beat rapidly youll smile reason
lapply(corpus[1:2], as.character)
## $`1`
## [1] "he wasnt home alone apparently in years thereafter oil fields platforms named pagan <U+0093>gods<U+0094> how btw thanks rt you gonna dc anytime soon love see been way way long"
##
## $`2`
## [1] "the st louis plant close it die old age workers making cars since onset mass automotive production s we love mr brown when meet someone special youll know your heart will beat rapidly youll smile reason"
dtm <- DocumentTermMatrix(corpus)
Analyzing Word Frequencies- finding terms that occur at least 1000 times
findFreqTerms(dtm, lowfreq=1000)
## [1] "also" "and" "back" "but" "can" "day" "first"
## [8] "get" "going" "good" "just" "know" "last" "like"
## [15] "love" "make" "much" "new" "now" "one" "people"
## [22] "said" "see" "the" "time" "two" "well" "will"
## [29] "year"
findAssocs(dtm, "research", 0.15) # just looking for some associations
## $research
## allgrain baun bedfellows
## 0.19 0.19 0.19
## changefind charlie<U+0092>s compensatory
## 0.19 0.19 0.19
## complementing espnradiocom evaluation<U+0094>
## 0.19 0.19 0.19
## fmri freemasons geology
## 0.19 0.19 0.19
## hci headings heeeeellllllloooooooo
## 0.19 0.19 0.19
## jamburrito kappa oneofmyfavoritemovies
## 0.19 0.19 0.19
## overheat overheated pisano
## 0.19 0.19 0.19
## revision sparging vents
## 0.19 0.19 0.19
## vermonts
## 0.19
We find that the data contains many non english characters. We have to identify such tokens and remove them because we dont want to predict them. For this purpose we will use them “tm” package which is extensively used for text mining.
corpus.df <- data.frame(text=unlist(sapply(corpus,'[',"content")),stringsAsFactors=F)
TokenizersDelimiters <- "\"\'\\t\\r\\n ().,;!?"
UniTokenizer <- NGramTokenizer(corpus.df, Weka_control(min = 1, max = 1))
BiTokenizer <- NGramTokenizer(corpus.df, Weka_control(min = 2, max = 2, delimiters = TokenizersDelimiters))
TriTokenizer <- NGramTokenizer(corpus.df, Weka_control(min = 3, max = 3, delimiters = TokenizersDelimiters))
UniGram.df <- data.frame(table(UniTokenizer))
BiGram.df <- data.frame(table(BiTokenizer))
TriGram.df <- data.frame(table(TriTokenizer))
UniGram.df<- UniGram.df[order(UniGram.df$Freq,decreasing = TRUE),]
BiGram.df <- BiGram.df [order(BiGram.df $Freq,decreasing = TRUE),]
TriGram.df <- TriGram.df[order(TriGram.df$Freq,decreasing = TRUE),]
top20.UniGram <- UniGram.df[1:20,]
top20.BiGram <- BiGram.df [1:20,]
top20.TriGram <- TriGram.df[1:20,]
Using plots can give us a more visual knowledge of the data used in this analysis report.
here we show the top 20 frequencies of UniGrams, BiGrams and Trigrams.
The simplest reasonable prediction model is a back-off model such as Katz back-off. Several other back-off models were discussed in the forums. The prediction algorithm has to handle the case where the n-gram being predicted was never observed in the corpora, and that is the next step.