This is an exploratory analysis for first milestone of Data Science Capstone Project. If you are interesting in the full code see the last section of this document.
source('textPred.R')
## filehash: Simple Key-Value Database (2.3 2015-08-12)
## Loading required package: RColorBrewer
We have three data sets. All of them are in English (except few words). The first on contains tweets, the second one blog’s notes and the last one news. The final goal is to use those data sets to write an application that, given few words, can predict next one(s). However this document contains a basic exploratory analysis of the data.
fn <- 'data/final/en_US/en_US.twitter.txt'
system(paste("wc -l", fn), intern = TRUE)
## [1] "2360148 data/final/en_US/en_US.twitter.txt"
dt <- getTokensCsv("tus.tokens")
##
Read 41.7% of 30177130 rows
Read 67.1% of 30177130 rows
Read 87.9% of 30177130 rows
Read 96.7% of 30177130 rows
Read 30177130 rows and 2 (of 2) columns from 0.366 GB file in 00:00:06
dim(dt)[1]
## [1] 30177130
dtSummary <- tokens.freq(dt)
rm(dt)
dim(dtSummary)[1]
## [1] 466857
wordcloud(dtSummary[1:500, tokens1], dtSummary[1:500, counts],
scale=c(5,0.5),
random.order=FALSE,
rot.per=0.35, use.r.layout=FALSE,
colors=brewer.pal(8, "Dark2"))
plotFreq(dtSummary[1:40, .(words=tokens1, freq)])
The following number of words cover 50% of words instances in the data set:
dtSummary[, cumulativeFreq:=cumsum(counts)/sum(counts)]%>%
.[cumulativeFreq<0.5, tokens1] %>%
length
## [1] 132
The following number of words cover 50% of words instances in the data set:
dtSummary[cumulativeFreq<0.9, tokens1] %>%
length
## [1] 6143
rm(dtSummary)
fn <- 'data/final/en_US/en_US.news.txt'
system(paste("wc -l", fn), intern = TRUE)
## [1] "1010242 data/final/en_US/en_US.news.txt"
dt <- getTokensCsv("nus.tokens")
##
Read 32.6% of 34588227 rows
Read 55.1% of 34588227 rows
Read 76.2% of 34588227 rows
Read 96.7% of 34588227 rows
Read 34588227 rows and 2 (of 2) columns from 0.436 GB file in 00:00:06
dim(dt)[1]
## [1] 34588227
dtSummary <- tokens.freq(dt)
rm(dt)
dim(dtSummary)[1]
## [1] 391919
wordcloud(dtSummary[1:500, tokens1], dtSummary[1:500, counts],
scale=c(5,0.5),
random.order=FALSE,
rot.per=0.35, use.r.layout=FALSE,
colors=brewer.pal(8, "Dark2"))
plotFreq(dtSummary[1:40, .(words=tokens1, freq)])
The following number of words cover 50% of words instances in the data set:
dtSummary[, cumulativeFreq:=cumsum(counts)/sum(counts)]%>%
.[cumulativeFreq<0.5, tokens1] %>%
length
## [1] 220
The following number of words cover 50% of words instances in the data set:
dtSummary[cumulativeFreq<0.9, tokens1] %>%
length
## [1] 9639
rm(dtSummary)
fn <- 'data/final/en_US/en_US.blogs.txt'
system(paste("wc -l", fn), intern = TRUE)
## [1] "899288 data/final/en_US/en_US.blogs.txt"
dt <- getTokensCsv("bus.tokens")
##
Read 28.4% of 37469821 rows
Read 50.2% of 37469821 rows
Read 70.9% of 37469821 rows
Read 90.7% of 37469821 rows
Read 37469821 rows and 2 (of 2) columns from 0.461 GB file in 00:00:06
dim(dt)[1]
## [1] 37469821
dtSummary <- tokens.freq(dt)
rm(dt)
dim(dtSummary)[1]
## [1] 488169
wordcloud(dtSummary[1:500, tokens1], dtSummary[1:500, counts],
scale=c(5,0.5),
random.order=FALSE,
rot.per=0.35, use.r.layout=FALSE,
colors=brewer.pal(8, "Dark2"))
plotFreq(dtSummary[1:40, .(words=tokens1, freq)])
The following number of words cover 50% of words instances in the data set:
dtSummary[, cumulativeFreq:=cumsum(counts)/sum(counts)]%>%
.[cumulativeFreq<0.5, tokens1] %>%
length
## [1] 116
The following number of words cover 50% of words instances in the data set:
dtSummary[cumulativeFreq<0.9, tokens1] %>%
length
## [1] 7608
rm(dtSummary)
Once we have 2,3 and 4 grams calculated, we are planing to use Katz’s back-off model together with Good-Turing estimation. This allows us to choose in a smooth way between predictions obtained from n-grams models with different n. We plan to use n up to 5.
At the moment the code that create ngrams is slow and buggy. We would like to improve by use external scripts written in bash or C.
The code is available at https://github.com/sbartek/textPrediction In particular, the script procUS.R is responsible for downloading and it is preprocess data using functions that are included in the file textPred.R.
First we download data using the function downloadCourseraSwiftKey which we implemented in textPred.R.
Now it was time for cleaning. We have read the file, then we transform the vector into data.table, since the operations we were faster (the most remarkable is fread).
Next, we lower case of all letter, and then we deal with is punctuation. What we did is that we treated symbols . , ? ... ; ! : () " as the one that divide the message (another future strategy is to only remove them). We also include here a lonely -. Then we remove extra empty spaces and finally we tokenize them. Here we use function basicDT also implemented in textPred.R.