Text Prediction

Introduction

This is an exploratory analysis for first milestone of Data Science Capstone Project. If you are interesting in the full code see the last section of this document.

source('textPred.R')

## filehash: Simple Key-Value Database (2.3 2015-08-12)

## Loading required package: RColorBrewer

Three data sets

We have three data sets. All of them are in English (except few words). The first on contains tweets, the second one blog’s notes and the last one news. The final goal is to use those data sets to write an application that, given few words, can predict next one(s). However this document contains a basic exploratory analysis of the data.

Tweets Data

File has the following number of lines:

fn <- 'data/final/en_US/en_US.twitter.txt'
system(paste("wc -l", fn), intern = TRUE)

## [1] "2360148 data/final/en_US/en_US.twitter.txt"

The total number of words is:

dt <- getTokensCsv("tus.tokens")

## 
Read 41.7% of 30177130 rows
Read 67.1% of 30177130 rows
Read 87.9% of 30177130 rows
Read 96.7% of 30177130 rows
Read 30177130 rows and 2 (of 2) columns from 0.366 GB file in 00:00:06

dim(dt)[1]

## [1] 30177130

The number of unique words is:

dtSummary <- tokens.freq(dt)
rm(dt)
dim(dtSummary)[1]

## [1] 466857

The most popular words:

wordcloud(dtSummary[1:500, tokens1], dtSummary[1:500, counts],
          scale=c(5,0.5),
          random.order=FALSE,
          rot.per=0.35, use.r.layout=FALSE,
          colors=brewer.pal(8, "Dark2"))

plotFreq(dtSummary[1:40, .(words=tokens1, freq)])

The following number of words cover 50% of words instances in the data set:

dtSummary[, cumulativeFreq:=cumsum(counts)/sum(counts)]%>%
  .[cumulativeFreq<0.5, tokens1] %>%
  length

## [1] 132

The following number of words cover 50% of words instances in the data set:

dtSummary[cumulativeFreq<0.9, tokens1] %>%
  length

## [1] 6143

rm(dtSummary)

News Data

File has the following number of lines:

fn <- 'data/final/en_US/en_US.news.txt'
system(paste("wc -l", fn), intern = TRUE)

## [1] "1010242 data/final/en_US/en_US.news.txt"

The total number of words is:

dt <- getTokensCsv("nus.tokens")

## 
Read 32.6% of 34588227 rows
Read 55.1% of 34588227 rows
Read 76.2% of 34588227 rows
Read 96.7% of 34588227 rows
Read 34588227 rows and 2 (of 2) columns from 0.436 GB file in 00:00:06

dim(dt)[1]

## [1] 34588227

The number of unique words is:

dtSummary <- tokens.freq(dt)
rm(dt)
dim(dtSummary)[1]

## [1] 391919

The most popular words:

wordcloud(dtSummary[1:500, tokens1], dtSummary[1:500, counts],
          scale=c(5,0.5),
          random.order=FALSE,
          rot.per=0.35, use.r.layout=FALSE,
          colors=brewer.pal(8, "Dark2"))

plotFreq(dtSummary[1:40, .(words=tokens1, freq)])

The following number of words cover 50% of words instances in the data set:

dtSummary[, cumulativeFreq:=cumsum(counts)/sum(counts)]%>%
  .[cumulativeFreq<0.5, tokens1] %>%
  length

## [1] 220

The following number of words cover 50% of words instances in the data set:

dtSummary[cumulativeFreq<0.9, tokens1] %>%
  length

## [1] 9639

rm(dtSummary)

Blogs Data

File has the following number of lines:

fn <- 'data/final/en_US/en_US.blogs.txt'
system(paste("wc -l", fn), intern = TRUE)

## [1] "899288 data/final/en_US/en_US.blogs.txt"

The total number of words is:

dt <- getTokensCsv("bus.tokens")

## 
Read 28.4% of 37469821 rows
Read 50.2% of 37469821 rows
Read 70.9% of 37469821 rows
Read 90.7% of 37469821 rows
Read 37469821 rows and 2 (of 2) columns from 0.461 GB file in 00:00:06

dim(dt)[1]

## [1] 37469821

The number of unique words is:

dtSummary <- tokens.freq(dt)
rm(dt)
dim(dtSummary)[1]

## [1] 488169

The most popular words:

wordcloud(dtSummary[1:500, tokens1], dtSummary[1:500, counts],
          scale=c(5,0.5),
          random.order=FALSE,
          rot.per=0.35, use.r.layout=FALSE,
          colors=brewer.pal(8, "Dark2"))

plotFreq(dtSummary[1:40, .(words=tokens1, freq)])

The following number of words cover 50% of words instances in the data set:

dtSummary[, cumulativeFreq:=cumsum(counts)/sum(counts)]%>%
  .[cumulativeFreq<0.5, tokens1] %>%
  length

## [1] 116

The following number of words cover 50% of words instances in the data set:

dtSummary[cumulativeFreq<0.9, tokens1] %>%
  length

## [1] 7608

rm(dtSummary)

Plans

Once we have 2,3 and 4 grams calculated, we are planing to use Katz’s back-off model together with Good-Turing estimation. This allows us to choose in a smooth way between predictions obtained from n-grams models with different n. We plan to use n up to 5.

At the moment the code that create ngrams is slow and buggy. We would like to improve by use external scripts written in bash or C.

How we wrote this report

Completed Code

The code is available at https://github.com/sbartek/textPrediction In particular, the script procUS.R is responsible for downloading and it is preprocess data using functions that are included in the file textPred.R.

Steps

First we download data using the function downloadCourseraSwiftKey which we implemented in textPred.R.

Now it was time for cleaning. We have read the file, then we transform the vector into data.table, since the operations we were faster (the most remarkable is fread).

Next, we lower case of all letter, and then we deal with is punctuation. What we did is that we treated symbols . , ? ... ; ! : () " as the one that divide the message (another future strategy is to only remove them). We also include here a lonely -. Then we remove extra empty spaces and finally we tokenize them. Here we use function basicDT also implemented in textPred.R.