Briefing

The aim of this report is to describe the three files that will be used to build the corpus employed to model a predictive algorithm for Swiftkey. The data is from a corpus called HC Corpora ([www.corpora.heliohost.org]) and it contains samples from news, blogs, and Twitter, in English, German, Russian, and Finnish.

Data

In this report, the English datasets are explored, and prepared for the modelling.

#Load required libraries
library(tm)

## Loading required package: NLP

library(RWeka)
library(ggplot2)

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

#Load data
myfile <- "./final/en_US/en_US.blogs.txt"
en_US.blogs <- scan(file=myfile, what="character", sep="\n", quote="")

myfile <- "./final/en_US/en_US.news.txt"
en_US.news <- scan(file=myfile, what="character", sep="\n", quote="")

myfile <- "./final/en_US/en_US.twitter.txt"
en_US.twitter <- scan(file=myfile, what="character", sep="\n", quote="")

## Warning in scan(file = myfile, what = "character", sep = "\n", quote = ""):
## embedded nul(s) found in input

rm(myfile) #used to save memory

#Some basic summaries of the three files
##Number of elements/documents (lines)
length(en_US.blogs)

## [1] 899288

length(en_US.news)

## [1] 1010242

length(en_US.twitter)

## [1] 2360148

##Descriptive stats on the number of characters of the documents (basic data tables)
summary(nchar(en_US.blogs))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40830

summary(nchar(en_US.news))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   110.0   185.0   201.2   268.0 11380.0

summary(nchar(en_US.twitter))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   37.00   64.00   68.68  100.00  140.00

##Word count
sum(sapply(strsplit(en_US.blogs, " "), length))

## [1] 37334131

sum(sapply(strsplit(en_US.news, " "), length))

## [1] 34372530

sum(sapply(strsplit(en_US.twitter, " "), length))

## [1] 30373543

Analysis of the documents

#Select a sample of 10% of the total documents
blogs <- sample(en_US.blogs, size = ceiling(length(en_US.blogs)/10),
                      replace = FALSE)
news <- sample(en_US.news, size = ceiling(length(en_US.news)/10),
                      replace = FALSE)
twitter <- sample(en_US.twitter, size = ceiling(length(en_US.twitter)/10),
                      replace = FALSE)

rm(en_US.blogs, en_US.news, en_US.twitter)

#Generate corpus (text that will be used to build algorithm)
corp.source <- VectorSource(paste(blogs, news, twitter))
rm(blogs, news, twitter)

corpus <- VCorpus(corp.source, readerControl = list(language = "English"))
rm(corp.source)

#Clean corpus; use getTransformations() to see all available.
##every word to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
##remove numbers
corpus <- tm_map(corpus, removeNumbers)
##strip whitespaces
corpus <- tm_map(corpus, stripWhitespace)
##remove stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))

##remove own stopwords
##profanity list from http://www.cs.cmu.edu/~biglou/resources/
url <- "http://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
profanity <- read.csv(url, header=FALSE)
ownStopWords <- as.character(profanity[[1]])
corpus <- tm_map(corpus, removeWords, ownStopWords)

rm(url, profanity, ownStopWords)

#Get the cleaned term matrix
dtm <- DocumentTermMatrix(corpus)
dim(dtm)

## [1] 236015 430726

#Remove sparse items
dtm2 <- removeSparseTerms(dtm, sparse=0.95)
dim(dtm2)

## [1] 236015     50

rm(dtm)

#Show frequent terms
freq <- sort(colSums(as.matrix(dtm2)), decreasing=TRUE)
word.freq <- data.frame(word=names(freq), freq=freq)
barplot(height=word.freq$freq, names.arg=word.freq$word)

Next milestones

Considering the results shown before and the size of the corpus, we are planning the following milestones for developing the predicting algorithm:

Work with different sample sizes and techniques of the files. Clean better the Twitter documents. Tokenise the corpus and build N-grams. Identify a suitable technique for modelling the English language. *Deploy algorithm in a Shinny web-app.

Data Science Capstone - Milestone Report

Francisco Marco-Serrano

Briefing

Data

Analysis of the documents

Next milestones