The aim of this report is to describe the three files that will be used to build the corpus employed to model a predictive algorithm for Swiftkey. The data is from a corpus called HC Corpora ([www.corpora.heliohost.org]) and it contains samples from news, blogs, and Twitter, in English, German, Russian, and Finnish.
In this report, the English datasets are explored, and prepared for the modelling.
#Load required libraries
library(tm)
## Loading required package: NLP
library(RWeka)
library(ggplot2)
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
#Load data
myfile <- "./final/en_US/en_US.blogs.txt"
en_US.blogs <- scan(file=myfile, what="character", sep="\n", quote="")
myfile <- "./final/en_US/en_US.news.txt"
en_US.news <- scan(file=myfile, what="character", sep="\n", quote="")
myfile <- "./final/en_US/en_US.twitter.txt"
en_US.twitter <- scan(file=myfile, what="character", sep="\n", quote="")
## Warning in scan(file = myfile, what = "character", sep = "\n", quote = ""):
## embedded nul(s) found in input
rm(myfile) #used to save memory
#Some basic summaries of the three files
##Number of elements/documents (lines)
length(en_US.blogs)
## [1] 899288
length(en_US.news)
## [1] 1010242
length(en_US.twitter)
## [1] 2360148
##Descriptive stats on the number of characters of the documents (basic data tables)
summary(nchar(en_US.blogs))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 47 156 230 329 40830
summary(nchar(en_US.news))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 110.0 185.0 201.2 268.0 11380.0
summary(nchar(en_US.twitter))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 37.00 64.00 68.68 100.00 140.00
##Word count
sum(sapply(strsplit(en_US.blogs, " "), length))
## [1] 37334131
sum(sapply(strsplit(en_US.news, " "), length))
## [1] 34372530
sum(sapply(strsplit(en_US.twitter, " "), length))
## [1] 30373543
#Select a sample of 10% of the total documents
blogs <- sample(en_US.blogs, size = ceiling(length(en_US.blogs)/10),
replace = FALSE)
news <- sample(en_US.news, size = ceiling(length(en_US.news)/10),
replace = FALSE)
twitter <- sample(en_US.twitter, size = ceiling(length(en_US.twitter)/10),
replace = FALSE)
rm(en_US.blogs, en_US.news, en_US.twitter)
#Generate corpus (text that will be used to build algorithm)
corp.source <- VectorSource(paste(blogs, news, twitter))
rm(blogs, news, twitter)
corpus <- VCorpus(corp.source, readerControl = list(language = "English"))
rm(corp.source)
#Clean corpus; use getTransformations() to see all available.
##every word to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
##remove numbers
corpus <- tm_map(corpus, removeNumbers)
##strip whitespaces
corpus <- tm_map(corpus, stripWhitespace)
##remove stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))
##remove own stopwords
##profanity list from http://www.cs.cmu.edu/~biglou/resources/
url <- "http://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
profanity <- read.csv(url, header=FALSE)
ownStopWords <- as.character(profanity[[1]])
corpus <- tm_map(corpus, removeWords, ownStopWords)
rm(url, profanity, ownStopWords)
#Get the cleaned term matrix
dtm <- DocumentTermMatrix(corpus)
dim(dtm)
## [1] 236015 430726
#Remove sparse items
dtm2 <- removeSparseTerms(dtm, sparse=0.95)
dim(dtm2)
## [1] 236015 50
rm(dtm)
#Show frequent terms
freq <- sort(colSums(as.matrix(dtm2)), decreasing=TRUE)
word.freq <- data.frame(word=names(freq), freq=freq)
barplot(height=word.freq$freq, names.arg=word.freq$word)
Considering the results shown before and the size of the corpus, we are planning the following milestones for developing the predicting algorithm:
Work with different sample sizes and techniques of the files. Clean better the Twitter documents. Tokenise the corpus and build N-grams. Identify a suitable technique for modelling the English language. *Deploy algorithm in a Shinny web-app.