Introduction

The goal of this project is just to display that I’ve gotten used to working with the data and that I’m on track to create my prediction algorithm, explaining an exploratory analysis and the goals for the eventual app and algorithm.

The Dataset has been downloaded from the below link and unzipped manually.

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Loading libraries

library(knitr)
library(stringi)
library(NLP)
library(tm)
library(RWeka)

Preparing directories

workingDir <- getwd()
dataDir <-file.path(workingDir, "data/")
resultsDir <- file.path(workingDir,"results")

Checking files in directories

dir(path = dataDir)
[1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Exploratory Data Analysis

Loading files

The dataset contains 3 files with information extracted from these type of sources: Blogs, News and Twitter

blogs_lines <- readLines(paste0(dataDir, "en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)
news_lines <- readLines(paste0(dataDir, "en_US.news.txt"), encoding = "UTF-8", skipNul= TRUE)
twitter_lines <- readLines(paste0(dataDir, "en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)

Summarizing and basic statistics of the files

Showing number of lines, characters and words for each of the 3 files, and also number of words per line (min, mean, and max)

words <- sapply(list(blogs_lines,news_lines,twitter_lines), function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(words) <- c('WPL_Min','WPL_Mean','WPL_Max')
abstract <- data.frame(
  FileName=c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt"), t(rbind(sapply(list(blogs_lines,news_lines,twitter_lines), stri_stats_general)[c('Lines','Chars'),], Words=sapply(list(blogs_lines,news_lines,twitter_lines), stri_stats_latex)['Words',], words)
  ))

print(abstract)
           FileName   Lines     Chars    Words WPL_Min WPL_Mean WPL_Max
1   en_US.blogs.txt  899288 206824382 37570839       0 41.75107    6726
2    en_US.news.txt 1010242 203223154 34494539       1 34.40997    1796
3 en_US.twitter.txt 2360148 162096241 30451170       1 12.75065      47

Building algorithm

Due to size of the files, we are going to use only 1% of the data for this initial analysis.

Sampling data with a 1% of each file.

set.seed(12345)

s_blogs <- blogs_lines[sample(1:length(blogs_lines), 0.01*length(blogs_lines), replace=FALSE)]
s_blogs<- paste(s_blogs, collapse = " ")

s_news <- news_lines[sample(1:length(news_lines), 0.01*length(news_lines), replace=FALSE)]
s_news <- paste(s_news, collapse = " ")

s_twitter <- twitter_lines[sample(1:length(twitter_lines), 0.01*length(twitter_lines), replace=FALSE)]
s_twitter<- paste(s_twitter, collapse = " ")

Cleaning sample data removing unconvention characters

s_blogs <- iconv(s_blogs, "UTF-8", "ASCII", sub="")
s_news <- iconv(s_news, "UTF-8", "ASCII", sub="")
s_twitter <- iconv(s_twitter, "UTF-8", "ASCII", sub="")

Joining 3 samples in 1 dataset

s_data <- c(s_blogs, s_news, s_twitter)

Building corpus (Used in NLP: https://en.wikipedia.org/wiki/Natural_language_processing)

s_corpus <- VCorpus(VectorSource(s_data))

Cleaning corpus (to lower case, removing numbers, punctuations, etc)

s_corpus <- tm_map(s_corpus, tolower)
s_corpus <- tm_map(s_corpus, removeNumbers)
s_corpus <- tm_map(s_corpus, removePunctuation)
s_corpus <- tm_map(s_corpus, removeWords, stopwords("english"))
s_corpus <- tm_map(s_corpus, stripWhitespace)
s_corpus <- tm_map(s_corpus, PlainTextDocument)

Building a simple model for the relationship between words with a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.

uni_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bi_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tri_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

uni_matrix <- TermDocumentMatrix(s_corpus, control = list(tokenize = uni_tokenizer))
bi_matrix <- TermDocumentMatrix(s_corpus, control = list(tokenize = bi_tokenizer))
tri_matrix <- TermDocumentMatrix(s_corpus, control = list(tokenize = tri_tokenizer))

Calculating the frequency of terms in each of these 3 matrices and construct dataframes of these frequencies of n-grams.

uni_corpus <- findFreqTerms(uni_matrix,lowfreq = 10)
uni_corpus_freq <- rowSums(as.matrix(uni_matrix[uni_corpus,]))
uni_corpus_freq <- sort(uni_corpus_freq, decreasing = TRUE)

bi_corpus <- findFreqTerms(bi_matrix,lowfreq=10)
bi_corpus_freq <- rowSums(as.matrix(bi_matrix[bi_corpus,]))
bi_corpus_freq <- sort(bi_corpus_freq, decreasing = TRUE)

tri_corpus <- findFreqTerms(tri_matrix,lowfreq=10)
tri_corpus_freq <- rowSums(as.matrix(tri_matrix[tri_corpus,]))
tri_corpus_freq <- sort(tri_corpus_freq, decreasing = TRUE)

Results and Plots

Unigram

kable(head(uni_corpus_freq, 10))
x
will 3160
just 3079
said 3051
one 2805
like 2670
can 2458
get 2352
time 2089
new 1966
dont 1867
barplot(uni_corpus_freq[1:20], col = "deepskyblue", las = 2, cex.names = 0.6)

Bigram

kable(head(bi_corpus_freq, 10))
x
right now 242
cant wait 215
dont know 205
last year 191
new york 177
last night 155
im going 150
feel like 145
high school 143
first time 125
barplot(bi_corpus_freq[1:20], col = "deepskyblue", las = 2, cex.names = 0.6)

Trigram

kable(head(tri_corpus_freq, 10))
x
cant wait see 51
happy mothers day 33
call call call 23
happy new year 21
new york city 21
let us know 20
italy lakes holidays 18
little italy boston 17
magianos little italy 17
im pretty sure 16
barplot(tri_corpus_freq[1:20], col = "deepskyblue", las = 2, cex.names = 0.6)

Next steps

It would be convenient to perform several test with different sample sizes of the dataset to find a balance between computational costs and accuracy of the results. And also it will be necessary to think the way to create a Shiny web app with this algorithm incorporated.