1. Introduction

The objective of this project is to perform an exploratory analysis of the dataset by analyzing the statistical properties and distribution of the data. Also, the results will be displayed using plots and tables, in order to show words frequencies and variations.

Later this analysis will be used to build a prediction model, which will be trained using a dataset compiled from three different sources: blogs, news and twitter.

2. Loading Data

We use directly the zip file and let do the unzipping part to R. Assuming that the file Coursera-SwiftKey.zip is placed in the current working directory I load the three different data sources in the respective variables.

#loading packages
library(stringi)
library(kableExtra)
library(NLP)
library(tm)
library(rJava) 
## if R is not able to correctly install this package, download first the latest Java version from the Official website
library(RWeka)

#unzip the data
trainData <- "Coursera-SwiftKey.zip"
unzip(trainData)

# blogs
conn <- file("final/en_US/en_US.blogs.txt", open = "r")
blogs_en <- readLines(conn, encoding = "UTF-8", skipNul = TRUE)
close(conn)

# news
# In this case I used the "rb" attribute since with "r" to the open parameter the reading stops with the warning: 'incomplete final line found on 'final/en_US/en_US.news.txt'
conn <- file("final/en_US/en_US.news.txt", open = "rb")
news_en <- readLines(conn, encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
close(conn)

# twitter
conn <- file("final/en_US/en_US.twitter.txt", open = "r")
twitter_en <- readLines(conn, encoding = "UTF-8", skipNul = TRUE)
close(conn)

#delete unused variables
rm(conn)

3. Exploratory Data Analysis

3.1 Descriptive Table

The table below shows a brief summary after loading data into three separate variables.

#I will use a sample size of 1% due to memory constraints to represent the three populations.
sampleSize = 0.01

fileSizeMB <- round(file.info(c('final/en_US/en_US.blogs.txt','final/en_US/en_US.news.txt','final/en_US/en_US.twitter.txt'))$size / 1024 ^ 2)

#lines per file
nLines <- sapply(list(blogs_en, news_en, twitter_en), length)

#characters per file
nChars <- sapply(list(nchar(blogs_en), nchar(news_en), nchar(twitter_en)), sum)

#words per file
nWords <- sapply(list(blogs_en, news_en, twitter_en), stri_stats_latex)[4,]

# words per line
wpline <- lapply(list(blogs_en, news_en, twitter_en), function(x) stri_count_words(x))

# summary
wplSummary = sapply(list(blogs_en, news_en, twitter_en),function(x) summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])
rownames(wplSummary) = c('Words/line Min', 'Words/line Mean', 'Words/line Max')

summary <- data.frame(File = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),FileSize = paste(fileSizeMB, " MB"),Lines = nLines, Characters = nChars, Words = nWords,t(rbind(round(wplSummary))))

kable(summary,row.names = FALSE, align = c("l", rep("r", 7)),caption = "") %>% kable_styling(position = "left")
File FileSize Lines Characters Words Words.line.Min Words.line.Mean Words.line.Max
en_US.blogs.txt 200 MB 899288 206824505 37570839 0 42 6726
en_US.news.txt 196 MB 1010242 203223159 34494539 1 34 1796
en_US.twitter.txt 159 MB 2360148 162096241 30451170 1 13 47

As we can see from the table, the average words per line, are different across the three samples and clearly blogs have more words per lines compared to twitter and news. The highest value of words per line can be found in blogs while twitter have the lowest value. News’ distribution of word per line is in the between.

As for lines, twitter has the highest value, since tweets are restricted on tweet length as well as on horizontal space to display it. On the contrary, file sizes are comparable for blogs and news, while twitter has the lowest value.

3.2 Subsets building

In this section I will define subsets from the main dataset, using 1% of the entire dataset due to memory constraints.

#subset of three variables: blogs_en, news_en, twitter_en
s_blogs <- blogs_en[sample(1:length(blogs_en), sampleSize*length(blogs_en), replace=FALSE)]
s_blogs<- paste(s_blogs, collapse = " ")

s_news <- news_en[sample(1:length(news_en), sampleSize*length(news_en), replace=FALSE)]
s_news <- paste(s_news, collapse = " ")

s_twitter <- twitter_en[sample(1:length(twitter_en), sampleSize*length(twitter_en), replace=FALSE)]
s_twitter<- paste(s_twitter, collapse = " ")

#now the three subset are merged into one dataset called s_data
s_blogs <- iconv(s_blogs, "UTF-8", "ASCII", sub="")
s_news <- iconv(s_news, "UTF-8", "ASCII", sub="")
s_twitter <- iconv(s_twitter, "UTF-8", "ASCII", sub="")

s_data <- c(s_blogs, s_news, s_twitter)

3.3 Corpus Building and Cleaning

In this section I will create the Corpus using the subsets for the three variables.

s_corpus <- VCorpus(VectorSource(s_data))

#now the corpus is cleaned by removing numbers, punctuation and whitespaces
s_corpus <- tm_map(s_corpus, tolower)
s_corpus <- tm_map(s_corpus, removeNumbers)
s_corpus <- tm_map(s_corpus, removePunctuation)
s_corpus <- tm_map(s_corpus, removeWords, stopwords("english"))
s_corpus <- tm_map(s_corpus, stripWhitespace)
s_corpus <- tm_map(s_corpus, PlainTextDocument)

3.4 Tokenizing and N-Grams

In this section I will display the most common words (uni/bi/tri-grams) in sample datasets; after defining the functions, data will be displayed using barplots.

#Tokenization functions
uni_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bi_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tri_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

#Matrices with tokenize
uni_matrix <- TermDocumentMatrix(s_corpus, control = list(tokenize = uni_tokenizer))
bi_matrix <- TermDocumentMatrix(s_corpus, control = list(tokenize = bi_tokenizer))
tri_matrix <- TermDocumentMatrix(s_corpus, control = list(tokenize = tri_tokenizer))

## Calculating word frequencies
uni_corpus <- findFreqTerms(uni_matrix,lowfreq = 10)
uni_corpus_freq <- rowSums(as.matrix(uni_matrix[uni_corpus,]))
uni_corpus_freq <- sort(uni_corpus_freq, decreasing = TRUE)

bi_corpus <- findFreqTerms(bi_matrix,lowfreq=10)
bi_corpus_freq <- rowSums(as.matrix(bi_matrix[bi_corpus,]))
bi_corpus_freq <- sort(bi_corpus_freq, decreasing = TRUE)

tri_corpus <- findFreqTerms(tri_matrix,lowfreq=10)
tri_corpus_freq <- rowSums(as.matrix(tri_matrix[tri_corpus,]))
tri_corpus_freq <- sort(tri_corpus_freq, decreasing = TRUE)

3.5 Results

The results of the previous code are shown below with barplots.

Unigram results

barplot(uni_corpus_freq[1:20], col = "#CCFFFF", las = 2)

Bigram Results

barplot(bi_corpus_freq[1:20], col = "#003399", las = 2)

Trigram Results

barplot(tri_corpus_freq[1:20], col = "#666666", las = 2)

4 Conclusive remarks

In the next weeks we are going to finalize this project by building a Shiny app with a predictive algorithm which will be based on a n-gram model with a frequency look-up. And a possible strategy for building this app is to use the tri-gram model to predict the next word or maybe even use a more complex model for more complex phrases.