Milestone Report

Summary

This report gives the initial analysis of the data. First I will demonstrate my steps for downloading and loading in the data. After that I will illustrate some basic summaries of the dataset.

Downloading the Data

The training data was downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. In the ../../data/ folder I have downloaded the zip file of the dataset Coursera-SwiftKey.zip.

# unzip the folder and see what's been unziped
dest <- "../../data/Coursera-SwiftKey.zip"
unzip(dest, list = TRUE)

##                             Name    Length                Date
## 1                         final/         0 2014-07-22 10:10:00
## 2                   final/de_DE/         0 2014-07-22 10:10:00
## 3  final/de_DE/de_DE.twitter.txt  75578341 2014-07-22 10:11:00
## 4    final/de_DE/de_DE.blogs.txt  85459666 2014-07-22 10:11:00
## 5     final/de_DE/de_DE.news.txt  95591959 2014-07-22 10:11:00
## 6                   final/ru_RU/         0 2014-07-22 10:10:00
## 7    final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8     final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9  final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10                  final/en_US/         0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12    final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13   final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14                  final/fi_FI/         0 2014-07-22 10:10:00
## 15    final/fi_FI/fi_FI.news.txt  94234350 2014-07-22 10:11:00
## 16   final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt  25331142 2014-07-22 10:10:00

We have three different data sources: news, blogs and twitter. Besides that we have four different languages: english, german, russian and finish. In this initial report we will focus on data written in english.

Loading the Data

# import the neccesary libraries for processing the text data
library(NLP)
library(tm)
library(RWeka)
library(stringi)

# read in the files and show basic summary of each
us_blogs <- readLines('../../data/final/en_US/en_US.blogs.txt', encoding = "UTF-8")
summary(us_blogs)

##    Length     Class      Mode 
##    899288 character character

us_tweets <- readLines('../../data/final/en_US/en_US.twitter.txt', encoding = "UTF-8", skipNul = TRUE)
summary(us_tweets)

##    Length     Class      Mode 
##   2360148 character character

us_news <- readLines('../../data/final/en_US/en_US.news.txt', encoding = "UTF-8")
summary(us_news)

##    Length     Class      Mode 
##   1010242 character character

Exploring the Data

We will use stringi package to get some basic info about the data. The functions used (stri_stats_general and stri_stats_latex) give us a bit more info than we need so we will subset only the number of lines and number of words.

# stri_stats_general returns:
# 1. Lines - number of lines (number of non-missing strings in the vector);
# 2. LinesNEmpty - number of lines with at least one non-WHITE_SPACE character;
# 3. Chars - total number of Unicode code points detected;
# 4. CharsNWhite - number of Unicode code points that are not WHITE_SPACEs;

# save the summary info in summary dataframe
summary <- data.frame(data = c("blogs", "twitter", "news"))
rownames(summary) <- summary[,1]
# blogs summary
summary_blogs <- stri_stats_general(us_blogs)
# news summary
summary_news <- stri_stats_general(us_news)
# tweets summary
summary_tweets <- stri_stats_general(us_tweets)
summary[,1] <- c(summary_blogs[1], summary_news[1], summary_tweets[1]) 

#stri_stats_latex returns an integer vector with the following named elements:
# 1. CharsWord - number of word characters;
# 2. CharsCmdEnvir - command and words characters;
# 3. CharsWhite - LaTeX white spaces, including { and } in some contexts;
# 4. Words - number of words;
# 5. Cmds - number of commands;
# 6. Envirs - number of environments;

# blogs summary
summ_blogs <- stri_stats_latex(us_blogs)
# news summary
summ_news <- stri_stats_latex(us_news)
# tweets summary
summ_tweets <- stri_stats_latex(us_tweets)
summary[,2] <- c(summ_blogs[4], summ_news[4], summ_tweets[4]) 
colnames(summary) <- c("#lines", "#words")
summary

##          #lines   #words
## blogs    899288 37570839
## twitter 1010242 34494539
## news    2360148 30451170

All the datasets have roughly the same number of words. However, twitter dataset has a lot more lines.

N-grams for the sample data

For our initial analysis we will subsample the small portion of the entire corpora and save it as .RData file. This subsample should be big enough to accurately represent the entire dataset, yet small enough so that the exploratory analysis can be performed in a reasonable amount of time.

# sample the data (50,000 of each) and save it as RData file
sample_blogs   <- sample(us_blogs, 50000)
sample_news    <- sample(us_news, 50000)
sample_twitter <- sample(us_tweets, 50000)
save(sample_blogs, sample_twitter, sample_news, file = "./sample/sample.RData")

Next we will create the n-gram control functions.

# ngram tokenizers
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

Following lines perform basic text preprocessing such as:
* convert all the words to lower case
* remove all numbers
* remove all punctuations

files <- DirSource(directory = "sample/",encoding ="UTF-8" )
corpus <- VCorpus(x=files)
corpus <- tm_map(corpus,tolower)
corpus <- tm_map(corpus,removeNumbers)
corpus <- tm_map(corpus,removePunctuation)
corpus <- Corpus(VectorSource(corpus))

Once the corpus is clean we can create term-document matrices.

uni_tdm <- TermDocumentMatrix(corpus)
bi_tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
tri_tdm <- TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer))
# transform them to a data frame
uni <- as.data.frame(inspect(uni_tdm), stringsAsFactors = FALSE);
bi <- as.data.frame(inspect(bi_tdm), stringsAsFactors = FALSE);
tri <- as.data.frame(inspect(tri_tdm), stringsAsFactors = FALSE);

Let’s get some basic summary of these 1,2 and 3-grams.

# number of all unigrams in our sample
dim(uni)

## [1] 6978    1

# number of all bigrams in our sample
dim(bi)

## [1] 22418     1

# number of all trigrams in our sample
dim(tri)

## [1] 27258     1

# total number of words
total <- sum(uni)
total

## [1] 24729

# let's take a look at the ngrams
most_freq_terms <- cbind(findFreqTerms(uni_tdm,lowfreq=100), uni[findFreqTerms(uni_tdm,lowfreq=100),])
# most frequent unigrams and how many times they occured in the corpus
most_freq_terms

##       [,1]    [,2]  
##  [1,] "about" "107" 
##  [2,] "and"   "836" 
##  [3,] "are"   "131" 
##  [4,] "but"   "173" 
##  [5,] "for"   "303" 
##  [6,] "had"   "116" 
##  [7,] "have"  "169" 
##  [8,] "her"   "120" 
##  [9,] "his"   "122" 
## [10,] "not"   "132" 
## [11,] "said"  "104" 
## [12,] "she"   "118" 
## [13,] "that"  "390" 
## [14,] "the"   "1601"
## [15,] "they"  "105" 
## [16,] "this"  "182" 
## [17,] "was"   "280" 
## [18,] "will"  "106" 
## [19,] "with"  "223" 
## [20,] "you"   "209"

most_freq_terms <- cbind(findFreqTerms(bi_tdm,lowfreq=30), bi[findFreqTerms(bi_tdm,lowfreq=30),])
# most frequent bigrams and how many times they occured in the corpus
most_freq_terms

##       [,1]        [,2] 
##  [1,] "and i"     "31" 
##  [2,] "and the"   "52" 
##  [3,] "at the"    "40" 
##  [4,] "for the"   "55" 
##  [5,] "i was"     "60" 
##  [6,] "in a"      "33" 
##  [7,] "in the"    "145"
##  [8,] "is a"      "32" 
##  [9,] "it was"    "46" 
## [10,] "of the"    "148"
## [11,] "on the"    "75" 
## [12,] "thank you" "42" 
## [13,] "to be"     "48" 
## [14,] "to the"    "71" 
## [15,] "with the"  "36"

most_freq_terms <- cbind(findFreqTerms(tri_tdm,lowfreq=10), tri[findFreqTerms(tri_tdm,lowfreq=10),])
# most frequent trigrams and how many times they occured in the corpus
most_freq_terms

##      [,1]       [,2]
## [1,] "a lot of" "14"
## [2,] "i had to" "10"

Next we will create histograms of ngrams. We can expect it to be very skewed since some words (and pairs of words) occur very often in english language while others do not. In the future analysis we will consider how to deal with those highly frequent words.

hist(x = uni[,1], col = "blue", xlab = "terms", main = "Histogram of unigrams")

hist(x = bi[,1], col = "green", xlab = "terms", main = "Histogram of bigrams")

hist(x = tri[,1], col = "purple", xlab = "terms", main = "Histogram of trigrams")

Conclusions

This brief analysis shows how to load the data into R and create a corpus that can be easily manipulated. We have seen how to make uni-, bi- and tri- grams as well. Due to the nature of the text data we have to be extra careful while performing any kind of analysis, for example, from the n-gram models we see that the data is highly skewed.

Next, we will tackle the entire corpus, remove the most sparse words, remove profanities, perform some kind of smoothing for n-grams, implement a back-off model and create a predictor (Naive-Bayes perhaps). Also, in order to make a fast and efficient Shiny app we will have to optimize the code as much as possible.