Executive Summary

As the part of Data Capstone Project, this milestone report demonstrates the work done on exploratory data analysis and modeling. To get started with the Data Science Capstone Project.I’ve download the Coursera Swiftkey Dataset. After extraction, I have chosen to work with folder en_US which contains following three files:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt.

Loading Library

# Preload necessary R librabires
library(dplyr)
library(doParallel)
library(stringi)
library(tm)
library(RWeka)

## Warning: package 'RWeka' was built under R version 3.2.5

library(ggplot2)
library(wordcloud)
library(SnowballC)

# Setup parallel clusters to accelarate execution time
jobcluster <- makeCluster(detectCores())
invisible(clusterEvalQ(jobcluster, library(tm)))
invisible(clusterEvalQ(jobcluster, library(stringi)))
invisible(clusterEvalQ(jobcluster, library(wordcloud)))

Details of files

Before starting to work with the files mentioned above, it is very important to have look into the basic details of those files like lines, words, etc.

blogs <- readLines("Coursera-Swiftkey/final/en_US/en_US.blogs.txt", encoding = 'UTF-8', skipNul = TRUE)
news <- readLines("Coursera-Swiftkey/final/en_US/en_US.news.txt", encoding = 'UTF-8', skipNul = TRUE)

## Warning in readLines("Coursera-Swiftkey/final/en_US/en_US.news.txt",
## encoding = "UTF-8", : incomplete final line found on 'Coursera-Swiftkey/
## final/en_US/en_US.news.txt'

twitter <- readLines("Coursera-Swiftkey/final/en_US/en_US.twitter.txt", encoding = 'UTF-8', skipNul = TRUE)
rawstats <- data.frame(
        File = c("blogs","news","twitter"), 
        t(rbind(sapply(list(blogs,news,twitter),stri_stats_general),
                TotalWords = sapply(list(blogs,news,twitter),stri_stats_latex)[4,]))
)
print(rawstats)

##      File   Lines LinesNEmpty     Chars CharsNWhite TotalWords
## 1   blogs  899288      899288 206824382   170389539   37570839
## 2    news   77259       77259  15639408    13072698    2651432
## 3 twitter 2360148     2360148 162096241   134082806   30451170

The file representing blogs data has about nine hundred thousand lines, the news has about seventy seven thousand lines and twitter files has almost two million and four hundred thousand lines.

Sampling

Since the raw data is very huge, sampling will be better option before starting the analysis.

set.seed(39)
sampleData <- list()
sampleTwitter <- twitter[sample(1:length(twitter),10000)]
sampleNews <- news[sample(1:length(news),10000)]
sampleBlogs <- blogs[sample(1:length(blogs),10000)]
sampleData <- c(sampleTwitter,sampleNews,sampleBlogs)
writeLines(sampleData, "./sample/sampleData.txt")
rm(rawstats,blogs,news,twitter, sampleData, sampleBlogs, sampleNews, sampleTwitter)

Data Preprocessing

The best approach to analyze the data is using tm package of R. Loading tm library and creating the corpus is the first step before starting the analysis on the data. The main structure for managing documents in tm is a so-called Corpus, representing a collection of text documents.

directory <- file.path(".", "sample")
us_files <- Corpus(DirSource(directory))

# remove more transforms
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
us_files <- tm_map(us_files, toSpace, "/|@|\\|")

# convert to lowercase
us_files <- tm_map(us_files, content_transformer(tolower))

# remove punctuation
us_files <- tm_map(us_files, removePunctuation)

# remove numbers
us_files <- tm_map(us_files, removeNumbers)

# strip whitespace
us_files <- tm_map(us_files, stripWhitespace)

# remove english stop words
us_files <- tm_map(us_files, removeWords, stopwords("english"))

# initiate stemming
us_files <- tm_map(us_files, stemDocument)

Ngram Tokenization

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.

Word frequencies plays vital role in creating the n-gram models. us_files corpus will be used to create unigrams, bigrams and trigrams.

unigramTokenizer <- function(x) {
        NGramTokenizer(x, Weka_control(min = 1, max = 1))
}
unigrams <- DocumentTermMatrix(us_files, control = list(tokenize = unigramTokenizer))

BigramTokenizer <- function(x) {
        NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
bigrams <- DocumentTermMatrix(us_files, control = list(tokenize = BigramTokenizer))

TrigramTokenizer <- function(x) {
        NGramTokenizer(x, Weka_control(min = 3, max = 3))
}
trigrams <- DocumentTermMatrix(us_files, control = list(tokenize = TrigramTokenizer))

Exploratory Data Analysis

After creation of document term matrix for unigrams, bigrams and trigrams, frequencies of these unigrams, bigrams and trigrams can be summed up to find total frquencies. Let’s start with the finding out the top frequencies of unigrams. Unigrams

unigrams_frequency <- sort(colSums(as.matrix(unigrams)),decreasing = TRUE)
unigrams_freq_df <- data.frame(word = names(unigrams_frequency), frequency = unigrams_frequency)
head(unigrams_freq_df, 10)

##      word frequency
## said said      2912
## will will      2801
## one   one      2613
## like like      2397
## get   get      2289
## time time      2215
## just just      2201
## can   can      2071
## year year      2052
## make make      1729

Plotting unigram frequency in the histogram

unigrams_freq_df %>% 
        filter(frequency > 1000) %>%
        ggplot(aes(reorder(word,-frequency), frequency)) +
        geom_bar(stat = "identity") +
        ggtitle("Unigrams with frequencies > 1000") +
        xlab("Unigrams") + ylab("Frequency") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Bigrams

bigrams_frequency <- sort(colSums(as.matrix(bigrams)),decreasing = TRUE)
bigrams_freq_df <- data.frame(word = names(bigrams_frequency), frequency = bigrams_frequency)
head(bigrams_freq_df, 10)

##                    word frequency
## last year     last year       211
## new york       new york       176
## high school high school       167
## right now     right now       158
## look like     look like       154
## year ago       year ago       147
## last week     last week       117
## dont know     dont know       110
## feel like     feel like       106
## st loui         st loui       101

Plotting bigrams frequency in the histogram

bigrams_freq_df %>% 
        filter(frequency > 100) %>%
        ggplot(aes(reorder(word,-frequency), frequency)) +
        geom_bar(stat = "identity") +
        ggtitle("Bigrams with frequencies > 100") +
        xlab("Bigrams") + ylab("Frequency") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Trigrams

trigrams_frequency <- sort(colSums(as.matrix(trigrams)),decreasing = TRUE)
trigrams_freq_df <- data.frame(word = names(trigrams_frequency), frequency = trigrams_frequency)
head(trigrams_freq_df, 10)

##                                                    word frequency
## new york citi                             new york citi        29
## none repeat scroll                   none repeat scroll        25
## repeat scroll yellow               repeat scroll yellow        25
## stylebackground none repeat stylebackground none repeat        25
## cant wait see                             cant wait see        17
## two year ago                               two year ago        17
## u u u                                             u u u        17
## presid barack obama                 presid barack obama        16
## happi mother day                       happi mother day        15
## st loui counti                           st loui counti        14

Plotting trigrams frequency in the histogram

trigrams_freq_df %>% 
        filter(frequency > 10) %>%
        ggplot(aes(reorder(word,-frequency), frequency)) +
        geom_bar(stat = "identity") +
        ggtitle("Trigrams with frequencies > 10") +
        xlab("Trigrams") + ylab("Frequency") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Prediction strategies and plans for Shiny app

On the basis of analysis, I am planning to use ngram dataframe to calculate the probabilities of the next word occuring with respect to previous words. For the Shiny app, the plan is to create an app with a simple interface where the user can enter a string of text. Our prediction model will then give a list of suggested words to update the next word.

Data Science Capstone - Milestone Report

Kishan K.C.

April 24, 2016