1. Introduction

This milestone report is about exploring and analysing the SwifKey data. The main goal of this project is to create an algorithm to predict next possible words while typing a fragment of text into an input field . The analysis of n-gram will be done on three different data sources i.e. blogs, news and twitter. In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus (source : Wikipedia).

2. Getting the Data

The data consists of text files in four languages: English, Russian, German and Finnish. For each language, are three files namely blogs, news and twitter. Out of this four languages, i choose to analyze English as it is the only language that i am familiar with.

setwd("C:/Users/DELL 1/Documents/Capstone Project/")
library(RWeka)
library(stringi)
library(stringr)
library(ggplot2)
library(dplyr)
library(reshape2)
library(tm)

twitter <- readLines(con <- file ("./final/en_US/en_US.twitter.txt", encoding = "UTF-8"))
blogs <- readLines(con <- file ("./final/en_US/en_US.blogs.txt", encoding = "UTF-8"))
news <- readLines(con <- file ("./final/en_US/en_US.news.txt", encoding = "UTF-8"))
close(con)

The following codes compute file size, line and word count for the English language blog, news and twitter files

blogs_words <- stri_count_words(blogs)
news_words <- stri_count_words(news)
twitter_words <- stri_count_words(twitter)
blogs_size <- file.info("final/en_US/en_US.blogs.txt")$size/1024^2
news_size <- file.info("final/en_US/en_US.news.txt")$size/1024^2
twitter_size <- file.info("final/en_US/en_US.twitter.txt")$size/1024^2
summary_table <- data.frame(filename = c("blogs","news","twitter"),
                            file_size_MB = c(blogs_size, news_size, twitter_size),
                            num_of_lines = c(length(blogs),length(news),length(twitter)),
                            num_of_words = c(sum(blogs_words),sum(news_words),sum(twitter_words)),
                            mean_num_Of_words = c(mean(blogs_words),mean(news_words),mean(twitter_words)))
summary_table

##   filename file_size_MB num_of_lines num_of_words mean_num_Of_words
## 1    blogs     200.4242       899288     37546246          41.75108
## 2     news     196.2775        77259      2674536          34.61779
## 3  twitter     159.3641      2360148     30093369          12.75063

3. Data Pre-Processing

From each data source, 1% of the sample is taken. Sampling is done to get a quick analysis on the data , to reduce the time needed for pre-processing , to clean the data as well as tokenizing the words of a corpus into different n-gram. This is done with the hope that the chosen sample is sufficient to represent the whole data population.

set.seed(1000)
blogs_sample <- sample(blogs, length(blogs)*0.01)
news_sample <- sample(news, length(news)*0.01)
twitter_sample <- sample(twitter, length(twitter)*0.01)
twitter_sample <- sapply(twitter_sample, 
                        function(row) iconv(row, "latin1", "ASCII", sub=""))

#Creating corpus
#The three samples taken from blogs, news and tweets are now combine for further analysis.
text_sample  <- c(blogs_sample,news_sample,twitter_sample)
length(text_sample) #number of lines

## [1] 33365

sum(stri_count_words(text_sample)) #number of words

## [1] 706286

#remove all weird characters
cleanedTwitter <- sapply(twitter, function(x) iconv (enc2utf8(x), sub ="byte"))
cleanedBlogs <- sapply(twitter, function(x) iconv (enc2utf8(x), sub ="byte"))
cleanedNews <- sapply(twitter, function(x) iconv (enc2utf8(x), sub ="byte"))

The following code is meant to clean the chosen data sample from other “noises”.

doc.vec <- VectorSource(text_sample)
doc.corpus <- Corpus(doc.vec)

#convert to lower case
doc.corpus <- tm_map(doc.corpus, tolower)

#remove all punctuation
doc.corpus <- tm_map(doc.corpus, removePunctuation )

#remove all numbers
doc.corpus <- tm_map(doc.corpus, removeNumbers )

#remove all white spaces
doc.corpus <- tm_map(doc.corpus, stripWhitespace )

#convert to plain text document
doc.corpus <- tm_map(doc.corpus, PlainTextDocument )

# Remove stopwords
doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))

4. Exploratory Analysis

The following codes compute word frequencies for each document and orders them from largest to smallest. The top 25 most frequent words from each file is reported in the form of a histogram. A document-term matrix or term-document matrix (TDM) is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms(source : Wikipedia)

Get_Freq <- function(TDM) 
  {
    freq <- sort(rowSums(as.matrix(TDM)), decreasing = TRUE)
    return(data.frame(word = names(freq), freq = freq))
  }
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
createPlot <- function(data, label) 
  {
  ggplot(data[1:25,], aes(reorder(word, -freq), freq)) +
    labs(x = label, y = "Frequency") +
    theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
    geom_bar(stat = "identity", fill = "blue")
  }

# Get frequencies of most common n-grams in data sample
freq1 <- Get_Freq(removeSparseTerms(TermDocumentMatrix(doc.corpus), 0.999))
freq2 <- Get_Freq(removeSparseTerms(TermDocumentMatrix(doc.corpus, control = list(tokenize = bigram)), 0.999))
freq3 <- Get_Freq(removeSparseTerms(TermDocumentMatrix(doc.corpus, control = list(tokenize = trigram)), 0.9999))

Here is a histogram of the 25 most common unigram in the data sample.

createPlot(freq1, "25 Most Common Unigram")

Next historam shows the 25 most common bigrams in the data sample.

createPlot(freq2, "25 Most Common Bigrams")

Here is a final histogram depicting the 25 most common unigrams in the data sample.

createPlot(freq3, "25 Most Common Trigrams")

5. Plan for Prediction Algorithm

A few observations can be done from the above analysis :

The n-gram shows a good predictive results even with a small sample size.
The processing capability is very slow albeit the small sample used.
Words produced by Bigrams and Trigrams analysis make more sense than Unigram

The next plan is to :

Perform remaining pre-processing data cleanup
Remove Profanity (I am still considering this due the the concern of losing the actual meaning of words if profanity are removed)

-Determine if word stemming is beneficial

-Build a prediction model based on larger size of sample data (perhaps 2%)

-Build a shiny web app to allow users to type in a phrase of words and hit submit.

-Create R-presentation describing the application

Coursera Data Science Capstone: Milestone Report

MAI

March 11, 2016

1. Introduction

2. Getting the Data

3. Data Pre-Processing

4. Exploratory Analysis

5. Plan for Prediction Algorithm