Synopsis

The main objective of this report is to prove that we’ve understood the underlying principles of Natural Language Processing (NLP) and that we are able to apply them for text mining. The report should also display that we’ve performed some basic summary statistic on the input dataset provided in the course, and that we have a plan to eventually create a predictive algorithm in the Shiny app.

Here are some criteria for this report as outlined in the course:

Perform basic summaries of the three files such as Word counts, line counts and basic data tables.
Make basic plots, such as histograms to illustrate features of the data.
Write the report in a brief, concise style, in a way that a non-data scientist manager could appreciate.

Data Processing and Preparation

The training datasets provided for this course are database of sentences taken from Twitter, News and Blogs written in US-English, as well as in other languages. This dataset is downloaded from the course website https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

All R codes are provided in the Appendix section.

Data Exploration

First we read the US_English dataset using readlines() function in R.

To begin studying the data, we start performing some basic summaries. The datasets are very large, but the RAM is limited, so we extract 1% random sample of the database for exploration purposes. Here are some initial statistics and counts (Appendix 1):

Summary Stats of Database Sample
	Count_Lines	Object_Size	Top_Words
US English Twitter	23601 lines	3.06 Mb	the, to, i, a, you, and, for, in, of, is, my, it, on, that, me
US English News	772 lines	0.19 Mb	the, and, to, a, of, in, for, that, said, is, on, it, he, with, as
US English Blogs	8992 lines	2.46 Mb	the, and, to, a, of, i, in, that, is, it, for, you, with, on, was

The words that appear with the most frequency are stop-words that do not really add much meaning to the analysis. So, it is clear that those should be removed from the corpora.

Following histograms display the count of words appearing in each line of the sample dataset for Twitter, News and Blogs. As expected, they show that a typical blogs line is the longest (count of most words in a line) compared to a typical Twitter feed or News line:

N-gram Analysis

N-gram analysis allows us to study n sequence of words from a corpus. An n-gram with 1 word is Unigram, with 2 words is Brigram, and with 3 words is Trigram. Any more sequence of words are referred to as four-gram, five-gram, and so on. N-gram analysis is valuable for predicting the next sequence of words.

We use various NLP packages in R such as tm, NLP, RWeka, etc. for data cleaning and exploratory text mining. Here are some steps followed to make the data ready for n-gram analysis (Appendix 2):

Combine all 3 sample database and create a corpus
Convert into lowercase
Remove punctuations, numbers, stop-words, and whitespace using tm_map function in tm package
Remove profanities using the list of bad-words in https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en
Remove sparse terms using removeSparseTerms function in tm package
Tokenize the words into unigram, bigram and trigram and plot the top 15 n-grams for each

After performing the aforementioned transformations, the resulting words are cleaner and much more descriptive of the domain. Following plots show the top 15 n-grams (Appendix 3):

Future Plans and Progress

With the initial analysis, the next plan is to recognize if the sample is too small. A larger sample might slowdown the process significantly, and it might not reflect the domain. But the probabilities taken from a small sample might be too generalized. So the goal should be to hit a balance.

We then have to perform a predictive modeling on the sample attempting various techniques such as Markov’s chains, etc. Also, we have to recognize how big the n-grams should be in our models. Then we have to validate that our models are good.

Another inevitable scenario is where n-grams do not appear on the training set, but later appears duting the actual prediction? What techniques can be applied?

APPENDIX

Appendix 1

#######################
# Read data and basic summaries
# English twitter
con <- file("en_US.twitter.txt", "r") 

en_US.twitter <- readLines(con,  skipNul=TRUE)
en_US.twitter[1:5]


# sampling (keep 5% of the number of lines)
en_US.twitter1 <- sample(en_US.twitter,length(en_US.twitter)*.01)


# basic summaries of the file
count_lines <- cat(length(en_US.twitter1),"lines ",sep=' ')
object_size <- cat(round(object.size(en_US.twitter1)/(1024^2),digits=0),"Mb", sep=' ')


# collapse the lines into single stream/vector
# Change to lowercase
en_US.twitter2 <- tolower(paste(en_US.twitter1,collapse=" "))
substr(en_US.twitter2,1,200)


# keep only alphabets and spaces
# en_US.twitter2 <- gsub("[?.;!¡·,'\\]","", en_US.twitter2)
en_US.twitter2 <- gsub( "[^[:alpha:] ]", "", en_US.twitter2 )
substr(en_US.twitter2,1,200)




# decompose into words
pattern <- "\\s+"
en_US_words <- strsplit(en_US.twitter2,pattern,perl=T)
en_US_words <- unlist(en_US_words)

# Check top 15 words that appear in the sample
en_US_words1 <- as.data.frame(table(en_US_words))
en_US_words1 <- en_US_words1[order(-en_US_words1$Freq),]




# Combine all stats
en_US_twitter_stats <- data.frame( Count_Lines = paste(length(en_US.twitter1),"lines ",sep=' '),
                                   Object_Size = paste(round(object.size(en_US.twitter1)/(1024^2),digits=0),"Mb", sep=' '), 
                                   Top_Words = paste(as.character(en_US_words1[1:15,1]),collapse=", ")
                                   )

rownames(en_US_twitter_stats) <-  "US English Twitter"




#######################
# English news
con2 <- file("en_US.news.txt", "r") 
en_US.news <- readLines(con2,  skipNul=TRUE)
en_US.news[1:5]

# sampling (keep 5% of the number of lines)
en_US.news1 <- sample(en_US.news,length(en_US.news)*.01)


# collapse the lines into single stream/vector
# Change to lowercase
en_US.news2 <- tolower(paste(en_US.news1,collapse=" "))
substr(en_US.news2,1,200)


# keep only alphabets and spaces
# en_US.twitter2 <- gsub("[?.;!¡·,'\\]","", en_US.twitter2)
en_US.news2 <- gsub( "[^[:alpha:] ]", "", en_US.news2 )
substr(en_US.news2,1,200)



# decompose into words
pattern <- "\\s+"
en_US_words_news <- strsplit(en_US.news2,pattern,perl=T)
en_US_words_news <- unlist(en_US_words_news)

# Check top 20 words that appear in the sample
en_US_words_news1 <- as.data.frame(table(en_US_words_news))
en_US_words_news1 <- en_US_words_news1[order(-en_US_words_news1$Freq),]
head(en_US_words_news1,20)




# Combine all stats
en_US_news_stats <- data.frame( Count_Lines = paste(length(en_US.news1),"lines ",sep=' '),
                                   Object_Size = paste(round(object.size(en_US.news1)/(1024^2),digits=0),"Mb", sep=' '), 
                                   Top_Words = paste(as.character(en_US_words_news1[1:15,1]),collapse=", ")
)

rownames(en_US_news_stats) <-  "US English News"








# English blogs
con3 <- file("en_US.blogs.txt", "r") 
en_US.blogs <- readLines(con3,  skipNul=TRUE)
en_US.blogs[1:5]

# sampling (keep 5% of the number of lines)
en_US.blogs1 <- sample(en_US.blogs,length(en_US.blogs)*.01)


# collapse the lines into single stream/vector
# Change to lowercase
en_US.blogs2 <- tolower(paste(en_US.blogs1,collapse=" "))
substr(en_US.blogs2,1,200)


# kepp only alphabets and spaces
# en_US.twitter2 <- gsub("[?.;!¡·,'\\]","", en_US.twitter2)
en_US.blogs2 <- gsub( "[^[:alpha:] ]", "", en_US.blogs2 )
substr(en_US.blogs2,1,200)




# decompose into words
pattern <- "\\s+"
en_US_words_blogs <- strsplit(en_US.blogs2,pattern,perl=T)
en_US_words_blogs <- unlist(en_US_words_blogs)

# Check top 20 words that appear in the sample
en_US_words_blogs1 <- as.data.frame(table(en_US_words_blogs))
en_US_words_blogs1 <- en_US_words_blogs1[order(-en_US_words_blogs1$Freq),]
head(en_US_words_blogs1,20)



# Combine all stats
en_US_blogs_stats <- data.frame( Count_Lines = paste(length(en_US.blogs1),"lines ",sep=' '),
                                Object_Size = paste(round(object.size(en_US.blogs1)/(1024^2),digits=2),"Mb", sep=' '), 
                                Top_Words = paste(as.character(en_US_words_blogs1[1:15,1]),collapse=", ")
)

rownames(en_US_blogs_stats) <-  "US English Blogs"

#-------------------------------------------------------------------
#-------------------------------------------------------------------
# Print basic stats of the sampled datasets
library(knitr)
kable( data.frame(rbind(en_US_twitter_stats, en_US_news_stats, en_US_blogs_stats)),caption= "Summary Stats of database sample"  )

########## Output Histograms of the sampled datasets: Count of words in each line
require(stringi)

par(mfrow=c(3,1))


#### English Twitter: Count words in each line and create histogram

word_count <- sapply(en_US.twitter1, function(x) data.frame(stri_stats_latex(x))[4,])
names(word_count) <- NULL

hist(word_count, xlab = "Count of words in each line", col = "blue2", main = "Fig 1. US English Twitter: Histogram of Count of words in each line ", breaks = 20, freq=T,prob=F)



#### English News: Count words and histogram
word_count <- sapply(en_US.news1, function(x) data.frame(stri_stats_latex(x))[4,])
names(word_count) <- NULL

hist(word_count, xlab = "Count of words in each line", col = "blue2", main = "Fig 2. US English News: Histogram of Count of words in each line ", breaks = 20)




#### English Blogs: Count words and histogram
word_count <- sapply(en_US.blogs1, function(x) data.frame(stri_stats_latex(x))[4,])
names(word_count) <- NULL

hist(word_count, xlab = "Count of words in each line", col = "blue2", main = "Fig 3. US English Blogs: Histogram of Count of words in each line ", breaks = 20)

Appendix 2

#-------------------------------------------------------------------
#-------------------------------------------------------------------

# Download required packages

library(NLP)
library(tm)
library(RCurl)
library(ggplot2)
library(wordcloud)
library(RWeka)
library(data.table)
library(SnowballC)
library(slam)


set.seed(1555)

# Combine all sources
all_en <- c(en_US.twitter,en_US.news,en_US.blogs)
rm(en_US.twitter,en_US.news,en_US.blogs) # Clean up large files


# Take .3% sample for test run
all_en_sample <- sample(all_en,length(all_en)*.003, replace=F)
length(all_en_sample)

# Memory Cleanup
gc()
rm(en_US.twitter,en_US.news,en_US.blogs)
gc()


# Create corpus from the sample data. Using package tm
en_us_corpus <-  Corpus(VectorSource(all_en_sample))
en_us_corpus

# Output some document within the corpus
writeLines(as.character(en_us_corpus[[30]]))


# What options are available for transformations?

getTransformations()

# perform available transformations
en_us_corpus <- tm_map(en_us_corpus,  content_transformer(tolower)) # Convert to lowercase
en_us_corpus <- tm_map(en_us_corpus,  removePunctuation)   # Remove punctuations
en_us_corpus <- tm_map(en_us_corpus,  removeNumbers)       # Remove numbers
en_us_corpus <- tm_map(en_us_corpus,  removeWords, stopwords("en")) # Remove common stopwords in english
en_us_corpus <- tm_map(en_us_corpus,  removeWords, c(stopwords("SMART"),"i", "I")) # Remove common stopwords in english

# Filter out Profanity
# Read list of profanities
profanity <- read.csv("profanity_filter.csv", header=F)[,1]
en_us_corpus <- tm_map(en_us_corpus, removeWords, profanity)  # Remove profanity
writeLines(as.character(en_us_corpus[[30]]))

en_us_corpus <- tm_map(en_us_corpus, stripWhitespace)       # Remove whitespace
writeLines(as.character(en_us_corpus[[30]]))



# en_us_corpus1<-data.frame(text=unlist(sapply(en_us_corpus,`[`, "content")), stringsAsFactors = FALSE)
#en_us_corpus1<-data.frame(text=unlist(en_us_corpus), stringsAsFactors = FALSE)


#-------------------------------------------------------------------
#-------------------------------------------------------------------
# Use RWeka library for n-gram analysis 


# UNIGRAM ANALYSIS. 

unigrams <-removeSparseTerms(DocumentTermMatrix(en_us_corpus, control=list(wordLengths=c(4,20))), 0.9999)
inspect(unigrams[1:2,66:100])
unigrams_freq <-(colSums(as.matrix(unigrams)) )
unigrams_freq[1:10]
class(unigrams_freq)

unigrams_freq  <- data.frame(words=names(unigrams_freq),freq=unigrams_freq)

unigrams_freq <- unigrams_freq[order(-unigrams_freq[,2]),]
head(unigrams_freq)


unigrams_freq1 <- unigrams_freq[1:15,]
unigrams_freq1$words <- factor(unigrams_freq1$words[1:15], levels = unigrams_freq1$words)








# BIGRAM ANALYSIS. Picked only the first 5k to speed up the function

BigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)

bigrams <-removeSparseTerms(DocumentTermMatrix(en_us_corpus, control=list( tokenize = BigramTokenizer, wordLengths=c(4,20))), 0.9999)
inspect(bigrams[1:2,66:200])
bigrams_freq <-(colSums(as.matrix(bigrams)) )
bigrams_freq[1:10]
class(bigrams_freq)

bigrams_freq  <- data.frame(words=names(bigrams_freq),freq=bigrams_freq)

bigrams_freq <- bigrams_freq[order(-bigrams_freq[,2]),]
head(bigrams_freq)


bigrams_freq1 <-bigrams_freq[1:15,]
bigrams_freq1$words <- factor(bigrams_freq1$words[1:15], levels = bigrams_freq1$words)





# TRIGRAM ANALYSIS. Picked only the first 5k to speed up the function

TrigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)

trigrams <-removeSparseTerms(DocumentTermMatrix(en_us_corpus, control=list(wordLengths=c(4,20), tokenize = TrigramTokenizer)), 0.9999)
inspect(trigrams[1:2,66:100])
trigrams_freq <-(colSums(as.matrix(trigrams)) )
trigrams_freq[1:10]
class(trigrams_freq)

trigrams_freq  <- data.frame(words=names(trigrams_freq),freq=trigrams_freq)

trigrams_freq <- trigrams_freq[order(-trigrams_freq[,2]),]
head(trigrams_freq)


trigrams_freq1 <-trigrams_freq[1:15,]
trigrams_freq1$words <- factor(trigrams_freq1$words[1:15], levels = trigrams_freq1$words)

Appendix 3

#### Output Graps for n-grams

# Unigram
graph  <- ggplot(unigrams_freq1, aes(x=words, y=freq)) + 
  geom_bar(stat="Identity", fill="dodgerblue3") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("Unigrams") +
  ylab("Frequency") +
  ggtitle("Unigrams: Top 15 recurring words")
graph

# Bigram
graph1  <- ggplot(bigrams_freq1, aes(x=words, y=freq)) + 
  geom_bar(stat="Identity", fill="dodgerblue3") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  
  xlab("Bigrams") +
  ylab("Frequency") +
  ggtitle("Bigrams: Top 15 recurring words")
graph1


# Trigram
graph2  <- ggplot(trigrams_freq1, aes(x=words, y=freq)) + 
  geom_bar(stat="Identity", fill="dodgerblue3") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("Trigrams") +
  ylab("Frequency") +
  ggtitle("Trigrams: Top 15 recurring words")
graph2

Capstone Assignment: Milestone Report

Piyush Neupane

01 May, 2016