JHU Project Milestone Report

1 Introduction & Objective

The main objective of this task is to understand the basic relationships we observe in the data and prepare to build our first Text Analytics (linguistic) models

We plan to acheive the above objective using the following steps

- Exploratory Analysis with Plots and statistics
  showing various features of the dataset including word  counts.
  
- Extending the above analysis for highest number of 'n' occurences also
  called as 1-gram, 2-gram, 3-grams

- Conclude with next steps regarding predictive modelling leveraging 
  validation and refiniing of expectations based on Data

This report has been created to fulfill the requirements of the exploratory analysis milestone for the Data Science Capstone offered by the Johns Hopkins University on Coursera.

2 Data

2.1 Data collection

In this project, we willuse the English Language dataset from the data provided by SwiftKey belonging to a corpus called HC Corpora

Dataset used in this project can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

2.2 Cases

There are 3 major subsets of the Data: 1. Blogs 2. News 3. Twitter

3 Exploratory data analysis

3.1 Data Preparation

Set the working directory in R

## [1] "C:/Users/somannam.AUTH/Documents/Personal files/Personal files/PMP/DATA Science- Journey/R/Data Science-Coursera/Capstone Project"

Load the required Packages in R

Loading the Data set using R

# load data locally
 download_url = 'https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip'
data_file = 'Coursera-SwiftKey.zip'

if (!file.exists(data_file)) {
  cat('Downloading Dataset...\n')
  download.file(download_url, destfile=data_file, method="curl")
  cat('Unzipping Dataset...\n')
  unzip(data_file)
}else{
  cat('Dataset is already downloaded!\n')
}

Loading the English Dataset from the Zip file

blog_data <- readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding="UTF-8")
twitter_data <- readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding="UTF-8")
con <- file("Coursera-SwiftKey/final/en_US/en_US.news.txt", open="rb")
news_data <- readLines(con, encoding="UTF-8")
close(con)
rm(con)

3.1 Summary Statistics about the Data

Let’s do a simple Exploratory Analysis for high level understanding of the Data

# Summary Blog Data
summary(blog_data) # Number of Characters in Blogs 
round(file.info("Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size/1024/1024, digits = 1)

# Summary News Data 
summary(news_data)# Number of Characters in News
round(file.info("Coursera-SwiftKey/final/en_US/en_US.news.txt")$size/1024/1024, digits = 1) 

# Summary Twitter Data
summary(twitter_data) # Number of Characters in Twitter
round(file.info("Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size/1024/1024, digits = 1)

# Disk size (in MB)
blogs_dsize <- file.info("en_US.blogs.txt")$size / 1024 / 1024
news_dsize <- file.info("en_US.news.txt")$size / 1024 / 1024
twitter_dsize <- file.info("en_US.twitter.txt")$size / 1024 / 1024

#In-memory size (in MB)
blogs_msize<-object.size(blog_data) / 1024 / 1024
news_msize<-object.size(news_data) / 1024 / 1024
twitter_msize<-object.size(twitter_data) / 1024 / 1024

# Words in lines
blogs_words <- stri_count_words(blog_data)
news_words <- stri_count_words(news_data)
twitter_words <- stri_count_words(twitter_data)

Overall Summary

# Summary
data.frame(source = c("blogs", "news", "twitter"),
           files_MB = c(blogs_dsize, news_dsize, twitter_dsize),
           in_memory_MB = c(blogs_msize, news_msize, twitter_msize),
           lines = c(length(blog_data), length(news_data), length(twitter_data)),
           words_num = c(sum(blogs_words), sum(news_words), sum(twitter_words)),
           mean_words_num = c(mean(blogs_words), mean(news_words), mean(twitter_words)))

##    source files_MB in_memory_MB   lines words_num mean_words_num
## 1   blogs 200.4242     248.4935  899288  37546246       41.75108
## 2    news 196.2775     249.6329 1010242  34762395       34.40997
## 3 twitter 159.3641     301.3967 2360148  30093369       12.75063

As we can see Blogs have the highest number of word count and highest file size with 200MB of Data

We can also look at some sample Data to determine next set of actions

3.2 Sampling the Dataset for further Analysis

blogs_sample <- sample(blog_data, 5000, replace = FALSE)
news_sample <- sample(news_data, 5000, replace = FALSE)
twitter_sample <- sample(twitter_data, 5000, replace = FALSE)

bnt_data  <- c(blogs_sample, news_sample, twitter_sample)

We will take a sample of Data from the above file to perorm next set of Analysis We shall sample about 5000 observations from each of the Data Set

4.1 Data Cleaning

Various Data cleaning methods are used to set-up the Data for further Analysis

A series of data cleaning and transformation steps were performed on the sampling data such that the profanity words and other words not needed to predict are removed. • Remove extra whitespace • Remove numbers • Remove unnecessary punctuation • Change text to all lowercase • Remove stop words • Remove swear/profanity words ( File was sourced from Shutterstock’s Github page and is usable under the Creative Commons Attribution 4.0 International License described by the following link: http://creativecommons.org/licenses/by/4.0/)

if (!file.exists("badwords.txt")){
download.file(url="https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en", destfile="badwords.txt", method = "curl")  
}
badWords <- readLines("badwords.txt")

corpus <- VCorpus(VectorSource(bnt_data))
corpus <- tm_map(corpus, function(x) iconv(x, to='UTF-8', sub='byte'))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, badWords)
corpus <- tm_map(corpus, PlainTextDocument)

wordcloud(corpus,scale=c(5,.5),min.freq=5,max.words=100,
    random.order=TRUE, rot.per=.35,colors=brewer.pal(8, "Dark2"))

4.2 Tokenization

We are using ‘hunspell’ package to parse the Data

allwords <- hunspell_parse(bnt_data)
print(allwords[[3]])

##  [1] "Crochet"   "Pattern"   "Central"   "has"       "many"      "free"      "patterns"  "you"       "can"      
## [10] "browse"    "through"   "to"        "get"       "patterns"  "for"       "many"      "different" "types"    
## [19] "of"        "pets"      "Crochet"   "For"       "Pets"

# Summarizing words after stemming
stems <- unlist(hunspell_stem(unlist(allwords)))
words <- sort(table(stems), decreasing = TRUE)
print(head(words, 30))

## stems
##   the    to   and     a    of     i    in     I    it  that     s    is   for    on   you  with   was    he  this    at 
## 22041 12318 11613 10674  9452  7828  7567  7550  5586  5212  4725  4670  4557  3979  3693  3160  2915  2480  2453  2414 
##  have    be    my    as     t   are   but    we   not  from 
##  2413  2350  2274  2257  2240  2196  2148  2109  1842  1733

Since we see most of the top words are ‘stop words’ we shall filter these points:

df <- as.data.frame(words)
df$stems <- as.character(df$stems)
stopwords <- hunspell_parse(readLines('http://jeroenooms.github.io/files/stopwords.txt'))
stops <- df$stems %in% unlist(stopwords)
wcdata <- head(df[!stops,], 150)
print(wcdata, max = 40)

##          stems Freq
## 52        time 1187
## 53          It 1164
## 69         day  924
## 85          He  707
## 87        love  687
## 88      people  687
## 101        aft  585
## 113        bee  532
## 115         In  526
## 119       week  512
## 120          A  508
## 122       ally  496
## 125       play  489
## 130       call  469
## 135       home  443
## 137      start  439
## 138     school  437
## 147       feel  404
## 150       game  401
## 155       life  384
##  [ reached getOption("max.print") -- omitted 130 rows ]

4.2.1 Word cloud of Top Frequency Words for ALL

names(wcdata) <- c("word", "freq")
wordcloud2(wcdata)

4.2.2 Word cloud of Top Frequency Words for BLOGS

blog_words <- hunspell_parse(blogs)
blog_stems <- unlist(hunspell_stem(unlist(blog_words)))
blog_stem_words <- sort(table(blog_stems), decreasing = TRUE)
print(head(words, 30))

## stems
##   the    to   and     a    of     i    in     I    it  that     s    is   for    on   you  with   was    he  this    at 
## 22041 12318 11613 10674  9452  7828  7567  7550  5586  5212  4725  4670  4557  3979  3693  3160  2915  2480  2453  2414 
##  have    be    my    as     t   are   but    we   not  from 
##  2413  2350  2274  2257  2240  2196  2148  2109  1842  1733

# Stemming for Blogs
df_blogs <- as.data.frame(blog_stem_words)
df_blogs$blog_stems <- as.character(df_blogs$blog_stems)
stopwords <- hunspell_parse(readLines('http://jeroenooms.github.io/files/stopwords.txt'))
blog_stops <- df_blogs$blog_stems %in% unlist(stopwords)
blog_wcdata <- head(df_blogs[!stops,], 150)
print(blog_wcdata, max = 40)

##     blog_stems Freq
## 52       there 5372
## 53         had 5253
## 69       would 3968
## 85       think 3112
## 87        want 3031
## 88         now 3022
## 101       good 2564
## 113      these 2345
## 115         He 2292
## 119       need 2266
## 120       most 2156
## 122        two 2070
## 125       life 2022
## 130       feel 1973
## 135      being 1910
## 137       book 1905
## 138       read 1901
## 147       said 1816
## 150  something 1779
## 155        its 1712
##  [ reached getOption("max.print") -- omitted 130 rows ]

# Word cloud for Blogs
library(wordcloud2)
names(blog_wcdata) <- c("blog_word", "freq")
wordcloud2(blog_wcdata)

4.2.3 Word cloud of Top Frequency Words for NEWS

news_words <- hunspell_parse(news)
news_stems <- unlist(hunspell_stem(unlist(news_words)))
news_stem_words <- sort(table(news_stems), decreasing = TRUE)
print(head(news_stem_words, 30))

## news_stems
##   the    to   and     a    of    in     s  that   for    it    on    is  with    he  said   was    at     i     I    as 
## 98453 45326 44395 44340 38775 34040 22891 18489 17688 17208 15373 14171 12795 12743 12623 11440 10728  9748  9725  9382 
##   his   but    be  have  from   are     t    by  they  year 
##  7801  7755  7705  7541  7533  6882  6814  6623  6442  6364

# Stemming for News
df_news <- as.data.frame(news_stem_words)
df_news$news_stems <- as.character(df_news$news_stems)
stopwords <- hunspell_parse(readLines('http://jeroenooms.github.io/files/stopwords.txt'))
news_stops <- df_news$news_stems %in% unlist(stopwords)
news_wcdata <- head(df_news[!stops,], 150)
print(news_wcdata, max = 40)

##     news_stems Freq
## 52        were 3688
## 53        time 3660
## 69        what 3061
## 85        some 2531
## 87           A 2506
## 88      school 2505
## 101      could 2059
## 113       take 1903
## 115        now 1873
## 119       only 1756
## 120    million 1729
## 122        how 1695
## 125        way 1692
## 130       even 1625
## 135       show 1593
## 137        don 1560
## 138       good 1558
## 147    through 1489
## 150       down 1446
## 155       help 1401
##  [ reached getOption("max.print") -- omitted 130 rows ]

# Word cloud for News
names(news_wcdata) <- c("news_word", "freq")
wordcloud2(news_wcdata)

4.2.4 Word cloud of Top Frequency Words for TWITTER

twitter_words <- hunspell_parse(twitter)
twitter_stems <- unlist(hunspell_stem(unlist(twitter_words)))
twitter_stem_words <- sort(table(twitter_stems), decreasing = TRUE)
print(head(twitter_stem_words, 30))

## twitter_stems
##   the     i     I    to     a   you   and    it   for    in    is    of    on     s    my  that     t    me    be    at 
## 46768 45892 40040 39161 30890 30154 21902 21026 19378 18732 18118 18111 15984 15907 14664 13949 11497 10261  9439  9330 
##  have  your  with  this    so    we   are  just     m   can 
##  9220  8870  8507  8356  8354  8265  7895  7639  7158  6741

# Stemming for Twitter
df_twitter <- as.data.frame(twitter_stem_words)
df_twitter$blog_stems <- as.character(df_twitter$twitter_stems)
stopwords <- hunspell_parse(readLines('http://jeroenooms.github.io/files/stopwords.txt'))
twitter_stops <- df$twitter_stems %in% unlist(stopwords)
twitter_wcdata <- head(df[!stops,], 150)
print(twitter_wcdata, max = 40)

##          stems Freq
## 52        time 1187
## 53          It 1164
## 69         day  924
## 85          He  707
## 87        love  687
## 88      people  687
## 101        aft  585
## 113        bee  532
## 115         In  526
## 119       week  512
## 120          A  508
## 122       ally  496
## 125       play  489
## 130       call  469
## 135       home  443
## 137      start  439
## 138     school  437
## 147       feel  404
## 150       game  401
## 155       life  384
##  [ reached getOption("max.print") -- omitted 130 rows ]

# Word cloud for Twitter
library(wordcloud2)
names(twitter_wcdata) <- c("twitter_word", "freq")
wordcloud2(twitter_wcdata)

4.2.5 Tokenization and Model Development for ngrams

BigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)

TrigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)


#Count words
freq_df <- function(tdm){
  freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
  freq_df <- data.frame(word=names(freq), freq=freq)
  return(freq_df)
}

unigram <- removeSparseTerms(TermDocumentMatrix(corpus), 0.9999)
unigram_freq <- freq_df(unigram)

bigram <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer)), 0.9999)
bigram_freq <- freq_df(bigram)

trigram <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer)), 0.9999)
trigram_freq <- freq_df(trigram)


# Frequency Plot for Unigram Frquency
freq_plot <- function(data, title) {
  ggplot(data[1:25,], aes(reorder(word, -freq), freq)) +
         labs(x = "Words/Phrases", y = "Frequency") +
         ggtitle(title) +
         theme(axis.text.x = element_text(angle = 90, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", color = "black", fill = "green", 
                 alpha = .4)
}

freq_plot(unigram_freq, "Top-25 Unigrams")

freq_plot <- function(data, title) {
  ggplot(data[1:25,], aes(reorder(word, -freq), freq)) +
         labs(x = "Words/Phrases", y = "Frequency") +
         ggtitle(title) +
         theme(axis.text.x = element_text(angle = 90, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", color = "black", fill = "blue", 
                 alpha = .4)
}

freq_plot(bigram_freq, "Top-25 Bigrams")

freq_plot <- function(data, title) {
  ggplot(data[1:25,], aes(reorder(word, -freq), freq)) +
         labs(x = "Words/Phrases", y = "Frequency") +
         ggtitle(title) +
         theme(axis.text.x = element_text(angle = 90, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", color = "black", fill = "orange", 
                 alpha = .4)
}

freq_plot(trigram_freq, "Top-25 Trigrams")

5.Conclusion

With the Exploratory Data Analysis we have understanding the distribution and relationship between the words, tokens, and phrases in the text.

We have observed a great difference between frequency Plot for each pairings using ngram modeling
There is a mximum occurance of stop words compared to regular words
Managing Stop words play an important role in text mining

6. Next Steps

Now that we have a basic understanding of the Data and working mechanism of ngrams, this knowledge will be very handy in further developing the predective model
Exploratory Data Analysis helped to validate and/or refine our expectation from the Data, which is critical to lay the foundation for the predective model
Development of Shiny App that takes n gram as an input and predicts the next text with considerable accuracy will be critical.
Perfromance enhancement will be an important consideration for next level of model development

7.References

7.1 Data reference

More information is available from the website here: https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/