Capstone: Milestone Report

SYNOPSIS

The goal of this report is to display that I’ve gotten used to working with the data and that I am on track to create my prediction algorithm. This report will address the following:

Demonstrate that I’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that I’ve amassed so far.
Provide my plans for creating a prediction algorithm and Shiny app to gather feedback.

LOADING DATA

Assumption:

Assuming data files are in working directory.

Reading and Sampling the data:

Since data files are huge in size, I have taken randomized 10% data from each source - news, blogs and twitter and explored it.

news <- readLines("en_US.news_mod.txt", encoding="UTF-8")
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8")
suppressWarnings(twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8"))

# For modelling purpose, I sampled 10% of the data from each source
set.seed(1234)
news <- news[as.logical(rbinom(length(news), 1, 0.1))]
blogs <- blogs[as.logical(rbinom(length(blogs), 1, 0.1))]
twitter <- twitter[as.logical(rbinom(length(twitter), 1, 0.1))]

PRE-PROCESSING DATA

Remove Decimals (in Float Numerals) to avoid confusion with Periods (Sentence ending).
Split Sentences and count the number of lines in files.
Remove Punctuation and Numbers from the data.
Remove 1-2 lettered words e.g. remaining “s” from apostrophes, “I”, “a” etc.
Remove extra spaces.
Do the Word Count.
Save the cleaned data to “Sample” folder in the form of text files (precursor for creating Corpus).

# Remove the numbers with decimal (float type) in between (e.g. 5.12). This will remove the ambiguity of confusing decimal with period (.) while sentence splitting.
news <- gsub(news, pattern = "[0-9]+\\.[0-9]+", replacement = " ")
blogs <- gsub(blogs, pattern = "[0-9]+\\.[0-9]+", replacement = " ")
twitter <- gsub(twitter, pattern = "[0-9]+\\.[0-9]+", replacement = " ")

# Split Sentences
split_sentences <- function(text) {
  sentence_list <- strsplit(text, split = ";|\\.|!|\\?")
  sentences <- unlist(sentence_list)
  return(sentences)
}

news <- split_sentences(news)
blogs <- split_sentences(blogs)
twitter <- split_sentences(twitter)


# Line Count
newsLines <- length(news)
blogsLines <- length(blogs)
twitterLines <- length(twitter)


# Keep only alphabetical part
news <- gsub(news, pattern = "[^[:alpha:]]", replacement = " ")
blogs <- gsub(blogs, pattern = "[^[:alpha:]]", replacement = " ")
twitter <- gsub(twitter, pattern = "[^[:alpha:]]", replacement = " ")

# Save the sampled data in text files for future use in creating n-gram
# For creating n-grams, stopwords and 2-lettered words like of, it, on etc will be included in Corpus
writeLines(news, "news.txt")
writeLines(blogs, "blogs.txt")
writeLines(twitter, "twitter.txt")

# Remove 1-2 lettered words and extra spaces for exploratory purpose
news <- gsub(news, pattern="\\W*\\b\\w{1,2}\\b", replacement=" ")
blogs <- gsub(blogs, pattern="\\W*\\b\\w{1,2}\\b", replacement=" ")
twitter <- gsub(twitter, pattern="\\W*\\b\\w{1,2}\\b", replacement=" ")
news <- gsub(news, pattern="\\s+", replacement=" ")
blogs <- gsub(blogs, pattern="\\s+", replacement=" ")
twitter <- gsub(twitter, pattern="\\s+", replacement=" ")

# Word Count
word_count <- function(text) {
words <- sum(sapply(strsplit(text, split = " "), length))
return(words)
}
newsWords <- word_count(news)
blogsWords <- word_count(blogs)
twitterWords <- word_count(twitter)

# Save the sampled data in text files for creating Corpus for Exploratory analysis
writeLines(news, "./Sample/news.txt")
writeLines(blogs, "./Sample/blogs.txt")
writeLines(twitter, "./Sample/twitter.txt")

PREPARING CORPUS & DOCUMENT-FEATURE MATRIX

Load Library - quanteda
Prepare Corpus from the text files in Sample directory using textfile, corpus functions.
Look at the summary for Corpus.
Create Document-Feature Matrix for the Corpus while:
1. Filtering Profanity and English Stopwords.
2. Stemming i.e. keeping the root word e.g. happy, happiness considered as one by removing “y”, “ness”.
3. Removing URL i.e. text starting with http(s)://
4. Removing Separators like extra spaces
Look at the DFM summary.
Print Top 20 Frequent words.
Finally prepare a word cloud to give a preview of words (Size proportionate to frequency of occurrence).

# Loading Libraries
library(quanteda)

## quanteda version 0.9.6.9

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:base':
## 
##     sample

# Prepare Corpus for Exploratory Analysis
mytf <- textfile(list.files(path = "./Sample", pattern = "\\.txt$", full.names = TRUE, recursive = TRUE))
myCorpus <- corpus(mytf)
myCorpus

## Corpus consisting of 3 documents.

# Cleaning Corpus: Filter profanity, "will", stopwords, do stemming, remove URL (http(s)://), separators
suppressWarnings(profane <- read.csv("swearWords.csv"))
profane <- tolower(colnames(profane))
cleanCorpusDFM <- dfm(myCorpus, stem = TRUE, removeURL = TRUE, ignoredFeatures = c("will", stopwords("english"), profane), language = "english", removeSeparators = TRUE)

## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 3 documents
##    ... indexing features: 159,256 feature types
##    ... removed 162 features, from 252 supplied (glob) feature types
##    ... stemming features (English), trimmed 36666 feature variants
##    ... created a 3 x 122429 sparse dfm
##    ... complete. 
## Elapsed time: 24.66 seconds.

cleanCorpusDFM

## Document-feature matrix of: 3 documents, 122,429 features.

#Top 20 frequent words
topfeatures(cleanCorpusDFM, 20)

##   can   one  said  just  like   get  time  year   day  make  love   new 
## 32538 31764 30480 30411 30246 30101 26742 24200 23225 20878 20273 19511 
##  good  know   now  work   don peopl   say  want 
## 18927 18295 18056 17778 16591 16569 16284 16089

# Plot DFM
# Loading Library
library(RColorBrewer)
# Word-Cloud
plot(cleanCorpusDFM, max.words = 100, colors = brewer.pal(6, "Dark2"), scale = c(3.5, 0.5))

SUMMARY STATISTICS OF DATA

Number of Lines and Words from each Source and Combined data.
Histogram plots showing frequent words - Overall and Segragated based on Source.
Top 10 Frequent words in Each Source and Combined.

Figure 1: No of Lines and Words in Sampled Data

## [1] "Summary Statistics for Word Count and Line Count"

##     Source Line_Count Word_Count
## 1    Blogs     278802    3089899
## 2     News     243724    2930799
## 3  Twitter     531319    2520348
## 4 Combined    1053845    8541046

## [1] "Data Set for Frequency of Words"

##     Word Blogs News Twitter Combined
## 1 anyway   753  123     390     1266
## 2     go  3882 3121    5637    12640
## 3  share  1640 1199    1035     3874
## 4   home  2980 3545    2584     9109
## 5  decor   332  148      52      532

Figure 2: Blogs Data summary: Plot & Table

##     df.Word df.Blogs
## 91      one    13748
## 175     can    11909
## 176    like    11187
## 262    time    10707
## 159    just    10103
## 16      get     9489
## 186    make     8212
## 190     day     7279
## 586    year     7024
## 272    know     6805

Figure 3: News Data summary: Plot & Table

##      df.Word df.News
## 192     said   25002
## 586     year   12846
## 91       one    9224
## 175      can    7145
## 262     time    7136
## 700      new    7107
## 1775   state    6759
## 452      say    6302
## 772      two    6243
## 176     like    6062

Figure 4: Twitter Data summary: Plot & Table

##      df.Word df.Twitter
## 159     just      15152
## 16       get      14578
## 175      can      13484
## 1322   thank      13244
## 176     like      12997
## 199     love      12545
## 190      day      11195
## 116     good      10385
## 262     time       8899
## 91       one       8792

Figure 5: Combined Data summary: Plot & Table

##     df.Word df.Combined
## 175     can       32538
## 91      one       31764
## 192    said       30480
## 159    just       30411
## 176    like       30246
## 16      get       30101
## 262    time       26742
## 586    year       24200
## 190     day       23225
## 186    make       20878

## Warning in rm(df, corpusInfo): object 'corpusInfo' not found

N-GRAMS

An n-gram is a contiguous sequence of n-words. It help us in predicting the next word as required from the project algorithm. It helps in understanding how the words are grouped together and what possible combinations are available for a word or a goup of words entered by user. N-gram data from the Sampled Corpora shall be used for training our model for predicting the next word. N in n-gram can be 2/3/4/5 so on. I will collect n-gram till n=5 for my project. As of now I have created and explored 2-gram and 3-gram. I intend to extend this to 4-gram and 5-gram for final project.

SUMMARY STATISTICS FOR 2- OR 3-grams

There are 3042089 2-grams indentified in one or more of three source sample and 140974 2-grams are common among all three. Figure below shows Top-25 Most Frequently Use 2-grams overall ranked according to their frequency.

There are 6091208 3-grams indentified in one or more of three source sample and 98842 3-grams are common among all three. Figure below shows Top-25 Most Frequently Use 3-grams overall ranked according to their frequency.

INTERESTING FINDINGS

I encountered following interesting findings while working on the project so far:

“en_US.news.txt” could not be fully read. When I checked the last element of read data character vector, I found there was a right arrow symbol (at three places) in the original file which caused “readLines” function to just read data till it encountered an arrow. After removing those 3 arrows I could read the whole data. I saved the modified text file as “en_US.new_mod.txt”
Whole sample data could not be tokenized at once. I split the data into small vectors and operated ngram on them. I found it was much faster (time elapsed was much lesser) when the vector size decreased even though it increased the no of splits.
Miss-spelled and abbreviated words like “aa”, zzzd“,”yoursss“,”plssss“,”txt" were found.
write.csv function writes the first element as “x” by default.
Foreign words made it through pre-processing.
ngram function works much faster on character vectors rather than Corpus. So I read the saved cleaned data and peformed some pre-processing I did using dfm function while using Corpus object and operated on it then.

PLAN OF ACTION

I am still developing my final plans. However, following is a glimpse of what I have planned so far:

1. Develop a methodology to access more of the data

As of now, I worked on 10% randomized sample from each source: blogs, news and twitter. I intend to gain increase my reach and make use of more available data to train my model. For this I will either use boot-strapping techniques taught in Statistical Inference class or I will further do 10% sampling from remaining data and then again 10% sampling from remaining data, process them and combine results in n-grams.

2. Create a Prediction Algorithm

If the user has entered 3 or more words, I will search the last 4 words (if input > 4words) in 5-grams and display the result if match found. Similarly for 4-gram, I will match last 3 words and display the 4th word in matched 4-gram. If I get more than one hits in above 4- or 5-grams, I will go with the most frequent one.
For input < 3 words, I will search 2-gram and 3-gram using last and last 2 words of input respectively. If I get hit from both grams, I will then use weighted average frequency corresponding to those and display the next best option as per my model. The weights to calculate the weighted avg frequency will be calculated beforehand by running cross-validation techniques taught in Practical Machine Learning Course.

3. Shiny App

I will then use Developing Data Products class notes to develop my Shiny App to efficiently display next word options for a given input string.

Issues to Overcome

Miss-spellings, Abbreviation and Froeign words to be corrected/excluded.
To be able to execute the model promptly and efficiently.
To find out better method of the two described in Point 1 above.

CONCLUSION

This is my first time being exposed to text-analytics and I would confess initially I found it very intimidating task. Capstone is very different from projects I had done so far as a part of this Specialization. However, once I started searching and reading about the packages and different ways to analyse text data, I got more and more interested in it. I hope I am able to justify the knowledge this Course has imparted me over the period of last 6 months. I would also like to thank you, the reader, for being patient while going through my report.