The goal of this report is to display that I’ve gotten used to working with the data and that I am on track to create my prediction algorithm. This report will address the following:
Assuming data files are in working directory.
Since data files are huge in size, I have taken randomized 10% data from each source - news, blogs and twitter and explored it.
news <- readLines("en_US.news_mod.txt", encoding="UTF-8")
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8")
suppressWarnings(twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8"))
# For modelling purpose, I sampled 10% of the data from each source
set.seed(1234)
news <- news[as.logical(rbinom(length(news), 1, 0.1))]
blogs <- blogs[as.logical(rbinom(length(blogs), 1, 0.1))]
twitter <- twitter[as.logical(rbinom(length(twitter), 1, 0.1))]
# Remove the numbers with decimal (float type) in between (e.g. 5.12). This will remove the ambiguity of confusing decimal with period (.) while sentence splitting.
news <- gsub(news, pattern = "[0-9]+\\.[0-9]+", replacement = " ")
blogs <- gsub(blogs, pattern = "[0-9]+\\.[0-9]+", replacement = " ")
twitter <- gsub(twitter, pattern = "[0-9]+\\.[0-9]+", replacement = " ")
# Split Sentences
split_sentences <- function(text) {
sentence_list <- strsplit(text, split = ";|\\.|!|\\?")
sentences <- unlist(sentence_list)
return(sentences)
}
news <- split_sentences(news)
blogs <- split_sentences(blogs)
twitter <- split_sentences(twitter)
# Line Count
newsLines <- length(news)
blogsLines <- length(blogs)
twitterLines <- length(twitter)
# Keep only alphabetical part
news <- gsub(news, pattern = "[^[:alpha:]]", replacement = " ")
blogs <- gsub(blogs, pattern = "[^[:alpha:]]", replacement = " ")
twitter <- gsub(twitter, pattern = "[^[:alpha:]]", replacement = " ")
# Save the sampled data in text files for future use in creating n-gram
# For creating n-grams, stopwords and 2-lettered words like of, it, on etc will be included in Corpus
writeLines(news, "news.txt")
writeLines(blogs, "blogs.txt")
writeLines(twitter, "twitter.txt")
# Remove 1-2 lettered words and extra spaces for exploratory purpose
news <- gsub(news, pattern="\\W*\\b\\w{1,2}\\b", replacement=" ")
blogs <- gsub(blogs, pattern="\\W*\\b\\w{1,2}\\b", replacement=" ")
twitter <- gsub(twitter, pattern="\\W*\\b\\w{1,2}\\b", replacement=" ")
news <- gsub(news, pattern="\\s+", replacement=" ")
blogs <- gsub(blogs, pattern="\\s+", replacement=" ")
twitter <- gsub(twitter, pattern="\\s+", replacement=" ")
# Word Count
word_count <- function(text) {
words <- sum(sapply(strsplit(text, split = " "), length))
return(words)
}
newsWords <- word_count(news)
blogsWords <- word_count(blogs)
twitterWords <- word_count(twitter)
# Save the sampled data in text files for creating Corpus for Exploratory analysis
writeLines(news, "./Sample/news.txt")
writeLines(blogs, "./Sample/blogs.txt")
writeLines(twitter, "./Sample/twitter.txt")
# Loading Libraries
library(quanteda)
## quanteda version 0.9.6.9
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:base':
##
## sample
# Prepare Corpus for Exploratory Analysis
mytf <- textfile(list.files(path = "./Sample", pattern = "\\.txt$", full.names = TRUE, recursive = TRUE))
myCorpus <- corpus(mytf)
myCorpus
## Corpus consisting of 3 documents.
# Cleaning Corpus: Filter profanity, "will", stopwords, do stemming, remove URL (http(s)://), separators
suppressWarnings(profane <- read.csv("swearWords.csv"))
profane <- tolower(colnames(profane))
cleanCorpusDFM <- dfm(myCorpus, stem = TRUE, removeURL = TRUE, ignoredFeatures = c("will", stopwords("english"), profane), language = "english", removeSeparators = TRUE)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 3 documents
## ... indexing features: 159,256 feature types
## ... removed 162 features, from 252 supplied (glob) feature types
## ... stemming features (English), trimmed 36666 feature variants
## ... created a 3 x 122429 sparse dfm
## ... complete.
## Elapsed time: 24.66 seconds.
cleanCorpusDFM
## Document-feature matrix of: 3 documents, 122,429 features.
#Top 20 frequent words
topfeatures(cleanCorpusDFM, 20)
## can one said just like get time year day make love new
## 32538 31764 30480 30411 30246 30101 26742 24200 23225 20878 20273 19511
## good know now work don peopl say want
## 18927 18295 18056 17778 16591 16569 16284 16089
# Plot DFM
# Loading Library
library(RColorBrewer)
# Word-Cloud
plot(cleanCorpusDFM, max.words = 100, colors = brewer.pal(6, "Dark2"), scale = c(3.5, 0.5))
## [1] "Summary Statistics for Word Count and Line Count"
## Source Line_Count Word_Count
## 1 Blogs 278802 3089899
## 2 News 243724 2930799
## 3 Twitter 531319 2520348
## 4 Combined 1053845 8541046
## [1] "Data Set for Frequency of Words"
## Word Blogs News Twitter Combined
## 1 anyway 753 123 390 1266
## 2 go 3882 3121 5637 12640
## 3 share 1640 1199 1035 3874
## 4 home 2980 3545 2584 9109
## 5 decor 332 148 52 532
## df.Word df.Blogs
## 91 one 13748
## 175 can 11909
## 176 like 11187
## 262 time 10707
## 159 just 10103
## 16 get 9489
## 186 make 8212
## 190 day 7279
## 586 year 7024
## 272 know 6805
## df.Word df.News
## 192 said 25002
## 586 year 12846
## 91 one 9224
## 175 can 7145
## 262 time 7136
## 700 new 7107
## 1775 state 6759
## 452 say 6302
## 772 two 6243
## 176 like 6062
## df.Word df.Twitter
## 159 just 15152
## 16 get 14578
## 175 can 13484
## 1322 thank 13244
## 176 like 12997
## 199 love 12545
## 190 day 11195
## 116 good 10385
## 262 time 8899
## 91 one 8792
## df.Word df.Combined
## 175 can 32538
## 91 one 31764
## 192 said 30480
## 159 just 30411
## 176 like 30246
## 16 get 30101
## 262 time 26742
## 586 year 24200
## 190 day 23225
## 186 make 20878
## Warning in rm(df, corpusInfo): object 'corpusInfo' not found
An n-gram is a contiguous sequence of n-words. It help us in predicting the next word as required from the project algorithm. It helps in understanding how the words are grouped together and what possible combinations are available for a word or a goup of words entered by user. N-gram data from the Sampled Corpora shall be used for training our model for predicting the next word. N in n-gram can be 2/3/4/5 so on. I will collect n-gram till n=5 for my project. As of now I have created and explored 2-gram and 3-gram. I intend to extend this to 4-gram and 5-gram for final project.
There are 3042089 2-grams indentified in one or more of three source sample and 140974 2-grams are common among all three. Figure below shows Top-25 Most Frequently Use 2-grams overall ranked according to their frequency.
There are 6091208 3-grams indentified in one or more of three source sample and 98842 3-grams are common among all three. Figure below shows Top-25 Most Frequently Use 3-grams overall ranked according to their frequency.
I encountered following interesting findings while working on the project so far:
I am still developing my final plans. However, following is a glimpse of what I have planned so far:
As of now, I worked on 10% randomized sample from each source: blogs, news and twitter. I intend to gain increase my reach and make use of more available data to train my model. For this I will either use boot-strapping techniques taught in Statistical Inference class or I will further do 10% sampling from remaining data and then again 10% sampling from remaining data, process them and combine results in n-grams.
If the user has entered 3 or more words, I will search the last 4 words (if input > 4words) in 5-grams and display the result if match found. Similarly for 4-gram, I will match last 3 words and display the 4th word in matched 4-gram. If I get more than one hits in above 4- or 5-grams, I will go with the most frequent one.
For input < 3 words, I will search 2-gram and 3-gram using last and last 2 words of input respectively. If I get hit from both grams, I will then use weighted average frequency corresponding to those and display the next best option as per my model. The weights to calculate the weighted avg frequency will be calculated beforehand by running cross-validation techniques taught in Practical Machine Learning Course.
I will then use Developing Data Products class notes to develop my Shiny App to efficiently display next word options for a given input string.
This is my first time being exposed to text-analytics and I would confess initially I found it very intimidating task. Capstone is very different from projects I had done so far as a part of this Specialization. However, once I started searching and reading about the packages and different ways to analyse text data, I got more and more interested in it. I hope I am able to justify the knowledge this Course has imparted me over the period of last 6 months. I would also like to thank you, the reader, for being patient while going through my report.