Data Science Capstone-Milestone Report-Week 2

Introduction

The goal of this assignment was to familiarize the students with the text files provided and to start using the natural language processing libraries. First step is to read the data from ethe files.

text_news_path <- '/Users/mehipour/Documents/R-working directory/Coursera-R Programming/Course 10/final/en_US/en_US.news.txt'
text_blog_path <- '/Users/mehipour/Documents/R-working directory/Coursera-R Programming/Course 10/final/en_US/en_US.blogs.txt'
text_twitter_path <- '/Users/mehipour/Documents/R-working directory/Coursera-R Programming/Course 10/final/en_US/en_US.twitter.txt'
text_news <- readLines(text_news_path, warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
text_blog <- readLines(text_blog_path, warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
text_twitter <- readLines(text_twitter_path, warn = FALSE, encoding = "UTF-8", skipNul = TRUE)

Exploratory Data Analysis

File Assessment

Second step was to examine the file size, the number of lines and words in each file. This was done by summarizing this information in a dataframe:

b_to_Mb = 1/1024/1024
# file size
news_file_size <- file.info(text_news_path)$size * b_to_Mb
blog_file_size <- file.info(text_blog_path)$size * b_to_Mb
twitter_file_size <- file.info(text_twitter_path)$size * b_to_Mb
# how many lines
news_line_count = length(text_news)
blog_line_count = length(text_blog)
twitter_line_count = length(text_twitter)
# how many words
news_word_count = length(unlist(text_news))
blog_word_count = length(unlist(text_blog))
twitter_word_count = length(unlist(text_twitter))

files_df = data.frame(C1 =c(news_file_size, blog_file_size, twitter_file_size),
                      C2 =c(news_line_count, blog_line_count, twitter_line_count),
                      C3 =c(news_word_count, blog_word_count, twitter_word_count))
colnames(files_df) = c('File Size (Mb)','Line Count','Word Count')
rownames(files_df) = c('News','Blogs','Twitter')
print(files_df)

##         File Size (Mb) Line Count Word Count
## News          196.2775    1010242    1010242
## Blogs         200.4242     899288     899288
## Twitter       159.3641    2360148    2360148

Generating A Sample Dataset

Given the large number of wards in the data, and limited available RAM on my personal computer, I randomly selected 200 lines from each text file and combined them into a sample list:

n_sample = 2000
set.seed(1234)
news_sample = sample(text_news, n_sample, replace = FALSE)
blog_sample = sample(text_blog, n_sample, replace = FALSE)
twitter_sample = sample(text_twitter, n_sample, replace = FALSE)
text_sample = c(news_sample, blog_sample, twitter_sample)

Tokenization

To perform text prediction, we have assess the sequences of words and their likelihood. To do that, we have to sparse the text into 1-gram, 2-gram and 3-grams sequences of texts. For that I wrote the follwing functions first:

corpus <- VCorpus(VectorSource(text_sample))
tokenizer1 <- function(x) 
    unlist(lapply(ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE)
tokenizer2 <- function(x) 
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
tokenizer3 <- function(x) 
    unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)

Let us have a look at the top-10 frequent words that appear in our sample data using a barchart.

## 1-gram corpus and word count
sorted_with_frequency1 <- sort(rowSums(as.matrix(TermDocumentMatrix(corpus, control = list(tokenize = tokenizer1)))), decreasing=TRUE)
barplot(sorted_with_frequency1[1:10], main = "Most Frequent 1-gram Sequence of Words", xlab="Sequence of Words", ylab = "Count", las=2)

A commonly used approach to quanlitatively highlight the more frequently used words in text is to use word clouds, which shows the similar information to the barplot.

df1 <- data.frame(word=names(sorted_with_frequency1), freq=sorted_with_frequency1)
wordcloud(df1$word, df1$freq, min.freq= 250)

We see that the words “the” and “and” is considerably used more often than any other word. This is not suprising! Now let’s have the similar barplot for the 2-gram an 3 gram sequences. We will not be looking the word cloud for these two:

# 2-gram
sorted_with_frequency2 <- sort(rowSums(as.matrix(TermDocumentMatrix(corpus, control = list(tokenize = tokenizer2)))), decreasing=TRUE)
barplot(sorted_with_frequency2[1:10], main = "Most Frequent 2-gram Sequence of Words", xlab="Sequence of Words", ylab = "Count", las=2)

# 3-gram
sorted_with_frequency3 <- sort(rowSums(as.matrix(TermDocumentMatrix(corpus, control = list(tokenize = tokenizer3)))), decreasing=TRUE)
barplot(sorted_with_frequency3[1:5], main = "Most Frequent 3-gram Sequence of Words", xlab="Sequence of Words", ylab = "Count", las=2)

Interesntly we see that in our sample, almost all of the most frequently used 2-gram sequences include the word “the” along with another word. FGiven the small sample of the text, perhaps it is difficult to make a prediction from the 3-gram sequence of words. So I believe that a more comprehensive analysis of this will be useful.

Summary of Observations

Here I demonstrated a simple and quick overview of the techniques that I will be using the assess how frequenlty words and sequences of words appear. Notice that I did not perform much data cleaning, since it appears to be complicated for now. However, for the prediction phase I will be removing periods, commas, and will ensure that all text is lower case for ease of prediction.

Plan of Action

The prediciton algorithm will work based on what sequence of words will most frequenlty (or more likley occur) given a given word. To develop the prediction algorithm, first I will be using a much larger dataset to train the aformentioned algorithm. I plan to make the shiny app with a text box where the user can type words and after pressing space the app suggest what word would likely be typed in a separate box right next to where the user will be typing.