The goal of this assignment was to familiarize the students with the text files provided and to start using the natural language processing libraries. First step is to read the data from ethe files.
text_news_path <- '/Users/mehipour/Documents/R-working directory/Coursera-R Programming/Course 10/final/en_US/en_US.news.txt'
text_blog_path <- '/Users/mehipour/Documents/R-working directory/Coursera-R Programming/Course 10/final/en_US/en_US.blogs.txt'
text_twitter_path <- '/Users/mehipour/Documents/R-working directory/Coursera-R Programming/Course 10/final/en_US/en_US.twitter.txt'
text_news <- readLines(text_news_path, warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
text_blog <- readLines(text_blog_path, warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
text_twitter <- readLines(text_twitter_path, warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
Second step was to examine the file size, the number of lines and words in each file. This was done by summarizing this information in a dataframe:
b_to_Mb = 1/1024/1024
# file size
news_file_size <- file.info(text_news_path)$size * b_to_Mb
blog_file_size <- file.info(text_blog_path)$size * b_to_Mb
twitter_file_size <- file.info(text_twitter_path)$size * b_to_Mb
# how many lines
news_line_count = length(text_news)
blog_line_count = length(text_blog)
twitter_line_count = length(text_twitter)
# how many words
news_word_count = length(unlist(text_news))
blog_word_count = length(unlist(text_blog))
twitter_word_count = length(unlist(text_twitter))
files_df = data.frame(C1 =c(news_file_size, blog_file_size, twitter_file_size),
C2 =c(news_line_count, blog_line_count, twitter_line_count),
C3 =c(news_word_count, blog_word_count, twitter_word_count))
colnames(files_df) = c('File Size (Mb)','Line Count','Word Count')
rownames(files_df) = c('News','Blogs','Twitter')
print(files_df)
## File Size (Mb) Line Count Word Count
## News 196.2775 1010242 1010242
## Blogs 200.4242 899288 899288
## Twitter 159.3641 2360148 2360148
Given the large number of wards in the data, and limited available RAM on my personal computer, I randomly selected 200 lines from each text file and combined them into a sample list:
n_sample = 2000
set.seed(1234)
news_sample = sample(text_news, n_sample, replace = FALSE)
blog_sample = sample(text_blog, n_sample, replace = FALSE)
twitter_sample = sample(text_twitter, n_sample, replace = FALSE)
text_sample = c(news_sample, blog_sample, twitter_sample)
To perform text prediction, we have assess the sequences of words and their likelihood. To do that, we have to sparse the text into 1-gram, 2-gram and 3-grams sequences of texts. For that I wrote the follwing functions first:
corpus <- VCorpus(VectorSource(text_sample))
tokenizer1 <- function(x)
unlist(lapply(ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE)
tokenizer2 <- function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
tokenizer3 <- function(x)
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
Let us have a look at the top-10 frequent words that appear in our sample data using a barchart.
## 1-gram corpus and word count
sorted_with_frequency1 <- sort(rowSums(as.matrix(TermDocumentMatrix(corpus, control = list(tokenize = tokenizer1)))), decreasing=TRUE)
barplot(sorted_with_frequency1[1:10], main = "Most Frequent 1-gram Sequence of Words", xlab="Sequence of Words", ylab = "Count", las=2)
A commonly used approach to quanlitatively highlight the more frequently used words in text is to use word clouds, which shows the similar information to the barplot.
df1 <- data.frame(word=names(sorted_with_frequency1), freq=sorted_with_frequency1)
wordcloud(df1$word, df1$freq, min.freq= 250)
We see that the words “the” and “and” is considerably used more often than any other word. This is not suprising! Now let’s have the similar barplot for the 2-gram an 3 gram sequences. We will not be looking the word cloud for these two:
# 2-gram
sorted_with_frequency2 <- sort(rowSums(as.matrix(TermDocumentMatrix(corpus, control = list(tokenize = tokenizer2)))), decreasing=TRUE)
barplot(sorted_with_frequency2[1:10], main = "Most Frequent 2-gram Sequence of Words", xlab="Sequence of Words", ylab = "Count", las=2)
# 3-gram
sorted_with_frequency3 <- sort(rowSums(as.matrix(TermDocumentMatrix(corpus, control = list(tokenize = tokenizer3)))), decreasing=TRUE)
barplot(sorted_with_frequency3[1:5], main = "Most Frequent 3-gram Sequence of Words", xlab="Sequence of Words", ylab = "Count", las=2)
Interesntly we see that in our sample, almost all of the most frequently used 2-gram sequences include the word “the” along with another word. FGiven the small sample of the text, perhaps it is difficult to make a prediction from the 3-gram sequence of words. So I believe that a more comprehensive analysis of this will be useful.
Here I demonstrated a simple and quick overview of the techniques that I will be using the assess how frequenlty words and sequences of words appear. Notice that I did not perform much data cleaning, since it appears to be complicated for now. However, for the prediction phase I will be removing periods, commas, and will ensure that all text is lower case for ease of prediction.
The prediciton algorithm will work based on what sequence of words will most frequenlty (or more likley occur) given a given word. To develop the prediction algorithm, first I will be using a much larger dataset to train the aformentioned algorithm. I plan to make the shiny app with a text box where the user can type words and after pressing space the app suggest what word would likely be typed in a separate box right next to where the user will be typing.