Introduction:

This is an exploratory data analysis for the capstone project on coursera.

# setup and load files
library(tm)
library(wordcloud)
library(ggplot2)
library(ggthemes)
library(dplyr)

# load files
blog_lines <- readLines("en_US.blogs.txt", skipNul = TRUE)
news_lines <- readLines("en_US.news.txt", skipNul = TRUE)
twitter_lines <- readLines("en_US.twitter.txt", skipNul = TRUE)

Functions for cleaning:

Functions were created to clean, tokenize, count words, and remove profanity.

Initial Statistics:

length_blogs <- length(blog_lines)
length_news <- length(news_lines)
length_twitter <- length(twitter_lines)
results <- tibble(length_blogs, length_news, length_twitter)
print(as.data.frame(results))
  length_blogs length_news length_twitter
1       899288     1010242        2360148
#Read 2000 lines and make a summary to understand the data profile
lines_to_read <- 2000
blogs_file <- readLines("en_US.blogs.txt", n=lines_to_read)
news_file <- readLines("en_US.news.txt", n=lines_to_read)
twitter_file <- readLines("en_US.twitter.txt", n=lines_to_read)
# create data frame with each document included as a column
aggregate_blogs_news_twitter <- as.data.frame(blogs_file, stringsAsFactors=FALSE)
aggregate_blogs_news_twitter[,2] <- as.data.frame(news_file, stringsAsFactors=FALSE)
aggregate_blogs_news_twitter[,3] <- as.data.frame(twitter_file, stringsAsFactors=FALSE)
colnames(aggregate_blogs_news_twitter)[] <- c("blogs", "news", "twitter")

Initial Data Cleaning and Processing:

This will convert to VCorpus for text mining, make lower case, remove punctuation, remove whitespace, remove stopwords, and remove numbers from the 2,000 lines that were loaded from each of the three files for analysis using the previously created functions. We will also create term-document matrix which is a table containing the frequency of the words in the lines.

# Process blogs data
blogs_token <- clean_and_tokenize(aggregate_blogs_news_twitter [,c("blogs")])
blogs_words <- count_words(blogs_token)
tdm_blogs <- TermDocumentMatrix(blogs_token)
m_blogs <- as.matrix(tdm_blogs)
v_blogs <- sort(rowSums(m_blogs),decreasing=TRUE)
d_blogs <- data.frame(word=names(v_blogs),freq=v_blogs)

news_token <- clean_and_tokenize(aggregate_blogs_news_twitter[,c("news")])
news_words <- count_words(news_token)
tdm_news <- TermDocumentMatrix(news_token)
m_news <- as.matrix(tdm_news)
v_news <- sort(rowSums(m_news),decreasing=TRUE)
d_news <- data.frame(word=names(v_news),freq=v_news)

twitter_token <- clean_and_tokenize(aggregate_blogs_news_twitter[,c("twitter")])
twitter_words <- count_words(twitter_token)
tdm_twitter <- TermDocumentMatrix(twitter_token)
m_twitter <- as.matrix(tdm_twitter)
v_twitter <- sort(rowSums(m_twitter),decreasing=TRUE)
d_twitter <- data.frame(word=names(v_twitter),freq=v_twitter)

EDA Plots:

Plotting wordcloud and bargraphs of word samples

Wordcloud - Blogs Words

Wordcloud - News Words

Wordcloud - Twitter Words

Removal of Profanity and Other Words:

We used the badwords file from https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/badwordslist/badwords.txt to provide a list of words to remove from the blogs, news, and twitter lines.

blogs_delta <- blogs_words_len - blogs_words_len2
news_delta <- news_words_len - news_words_len2
twitter_delta <- twitter_words_len - twitter_words_len2
print(matrix(c("Blogs words removed", "News words removed", "Twitter words removed", 
               blogs_delta, news_delta, twitter_delta),
             nrow=3))
     [,1]                    [,2]
[1,] "Blogs words removed"   "18"
[2,] "News words removed"    "8" 
[3,] "Twitter words removed" "13"

Plans for machine learning:

I plan to use 3-grams to predict the next word. The user would enter a word(s) and the algorithm will predict most probable next word based on the data.

References:

I used the following website to better understand how to complete this research, the process for tokenizing and analysis was very useful; many thanks to the original blogger!

http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know

For the algorithm I am considering this blog as a key resource:

https://sookocheff.com/post/nlp/ngram-modeling-with-markov-chains/