This is an exploratory data analysis for the capstone project on coursera.
# setup and load files
library(tm)
library(wordcloud)
library(ggplot2)
library(ggthemes)
library(dplyr)
# load files
blog_lines <- readLines("en_US.blogs.txt", skipNul = TRUE)
news_lines <- readLines("en_US.news.txt", skipNul = TRUE)
twitter_lines <- readLines("en_US.twitter.txt", skipNul = TRUE)
Functions were created to clean, tokenize, count words, and remove profanity.
length_blogs <- length(blog_lines)
length_news <- length(news_lines)
length_twitter <- length(twitter_lines)
results <- tibble(length_blogs, length_news, length_twitter)
print(as.data.frame(results))
length_blogs length_news length_twitter
1 899288 1010242 2360148
#Read 2000 lines and make a summary to understand the data profile
lines_to_read <- 2000
blogs_file <- readLines("en_US.blogs.txt", n=lines_to_read)
news_file <- readLines("en_US.news.txt", n=lines_to_read)
twitter_file <- readLines("en_US.twitter.txt", n=lines_to_read)
# create data frame with each document included as a column
aggregate_blogs_news_twitter <- as.data.frame(blogs_file, stringsAsFactors=FALSE)
aggregate_blogs_news_twitter[,2] <- as.data.frame(news_file, stringsAsFactors=FALSE)
aggregate_blogs_news_twitter[,3] <- as.data.frame(twitter_file, stringsAsFactors=FALSE)
colnames(aggregate_blogs_news_twitter)[] <- c("blogs", "news", "twitter")
This will convert to VCorpus for text mining, make lower case, remove punctuation, remove whitespace, remove stopwords, and remove numbers from the 2,000 lines that were loaded from each of the three files for analysis using the previously created functions. We will also create term-document matrix which is a table containing the frequency of the words in the lines.
# Process blogs data
blogs_token <- clean_and_tokenize(aggregate_blogs_news_twitter [,c("blogs")])
blogs_words <- count_words(blogs_token)
tdm_blogs <- TermDocumentMatrix(blogs_token)
m_blogs <- as.matrix(tdm_blogs)
v_blogs <- sort(rowSums(m_blogs),decreasing=TRUE)
d_blogs <- data.frame(word=names(v_blogs),freq=v_blogs)
news_token <- clean_and_tokenize(aggregate_blogs_news_twitter[,c("news")])
news_words <- count_words(news_token)
tdm_news <- TermDocumentMatrix(news_token)
m_news <- as.matrix(tdm_news)
v_news <- sort(rowSums(m_news),decreasing=TRUE)
d_news <- data.frame(word=names(v_news),freq=v_news)
twitter_token <- clean_and_tokenize(aggregate_blogs_news_twitter[,c("twitter")])
twitter_words <- count_words(twitter_token)
tdm_twitter <- TermDocumentMatrix(twitter_token)
m_twitter <- as.matrix(tdm_twitter)
v_twitter <- sort(rowSums(m_twitter),decreasing=TRUE)
d_twitter <- data.frame(word=names(v_twitter),freq=v_twitter)
Plotting wordcloud and bargraphs of word samples
Wordcloud - Blogs Words
Wordcloud - News Words
Wordcloud - Twitter Words
We used the badwords file from https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/badwordslist/badwords.txt to provide a list of words to remove from the blogs, news, and twitter lines.
blogs_delta <- blogs_words_len - blogs_words_len2
news_delta <- news_words_len - news_words_len2
twitter_delta <- twitter_words_len - twitter_words_len2
print(matrix(c("Blogs words removed", "News words removed", "Twitter words removed",
blogs_delta, news_delta, twitter_delta),
nrow=3))
[,1] [,2]
[1,] "Blogs words removed" "18"
[2,] "News words removed" "8"
[3,] "Twitter words removed" "13"
I plan to use 3-grams to predict the next word. The user would enter a word(s) and the algorithm will predict most probable next word based on the data.
I used the following website to better understand how to complete this research, the process for tokenizing and analysis was very useful; many thanks to the original blogger!
For the algorithm I am considering this blog as a key resource:
https://sookocheff.com/post/nlp/ngram-modeling-with-markov-chains/