Blog Data

Executive Summary

The primary objective of this report demonstrates that I have downloaded the data and have successfully loaded it in. It also summarizes basic statistics about the data sets, reports interesting findings, and proposes next steps for the prediction algorithm/app.

Exploratory Data Analysis

Loading the data from the source files.

blogs <- readLines(“en_US.blogs.txt”) news <- readLines(“en_US.news.txt”) twitter <- readLines(“en_US.twitter.txt”) ## Warning in readLines(“en_US.twitter.txt”): line 167155 appears to contain an ## embedded nul ## Warning in readLines(“en_US.twitter.txt”): line 268547 appears to contain an ## embedded nul ## Warning in readLines(“en_US.twitter.txt”): line 1274086 appears to contain an ## embedded nul ## Warning in readLines(“en_US.twitter.txt”): line 1759032 appears to contain an ## embedded nul

Building a data frame to show the line count and word count for each source file.

summary <- data.frame(Source = c(“Blogs”, “News”, “Twitter”), LineCount = c(length(blogs), length(news), length(twitter)), WordCount = c(sum(str_count(blogs, “\S+”)),sum(str_count(news, “\S+”)),sum(str_count(twitter, “\S+”)))) summary Source LineCount WordCount Blogs 899288 37334131 News 1010206 34371031 Twitter 2360148 30373543 3 rows Visualizing the data with many plots.

ggplot(summary, aes(x = Source, y = LineCount, fill = Source)) + geom_bar(stat = “identity”) + labs(title = “Line Count by Source”, x = “Source File”, y = “Line Count”)

ggplot(summary, aes(x = Source, y = WordCount, fill = Source)) + geom_bar(stat = “identity”) + labs(title = “Word Count by Source”, x = “Source File”, y = “Word Count”)

blogs_data <- tibble(Text = blogs) blogs_words <- blogs_data %>% unnest_tokens(output = word, input = Text) blogs_words <- blogs_words %>% anti_join(stop_words) ## Joining with by = join_by(word) blogs_wordcounts <- blogs_words %>% count(word, sort = TRUE) blogs_wordcounts <- as.data.frame(blogs_wordcounts) top_blog_word <- head(blogs_wordcounts, 20) top_blog_word<-mutate(top_blog_word, word = reorder(word, n)) ggplot(top_blog_word, aes(x=word, y=n)) + labs(x = “Word”, y = “Word Count”, title = “Most Frequent Words in Blogs ”) +geom_bar(stat = “identity”)

news_data <- tibble(Text = news) news_words <- news_data %>% unnest_tokens(output = word, input = Text) news_words <- news_words %>% anti_join(stop_words) ## Joining with by = join_by(word) news_wordcounts <- news_words %>% count(word, sort = TRUE) news_wordcounts <- as.data.frame(news_wordcounts) top_news_word <- head(news_wordcounts, 20) top_news_word<-mutate(top_news_word, word = reorder(word, n)) ggplot(top_news_word, aes(x=word, y=n)) + labs(x = “Word”, y = “Word Count”, title = “Most Frequent Words in News ”) +geom_bar(stat = “identity”)

twitter_data <- tibble(Text = twitter) twitter_words <- twitter_data %>% unnest_tokens(output = word, input = Text) twitter_words <- twitter_words %>% anti_join(stop_words) ## Joining with by = join_by(word) twitter_wordcounts <- twitter_words %>% count(word, sort = TRUE) twitter_wordcounts <- as.data.frame(twitter_wordcounts) top_twitter_word <- head(twitter_wordcounts, 20) top_twitter_word<-mutate(top_twitter_word, word = reorder(word, n)) ggplot(top_twitter_word, aes(x=word, y=n)) + labs(x = “Word”, y = “Word Count”, title = “Most Frequent Words in Twitter ”) +geom_bar(stat = “identity”)

Findings and Next Steps Blogs appear to have the highest word count, while Twitter has the highest line count. The next steps are to build a basic n-gram model, as well as a model that can handle unseen n-grams. The goal for the prediction model is to minimize both the size and runtime of the model in order to provide a reasonable experience to the user.