Executive Summary

The primary objective of this report is just to display that I can work with the data and am on track to create a prediction algorithm. This report demonstrates that I have downloaded the data and have successfully loaded it in. It also summarizes basic statistics about the data sets, reports interesting findings, and proposes next steps for the prediction algorithm/app.

Exploratory Data Analysis

Loading the data from the source files.

  blogs <- readLines("en_US.blogs.txt")
  news <- readLines("en_US.news.txt")
  twitter <- readLines("en_US.twitter.txt")
## Warning in readLines("en_US.twitter.txt"): line 167155 appears to contain an
## embedded nul
## Warning in readLines("en_US.twitter.txt"): line 268547 appears to contain an
## embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1274086 appears to contain an
## embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1759032 appears to contain an
## embedded nul

Building a data frame to show the line count and word count for each source file.

summary <- data.frame(Source = c("Blogs", "News", "Twitter"), LineCount = c(length(blogs), length(news), length(twitter)), WordCount = c(sum(str_count(blogs, "\\S+")),sum(str_count(news, "\\S+")),sum(str_count(twitter, "\\S+"))))
summary

Visualizing the data with plots.

  ggplot(summary, aes(x = Source, y = LineCount, fill = Source)) +
  geom_bar(stat = "identity") +
  labs(title = "Line Count by Source", x = "Source File", y = "Line Count") 

  ggplot(summary, aes(x = Source, y = WordCount, fill = Source)) +
  geom_bar(stat = "identity") +
  labs(title = "Word Count by Source", x = "Source File", y = "Word Count") 

  blogs_data <- tibble(Text = blogs) 
  blogs_words <- blogs_data %>% 
                  unnest_tokens(output = word, input = Text) 
  blogs_words <- blogs_words %>%
                   anti_join(stop_words)
## Joining with `by = join_by(word)`
  blogs_wordcounts <- blogs_words %>% count(word, sort = TRUE)
  blogs_wordcounts <- as.data.frame(blogs_wordcounts)
  top_blog_word <- head(blogs_wordcounts, 20)
  top_blog_word<-mutate(top_blog_word, word = reorder(word, n))
  ggplot(top_blog_word, aes(x=word, y=n)) + labs(x = "Word", y = "Word Count ", title = "Most Frequent Words in  Blogs \n") +geom_bar(stat = "identity")

  news_data <- tibble(Text = news) 
  news_words <- news_data %>% 
                  unnest_tokens(output = word, input = Text) 
  news_words <- news_words %>%
                   anti_join(stop_words)
## Joining with `by = join_by(word)`
  news_wordcounts <- news_words %>% count(word, sort = TRUE)
  news_wordcounts <- as.data.frame(news_wordcounts)
  top_news_word <- head(news_wordcounts, 20)
  top_news_word<-mutate(top_news_word, word = reorder(word, n))
  ggplot(top_news_word, aes(x=word, y=n)) + labs(x = "Word", y = "Word Count ", title = "Most Frequent Words in News \n") +geom_bar(stat = "identity")

  twitter_data <- tibble(Text = twitter) 
  twitter_words <- twitter_data %>% 
                  unnest_tokens(output = word, input = Text) 
  twitter_words <- twitter_words %>%
                   anti_join(stop_words)
## Joining with `by = join_by(word)`
  twitter_wordcounts <- twitter_words %>% count(word, sort = TRUE)
  twitter_wordcounts <- as.data.frame(twitter_wordcounts)
  top_twitter_word <- head(twitter_wordcounts, 20)
  top_twitter_word<-mutate(top_twitter_word, word = reorder(word, n))
  ggplot(top_twitter_word, aes(x=word, y=n)) + labs(x = "Word", y = "Word Count ", title = "Most Frequent Words in Twitter \n") +geom_bar(stat = "identity")

Findings and Next Steps

Blogs appear to have the highest word count, while Twitter has the highest line count. The next steps are to build a basic n-gram model, as well as a model that can handle unseen n-grams. The goal for the prediction model is to minimize both the size and runtime of the model in order to provide a reasonable experience to the user.