The primary objective of this report is just to display that I can work with the data and am on track to create a prediction algorithm. This report demonstrates that I have downloaded the data and have successfully loaded it in. It also summarizes basic statistics about the data sets, reports interesting findings, and proposes next steps for the prediction algorithm/app.
Loading the data from the source files.
blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
twitter <- readLines("en_US.twitter.txt")
## Warning in readLines("en_US.twitter.txt"): line 167155 appears to contain an
## embedded nul
## Warning in readLines("en_US.twitter.txt"): line 268547 appears to contain an
## embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1274086 appears to contain an
## embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1759032 appears to contain an
## embedded nul
Building a data frame to show the line count and word count for each source file.
summary <- data.frame(Source = c("Blogs", "News", "Twitter"), LineCount = c(length(blogs), length(news), length(twitter)), WordCount = c(sum(str_count(blogs, "\\S+")),sum(str_count(news, "\\S+")),sum(str_count(twitter, "\\S+"))))
summary
Visualizing the data with plots.
ggplot(summary, aes(x = Source, y = LineCount, fill = Source)) +
geom_bar(stat = "identity") +
labs(title = "Line Count by Source", x = "Source File", y = "Line Count")
ggplot(summary, aes(x = Source, y = WordCount, fill = Source)) +
geom_bar(stat = "identity") +
labs(title = "Word Count by Source", x = "Source File", y = "Word Count")
blogs_data <- tibble(Text = blogs)
blogs_words <- blogs_data %>%
unnest_tokens(output = word, input = Text)
blogs_words <- blogs_words %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
blogs_wordcounts <- blogs_words %>% count(word, sort = TRUE)
blogs_wordcounts <- as.data.frame(blogs_wordcounts)
top_blog_word <- head(blogs_wordcounts, 20)
top_blog_word<-mutate(top_blog_word, word = reorder(word, n))
ggplot(top_blog_word, aes(x=word, y=n)) + labs(x = "Word", y = "Word Count ", title = "Most Frequent Words in Blogs \n") +geom_bar(stat = "identity")
news_data <- tibble(Text = news)
news_words <- news_data %>%
unnest_tokens(output = word, input = Text)
news_words <- news_words %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
news_wordcounts <- news_words %>% count(word, sort = TRUE)
news_wordcounts <- as.data.frame(news_wordcounts)
top_news_word <- head(news_wordcounts, 20)
top_news_word<-mutate(top_news_word, word = reorder(word, n))
ggplot(top_news_word, aes(x=word, y=n)) + labs(x = "Word", y = "Word Count ", title = "Most Frequent Words in News \n") +geom_bar(stat = "identity")
twitter_data <- tibble(Text = twitter)
twitter_words <- twitter_data %>%
unnest_tokens(output = word, input = Text)
twitter_words <- twitter_words %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
twitter_wordcounts <- twitter_words %>% count(word, sort = TRUE)
twitter_wordcounts <- as.data.frame(twitter_wordcounts)
top_twitter_word <- head(twitter_wordcounts, 20)
top_twitter_word<-mutate(top_twitter_word, word = reorder(word, n))
ggplot(top_twitter_word, aes(x=word, y=n)) + labs(x = "Word", y = "Word Count ", title = "Most Frequent Words in Twitter \n") +geom_bar(stat = "identity")
Blogs appear to have the highest word count, while Twitter has the highest line count. The next steps are to build a basic n-gram model, as well as a model that can handle unseen n-grams. The goal for the prediction model is to minimize both the size and runtime of the model in order to provide a reasonable experience to the user.