Blog Data
Executive Summary
The primary objective of this report demonstrates that I have downloaded the data and have successfully loaded it in. It also summarizes basic statistics about the data sets, reports interesting findings, and proposes next steps for the prediction algorithm/app.
Exploratory Data Analysis
Loading the data from the source files.
blogs <- readLines(“en_US.blogs.txt”) news <- readLines(“en_US.news.txt”) twitter <- readLines(“en_US.twitter.txt”) ## Warning in readLines(“en_US.twitter.txt”): line 167155 appears to contain an ## embedded nul ## Warning in readLines(“en_US.twitter.txt”): line 268547 appears to contain an ## embedded nul ## Warning in readLines(“en_US.twitter.txt”): line 1274086 appears to contain an ## embedded nul ## Warning in readLines(“en_US.twitter.txt”): line 1759032 appears to contain an ## embedded nul
Building a data frame to show the line count and word count for each source file.
summary <- data.frame(Source = c(“Blogs”, “News”, “Twitter”),
LineCount = c(length(blogs), length(news), length(twitter)), WordCount =
c(sum(str_count(blogs, “\S+”)),sum(str_count(news,
“\S+”)),sum(str_count(twitter, “\S+”)))) summary Source
ggplot(summary, aes(x = Source, y = LineCount, fill = Source)) + geom_bar(stat = “identity”) + labs(title = “Line Count by Source”, x = “Source File”, y = “Line Count”)
ggplot(summary, aes(x = Source, y = WordCount, fill = Source)) + geom_bar(stat = “identity”) + labs(title = “Word Count by Source”, x = “Source File”, y = “Word Count”)
blogs_data <- tibble(Text = blogs) blogs_words <- blogs_data
%>% unnest_tokens(output = word, input = Text) blogs_words <-
blogs_words %>% anti_join(stop_words) ## Joining with
by = join_by(word)
blogs_wordcounts <- blogs_words
%>% count(word, sort = TRUE) blogs_wordcounts <-
as.data.frame(blogs_wordcounts) top_blog_word <-
head(blogs_wordcounts, 20) top_blog_word<-mutate(top_blog_word, word
= reorder(word, n)) ggplot(top_blog_word, aes(x=word, y=n)) + labs(x =
“Word”, y = “Word Count”, title = “Most Frequent Words in Blogs ”)
+geom_bar(stat = “identity”)
news_data <- tibble(Text = news) news_words <- news_data %>%
unnest_tokens(output = word, input = Text) news_words <- news_words
%>% anti_join(stop_words) ## Joining with
by = join_by(word)
news_wordcounts <- news_words %>%
count(word, sort = TRUE) news_wordcounts <-
as.data.frame(news_wordcounts) top_news_word <- head(news_wordcounts,
20) top_news_word<-mutate(top_news_word, word = reorder(word, n))
ggplot(top_news_word, aes(x=word, y=n)) + labs(x = “Word”, y = “Word
Count”, title = “Most Frequent Words in News ”) +geom_bar(stat =
“identity”)
twitter_data <- tibble(Text = twitter) twitter_words <-
twitter_data %>% unnest_tokens(output = word, input = Text)
twitter_words <- twitter_words %>% anti_join(stop_words) ##
Joining with by = join_by(word)
twitter_wordcounts <-
twitter_words %>% count(word, sort = TRUE) twitter_wordcounts <-
as.data.frame(twitter_wordcounts) top_twitter_word <-
head(twitter_wordcounts, 20)
top_twitter_word<-mutate(top_twitter_word, word = reorder(word, n))
ggplot(top_twitter_word, aes(x=word, y=n)) + labs(x = “Word”, y = “Word
Count”, title = “Most Frequent Words in Twitter ”) +geom_bar(stat =
“identity”)
Findings and Next Steps Blogs appear to have the highest word count, while Twitter has the highest line count. The next steps are to build a basic n-gram model, as well as a model that can handle unseen n-grams. The goal for the prediction model is to minimize both the size and runtime of the model in order to provide a reasonable experience to the user.