The goal of this project is to build a next-word prediction application using text data from blogs, news articles, and Twitter. This milestone report describes the exploratory data analysis performed so far and outlines the plan for building the final prediction model and Shiny application.
The dataset consists of text from blogs, news, and Twitter. The following code loads these files into R.
blogs <- readLines("en_US.blogs.txt", encoding="UTF-8", skipNul=TRUE)
news <- readLines("en_US.news.txt", encoding="UTF-8", skipNul=TRUE)
twitter <- readLines("en_US.twitter.txt", encoding="UTF-8", skipNul=TRUE)
We calculate the number of lines and total words in each dataset.
library(stringi)
summary_table <- data.frame(
File = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
)
)
summary_table
## File Lines Words
## 1 Blogs 899288 37546806
## 2 News 1010206 34761151
## 3 Twitter 2360148 30096690
The following histogram shows the distribution of line lengths in the blogs dataset.
blog_lengths <- stri_count_words(blogs)
hist(blog_lengths, breaks = 50, main = "Distribution of Blog Line Word Counts",
xlab = "Words per line")
The exploratory analysis shows that the text data contains millions of lines and tens of millions of words. The next step in this project will be to clean the text, build n-gram models, and use them to predict the next word in a sentence. A Shiny application will be developed to provide an interactive interface for users to input text and receive predicted next words.