Introduction

The goal of this project is to build a next-word prediction application using text data from blogs, news articles, and Twitter. This milestone report describes the exploratory data analysis performed so far and outlines the plan for building the final prediction model and Shiny application.

Loading the Data

The dataset consists of text from blogs, news, and Twitter. The following code loads these files into R.

blogs <- readLines("en_US.blogs.txt", encoding="UTF-8", skipNul=TRUE)
news <- readLines("en_US.news.txt", encoding="UTF-8", skipNul=TRUE)
twitter <- readLines("en_US.twitter.txt", encoding="UTF-8", skipNul=TRUE)

Basic Summary Statistics

We calculate the number of lines and total words in each dataset.

library(stringi)

summary_table <- data.frame(
  File = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  )
)

summary_table
##      File   Lines    Words
## 1   Blogs  899288 37546806
## 2    News 1010206 34761151
## 3 Twitter 2360148 30096690

Basic Plot

The following histogram shows the distribution of line lengths in the blogs dataset.

blog_lengths <- stri_count_words(blogs)
hist(blog_lengths, breaks = 50, main = "Distribution of Blog Line Word Counts",
     xlab = "Words per line")

Conclusion and Next Steps

The exploratory analysis shows that the text data contains millions of lines and tens of millions of words. The next step in this project will be to clean the text, build n-gram models, and use them to predict the next word in a sentence. A Shiny application will be developed to provide an interactive interface for users to input text and receive predicted next words.