Introduction

This report summarizes my exploratory data analysis for the Data Science Capstone project.
The goal is to show the data has been downloaded, loaded, and explored and to outline plans for the prediction algorithm.

Setup

Loading the Data

Important: adjust the file paths below if your data is in a different location.

# Example paths used in the course dataset
# If files are in folder 'final/en_US/' use these exact names
blogs   <- readLines("final/en_US/en_US.blogs.txt", warn = FALSE)
news    <- readLines("final/en_US/en_US.news.txt", warn = FALSE)
twitter <- readLines("final/en_US/en_US.twitter.txt", warn = FALSE)


# Basic Summaries

library(stringi)

data_summary <- data.frame(
  File  = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(sum(stri_count_words(blogs)),
            sum(stri_count_words(news)),
            sum(stri_count_words(twitter)))
)
data_summary

##      File   Lines    Words
## 1   Blogs  899288 37546250
## 2    News 1010242 34762395
## 3 Twitter 2360148 30093372

Small Samples and Corpus

set.seed(123)
sample_data <- c(
  sample(blogs,   min(2000, length(blogs))),
  sample(news,    min(2000, length(news))),
  sample(twitter, min(2000, length(twitter)))
)
length(sample_data)

## [1] 6000

Frequent Terms (small example)

library(tm)
corpus <- VCorpus(VectorSource(sample_data))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
tdm <- TermDocumentMatrix(corpus)
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
freq_df <- data.frame(term = names(freq), freq = freq)
head(freq_df, 10)

##      term freq
## the   the 8724
## and   and 4557
## that that 1906
## for   for 1868
## with with 1330
## you   you 1296
## was   was 1128
## have have  943
## this this  899
## are   are  849

Simple Plot (Top 15 words)

library(ggplot2)
top20 <- head(freq_df, 15)
ggplot(top20, aes(x=reorder(term, freq), y=freq)) +
  geom_col() + coord_flip() + labs(x="", y="Frequency",
    title="Top words (sample)")

Plans for Prediction Algorithm

Tokenize text to build unigram/bigram/trigram frequency tables.
Use a backoff model (Katz or Stupid Backoff) to predict next words.
Build a predictNextWord() function.
Wrap the predictor in a Shiny app: text input → ranked predicted words.

Conclusion

This Milestone demonstrates: - Data was loaded and inspected.
- Basic counts and small plots are provided.
- A clear roadmap for the prediction algorithm and Shiny app is included.

Milestone Report

Manju Vidyananda Gowda