Data Science Capstone - Milestone Report

1. Introduction

This report presents the exploratory data analysis (EDA) of the text datasets used for building a next word prediction model. The datasets include blogs, news, and Twitter data in English. The goal of this analysis is to understand the structure, size, and basic characteristics of the data.

2. Loading the Data

blogs <- readLines("en_US.blogs.txt", n = 10000)
news <- readLines("en_US.news.txt", n = 10000)
twitter <- readLines("en_US.twitter.txt", n = 10000)

3. Basic Summary of the Data

library(stringi)

summary_table <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Line_Count = c(length(blogs), length(news), length(twitter)),
  Word_Count = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  )
)

summary_table

##   Dataset Line_Count Word_Count
## 1   Blogs      10000     412805
## 2    News      10000     348070
## 3 Twitter      10000     126511

Explanation

The table above shows the number of lines and words in each dataset. Blogs and Twitter contain a large amount of informal text, while news data is more structured and formal.

4. Sampling the Data

Due to the large size of the dataset, a small sample is taken for further analysis.

set.seed(123)

blogs_sample <- sample(blogs, 1000)
news_sample <- sample(news, 1000)
twitter_sample <- sample(twitter, 1000)

sample_data <- c(blogs_sample, news_sample, twitter_sample)

Explanation

A subset of the data is taken to reduce computation time while still preserving patterns.

5. Text Cleaning and Processing

library(tm)

# Create corpus
corpus <- Corpus(VectorSource(sample_data))

# Convert to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))

# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)

# Remove numbers
corpus <- tm_map(corpus, removeNumbers)

# Remove stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))

# Remove extra whitespace
corpus <- tm_map(corpus, stripWhitespace)

Explanation

The text data is cleaned by converting all characters to lowercase, removing punctuation, numbers, and extra spaces. Common stopwords such as “the”, “and”, and “is” are also removed.This improves text quality by removing noise and making patterns easier to analyze.This process reduces noise in the data and makes it easier to identify meaningful patterns during analysis.

6. Word Frequency Analysis

dtm <- DocumentTermMatrix(corpus)
freq <- colSums(as.matrix(dtm))

freq <- sort(freq, decreasing = TRUE)

top_words <- head(freq, 10)
top_words

##   said   will   just    one    can   like   time    get people   good 
##    327    281    259    259    211    202    189    159    139    133

Explanation

The most frequent words are common English stopwords such as “the”, “and”, and “to”. These words dominate the dataset but carry limited meaning, indicating the need for stopword removal in further modeling.

7. Visualization

barplot(top_words,
        main = "Top 10 Most Frequent Words",
        las = 2)

Explanation

The bar plot highlights the most frequent words, showing common patterns and dominance of stopwords in the dataset.

8. Key Findings

The datasets are very large and contain millions of words.
Blogs and Twitter data are more informal in nature.
News data is more structured and formal.
A small sample is sufficient for initial analysis.
Common words dominate the dataset.
Stopwords dominate the text and should be removed for better prediction accuracy.

9. Plan for Prediction Model

Use n-gram models (unigram, bigram, trigram)
Predict next word based on previous words
Use frequency-based approach
Improve performance using smoothing techniques

10. Conclusion

This exploratory analysis provides a clear understanding of the dataset. It helps in preparing the data for building a next word prediction model and developing a Shiny application.

Data Science Capstone - Milestone Report

Dhivya R

2026-04-24

1. Introduction

2. Loading the Data

3. Basic Summary of the Data

Explanation

4. Sampling the Data

Explanation

5. Text Cleaning and Processing

Explanation

6. Word Frequency Analysis

Explanation

7. Visualization

Explanation

8. Key Findings

9. Plan for Prediction Model

10. Conclusion