Data Science Capstone - Milestone Report

1. Introduction

This report presents the exploratory data analysis (EDA) of the text datasets used for building a next word prediction model. The datasets include blogs, news, and Twitter data in English. The goal of this analysis is to understand the structure, size, and basic characteristics of the data.

The analysis focuses on understanding word frequency patterns and structure of the dataset.

2. Loading the Data

blogs <- readLines("en_US.blogs.txt", n = 10000)
news <- readLines("en_US.news.txt", n = 10000)
twitter <- readLines("en_US.twitter.txt", n = 10000)

3. Basic Summary of the Data

library(stringi)

summary_table <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Line_Count = c(length(blogs), length(news), length(twitter)),
  Word_Count = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  )
)

summary_table

##   Dataset Line_Count Word_Count
## 1   Blogs      10000     412805
## 2    News      10000     348070
## 3 Twitter      10000     126511

Explanation

The table above shows the number of lines and words in each dataset. Blogs and Twitter contain a large amount of informal text, while news data is more structured and formal.### Explanation

Additional Summary

library(stringi)

char_count <- c(
  sum(nchar(blogs)),
  sum(nchar(news)),
  sum(nchar(twitter))
)

data.frame(
  Dataset = c("Blogs","News","Twitter"),
  Character_Count = char_count
)

##   Dataset Character_Count
## 1   Blogs         2277384
## 2    News         2035687
## 3 Twitter          681674

Explanation

Character count provides an estimate of the overall size of each dataset. This helps understand how large the text data is and how much processing power may be required.

4. Sampling the Data

Due to the large size of the dataset, a small sample is taken for further analysis.

set.seed(123)

blogs_sample <- sample(blogs, 1000)
news_sample <- sample(news, 1000)
twitter_sample <- sample(twitter, 1000)

sample_data <- c(blogs_sample, news_sample, twitter_sample)

Sample Size Check

length(sample_data)

## [1] 3000

Explanation

A subset of 1000 lines from each dataset is selected to reduce computation time while maintaining the overall structure and patterns of the data. Sampling allows efficient analysis without processing the entire dataset.

A subset of the data is taken to reduce computation time while still preserving patterns.

5. Text Cleaning and Processing

library(tm)

# Create corpus
corpus <- Corpus(VectorSource(sample_data))

# Convert to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))

# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)

# Remove numbers
corpus <- tm_map(corpus, removeNumbers)

# Remove stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))

# Remove extra whitespace
corpus <- tm_map(corpus, stripWhitespace)

Explanation

The text data is cleaned by converting all characters to lowercase, removing punctuation, numbers, and extra spaces. Common stopwords such as “the”, “and”, and “is” are also removed.This improves text quality by removing noise and making patterns easier to analyze.This process reduces noise in the data and makes it easier to identify meaningful patterns during analysis.

6. Word Frequency Analysis

dtm <- DocumentTermMatrix(corpus)
freq <- colSums(as.matrix(dtm))

freq <- sort(freq, decreasing = TRUE)

top_words <- head(freq, 10)
top_words

##   said   will   just    one    can   like   time    get people   good 
##    327    281    259    259    211    202    189    159    139    133

Explanation

The most frequent words are common English stopwords such as “the”, “and”, and “to”. These words dominate the dataset but carry limited meaning, indicating the need for stopword removal in further modeling.

7. Visualization

7.1 Bar Plot of Top Words

barplot(top_words,
        main = "Top 10 Most Frequent Words",
        las = 2)

Explanation

This bar plot shows the most frequently occurring words in the dataset. It helps identify commonly used words and overall patterns in text data.

7.2 Horizontal Bar Plot

barplot(rev(top_words),
        horiz = TRUE,
        main = "Top Words (Horizontal View)")

Explanation

The horizontal bar plot improves readability of word labels and provides a clearer comparison of word frequencies.

7.3 Word Cloud

library(wordcloud)

wordcloud(names(freq), freq, max.words = 100)

Explanation

The word cloud visually highlights the most frequent words. Larger words indicate higher frequency, making it easy to identify dominant terms.

7.4 Dataset Comparison (Line Count)

barplot(summary_table$Line_Count,
        names.arg = summary_table$Dataset,
        col = "lightblue",
        main = "Line Count Comparison")

Explanation

This plot compares the number of lines across Blogs, News, and Twitter datasets, helping understand dataset size differences.

8. Key Findings

The datasets are very large and contain millions of words.
Blogs and Twitter data are more informal in nature.
News data is more structured and formal.
A small sample is sufficient for initial analysis.
Common words dominate the dataset.
Stopwords dominate the text and should be removed for better prediction accuracy.

9. Plan for Prediction Model

Use n-gram models (unigram, bigram, trigram)
Predict next word based on previous words
Use frequency-based approach
Improve performance using smoothing techniques

10. Conclusion

This exploratory analysis provides a clear understanding of the dataset. It helps in preparing the data for building a next word prediction model and developing a Shiny application.

Data Science Capstone - Milestone Report

Dhivya R

2026-04-25

1. Introduction

2. Loading the Data

3. Basic Summary of the Data

Explanation

Additional Summary

Explanation

4. Sampling the Data

Sample Size Check

Explanation

5. Text Cleaning and Processing

Explanation

6. Word Frequency Analysis

Explanation

7. Visualization

7.1 Bar Plot of Top Words

Explanation

7.2 Horizontal Bar Plot

Explanation

7.3 Word Cloud

Explanation

7.4 Dataset Comparison (Line Count)

Explanation

8. Key Findings

9. Plan for Prediction Model

10. Conclusion