Exploratory Data Analysis – Milestone Report

This report presents an exploratory analysis of the text data used in the Data Science Capstone project. The purpose of this analysis is to understand the structure of the data, summarize its main characteristics, and outline plans for building a word prediction algorithm and Shiny application.

Loading Required Libraries

library(tm)

## Loading required package: NLP

library(stringi)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

The following table summarizes the number of lines and words ineach dataset.

## Loading the Dataset


``` r
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Basic Summary Statistics

The following table summarizes the number of lines and words in each dataset.

## Summary Statistics Table


``` r
data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  )
)

##   Dataset   Lines    Words
## 1   Blogs  899288 37546806
## 2    News 1010206 34761151
## 3 Twitter 2360148 30096690

Sampling the Data

Due to the large size of the datasets, a random sample is used for exploratory visualization.

## Sampling the Data


``` r
set.seed(123)
blogs_sample <- sample(blogs, 5000)
blog_words <- stri_count_words(blogs_sample)

## Distribution of Words per Blog Line


``` r
ggplot(data.frame(words = blog_words), aes(x = words)) +
  geom_histogram(bins = 30, fill = "steelblue") +
  labs(
    title = "Distribution of Words per Blog Line",
    x = "Words per Line",
    y = "Frequency"
  )

```markdown ## Future Work and Prediction Plan

The next phase of this project will focus on building a word prediction model using n-gram techniques. A Shiny web application will be developed to allow users to interact with the prediction system.

Exploratory Data Analysis – Milestone Report

Srikanth Reddy

2026-01-05

Loading Required Libraries