Milestone Report - Module 2

Synopsis

This is the Milestone Report for the Coursera Data Science Capstone project. The goal of the capstone project is to create a predictive text model using a large text corpus of documents as training data. Natural language processing techniques will be used to perform the analysis and build the predictive model.

This milestone report describes the major features of the training data with our exploratory data analysis and summarizes our plans for creating the predictive model.

Getting the data

## Load CRAN modules 
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(stringr)
library(tidytext)
library(knitr)
library(tibble)

Once the dataset is downloaded start reading it as this a huge dataset so we’ll read line by line only the amount of data needed before doing that lets first list all the files in the directory List all the files of /final/en_US Dataset folder The data sets consist of text from 3 different sources: 1) News, 2) Blogs and 3) Twitter feeds. In this project, we will only focus on the English - US data sets.

Cleaning The Data

Before performing exploratory analysis, we must clean the data first. This involves removing URLs, special characters, punctuations, numbers, excess whitespace, stopwords, and changing the text to lower case. Since the data sets are quite large, we will randomly choose 2% of the data to demonstrate the data cleaning and exploratory analysis also please take care of the UTF chars.

summary_stats <- raw_data %>%
  group_by(source) %>%
  summarise(
    lines = n(),
    words = sum(str_count(text, "\\S+")),
    avg_words_per_line = mean(str_count(text, "\\S+"))
  )

kable(summary_stats, caption = "Summary Statistics of Input Data")

Summary Statistics of Input Data
source	lines	words	avg_words_per_line
Blogs	10	158	15.8
News	10	131	13.1
Twitter	10	129	12.9

##Exploratory Analysis Now tine to do some exploratory analysis on the data. It would be interesting and helpful to find the most frequently occurring words in the data. Here we list the most common (n-grams) uni-grams, bi-grams, and tri-grams.

# Tokenize and remove stop words
tidy_text <- raw_data %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!str_detect(word, "^\\d+$")) # Remove pure numbers

# Count frequencies
top_words <- tidy_text %>%
  count(word, sort = TRUE) %>%
  top_n(15)

## Selecting by n

# Plot
ggplot(top_words, aes(x = reorder(word, n), y = n, fill = n)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(title = "Top 15 Most Frequent Words (All Sources)",
       x = NULL, y = "Frequency") +
  theme_minimal()

# Bigram Tokenization
bigrams <- raw_data %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram))

# Separate for filtering (optional: remove bigrams where BOTH words are stop words)
bigrams_filtered <- bigrams %>%
  count(bigram, sort = TRUE) %>%
  top_n(15)

## Selecting by n

# Plot
ggplot(bigrams_filtered, aes(x = reorder(bigram, n), y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 15 Most Frequent Bigrams",
       subtitle = "Includes stop words as they are vital for sentence structure",
       x = NULL, y = "Frequency") +
  theme_minimal()

Milestone Report - Module 2

Akash Anand

2025-12-26

Synopsis

Getting the data

Cleaning The Data