This is the Milestone Report for the Coursera Data Science Capstone project. The goal of the capstone project is to create a predictive text model using a large text corpus of documents as training data. Natural language processing techniques will be used to perform the analysis and build the predictive model.
This milestone report describes the major features of the training data with our exploratory data analysis and summarizes our plans for creating the predictive model.
## Load CRAN modules
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(stringr)
library(tidytext)
library(knitr)
library(tibble)
Once the dataset is downloaded start reading it as this a huge dataset so we’ll read line by line only the amount of data needed before doing that lets first list all the files in the directory List all the files of /final/en_US Dataset folder The data sets consist of text from 3 different sources: 1) News, 2) Blogs and 3) Twitter feeds. In this project, we will only focus on the English - US data sets.
Before performing exploratory analysis, we must clean the data first. This involves removing URLs, special characters, punctuations, numbers, excess whitespace, stopwords, and changing the text to lower case. Since the data sets are quite large, we will randomly choose 2% of the data to demonstrate the data cleaning and exploratory analysis also please take care of the UTF chars.
summary_stats <- raw_data %>%
group_by(source) %>%
summarise(
lines = n(),
words = sum(str_count(text, "\\S+")),
avg_words_per_line = mean(str_count(text, "\\S+"))
)
kable(summary_stats, caption = "Summary Statistics of Input Data")
| source | lines | words | avg_words_per_line |
|---|---|---|---|
| Blogs | 10 | 158 | 15.8 |
| News | 10 | 131 | 13.1 |
| 10 | 129 | 12.9 |
##Exploratory Analysis Now tine to do some exploratory analysis on the data. It would be interesting and helpful to find the most frequently occurring words in the data. Here we list the most common (n-grams) uni-grams, bi-grams, and tri-grams.
# Tokenize and remove stop words
tidy_text <- raw_data %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word") %>%
filter(!str_detect(word, "^\\d+$")) # Remove pure numbers
# Count frequencies
top_words <- tidy_text %>%
count(word, sort = TRUE) %>%
top_n(15)
## Selecting by n
# Plot
ggplot(top_words, aes(x = reorder(word, n), y = n, fill = n)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(title = "Top 15 Most Frequent Words (All Sources)",
x = NULL, y = "Frequency") +
theme_minimal()
# Bigram Tokenization
bigrams <- raw_data %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
filter(!is.na(bigram))
# Separate for filtering (optional: remove bigrams where BOTH words are stop words)
bigrams_filtered <- bigrams %>%
count(bigram, sort = TRUE) %>%
top_n(15)
## Selecting by n
# Plot
ggplot(bigrams_filtered, aes(x = reorder(bigram, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 15 Most Frequent Bigrams",
subtitle = "Includes stop words as they are vital for sentence structure",
x = NULL, y = "Frequency") +
theme_minimal()