Data Science Capstone - Milestone Report

Author: Deekshith N

Date: December 4, 2025

1) Introduction

This document corresponds to the Milestone Report, an assignment for week 2 of the Data Science Capstone course from Coursera. This course is the 10th of 10 comprising the Data Science Specialization (DDS) from the John Hopkins University. It is designed to allow students to create a usable/public data product to showcase their skills to potential employers, drawing on real-world problems.

2) Dataset

2.1) Description

The data originates from a corpus known as HC Corpora (archived at https://web-beta.archive.org/web/20160930083655/http://www.corpora.heliohost.org/aboutcorpus.html). Corpora are collected from publicly available sources via a web crawler.

The dataset can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. Once uncompressed, it consists of four folders corresponding to four different languages (German, English, Finnish, and Russian), with each folder containing three files from three text sources (blogs, news, and Twitter).

File List (HC Corpora)

# File
1 HC_Corpora/de_DE/de_DE.blogs.txt
2 HC_Corpora/de_DE/de_DE.news.txt
3 HC_Corpora/de_DE/de_DE.twitter.txt
4 HC_Corpora/en_US/en_US.blogs.txt
5 HC_Corpora/en_US/en_US.news.txt
6 HC_Corpora/en_US/en_US.twitter.txt
7 HC_Corpora/fi_FI/fi_FI.blogs.txt
8 HC_Corpora/fi_FI/fi_FI.news.txt
9 HC_Corpora/fi_FI/fi_FI.twitter.txt
10 HC_Corpora/ru_RU/ru_RU.blogs.txt
11 HC_Corpora/ru_RU/ru_RU.news.txt
12 HC_Corpora/ru_RU/ru_RU.twitter.txt
Note on Cleaning: Special characters, such as question marks and other non-standard symbols, exist in the texts and must be taken into account for filtering during the data cleaning stages.

2.2) Dataset Details

The initial analysis provides a summary of file statistics, including size, line count, word count, and the words-per-line (W/L) ratio, derived from the `wc` command execution.

Full Dataset Statistics

File Size Size in MB Line count Word count W/L ratio Language
HC_Corpora/de_DE/de_DE.blogs.txt 283838271 270.69 868434 37048742 42.66 German
HC_Corpora/de_DE/de_DE.news.txt 250346360 238.74 808455 33814041 41.83 German
HC_Corpora/de_DE/de_DE.twitter.txt 196531587 187.43 5352052 47348421 8.85 German
HC_Corpora/en_US/en_US.blogs.txt 210160014 200.43 899288 37334139 41.51 English
HC_Corpora/en_US/en_US.news.txt 205811889 196.28 772594 34494541 44.65 English
HC_Corpora/en_US/en_US.twitter.txt 167105338 159.36 2360148 30373583 12.87 English
HC_Corpora/fi_FI/fi_FI.blogs.txt 2174622 2.07 43632 415309 9.52 Finnish
HC_Corpora/fi_FI/fi_FI.news.txt 2364024 2.25 44926 451978 10.06 Finnish
HC_Corpora/fi_FI/fi_FI.twitter.txt 2772504 2.64 67439 520261 7.71 Finnish
HC_Corpora/ru_RU/ru_RU.blogs.txt 28414343 27.09 74012 2673898 36.13 Russian
HC_Corpora/ru_RU/ru_RU.news.txt 24043940 22.93 55767 2548238 45.69 Russian
HC_Corpora/ru_RU/ru_RU.twitter.txt 43876779 41.85 179786 3209866 17.85 Russian

English Language Files Only

For the Capstone project, only the **English language** files will be used.

File Size Size in MB Line count Word count W/L ratio Language
HC_Corpora/en_US/en_US.blogs.txt 210160014 200.43 899288 37334139 41.51 English
HC_Corpora/en_US/en_US.news.txt 205811889 196.28 772594 34494541 44.65 English
HC_Corpora/en_US/en_US.twitter.txt 167105338 159.36 2360148 30373583 12.87 English

The English dataset totals approximately **556 MB** of data. Due to potential performance issues, a **1% subset** of the original dataset will be used for processing, as suggested in Task 1 of the course.

2.3) Dataset Cleaning

The cleaning process begins with loading and sampling. Text from the blog, Twitter, and news files is loaded, handling nulls (`skipNul = TRUE`) and using the appropriate options (e.g., `open = 'rb'` for `en_US.news.txt`) to avoid warnings. A **1% sample** of the data is then performed.

The initial sampled text object is significantly smaller than the original set, approximately **4.6 MB** in size.

A corpus is created from the sampled text using the **tm** package to leverage its text mining functionalities. The original corpus has approximately **1,281,993 words**.

Transformation Steps:

The cleaning is performed via the following sequential transformations:

  1. Uniforming to lowercase: Converts all text to lowercase using `tolower()`.
  2. Removing punctuation, numbers, special characters: Uses `removePunctuation()`, `removeNumbers()`, and the custom `toSpace()` function to handle non-standard punctuations, URIs, Twitter users/hashtags, and email addresses.
  3. Striping whitespaces: Collapses multiple whitespaces to a single blank using `stripWhitespace()`.
  4. Removing stop words: Removes English stop words using `removeWords` with `stopwords("english")`.
  5. Profanity filtering: Removes swear words using a list sourced from "A list of 723 bad words to blacklist..." (URL).
  6. Stemming the text: Reduces words to their word stem or root (e.g., "working" to "work") using `stemDocument()` from the **SnowballC** package.

After all transformations, the corpus has **739,471 words**, which is **542,522** less than the original word count before cleaning.

3) Analysis

3.1) Exploratory Analysis

The exploratory analysis involves creating Document Term Matrices (DTMs) for 1-Grams (words), 2-Grams, and 3-Grams. This is done using the `DocumentTermMatrix()` function from the **tm** package and N-Gram tokenizers from the **RWeka** package. Frequencies are calculated, sorted, and the top 10 are plotted.

Top 10 Frequencies

For 1-Grams (Words)

Word Frequency
can 3994
said 3481
one 3203
will 2973
just 2907
like 2819
get 2603
time 2566
new 2480
know 2392

For 2-Grams

2-Gram Frequency
look forward 239
can find 228
ralli point 221
feel like 216
new york 196
right now 188
dont know 188
last year 173
can get 167
stori time 158

For 3-Grams

3-Gram Frequency
new york citi 44
let know know 44
happi mother day 40
year old boy 37
happi new year 36
cinco de mayo 33
unit state america 33
let know soon 32
five year old 31
dont know dont 30

Coverage Analysis (Task 2 Questions)

The following addresses the questions posed in Task 2:

The total number of word instances is 739,471, and the total number of unique words is 52,143.

Using the descending word frequencies:

Given that the English language primarily uses ASCII characters, a method can be implemented to detect non-English words by checking for characters containing accents, umlauts, or non-English letters (like Slavic letters), which are typically found in encodings like **ISO/IEC 8859-1** (Latin-1).

A function `detectNonEnglishWords` can be used to convert characters using `iconv()` and tag any converted words with a placeholder like ``, allowing for detection and removal.

# Example: Detecting non-English words
originalText <- 'The Fußball is the King of Sports' # Using Fußball (German)
detectNonEnglishWords('The Fußball is the King of Sports')
# Output would show 'Fußball' tagged as invalid (false)
# Example: Removing non-English words
removeNonEnglishWords('The Fußball is the King of Sports')
# Output: "The is the King of Sports"

3.2) Further Steps

The next steps for the project will involve:

  1. **Building a Predictive Algorithm**: Developing a predictive model using N-Gram lookups to compute probabilities for the next word's occurrence, backing off to a lower N-gram level (e.g., from 3-gram to 2-gram) as needed (Back-off model).
  2. **Developing a Web Application**: Creating a web application using **Shiny** that integrates the predictive algorithm to suggest the next word to the user in real-time.