Data Science Capstone - Milestone Report

Author: Deekshith N

Date: December 4, 2025

1) Introduction

This document corresponds to the Milestone Report, an assignment for week 2 of the Data Science Capstone course from Coursera. This course is the 10th of 10 comprising the Data Science Specialization (DDS) from the John Hopkins University. It is designed to allow students to create a usable/public data product to showcase their skills to potential employers, drawing on real-world problems.

2) Dataset

2.1) Description

The data originates from a corpus known as HC Corpora (archived at https://web-beta.archive.org/web/20160930083655/http://www.corpora.heliohost.org/aboutcorpus.html). Corpora are collected from publicly available sources via a web crawler.

The dataset can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. Once uncompressed, it consists of four folders corresponding to four different languages (German, English, Finnish, and Russian), with each folder containing three files from three text sources (blogs, news, and Twitter).

File List (HC Corpora)

#	File
1	HC_Corpora/de_DE/de_DE.blogs.txt
2	HC_Corpora/de_DE/de_DE.news.txt
3	HC_Corpora/de_DE/de_DE.twitter.txt
4	HC_Corpora/en_US/en_US.blogs.txt
5	HC_Corpora/en_US/en_US.news.txt
6	HC_Corpora/en_US/en_US.twitter.txt
7	HC_Corpora/fi_FI/fi_FI.blogs.txt
8	HC_Corpora/fi_FI/fi_FI.news.txt
9	HC_Corpora/fi_FI/fi_FI.twitter.txt
10	HC_Corpora/ru_RU/ru_RU.blogs.txt
11	HC_Corpora/ru_RU/ru_RU.news.txt
12	HC_Corpora/ru_RU/ru_RU.twitter.txt

Note on Cleaning: Special characters, such as question marks and other non-standard symbols, exist in the texts and must be taken into account for filtering during the data cleaning stages.

2.2) Dataset Details

The initial analysis provides a summary of file statistics, including size, line count, word count, and the words-per-line (W/L) ratio, derived from the `wc` command execution.

Full Dataset Statistics

File	Size	Size in MB	Line count	Word count	W/L ratio	Language
HC_Corpora/de_DE/de_DE.blogs.txt	283838271	270.69	868434	37048742	42.66	German
HC_Corpora/de_DE/de_DE.news.txt	250346360	238.74	808455	33814041	41.83	German
HC_Corpora/de_DE/de_DE.twitter.txt	196531587	187.43	5352052	47348421	8.85	German
HC_Corpora/en_US/en_US.blogs.txt	210160014	200.43	899288	37334139	41.51	English
HC_Corpora/en_US/en_US.news.txt	205811889	196.28	772594	34494541	44.65	English
HC_Corpora/en_US/en_US.twitter.txt	167105338	159.36	2360148	30373583	12.87	English
HC_Corpora/fi_FI/fi_FI.blogs.txt	2174622	2.07	43632	415309	9.52	Finnish
HC_Corpora/fi_FI/fi_FI.news.txt	2364024	2.25	44926	451978	10.06	Finnish
HC_Corpora/fi_FI/fi_FI.twitter.txt	2772504	2.64	67439	520261	7.71	Finnish
HC_Corpora/ru_RU/ru_RU.blogs.txt	28414343	27.09	74012	2673898	36.13	Russian
HC_Corpora/ru_RU/ru_RU.news.txt	24043940	22.93	55767	2548238	45.69	Russian
HC_Corpora/ru_RU/ru_RU.twitter.txt	43876779	41.85	179786	3209866	17.85	Russian

English Language Files Only

For the Capstone project, only the **English language** files will be used.

File	Size	Size in MB	Line count	Word count	W/L ratio	Language
HC_Corpora/en_US/en_US.blogs.txt	210160014	200.43	899288	37334139	41.51	English
HC_Corpora/en_US/en_US.news.txt	205811889	196.28	772594	34494541	44.65	English
HC_Corpora/en_US/en_US.twitter.txt	167105338	159.36	2360148	30373583	12.87	English

The English dataset totals approximately **556 MB** of data. Due to potential performance issues, a **1% subset** of the original dataset will be used for processing, as suggested in Task 1 of the course.

2.3) Dataset Cleaning

The cleaning process begins with loading and sampling. Text from the blog, Twitter, and news files is loaded, handling nulls (`skipNul = TRUE`) and using the appropriate options (e.g., `open = 'rb'` for `en_US.news.txt`) to avoid warnings. A **1% sample** of the data is then performed.

The initial sampled text object is significantly smaller than the original set, approximately **4.6 MB** in size.

A corpus is created from the sampled text using the **tm** package to leverage its text mining functionalities. The original corpus has approximately **1,281,993 words**.

Transformation Steps:

The cleaning is performed via the following sequential transformations:

Uniforming to lowercase: Converts all text to lowercase using `tolower()`.
Removing punctuation, numbers, special characters: Uses `removePunctuation()`, `removeNumbers()`, and the custom `toSpace()` function to handle non-standard punctuations, URIs, Twitter users/hashtags, and email addresses.
Striping whitespaces: Collapses multiple whitespaces to a single blank using `stripWhitespace()`.
Removing stop words: Removes English stop words using `removeWords` with `stopwords("english")`.
Profanity filtering: Removes swear words using a list sourced from "A list of 723 bad words to blacklist..." (URL).
Stemming the text: Reduces words to their word stem or root (e.g., "working" to "work") using `stemDocument()` from the **SnowballC** package.

After all transformations, the corpus has **739,471 words**, which is **542,522** less than the original word count before cleaning.

3) Analysis

3.1) Exploratory Analysis

The exploratory analysis involves creating Document Term Matrices (DTMs) for 1-Grams (words), 2-Grams, and 3-Grams. This is done using the `DocumentTermMatrix()` function from the **tm** package and N-Gram tokenizers from the **RWeka** package. Frequencies are calculated, sorted, and the top 10 are plotted.

Top 10 Frequencies

For 1-Grams (Words)

Word	Frequency
can	3994
said	3481
one	3203
will	2973
just	2907
like	2819
get	2603
time	2566
new	2480
know	2392

For 2-Grams

2-Gram	Frequency
look forward	239
can find	228
ralli point	221
feel like	216
new york	196
right now	188
dont know	188
last year	173
can get	167
stori time	158

For 3-Grams

3-Gram	Frequency
new york citi	44
let know know	44
happi mother day	40
year old boy	37
happi new year	36
cinco de mayo	33
unit state america	33
let know soon	32
five year old	31
dont know dont	30

Coverage Analysis (Task 2 Questions)

The following addresses the questions posed in Task 2:

**How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?**

The total number of word instances is 739,471, and the total number of unique words is 52,143.

Using the descending word frequencies:

For 50% coverage of all word instances, 377 unique words are needed.
For 90% coverage of all word instances, 6,234 unique words are needed.

**How do you evaluate how many of the words come from foreign languages?**

Given that the English language primarily uses ASCII characters, a method can be implemented to detect non-English words by checking for characters containing accents, umlauts, or non-English letters (like Slavic letters), which are typically found in encodings like **ISO/IEC 8859-1** (Latin-1).

A function `detectNonEnglishWords` can be used to convert characters using `iconv()` and tag any converted words with a placeholder like ``, allowing for detection and removal.

# Example: Detecting non-English words
originalText <- 'The Fußball is the King of Sports' # Using Fußball (German)
detectNonEnglishWords('The Fußball is the King of Sports')
# Output would show 'Fußball' tagged as invalid (false)

# Example: Removing non-English words
removeNonEnglishWords('The Fußball is the King of Sports')
# Output: "The is the King of Sports"

3.2) Further Steps

The next steps for the project will involve:

**Building a Predictive Algorithm**: Developing a predictive model using N-Gram lookups to compute probabilities for the next word's occurrence, backing off to a lower N-gram level (e.g., from 3-gram to 2-gram) as needed (Back-off model).
**Developing a Web Application**: Creating a web application using **Shiny** that integrates the predictive algorithm to suggest the next word to the user in real-time.