Date: December 4, 2025
This document corresponds to the Milestone Report, an assignment for week 2 of the Data Science Capstone course from Coursera. This course is the 10th of 10 comprising the Data Science Specialization (DDS) from the John Hopkins University. It is designed to allow students to create a usable/public data product to showcase their skills to potential employers, drawing on real-world problems.
The data originates from a corpus known as HC Corpora (archived at https://web-beta.archive.org/web/20160930083655/http://www.corpora.heliohost.org/aboutcorpus.html). Corpora are collected from publicly available sources via a web crawler.
The dataset can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. Once uncompressed, it consists of four folders corresponding to four different languages (German, English, Finnish, and Russian), with each folder containing three files from three text sources (blogs, news, and Twitter).
| # | File |
|---|---|
| 1 | HC_Corpora/de_DE/de_DE.blogs.txt |
| 2 | HC_Corpora/de_DE/de_DE.news.txt |
| 3 | HC_Corpora/de_DE/de_DE.twitter.txt |
| 4 | HC_Corpora/en_US/en_US.blogs.txt |
| 5 | HC_Corpora/en_US/en_US.news.txt |
| 6 | HC_Corpora/en_US/en_US.twitter.txt |
| 7 | HC_Corpora/fi_FI/fi_FI.blogs.txt |
| 8 | HC_Corpora/fi_FI/fi_FI.news.txt |
| 9 | HC_Corpora/fi_FI/fi_FI.twitter.txt |
| 10 | HC_Corpora/ru_RU/ru_RU.blogs.txt |
| 11 | HC_Corpora/ru_RU/ru_RU.news.txt |
| 12 | HC_Corpora/ru_RU/ru_RU.twitter.txt |
The initial analysis provides a summary of file statistics, including size, line count, word count, and the words-per-line (W/L) ratio, derived from the `wc` command execution.
| File | Size | Size in MB | Line count | Word count | W/L ratio | Language |
|---|---|---|---|---|---|---|
| HC_Corpora/de_DE/de_DE.blogs.txt | 283838271 | 270.69 | 868434 | 37048742 | 42.66 | German |
| HC_Corpora/de_DE/de_DE.news.txt | 250346360 | 238.74 | 808455 | 33814041 | 41.83 | German |
| HC_Corpora/de_DE/de_DE.twitter.txt | 196531587 | 187.43 | 5352052 | 47348421 | 8.85 | German |
| HC_Corpora/en_US/en_US.blogs.txt | 210160014 | 200.43 | 899288 | 37334139 | 41.51 | English |
| HC_Corpora/en_US/en_US.news.txt | 205811889 | 196.28 | 772594 | 34494541 | 44.65 | English |
| HC_Corpora/en_US/en_US.twitter.txt | 167105338 | 159.36 | 2360148 | 30373583 | 12.87 | English |
| HC_Corpora/fi_FI/fi_FI.blogs.txt | 2174622 | 2.07 | 43632 | 415309 | 9.52 | Finnish |
| HC_Corpora/fi_FI/fi_FI.news.txt | 2364024 | 2.25 | 44926 | 451978 | 10.06 | Finnish |
| HC_Corpora/fi_FI/fi_FI.twitter.txt | 2772504 | 2.64 | 67439 | 520261 | 7.71 | Finnish |
| HC_Corpora/ru_RU/ru_RU.blogs.txt | 28414343 | 27.09 | 74012 | 2673898 | 36.13 | Russian |
| HC_Corpora/ru_RU/ru_RU.news.txt | 24043940 | 22.93 | 55767 | 2548238 | 45.69 | Russian |
| HC_Corpora/ru_RU/ru_RU.twitter.txt | 43876779 | 41.85 | 179786 | 3209866 | 17.85 | Russian |
For the Capstone project, only the **English language** files will be used.
| File | Size | Size in MB | Line count | Word count | W/L ratio | Language |
|---|---|---|---|---|---|---|
| HC_Corpora/en_US/en_US.blogs.txt | 210160014 | 200.43 | 899288 | 37334139 | 41.51 | English |
| HC_Corpora/en_US/en_US.news.txt | 205811889 | 196.28 | 772594 | 34494541 | 44.65 | English |
| HC_Corpora/en_US/en_US.twitter.txt | 167105338 | 159.36 | 2360148 | 30373583 | 12.87 | English |
The English dataset totals approximately **556 MB** of data. Due to potential performance issues, a **1% subset** of the original dataset will be used for processing, as suggested in Task 1 of the course.
The cleaning process begins with loading and sampling. Text from the blog, Twitter, and news files is loaded, handling nulls (`skipNul = TRUE`) and using the appropriate options (e.g., `open = 'rb'` for `en_US.news.txt`) to avoid warnings. A **1% sample** of the data is then performed.
The initial sampled text object is significantly smaller than the original set, approximately **4.6 MB** in size.
A corpus is created from the sampled text using the **tm** package to leverage its text mining functionalities. The original corpus has approximately **1,281,993 words**.
The cleaning is performed via the following sequential transformations:
SnowballC** package.After all transformations, the corpus has **739,471 words**, which is **542,522** less than the original word count before cleaning.
The exploratory analysis involves creating Document Term Matrices (DTMs) for 1-Grams (words), 2-Grams, and 3-Grams. This is done using the `DocumentTermMatrix()` function from the **tm** package and N-Gram tokenizers from the **RWeka** package. Frequencies are calculated, sorted, and the top 10 are plotted.
| Word | Frequency |
|---|---|
| can | 3994 |
| said | 3481 |
| one | 3203 |
| will | 2973 |
| just | 2907 |
| like | 2819 |
| get | 2603 |
| time | 2566 |
| new | 2480 |
| know | 2392 |
| 2-Gram | Frequency |
|---|---|
| look forward | 239 |
| can find | 228 |
| ralli point | 221 |
| feel like | 216 |
| new york | 196 |
| right now | 188 |
| dont know | 188 |
| last year | 173 |
| can get | 167 |
| stori time | 158 |
| 3-Gram | Frequency |
|---|---|
| new york citi | 44 |
| let know know | 44 |
| happi mother day | 40 |
| year old boy | 37 |
| happi new year | 36 |
| cinco de mayo | 33 |
| unit state america | 33 |
| let know soon | 32 |
| five year old | 31 |
| dont know dont | 30 |
The following addresses the questions posed in Task 2:
The total number of word instances is 739,471, and the total number of unique words is 52,143.
Using the descending word frequencies:
Given that the English language primarily uses ASCII characters, a method can be implemented to detect non-English words by checking for characters containing accents, umlauts, or non-English letters (like Slavic letters), which are typically found in encodings like **ISO/IEC 8859-1** (Latin-1).
A function `detectNonEnglishWords` can be used to convert characters using `iconv()` and tag any converted words with a placeholder like `
# Example: Detecting non-English words
originalText <- 'The Fußball is the King of Sports' # Using Fußball (German)
detectNonEnglishWords('The Fußball is the King of Sports')
# Output would show 'Fußball' tagged as invalid (false)
# Example: Removing non-English words
removeNonEnglishWords('The Fußball is the King of Sports')
# Output: "The is the King of Sports"
The next steps for the project will involve: