Milestone Report - Data Science Capstone - Data Science Specialization

2025-04-17

Basic Statistics on the Working Files

For each of the three files provided for this Data Science Capstone, the following includes three graphs indicating the number of lines, maximum length, number of words, and number of characters.

Count of Lines

Max Lenght of Line

Number of Words

Number of Characters

Some Strategies for Preparing the Datasets

In any data analyst’s work, the first step is to perform data wrangling to proceed with the subsequent steps.

In this case, since we are dealing with a file containing words, the following tasks were performed to extract the words that provide the context of the texts.

Some Strategies for Preparing the Datasets cont.

These sets will later be used to build our predictive word model:

All text was converted to lowercase.

Numbers and punctuation marks were removed.

Some of the most common abbreviations were expanded.

It was detected that some symbols like $ were used, so they were translated into words.

Profanity filtering was performed.

Some Strategies for Preparing the Datasets cont.

Stop words, brief words that do not provide context to the text (pronouns, articles, prepositions, etc.), were removed.

Lemmatization was performed, which attempts to assign each word to its corresponding root.

Some Strategies for Preparing the Datasets cont.

After performing these actions, a set composed of all the resulting words was obtained.

From this set, words with a lower frequency of occurrence were removed; these are considered rare words that are not likely to be commonly found in the texts.

Some basic plots of the resulting dataset

Next, a study of the words was conducted, and sets of two and three words that appear in the text were constructed. The goal is to find word groupings that allow us to deduce which word is most frequently found in one of these groupings when part of those words are present in our text. Below are some graphs with the most frequent words, as well as the most frequent two-word and three-word groupings.

Basic Statistics on the Working Files

Count of Lines

Max Lenght of Line

Number of Words

Number of Characters

Some Strategies for Preparing the Datasets

Some Strategies for Preparing the Datasets cont.

Some Strategies for Preparing the Datasets cont.

Some Strategies for Preparing the Datasets cont.

Some basic plots of the resulting dataset

Plot word frequencies

Plot bigram frequencies

Plot trigram frequencies

To Be Continued…