For each of the three files provided for this Data Science Capstone, the following includes three graphs indicating the number of lines, maximum length, number of words, and number of characters.
2025-04-17
For each of the three files provided for this Data Science Capstone, the following includes three graphs indicating the number of lines, maximum length, number of words, and number of characters.
In any data analyst’s work, the first step is to perform data wrangling to proceed with the subsequent steps.
In this case, since we are dealing with a file containing words, the following tasks were performed to extract the words that provide the context of the texts.
These sets will later be used to build our predictive word model:
All text was converted to lowercase.
Numbers and punctuation marks were removed.
Some of the most common abbreviations were expanded.
It was detected that some symbols like $ were used, so they were translated into words.
Profanity filtering was performed.
Stop words, brief words that do not provide context to the text (pronouns, articles, prepositions, etc.), were removed.
Lemmatization was performed, which attempts to assign each word to its corresponding root.
After performing these actions, a set composed of all the resulting words was obtained.
From this set, words with a lower frequency of occurrence were removed; these are considered rare words that are not likely to be commonly found in the texts.
Next, a study of the words was conducted, and sets of two and three words that appear in the text were constructed. The goal is to find word groupings that allow us to deduce which word is most frequently found in one of these groupings when part of those words are present in our text. Below are some graphs with the most frequent words, as well as the most frequent two-word and three-word groupings.
Thank you!