The goal of this task was to familiarize myself with the content of the three datasets provided for this task. Each file was uploaded in R and I first ran some summary statistics to get a general understanding of their size and shape:
I then cleaned the data and conducted analysis on aggregate word count, 2-gram count and 3-gram count in each data set.
When reviewing the outcomes of the EDA process there were some consistent themes:
The most common words are, on their own, not particularly indicative of the semantic contents of the larger body of text they were part of. Words such as “a”, “the” and “with” offer relatively little value. In other words, the frequency of the occurence of the word seems to be strongly negative correlated with the “value” of the word
The top 20 2-grams and 3-grams represented relatively small portions of the overall population of n-grams, so any model that is developed will need to rely on a much broader swathe of data to be useful
The larger the “n” in n-gram the more valuable the insights might be for any type of predictive text model, However this would likely come at the expense of computational intensity, so there will be some trade-offs
When building the model, the computational load will be very high if the EDA is indicative of the calculation time. Because of this, I will need to develop one or more ways to process the information efficiently, possibly in chunks which will then be re-aggregated somehow
I will need to strike an equilibrium between accuracy and speed
I may need to run some of the calculations externally on Google Colab where I can scale up processing capacity while I experiment with different methodologies to complete this task