Buzz words are one of my least favorite things, but as buzz words go, I can appreciate the term “Data Lake.” It is one of the few buzz words that communicates a meaning very close to its intended definition. As you might imagine, with the advent of large scale data processing, there would be a need to name the location where lots of data resides, ergo, data lake. I personally prefer to call it a series of redundant commodity servers with Direct-Attached Storage, or hyperscale computing with large storage capacity, but I see the need for a shorter name. Also, the person that first coined the phrase could just have easily called it Data Ocean, but they were modest and settled for Data Lake. I like that.

Not to be outdone, and technology vendors being who they are, came up with their own “original” buzz word, “Data Reservoir”, implying a place where clean, processed data resides. Whatever you want to call it is fine with me as long as you understand that unless you follow a solid data management process, even your data reservoir could end up being a data landfill (also a popular buzz word). The point of this post is to demonstrate what a real data mangement process might look like by providing real examples, with real code, and, most importantly, to emphasize the importance of having a good process in place. Once implemented, you will have good data, from which solid analysis can be performed, and from which good decisions can be made.