This book is intended for those wishing tom pursue a career in analytics. At the time this book was written, Data Science is the latest and greatest of job titles. However, the basics of analytics has been in effect for the longest time.
The day that mankind started making decisions based on past observations, is the day analytics was born. It may very well have been when the wise man in the cave saw clouds and heard thunder and then said: “it will rain”. Perhaps instead of calling him the wise man or chief, they should have called him vice president of data science. After all, it wasn’t a lucky guess. The wise man had observed that in the past that when there are clouds and thunder, rain follows shortly after.
The book starts by defining the sequential steps to follow when performing an analysis. Then each chapter after gives the basic tools to perform the technical steps of an analysis. However, the chapters are not exhaustive of all possible tools and techniques. Nonetheless, once the basics outlined in this book have been mastered, any other test or algorithm can be learned.
The software required is R, which is open source. Additionally, RStudio is highly recommended. It is assumed that the reader has basic knowledge of R, and data manipulation.
Enjoy the journey to becoming the wise man in the cave.
For most analytical endeavors you will follow five basic steps:
Next, these steps are explained. Bi passing or omitting any of these steps will likely result in either more work or erroneous results.
Quoting Alice in Wonderland:
“Would you tell me, please, which way I ought to go from here?”
“That depends a good deal on where you want to get to.”
“I don’t much care where –”
“Then it doesn’t matter which way you go.”
― Lewis Carroll, Alice in Wonderland.
The research question is the most important step for any analytical project. Ill defined will result in doing an analysis that will not be of use to the customer. A well defined research question will serve as the blueprint for the entire study.
Think of the scenario where you know where you need to go. At that moment you can plan how to get there. You can evaluate all the alternatives on how you will get there, but one thing is certain, when you get there, you will have succeeded. Likewise, once your research question is defined, once you have answered it, you will have succeeded in your analysis.
Defining the research question will not be easy. This will require the researcher, meaning you, to have direct access to the person, or people, requesting the study. At times, the request for an analysis will trickle down to the researcher through layers of managers. This situation is all but a guarantee that the research question will be ill defined, and the analysis an exercise in futility.
Say you have managed to get direct access to the person requesting the study. The next challenge is to get that person to think through exactly what they’d like to see in terms of results. One strategy is to understand how the results will be used. In addition, it is very helpful to sketch out the final report with mock graphs and tables along with fill in the blanks. In short, you will be building the outline for step five: Desiminate Results.
Chasing the data may be a more appropriate description, at times. It will be your responsibility to match the available data to the research question. At times, it may be necessary to go back to the champion of the study to inform him, her, or they that certain variables are not available.
Once all data is collected, from as many sources as needed, it must all be joined into one table. This process will entail, at times, asymmetric joins, such as left joins. Careful detail must be paid to not duplicate observations unintentionally. Taking notes as the joins are performed is a good practice. Specifically, noting how many observations are on each table prior to the join, then the count of the resulting table.
Last, knowing which algorithm(s) will be used will dictate how the data must be organized for modeling. Unfortunately, there are no one set of steps as each algorithm, using a particular software, may require a different setup. Ensuring the final table, sometimes called the modeling or data mining table, is correct is imperative. A modeling table incorrectly built with give incorrect results.
Variables can be classified as follows, but not limited to:
Minimum: The minimum value within a variable.
Maximum: The maximum value within a variable.
Mean: Also commonly known as the average. It is the sum of all the values divided by the number of values.
Mode: The value that occurs most often
Median: Is the middle value. When there is an even number of observations, it is the mean of the two middle values.
Frequency: Is the count of identical values.
Skewness: Is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or undefined.
Kurtosis: Is a measure of the “tailedness” of the probability distribution of a real-valued random variable.