Introduction

This book is intended for those wishing tom pursue a career in analytics. At the time this book was written, Data Science is the latest and greatest of job titles. However, the basics of analytics has been in effect for the longest time.

The day that mankind started making decisions based on past observations, is the day analytics was born. It may very well have been when the wise man in the cave saw clouds and heard thunder and then said: “it will rain”. Perhaps instead of calling him the wise man or chief, they should have called him vice president of data science. After all, it wasn’t a lucky guess. The wise man had observed that in the past that when there are clouds and thunder, rain follows shortly after.

The book starts by defining the sequential steps to follow when performing an analysis. Then each chapter after gives the basic tools to perform the technical steps of an analysis. However, the chapters are not exhaustive of all possible tools and techniques. Nonetheless, once the basics outlined in this book have been mastered, any other test or algorithm can be learned.

The software required is R, which is open source. Additionally, RStudio is highly recommended. It is assumed that the reader has basic knowledge of R, and data manipulation.

Enjoy the journey to becoming the wise man in the cave.

Five Steps to Analytics

For most analytical endeavors you will follow five basic steps:

Defining The Research Question
Getting the Data
Explore and Clean the Data
Modeling
Disseminating the Results

Next, these steps are explained. Bi passing or omitting any of these steps will likely result in either more work or erroneous results.

Define the research question

Quoting Alice in Wonderland:
“Would you tell me, please, which way I ought to go from here?”
“That depends a good deal on where you want to get to.”
“I don’t much care where –”
“Then it doesn’t matter which way you go.”
― Lewis Carroll, Alice in Wonderland.

The research question is the most important step for any analytical project. Ill defined will result in doing an analysis that will not be of use to the customer. A well defined research question will serve as the blueprint for the entire study.

Think of the scenario where you know where you need to go. At that moment you can plan how to get there. You can evaluate all the alternatives on how you will get there, but one thing is certain, when you get there, you will have succeeded. Likewise, once your research question is defined, once you have answered it, you will have succeeded in your analysis.

Defining the research question will not be easy. This will require the researcher, meaning you, to have direct access to the person, or people, requesting the study. At times, the request for an analysis will trickle down to the researcher through layers of managers. This situation is all but a guarantee that the research question will be ill defined, and the analysis an exercise in futility.

Say you have managed to get direct access to the person requesting the study. The next challenge is to get that person to think through exactly what they’d like to see in terms of results. One strategy is to understand how the results will be used. In addition, it is very helpful to sketch out the final report with mock graphs and tables along with fill in the blanks. In short, you will be building the outline for step five: Desiminate Results.

Get the data

Chasing the data may be a more appropriate description, at times. It will be your responsibility to match the available data to the research question. At times, it may be necessary to go back to the champion of the study to inform him, her, or they that certain variables are not available.

Once all data is collected, from as many sources as needed, it must all be joined into one table. This process will entail, at times, asymmetric joins, such as left joins. Careful detail must be paid to not duplicate observations unintentionally. Taking notes as the joins are performed is a good practice. Specifically, noting how many observations are on each table prior to the join, then the count of the resulting table.

Last, knowing which algorithm(s) will be used will dictate how the data must be organized for modeling. Unfortunately, there are no one set of steps as each algorithm, using a particular software, may require a different setup. Ensuring the final table, sometimes called the modeling or data mining table, is correct is imperative. A modeling table incorrectly built with give incorrect results.

Explore and Clean the Data

Model

Desiminate Results

Univartiate Statistics

Types of Variables

Variables can be classified as follows, but not limited to:

Categorical variable: variables than can be put into categories. For example, the category “Toothpaste Brands” might contain the variables Colgate and Aquafresh.
Confounding variable: extra variables that have a hidden effect on your experimental results.
Continuous variable: a variable with infinite number of values, like “time” or “weight”.
Control variable: a factor in an experiment which must be held constant. For example, in an experiment to determine whether light makes plants grow faster, you would have to control for soil quality and water.
Dependent variable: the outcome of an experiment. As you change the independent variable, you watch what happens to the dependent variable.
Discrete variable: a variable that can only take on a certain number of values. For example, “number of cars in a parking lot” is discrete because a car park can only hold so many cars.
Independent variable: a variable that is not affected by anything that you, the researcher, does. Usually plotted on the x-axis.
Lurking variable: a “hidden” variable the affects the relationship between the independent and dependent variables.
A measurement variable has a number associated with it. It’s an “amount” of something, or a”number” of something.
Nominal variable: another name for categorical variable.
Ordinal variable: similar to a categorical variable, but there is a clear order. For example, income levels of low, middle, and high could be considered ordinal.
Qualitative variable: a broad category for any variable that can’t be counted (i.e. has no numerical value). Nominal and ordinal variables fall under this umbrella term.
Quantitative variable: A broad category that includes any variable that can be counted, or has a numerical value associated with it. Examples of variables that fall into this category include discrete variables and ratio variables.
Random variables are associated with random processes and give numbers to outcomes of random events.
A ranked variable is an ordinal variable; a variable where every data point can be put in order (1st, 2nd, 3rd, etc.).
Ratio variables: similar to interval variables, but has a meaningful zero.

Univartiate Statistics

Minimum: The minimum value within a variable.
Maximum: The maximum value within a variable.
Mean: Also commonly known as the average. It is the sum of all the values divided by the number of values.
Mode: The value that occurs most often
Median: Is the middle value. When there is an even number of observations, it is the mean of the two middle values.
Frequency: Is the count of identical values.
Skewness: Is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or undefined.
Kurtosis: Is a measure of the “tailedness” of the probability distribution of a real-valued random variable.

Table of Statistics by Variable Type

Univariate Plots

Data Cleaning: Extreme Values

Bivariate Statistics

Hypothesis Testing

Correlations

t-test

Chi-Square Test

Table of Variable Type by Test

Modeling

Models

Training and Validation Datasets

Linear Models

Logistic Regression

Classification and Regression Trees

Random Forrest

Choosing the Best Model

Exercise Answewrs

References

https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/types-of-variables/

I’m a Data Scientist Part II

Illya Mowerman, Ph.D.

Introduction

Five Steps to Analytics

Define the research question

Get the data

Explore and Clean the Data

Model

Desiminate Results

Univartiate Statistics

Types of Variables

Univartiate Statistics

Table of Statistics by Variable Type

Univariate Plots

Data Cleaning: Extreme Values

Bivariate Statistics

Hypothesis Testing

Correlations

t-test

Chi-Square Test

Table of Variable Type by Test

Modeling

Models

Training and Validation Datasets

Linear Models

Logistic Regression

Classification and Regression Trees

Random Forrest

Choosing the Best Model

Exercise Answewrs

References