Before proceeding, let’s describe a taxonomy for types of models, grouped by purpose. While not exhasive, most models faill into at least one of these categories:
The purpose of descriptive model is to describe or illustrate characteristics of some data/ The analysis might have no other purpose than to visually emphasize some trend or artifact in the data.
Another example of a descriptive model is the locally estimated scatterplot smoothing model, more commonly known as LOESS. Here, a smooth and flexible regression model is fit to a data set, usually with a single independent variable, and the fitted regression line is used to elucidate some trend in the data
The goal of an inferential model is to produce for a research question or to test a specific hypothesis. The goal is to make some statemetn of truth regarding a predefined conjecture or idea. In many (but not all) cases, a qualitative statement is produced (that is ‘statistically significant’)
Sometimes data are modeled to produce the most accurate prediction possible for new data. Here, the primary goal is that the predicted values have the highest possible fidelity to the true value of the new data.
A simple example would be for a book buyer to predict how many copies of a particular book should be shipped to their store for the next month. An over-prediction wastes space and money due to excess books. if the prediction is smaller than it should be, there is opportunity loss and less project.
What are the most important factors affecting predictive models? There are many different ways that a predictive
These are imperfect definitions and do not account for all possible types of models. In Chapter 7, we refer to this characteristic of supervised techniques as the model mode.
There are always a few critical phases of data analysis that come before modeling.
Import –> Tidy –> Transform, Visualize, Model, Understand –> Communicate
ds_model
This iterative process is especially true fr modeling. Figure 1.3 is meant to emulate the typical path to determining an appropriate mode. The general phases are:
ds_model
Exploratory Data Analysis (EDA): Initially there is a back and forth betweeen numerical analysis and visualization of the data (represented in Figure 1.2) where different discoveries lead to more questions and data analysis ‘side-quests’ to gain more understanding
Feature engineering: The understand gained from EDA results in the creation of specific model terms that make it easier to accuracy model the observed data. This can include complex methodologies (e.g: PCA) or simpler features (using the ratio of two predictors). Chapter 6 focuses entirely on this important step.
Model tuning and selection (circles with blue and yellow segments): A variety of models are generated and their performance is compared. Some models require parameter tuning where some structural parameters are required to be specificed or optimized. The colored segments withtin the circles signify the repeated data spliting used during resampling
Model evaluation: During this phase of model development, we assess the model’s performance metrics, examine residual plots, and conduct other EDA-like analyses to understand how well the models work. In some cases, formal between-model comparisons (Chapter 11) help understand whether any differences in models are within experimental noise.
For all kinds of modeling, software for building models must support good scientific methodology and ease of use for practioners from diverse backgrounds. The software we develop approaches this with the ideas and syntax of the tidyverse, which we introduce (or review) in Chapter 2, Chapter 3 is a quick tour of conventional base R modeling functions and summarize the unmet needs in that area.