Types of Models

Before proceeding, let’s describe a taxonomy for types of models, grouped by purpose. While not exhasive, most models faill into at least one of these categories:

Descriptive Models

The purpose of descriptive model is to describe or illustrate characteristics of some data/ The analysis might have no other purpose than to visually emphasize some trend or artifact in the data.

Another example of a descriptive model is the locally estimated scatterplot smoothing model, more commonly known as LOESS. Here, a smooth and flexible regression model is fit to a data set, usually with a single independent variable, and the fitted regression line is used to elucidate some trend in the data

Inferential Models

The goal of an inferential model is to produce for a research question or to test a specific hypothesis. The goal is to make some statemetn of truth regarding a predefined conjecture or idea. In many (but not all) cases, a qualitative statement is produced (that is ‘statistically significant’)

Predictive Models

Sometimes data are modeled to produce the most accurate prediction possible for new data. Here, the primary goal is that the predicted values have the highest possible fidelity to the true value of the new data.

A simple example would be for a book buyer to predict how many copies of a particular book should be shipped to their store for the next month. An over-prediction wastes space and money due to excess books. if the prediction is smaller than it should be, there is opportunity loss and less project.

What are the most important factors affecting predictive models? There are many different ways that a predictive

Some Terminology

  • Regression: predicts a numeric outcome
  • Classification: predict an outcome that is an ordered or unordered set of qualitative values.

These are imperfect definitions and do not account for all possible types of models. In Chapter 7, we refer to this characteristic of supervised techniques as the model mode.

How does modeling fit into the data analysis process?

There are always a few critical phases of data analysis that come before modeling.

  1. Cleaning the data
  2. Understanding the data

Import –> Tidy –> Transform, Visualize, Model, Understand –> Communicate

ds_model

ds_model

This iterative process is especially true fr modeling. Figure 1.3 is meant to emulate the typical path to determining an appropriate mode. The general phases are:

ds_model

ds_model

  • Exploratory Data Analysis (EDA): Initially there is a back and forth betweeen numerical analysis and visualization of the data (represented in Figure 1.2) where different discoveries lead to more questions and data analysis ‘side-quests’ to gain more understanding

  • Feature engineering: The understand gained from EDA results in the creation of specific model terms that make it easier to accuracy model the observed data. This can include complex methodologies (e.g: PCA) or simpler features (using the ratio of two predictors). Chapter 6 focuses entirely on this important step.

  • Model tuning and selection (circles with blue and yellow segments): A variety of models are generated and their performance is compared. Some models require parameter tuning where some structural parameters are required to be specificed or optimized. The colored segments withtin the circles signify the repeated data spliting used during resampling

  • Model evaluation: During this phase of model development, we assess the model’s performance metrics, examine residual plots, and conduct other EDA-like analyses to understand how well the models work. In some cases, formal between-model comparisons (Chapter 11) help understand whether any differences in models are within experimental noise.

Chapter Summary

  • This chapter focused on how models describe relationships in data, and different types of models such as descriptive models, inferential models, and predictive models. The predictive capacity of a model can be used to evaluate it, even when its main goal is not prediction. Modeling itself sits within the boader data analysis process, and exploratory data analysis is a key part of building high-quality models.

For all kinds of modeling, software for building models must support good scientific methodology and ease of use for practioners from diverse backgrounds. The software we develop approaches this with the ideas and syntax of the tidyverse, which we introduce (or review) in Chapter 2, Chapter 3 is a quick tour of conventional base R modeling functions and summarize the unmet needs in that area.