Data and Models

M. Drew LaMar
February 14, 2020


https://xkcd.com/2110/

Course Announcements

  • Reading for next Monday:
    • Model Based Inference in the Life Sciences: A Primer on Evidence by David R. Anderson
      Chapter 2: Data and Models: Sections 2.1 and 2.2 only!

Wrapping up Multiple Linear Regression

These were our steps for fitting a linear model:

  • Model Building: What type(s) of model to use?
  • Model Selection: Of these model types, which model fits the data the best?
  • Model Validation: Does this resulting model meet the assumptions?

Winning model

Using backward elimination with the adjusted \( R^2 \) method:

winner <- lm(price ~ cond + stock_photo + wheels, data = mario_kart)
summary(winner)

Call:
lm(formula = price ~ cond + stock_photo + wheels, data = mario_kart)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.454  -2.959  -0.949   2.712  14.061 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     36.0483     0.9745  36.990  < 2e-16 ***
condnew          5.1763     0.9961   5.196 7.21e-07 ***
stock_photoyes   1.1177     1.0192   1.097    0.275    
wheels           7.2984     0.5448  13.397  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.884 on 137 degrees of freedom
Multiple R-squared:  0.719, Adjusted R-squared:  0.7128 
F-statistic: 116.8 on 3 and 137 DF,  p-value: < 2.2e-16

Model Validation and Diagnostics

Okay, I've got a best fit model (i.e “Good squeezy squeezy”) BUT…

ARE MY ASSUMPTIONS MET? (i.e. is it the right fruit?!?!)

Assumptions of Multiple Linear Regression

  1. the residuals of the model are nearly normal,
  2. the variability of the residuals is nearly constant,
  3. the residuals are independent, and
  4. each variable is linearly related to the outcome.

Nearly normal residuals?

  1. the residuals of the model are nearly normal,
  2. the variability of the residuals is nearly constant,
  3. the residuals are independent, and
  4. each variable is linearly related to the outcome.

Nearly normal residuals?

Do a qqplot of the residuals:

qqnorm(residuals(winner))

plot of chunk unnamed-chunk-3

Constant variability?

  1. the residuals of the model are nearly normal,
  2. the variability of the residuals is nearly constant,
  3. the residuals are independent, and
  4. each variable is linearly related to the outcome.

Constant variability?

Residuals vs. fitted values:

plot(fitted(winner), residuals(winner))

plot of chunk unnamed-chunk-4

Independence??

  1. the residuals of the model are nearly normal,
  2. the variability of the residuals is nearly constant,
  3. the residuals are independent, and
  4. each variable is linearly related to the outcome.

Independence?

Residuals vs order of collection (ID):

plot(mario_kart$ID, residuals(mario_kart))

plot of chunk unnamed-chunk-5

Linearity

  1. the residuals of the model are nearly normal,
  2. the variability of the residuals is nearly constant,
  3. the residuals are independent, and
  4. each variable is linearly related to the outcome.

Linearity

par(mfrow=c(1,3))
boxplot(residuals(winner) ~ cond, data = mario_kart)
boxplot(residuals(winner) ~ stock_photo, data = mario_kart)
plot(residuals(winner) ~ wheels, data = mario_kart)

plot of chunk unnamed-chunk-6

Models and Modeling

Discuss: What is a model?

Models and Modeling

Question: Is this a model?

Models and Modeling

Question: Is this a model?

Models and Modeling

Question: Is this a model?

Models and Modeling

Question: Is this a model?

Models and Modeling

Question: Is this a model?

Models and Modeling

Discuss: What are the components of a model?

Answer:

  • Objects (nouns)
  • Processes or relationships (verbs)
  • Simplified, abstract/concrete (adjectives)
  • Function (use case) - Not strictly necessary for the definition of a model. This answers the “Why model?” question.

Definition: A model is a simplified, abstract (or concrete) representation of objects and their relationships and/or processes in the real world.

Bits-and-pieces

Question: What are the model components for this model?

  • Objects
  • Processes/Relationships
  • Functions

Models - General Definition

Definition: A model is a simplified, abstract (or concrete) representation of objects and their representations or processes in the real world.

The Importance of Good Data

Quote: “Meaningful data of sufficient quantity are the grist of scientific bread.”

  1. Is the study sound so that an inductive inference can be justified?
    • Experimental design should be able to address predictions from…
    • …one or a number of scientific hypotheses that have been well thought out.
  2. Are the data analysis methods sound? Relies on…
    • …adequate modeling and
    • objective approaches to model selection.

Information and Statistics

Quote: “If data are collected in an appropriate manner, then there is information in the sample data about the process or system under study.”

  • Mathematical model is required (in most cases) to obtain information from the data.
  • Inductive vs deductive reasoning
  • Inductive: “inference of a generalized conclusion from particular instances” (sample -> population)
  • Statistics adds rigor to the inductive process.
  • The inference comes from a model that approximates the system or process of interest.

Models in Science

Quantification is essential due to variation and complexity.

Quote: “Unless one is engaged in simple descriptive studies, they [the empirical sciences] must deal with mathematical models.”

Quote: “We are not trying to model the data; instead, we are trying to model the information in the data.”

Quote: “Data contain both information and noise; fitting the data perfectly would include modeling the noise and this is counter to our science objective.”