Concept Study Guide

Statistical Modeling: A Fresh Approach

Daniel Kaplan

The questions are organized not by topic but by cognitive level, according to Bloom's Taxonomy


  1. What is a standard deviation?
  2. What is a median?
  3. What is a percentile?
  4. What is an outlier?
  5. What's an categorical variable?
  6. What is a correlation coefficient?
  7. What is a model value?
  8. What is a variance?
  9. \( R^2 \) is a ratio. Of what?
  10. What's an indicator variable (sometimes called “dummy” variable)?
  11. What's a backdoor pathway?
  12. What's a null hypothesis?
  13. What are the two possible outcomes of a hypothesis test?
  14. What is a probability?
  15. What is an odds?
  16. What is a sampling frame?
  17. What is a random sample?
  18. What is a “rank transform”?
  19. What is resampling?
  20. What is shuffling?
  21. What is the form of a confidence interval?
  22. What's a Type I error? A Type II error?
  23. What is “bootstrapping”?
  24. What is a vector?
  25. What does it mean to be “orthogonal”?
  26. What is a skew distribution?


  1. For a given variable, how are the variance and the standard deviation related?
  2. Why does it make no sense to calculate the mean of a categorical variable?
  3. How do backdoor pathways potentially confuse interpretation of correlations as causation?
  4. Which percentile is the median?
  5. Why are indicator variables used to represent categorical variables?
  6. What is the criterion used for fitting a model (in linear models)?
  7. Why is the median robust to outliers?
  8. Why is the range of a variable not robust to outliers?
  9. Is the variance robust to outliers?
  10. What's the highest and lowest possible values for \( R^2 \)? Why?
  11. What's the difference between a null hypothesis and an alternative hypothesis?
  12. In ANOVA, F is a ratio. Of what?
  13. In a regression report, t is a ratio. Of what?
  14. An odds and a probability relate the same information. How are they related?
  15. How does sampling introduce variation?
  16. What is a sampling distribution?
  17. What is a confidence interval?
  18. What's the relationship between a standard error, a sampling distribution, and a margin of error?
  19. What's an intercept term?
  20. To which hypothesis does a Type I error relate? A Type II error?
  21. How do model vectors relate to a quantitative variable?
  22. What model vectors are created for a categorical variable?
  23. What kind of number is each of these things: (e.g. an integer, positive, negative, and so on)
  24. a degree of freedom?
  25. a variance?
  26. a standard deviation?
  27. a correlation coefficient \( r \)?
  28. a coefficient of determination \( R^2 \)?
  29. How does an experiment differ from an observational study?
  30. What is a conditional probability and how does it differ from a joint or marginal probability?
  31. What is a probability density? How is it related to a cumulative probability?
  32. What is the link between these two conditional probabilities: \( p(a|b) \) and \( p(b|a) \)?


  1. How do you calculate the variance of a variable?
  2. How do you calculate the variance of residuals from a model?
  3. How do you include an intercept term in a model? How do you exclude it?
  4. How would you display a distribution density for a quantitative variable?
  5. How do you include a main effect in a model?
  6. How do you include an interaction term in a model?
  7. How do you include a covariate in a model?
  8. How do you produce a regression report from a model?
  9. How do you produce an ANOVA report from a model?
  10. How does a p-value guide the outcome of a hypothesis test?
  11. If you fail to reject the null hypothesis, does that mean the alternative hypothesis is right?
  12. What is a risk ratio/probability ratio?
  13. Why are random samples preferred?
  14. What's the difference in the uses of \( r \) and \( R^2 \)?
  15. What are main effects and how do they differ from interaction terms?
  16. How do you resample a data set?
  17. How do you shuffle a variable in a data set (when constructing a model)?
  18. How would you measure the angle between two variables?
  19. Broadly speaking, how do you perform an experiment? What's are the essential things to do?
  20. What is a rank transformation and what does it do to the distribution of a variable?
  21. What does the term “least squares” refer to?


  1. What's the relationship between the variance of model values, of residuals, and of the response?
  2. Why use a covariate in a model?
  3. What is the difference between an explanatory and a response variable?
  4. What is an alternative hypothesis used for?
  5. How can one use the power of a hypothesis test in interpreting the outcome of the test?
  6. How would use use a hypothetical causal network to identify covariates to include or exclude in a model?
  7. What's the relationship between the degrees of freedom listed in an ANOVA table?
  8. Given the units of a response variable A and explanatory variables B and C:
  9. What's the difference between a p-value and a significance level?
  10. For what purpose is resampling used?
  11. For what purpose is shuffling used?
  12. What does it mean for covariates to “eat variance”?
  13. How are degrees of freedom used?
  14. What does it mean to say that two model terms are “colinear”?


  1. What will happen to \( R^2 \) when you add a new explanatory term to a model?
  2. What factors influence the size of standard errors of coefficients?
  3. What's the point of converting an F statistic to a p-value?
  4. Why is orthogonality advantageous when interpreting a model?
  5. How could you change a model (not the data!) to reduce the size of the residuals?
  6. When would you use a rank transform?
  7. When would you use logistic regression instead of ordinary least squares regression?


  1. How would you judge whether including a covariate in a model improves the model?
  2. How would you decide which is more appropriate for a given purpose: an ANOVA or a regression report?
  3. What's the advantage of using log odds in modeling probabilities?
  4. Why use a 95% percent coverage interval?
  5. How would you decide on an appropriate alternative hypothesis?
  6. How would you decide if a random sample is really random?
  7. Are smaller residuals always a sign that a model is better?