Agenda for today

  1. Check in understanding (20 mins)
  2. SAS Demo and practice interpreting SLR (30 mins)
  3. Regression to the mean (10 mins)


Check in understanding


Q1: What does OLS in regression stand for?

  1. Optical landing system
  2. Ordinary linear statistics
  3. Ordinary least square

Q2: Among the following items, which ones we want to minimize to get a “best fitting” line?

  1. SSM (Sum Squares Model or variance explained)
  2. SST (Total Sum of Squares or total variance in Y)
  3. SSE (Sum Squares Error or variance unexplained)
  4. -2LL (-2 log likelihood)
  5. R-squared

Q3: Which number is a possible value of R squared in a linear regression model?

  1. -0.22
  2. 0.001
  3. 1.05

Q4: Which intervals/bands tell us the range of predicted values y (y_hat) of given x?

  1. Prediction bands
  2. Confidence bands
  3. Sampling intervals

Q5: Typically, which band is narrower?

  1. Prediction bands
  2. Confidence bands

Suppose SAS tells you that the estimate for b in a SLR (y=a+b*x) is 0.82 (SE=0.54).

Q6. Do you think there is a positive association between x and y (Hint: Use 95% CI)? a. Yes b. No


Q7. Suppose this effect is significant, does this result indicate that x has a causal effect on y? a. Yes b. No c. It depends!


Simple linear regression (SLR)


SLR examples in SAS: behavioral factors associated with mental health.

Data source: 2018 North Carolina Behavioral Risk Factor Surveillance System (BRFSS) study (n=4,526)
BRFSS uses complex survey design to represent the state/national populations.


DV/Outcome - MUD: number of mentally unhealthy days in the past 30 days (response range: 0~30)
Note that MUD is a count variable but let’s assume it is continuous before we move to generalized linear regression.


Example #1: behavioral factor - PA

IV/Predictor #1 - PA: binary variable of whether had any physical activity in the past month

proc glm data=temp1;
    title "SLR: DV=MUD; IV=PA";
    model MUD = PA /solution clparm; 
    estimate " MUDs - no PA" intercept 1 PA 0;
    estimate "MUDs - PA" intercept 1 PA 1;
run; quit;

proc reg data=temp1;
    title "SLR: DV=MUD; IV=PA";
    model MUD = PA/clb; /*clb output CIs*/
run; quit;


Key outputs:

General model fits: R2 & F statistics.

My interpretations:

  • R2=0.0117: 1.17% of variance of MUD (DV) was explained by this model.
  • F value=53.68, p <0.0001. The variance of MUD explained by this model specified was significant (significantly greater than zero).


Parameter estimates: intercept & slope.

My interpretations:

  • Intercept: 5.47, SE=0.25, 95%CI:4.99, 5.95, p-value<0.0001
  • The estimated average MUD for individuals who did not participate in PA.
  • Slope: -2.07, SE=0.28, 95%CI:-2.62, -1.51, p-value<0.0001
  • Compared to individuals who did not participated in PA, those who participated in PA had significantly 2.07 fewer mentally unhealthy days in the past 30 days (aka the group difference in MUD: PA vs non-PA)
  • Estimate statements: predicted group means


Example #2: behavioral factor - sleep

IV2 - sleep time (daily sleep time in hours)

proc glm data=temp1;
    title "SLR: DV=MUD; IV=sleep";
    model MUD = sleep /solution clparm; 
    estimate " MUDs - sleep 8 hrs" intercept 1 sleep 8;
    estimate "MUDs - sleep 6 hrs" intercept 1 sleep 6;
run; quit;


Key outputs:

  • How is the general model fit?
  • Interpretation of the intercept?
  • Interpretation of the slope?


Note that an average daily sleep time of zero hour does not make sense. We can use centering to make the intercept more interpretable.


proc means data=temp1;
    var sleep;
run;
data temp2;
    set temp1;
    sleep_c = sleep - 7;
run;
proc glm data=temp2;
    title "SLR: DV=MUD; IV=sleep_c";
    model MUD = sleep_c /solution clparm; 
    estimate "MUDs - sleep 7 hrs (sample mean sleep time)" intercept 1 sleep_c 0;
    estimate "MUDs - sleep 8 hrs" intercept 1 sleep_c 1;
    estimate "MUDs - sleep 6 hrs" intercept 1 sleep_c -1;
run; quit;


How would you interpret the intercept now?



Regression to the mean (RTM): will extreme observations “move” towards average on future measurements?


Sir Francis Galton: “It appeared from these experiments that the offspring did not tend to resemble their parents in size, but always to be more mediocre than they – to be smaller than the parents, if the parents were large; to be larger than the parents, if the parents were small.”


Example in textbook: Will daughters be more average than their mothers?


daughter_height = 63.9 + 0.54*mother_height_mc + error (mc: mean centered)

  • The slope is less than 1: For a mother who is 10 inches taller than average, her daughter will be predicted to be 5.4 inches taller than average, whose daughter will be only 2.9 inches taller than average.
  • For shorter moms, their daughters’ height will also be predicted towards the mean.
  • The answer is no. Because the actual height is not the same thing as the prediction. Our prediction always has an error, which adds variation and keep the total variation in height across generations roughly constant.


Imperfect correlation and chance

  • Regression to the mean (RTM) is a statistical phenomenon, which is common to see in repeated measures. It is not a natural law or a rule.

  • RTM arises when the correlation between X and Y is not equal to 1 (i.e., these two variables are not perfectly correlated).

  • In other words, there are many other reasons could explain daughters’ heights. Daughters’ heights are not determined by mothers’ heights.

  • RTM is also a reminder for us to not jump to causal explanations of some phenomenons we observe.


Variation and causality: What is the true effect of an intervention?

  • Sometimes, RTM can make natural variance in repeated measures look like real changes. And this is a threat to validity we want to address.

  • Thus, we often want to have a control/comparison group OR to adjust for (i.e. control for, conditional on) the baseline measure when we evaluate the effect of an intervention. I can talk about analyzing two-time point data if that is relevant to your research interests.


More explanations if you are still confused: Misunderstanding Regression to the Mean
What is Regression to the Mean? Misunderstanding Statistics


A fun fact of Galton: Galton introduced the term “regression” into statistics but his definition of “regression” was actually “regression to/towards mean.” Galton did not run regression models.

Next week: Linear regression model with multiple predictors

  • Updates:
    • I uploaded a formula sheet for Class 1 and Exercise 1.1. I think it include relevant information that you need for Problem Set 1.1.
  • Reminders:
    • Exercise 1.3 (Due next Tuesday)
    • Exercise 1.4 DAG exercise (post and comment prior to next Friday’s recitation)
    • Check out the updated schedule for week 4