Q1: What does OLS in regression stand for?
Q2: Among the following items, which ones we want to minimize to get a “best fitting” line?
Q3: Which number is a possible value of R squared in a linear regression model?
Q4: Which intervals/bands tell us the range of predicted values y (y_hat) of given x?
Q5: Typically, which band is narrower?
Suppose SAS tells you that the estimate for b in a SLR (y=a+b*x) is 0.82 (SE=0.54).
Q6. Do you think there is a positive association between x and y (Hint: Use 95% CI)? a. Yes b. No
Q7. Suppose this effect is significant, does this result indicate that x has a causal effect on y? a. Yes b. No c. It depends!
Data source: 2018 North Carolina Behavioral Risk Factor Surveillance System (BRFSS) study (n=4,526)
BRFSS uses complex survey design to represent the state/national populations.
DV/Outcome - MUD: number of mentally unhealthy days in the past 30 days (response range: 0~30)
Note that MUD is a count variable but let’s assume it is continuous before we move to generalized linear regression.
IV/Predictor #1 - PA: binary variable of whether had any physical activity in the past month
proc glm data=temp1;
title "SLR: DV=MUD; IV=PA";
model MUD = PA /solution clparm;
estimate " MUDs - no PA" intercept 1 PA 0;
estimate "MUDs - PA" intercept 1 PA 1;
run; quit;
proc reg data=temp1;
title "SLR: DV=MUD; IV=PA";
model MUD = PA/clb; /*clb output CIs*/
run; quit;
My interpretations:
My interpretations:
IV2 - sleep time (daily sleep time in hours)
proc glm data=temp1;
title "SLR: DV=MUD; IV=sleep";
model MUD = sleep /solution clparm;
estimate " MUDs - sleep 8 hrs" intercept 1 sleep 8;
estimate "MUDs - sleep 6 hrs" intercept 1 sleep 6;
run; quit;
proc means data=temp1;
var sleep;
run;
data temp2;
set temp1;
sleep_c = sleep - 7;
run;
proc glm data=temp2;
title "SLR: DV=MUD; IV=sleep_c";
model MUD = sleep_c /solution clparm;
estimate "MUDs - sleep 7 hrs (sample mean sleep time)" intercept 1 sleep_c 0;
estimate "MUDs - sleep 8 hrs" intercept 1 sleep_c 1;
estimate "MUDs - sleep 6 hrs" intercept 1 sleep_c -1;
run; quit;
Sir Francis Galton: “It appeared from these experiments that the offspring did not tend to resemble their parents in size, but always to be more mediocre than they – to be smaller than the parents, if the parents were large; to be larger than the parents, if the parents were small.”
daughter_height = 63.9 + 0.54*mother_height_mc + error (mc: mean centered)
Regression to the mean (RTM) is a statistical phenomenon, which is common to see in repeated measures. It is not a natural law or a rule.
RTM arises when the correlation between X and Y is not equal to 1 (i.e., these two variables are not perfectly correlated).
In other words, there are many other reasons could explain daughters’ heights. Daughters’ heights are not determined by mothers’ heights.
RTM is also a reminder for us to not jump to causal explanations of some phenomenons we observe.
Sometimes, RTM can make natural variance in repeated measures look like real changes. And this is a threat to validity we want to address.
Thus, we often want to have a control/comparison group OR to adjust for (i.e. control for, conditional on) the baseline measure when we evaluate the effect of an intervention. I can talk about analyzing two-time point data if that is relevant to your research interests.
More explanations if you are still confused: Misunderstanding Regression to the Mean
What is Regression to the Mean? Misunderstanding Statistics
A fun fact of Galton: Galton introduced the term “regression” into statistics but his definition of “regression” was actually “regression to/towards mean.” Galton did not run regression models.