Harold Nelson
2026-04-21
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Create x and y as vectors of length 100 by sampling from normal distributions with different means and standard deviations.
Note tha x and y are totally independent. Run a linear model to verify.
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6461 -0.6705 0.0209 0.6469 3.1868
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.7282 0.5953 26.42 <2e-16 ***
## x -0.1394 0.1124 -1.24 0.218
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.002 on 98 degrees of freedom
## Multiple R-squared: 0.01545, Adjusted R-squared: 0.005399
## F-statistic: 1.537 on 1 and 98 DF, p-value: 0.218
Create sumx and sumy as the partial sums of x and y. Alos create time with the integers 1 to 100.
Run a linear model to see if sumx and sumy are correlated.
##
## Call:
## lm(formula = sumy ~ sumx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3556 -6.6631 -0.6702 6.5908 17.0703
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.986356 1.632932 1.216 0.227
## sumx 2.895652 0.005408 535.404 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.105 on 98 degrees of freedom
## Multiple R-squared: 0.9997, Adjusted R-squared: 0.9997
## F-statistic: 2.867e+05 on 1 and 98 DF, p-value: < 2.2e-16
Create models to show that sumx and sumy both depend on time.
##
## Call:
## lm(formula = sumx ~ time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.810 -1.710 0.113 1.754 4.122
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.053574 0.442933 -0.121 0.904
## time 5.191303 0.007615 681.744 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.198 on 98 degrees of freedom
## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
## F-statistic: 4.648e+05 on 1 and 98 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = sumy ~ time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6011 -2.1125 -0.2492 2.0871 5.1409
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.633951 0.507325 3.221 0.00174 **
## time 15.036111 0.008722 1723.980 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.518 on 98 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.972e+06 on 1 and 98 DF, p-value: < 2.2e-16
When two variables, A and B, are both highly correlated with a third variable C, A and B will be highly correlated with each other statistically,
This is a real problem in economics because many economic time series are highly correlated with time.
This is a type of study designed to eliminate the possibility of a lurking variable.
Here’s how it would be used to determine the efficacy of an experimental drug.
A group of subjects is selected on the basis of random choice from the target poppulation.
The group is divided into two subgroups called the treatment group and the control group using some random selection device, like flipping a coin.
The subjects in the treatment group are given the real drug. The subjects in the control group are given a placebo with no real drug content. No subject is aware of their status as treatment or control.
The difference in outcome between the treatments and controls measures the efficacy of the drug,
These are studies without the benefit of random selection. The danger of lurking variables is very high.
A study based on birth records and the exposure of the mother during pregnancy to air pollution reveals that air pollution is associated with premature birth. The study concludes that air pollution makes premature birth more likely.
Identify a likely lurking variable.
Low income causes people to live in undesirable areas. It also causes reduced access to medical care, such as prenatal visits.
A survey asks mothers 2 questions.
Did you take Tylenol while you were pregnant?
Does your child have autism?
The study concludes that Tylenol causes autism.
Identify a potential lurking variable.
Why does a person take Tylenol?
This could be the underlying cause of autism.