Lurking Variables

Harold Nelson

2026-04-21

Setup

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(moderndive)

Create Two Independent Series

Create x and y as vectors of length 100 by sampling from normal distributions with different means and standard deviations.

Note tha x and y are totally independent. Run a linear model to verify.

Solution

x = rnorm(100,mean = 5, sd = 1)

y = rnorm(100,mean = 15, sd = 1)

xy_mod = lm(y~x)
summary(xy_mod)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6461 -0.6705  0.0209  0.6469  3.1868 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.7282     0.5953   26.42   <2e-16 ***
## x            -0.1394     0.1124   -1.24    0.218    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.002 on 98 degrees of freedom
## Multiple R-squared:  0.01545,    Adjusted R-squared:  0.005399 
## F-statistic: 1.537 on 1 and 98 DF,  p-value: 0.218

Partial Sums

Create sumx and sumy as the partial sums of x and y. Alos create time with the integers 1 to 100.

sumx = rep(0,100)
sumx[1] = x[1]
for(i in 2:100){
  sumx[i] = sumx[i - 1] + x[i]
}

sumy = rep(0,100)
sumy[1] = y[1]
for(i in 2:100){
  sumy[i] = sumy[i - 1] + y[i]
}

time = 1:100

Dependence??

Run a linear model to see if sumx and sumy are correlated.

Solution

sums_mod = lm(sumy~sumx)
summary(sums_mod)
## 
## Call:
## lm(formula = sumy ~ sumx)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3556  -6.6631  -0.6702   6.5908  17.0703 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.986356   1.632932   1.216    0.227    
## sumx        2.895652   0.005408 535.404   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.105 on 98 degrees of freedom
## Multiple R-squared:  0.9997, Adjusted R-squared:  0.9997 
## F-statistic: 2.867e+05 on 1 and 98 DF,  p-value: < 2.2e-16

Time Dendence

Create models to show that sumx and sumy both depend on time.

Solution

time_x = lm(sumx~time)
summary(time_x)
## 
## Call:
## lm(formula = sumx ~ time)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.810 -1.710  0.113  1.754  4.122 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.053574   0.442933  -0.121    0.904    
## time         5.191303   0.007615 681.744   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.198 on 98 degrees of freedom
## Multiple R-squared:  0.9998, Adjusted R-squared:  0.9998 
## F-statistic: 4.648e+05 on 1 and 98 DF,  p-value: < 2.2e-16
time_y = lm(sumy~time)
summary(time_y)
## 
## Call:
## lm(formula = sumy ~ time)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6011 -2.1125 -0.2492  2.0871  5.1409 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  1.633951   0.507325    3.221  0.00174 ** 
## time        15.036111   0.008722 1723.980  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.518 on 98 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.972e+06 on 1 and 98 DF,  p-value: < 2.2e-16

Moral of the Story

When two variables, A and B, are both highly correlated with a third variable C, A and B will be highly correlated with each other statistically,

This is a real problem in economics because many economic time series are highly correlated with time.

Random Controlled Trials (RCT)

This is a type of study designed to eliminate the possibility of a lurking variable.

Here’s how it would be used to determine the efficacy of an experimental drug.

  1. A group of subjects is selected on the basis of random choice from the target poppulation.

  2. The group is divided into two subgroups called the treatment group and the control group using some random selection device, like flipping a coin.

  3. The subjects in the treatment group are given the real drug. The subjects in the control group are given a placebo with no real drug content. No subject is aware of their status as treatment or control.

  4. The difference in outcome between the treatments and controls measures the efficacy of the drug,

Observational Studies

These are studies without the benefit of random selection. The danger of lurking variables is very high.

Example 1

A study based on birth records and the exposure of the mother during pregnancy to air pollution reveals that air pollution is associated with premature birth. The study concludes that air pollution makes premature birth more likely.

Identify a likely lurking variable.

Solution

Low income causes people to live in undesirable areas. It also causes reduced access to medical care, such as prenatal visits.

Example 2

A survey asks mothers 2 questions.

  1. Did you take Tylenol while you were pregnant?

  2. Does your child have autism?

The study concludes that Tylenol causes autism.

Identify a potential lurking variable.

Solution

Why does a person take Tylenol?

This could be the underlying cause of autism.