Lurking Variables

Harold Nelson

2025-12-02

Setup

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(moderndive)

Create Two Independent Series

Create x and y as vectors of length 100 by sampling from normal distributions with different means and standard deviations.

Note tha x and y are totally independent. Run a linear model to verify.

Solution

x = rnorm(100,mean = 5, sd = 1)

y = rnorm(100,mean = 15, sd = 1)

xy_mod = lm(y~x)
summary(xy_mod)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3246 -0.6310  0.0063  0.6159  2.8186 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.73946    0.50295   31.29   <2e-16 ***
## x           -0.13018    0.09941   -1.31    0.193    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.028 on 98 degrees of freedom
## Multiple R-squared:  0.0172, Adjusted R-squared:  0.007169 
## F-statistic: 1.715 on 1 and 98 DF,  p-value: 0.1934

Partial Sums

Create sumx and sumy as the partial sums of x and y. Alos create time with the integers 1 to 100.

sumx = rep(0,100)
sumx[1] = x[1]
for(i in 2:100){
  sumx[i] = sumx[i - 1] + x[i]
}

sumy = rep(0,100)
sumy[1] = y[1]
for(i in 2:100){
  sumy[i] = sumy[i - 1] + y[i]
}

time = 1:100

Dependence??

Run a linear model to see if sumx and sumy are correlated.

Solution

sums_mod = lm(sumy~sumx)
summary(sums_mod)
## 
## Call:
## lm(formula = sumy ~ sumx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.727  -6.678  -0.396   7.139  26.689 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.64715    2.19931  -0.749    0.456    
## sumx         3.00862    0.00755 398.494   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.89 on 98 degrees of freedom
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9994 
## F-statistic: 1.588e+05 on 1 and 98 DF,  p-value: < 2.2e-16

Time Dendence

Create models to show that sumx and sumy both depend on time.

Solution

time_x = lm(sumx~time)
summary(time_x)
## 
## Call:
## lm(formula = sumx ~ time)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1843 -1.9542  0.0365  2.0181  5.8354 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.93235    0.59511   1.567     0.12    
## time         4.99375    0.01023 488.101   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.953 on 98 degrees of freedom
## Multiple R-squared:  0.9996, Adjusted R-squared:  0.9996 
## F-statistic: 2.382e+05 on 1 and 98 DF,  p-value: < 2.2e-16
time_y = lm(sumy~time)
summary(time_y)
## 
## Call:
## lm(formula = sumy ~ time)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1508 -3.2456 -0.4177  3.0869  7.7560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.79513    0.74977    1.06    0.292    
## time        15.03149    0.01289 1166.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.721 on 98 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 1.36e+06 on 1 and 98 DF,  p-value: < 2.2e-16

Moral of the Story

When two variables, A and B, are both highly correlated with a third variable C, A and B will be highly correlated with each other statistically,

This is a real problem in economics because many economic time series are highly correlated with time.