Coufounding variable

Suppose x1 is the money I earn, which is increasing day by day; x2 is the money I spend on buying, which fluctuates according to my mood; y is happiness, minus the money spent from the money earned, and add Some random disturbances.

library(datasets)
library(ggplot2)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(tidyverse)

## ── Attaching packages ────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ✓ purrr   0.3.4

## ── Conflicts ───────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

n <- 100
x1 <- 1:100
x2 <- .01 * x1 + runif(n, -.1, .1)
y <- x1 - x2 + rnorm(n, sd = .01)
summary(lm(y ~ x2))$coef

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  3.441732   1.112777  3.092921 2.582086e-03
## x2          92.632806   1.897388 48.821229 1.389191e-70

If only x2 is included in the model, we will find that x2 and y are positively correlated, and the regression coefficient is particularly large. Obviously there is a problem, because according to our setting, the partial regression coefficient should be -1. Let’s make a graph and observe.

dat = data.frame(y = y, x1 = x1, x2 = x2)
ggplot(dat, aes(y = y, x = x2, colour = x1))+
  geom_smooth(method = lm, se = FALSE, colour = "black")+
  geom_point(size = 4)

## `geom_smooth()` using formula 'y ~ x'

The x-axis is x2, and the y-axis is y. It seems that the two do show a positive correlation. The color of the point represents the size of x1. The larger x1 is, the lighter the color. We found that as x2 increases, x1 also increase.Is there any connection to the increasing trend? So when we included x1 and x2 into the model at the same time, we found that:

summary(lm(y ~ x1 + x2))$coef

##                 Estimate   Std. Error      t value      Pr(>|t|)
## (Intercept) -0.001093745 0.0018039289   -0.6063128  5.457234e-01
## x1           0.999910432 0.0001562869 6397.9148255 1.202499e-274
## x2          -0.989286381 0.0149248177  -66.2846541  1.359839e-82

Now it’s normal. Because of the random disturbance, it’s not equal to 1, but it’s very close. According to the proof at the beginning, the residual error is used for mapping, after correcting the influence of x1 on x2.

dat2 = data.frame(y = y, x1 = x1, x2 = x2,ey = resid(lm(y~x1)),ex2 = resid(lm(x2~x1)))
ggplot(dat2, aes(y = ey, x = ex2, colour = x1))+
  geom_smooth(method = lm, se = FALSE, colour = "black")+
  geom_point(size = 4)

## `geom_smooth()` using formula 'y ~ x'

The two are negatively correlated, and the partial regression coefficient is approximately equal to 1, so happiness comes from earning money (x1), spending money (x2) will not bring happiness. On the surface, it seems that the happiness obtained by spending money is actually due to making money. When you spend more money, you have more happiness, and the happiness brought by making money is greater than the unhappiness brought about by spending money. This process is called adjustment.

Coufounding variable

Yue Chen

10/11/2020