- Simulate a random sample of 1000 cases. Construct a variable A with
a standard normal distribution. Construct a variable X that equals A
plus a standard normal distribution. Construct a variable Y that equals
A plus a standard normal distribution.
set.seed(5673)
n <- 1000
A <- rnorm(n, 110, 31)
X <- A + rnorm(n, 124, 42)
Y <- A + rnorm(n, 150, 50)
cor(A,X)
## [1] 0.5996215
cor(A,Y)
## [1] 0.5317214
cor(X,Y)
## [1] 0.3564466
par(
mfrow=c(2,2),
mar=c(4,4,1,0)
)
hist(A, xlab = "Random value (A)", col = "#404080",
main = "The distribution of the A", cex.lab = 1, cex.axis = 1, cex.main = 1)
hist(X, xlab = "Random value (X)", col = "#69b3a2",
main = "The distribution of the X", cex.lab = 1, cex.axis = 1, cex.main = 1)
hist(Y, xlab = "Random value (X)", col = "#a6bddb",
main = "The distribution of the Y", cex.lab = 1, cex.axis = 1, cex.main = 1)

- The slope coefficient in a bivariate regression model of Y on A
should be positive and significant because we put in a positive
relationship when we constructed the variable Y (Y = A + n). Well, after
running code, we see that an increase in variable A by one leads to an
increase by 1.03 in Y, and it is a significant change.
summary(mod1 <- lm(Y ~ A))
##
## Call:
## lm(formula = Y ~ A)
##
## Residuals:
## Min 1Q Median 3Q Max
## -156.345 -34.134 -0.047 33.100 170.635
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 150.71636 5.94961 25.33 <2e-16 ***
## A 1.02916 0.05189 19.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.02 on 998 degrees of freedom
## Multiple R-squared: 0.2827, Adjusted R-squared: 0.282
## F-statistic: 393.4 on 1 and 998 DF, p-value: < 2.2e-16
library(ggplot2)
library(hrbrthemes)
dat <- data.frame(A,X,Y)
ggplot(data = dat, aes(x=A, y=Y)) +
geom_point()+
geom_smooth(method=lm , color="red", se=FALSE) +
theme_ipsum()

- Unlike the previous correlation that is direct, the association
between X and Y is spurious because A ‘influences’ both X and Y, it is a
confounding variable. Thus, we can assume a positive and significant
association between X and Y but smaller than in the model Y~A.
summary(mod2 <- lm(Y ~ X))
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -198.940 -37.204 0.102 35.947 213.627
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 166.14581 8.33617 19.93 <2e-16 ***
## X 0.42088 0.03492 12.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.29 on 998 degrees of freedom
## Multiple R-squared: 0.1271, Adjusted R-squared: 0.1262
## F-statistic: 145.3 on 1 and 998 DF, p-value: < 2.2e-16
ggplot(data = dat, aes(x=X, y=Y)) +
geom_point(color = "#636363")+
geom_smooth(method=lm , color="red", se=FALSE) +
theme_ipsum()

- Since both independent variables - A and X - are highly correlated,
it becomes difficult or impossible to distinguish their individual
effects on the dependent variable. Therefore, regression coefficients
will change according to whether one or another variable is included or
excluded from the model. We can expect that when all predictors are
included the spurious association between X and Y will become
insignificant.
summary(mod3 <- lm(Y ~ X+A))
##
## Call:
## lm(formula = Y ~ X + A)
##
## Residuals:
## Min 1Q Median 3Q Max
## -155.363 -33.491 -0.843 33.397 176.274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 142.06654 7.72103 18.400 <2e-16 ***
## X 0.06935 0.03951 1.755 0.0796 .
## A 0.96100 0.06477 14.837 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.97 on 997 degrees of freedom
## Multiple R-squared: 0.2849, Adjusted R-squared: 0.2835
## F-statistic: 198.6 on 2 and 997 DF, p-value: < 2.2e-16
library(scatterplot3d)
scatterplot3d(dat[,1:3], pch = 16)

- I expect the root mean squared error in model 2 (1b) will be large
since the beta coefficient might not be accurately estimated due to the
spurious association.
sq1 <- sqrt(mean(mod1$residuals^2))
sq2 <- sqrt(mean(mod2$residuals^2))
RMES <- c(sq1,sq2)
Models <- c('Model1', 'Model 2')
print(fstat <- data.frame(Models, RMES))
## Models RMES
## 1 Model1 50.97267
## 2 Model 2 56.23270
- I expect the R2 in model 1a to be larger than the R2 in model 1b
because again, the relationship between X and Y is confounded by the
influence of A.
r1 <- summary(mod1)$adj.r.squared
r2 <- summary(mod2)$adj.r.squared
rr <- c(r1,r2)
print(fstat <- data.frame(Models, rr))
## Models rr
## 1 Model1 0.2820090
## 2 Model 2 0.1261795
- By including of covariate A in a regression model, we see that the
spurious association between X and Y disappears.
|
Â
|
Model 2
|
Model 3
|
|
Predictors
|
Estimates
|
std. Error
|
Estimates
|
std. Error
|
|
(Intercept)
|
166.15 ***
|
8.34
|
142.07 ***
|
7.72
|
|
X
|
0.42 ***
|
0.03
|
0.07
|
0.04
|
|
A
|
|
|
0.96 ***
|
0.06
|
|
Observations
|
1000
|
1000
|
|
R2 / R2 adjusted
|
0.127 / 0.126
|
0.285 / 0.284
|
- p<0.05Â Â Â ** p<0.01Â Â Â *** p<0.001
|
- Zero conditional mean of error also often referred as exogeneity
implies that the error mean is zero and there is no linear relationship
between the errors and the independent variables. I plotted residuals
against the predictors in all models to inspect if there are obvious
correlations. The residuals should be randomly and symmetrically
distributed around zero across row numbers no matter how we sort the
rows. Visually, the plots indicate the correlations between consecutive
errors (the lowest correlation is in the last graph between residuals
and X). As a result, the models can systematically under- or
over-predicts the observed values. This is not surprising because the
assumption can be violated due to different reasons: omitted variable
bias, or confounding, or measurement error in the independent variables,
or simultaneity between the independent and dependent variables,
etc.
eruption.res1 <-resid(mod1)
eruption.res2 <- resid(mod2)
eruption.res3 <- resid(mod3)
library(car)
par(
mfrow=c(2,2),
mar=c(4,4,1,0)
)
crPlots(mod1,
~ A,
ylab = "Partial residuals",
xlab = "A (Model 1)")
crPlots(mod2,
~ X,
ylab = "Partial residuals",
xlab = "X (Model 2)")
crPlots(mod3,
~ A,
ylab = "Partial residuals",
xlab = "A (Model 3)")
crPlots(mod3,
~ X,
ylab = "Partial residuals",
xlab = "X (Model 3)")

- The homoscedasticity assumption assumes that the error term has
constant variance, i.e. the same variance regardless of the value of
predictors. The variance of the errors should be consistent for all
observations (homoscedasticity). If not, heteroscedasticity reduces the
precision of the estimates in OLS linear regression. We do not see a
pronounced dependence (although it is propapbly not an ideal
homoscedasticity). Heteroskedasticity usually occurs when the data
contains a wide range of values or abnormal distributions; it is also
connected with the presence of outlier in the data. Our data is
generated normally from random data, and does not have such
problems.
spreadLevelPlot(mod1, main = "Homo(hetero)scedasticity Model 1: Y ~ A",
cex.lab = 1, cex.axis = 1)

##
## Suggested power transformation: 0.9269855
spreadLevelPlot(mod2, main = "Homo(hetero)scedasticity Model 2: Y ~ X",
cex.lab = 1, cex.axis = 1)

##
## Suggested power transformation: 1.205107
spreadLevelPlot(mod3, main = "Homo(hetero)scedasticity Model 3: ~ X+A",
cex.lab = 1, cex.axis = 1)

##
## Suggested power transformation: 0.9789594