1. Simulate a random sample of 1000 cases. Construct a variable A with a standard normal distribution. Construct a variable X that equals A plus a standard normal distribution. Construct a variable Y that equals A plus a standard normal distribution.
set.seed(5673)
n <- 1000
A <- rnorm(n, 110, 31)
X <- A + rnorm(n, 124, 42)
Y <- A + rnorm(n, 150, 50)

cor(A,X)
## [1] 0.5996215
cor(A,Y)
## [1] 0.5317214
cor(X,Y)
## [1] 0.3564466
par(
  mfrow=c(2,2),
  mar=c(4,4,1,0)
)
hist(A, xlab = "Random value (A)", col = "#404080",
     main = "The distribution of the A", cex.lab = 1, cex.axis = 1, cex.main = 1)
hist(X, xlab = "Random value (X)", col = "#69b3a2",
     main = "The distribution of the X", cex.lab = 1, cex.axis = 1, cex.main = 1)
hist(Y, xlab = "Random value (X)", col = "#a6bddb",
     main = "The distribution of the Y", cex.lab = 1, cex.axis = 1, cex.main = 1)

  1. The slope coefficient in a bivariate regression model of Y on A should be positive and significant because we put in a positive relationship when we constructed the variable Y (Y = A + n). Well, after running code, we see that an increase in variable A by one leads to an increase by 1.03 in Y, and it is a significant change.
summary(mod1 <- lm(Y ~ A))
## 
## Call:
## lm(formula = Y ~ A)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -156.345  -34.134   -0.047   33.100  170.635 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 150.71636    5.94961   25.33   <2e-16 ***
## A             1.02916    0.05189   19.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.02 on 998 degrees of freedom
## Multiple R-squared:  0.2827, Adjusted R-squared:  0.282 
## F-statistic: 393.4 on 1 and 998 DF,  p-value: < 2.2e-16
library(ggplot2)
library(hrbrthemes)
 
dat <- data.frame(A,X,Y)
ggplot(data = dat, aes(x=A, y=Y)) + 
  geom_point()+
  geom_smooth(method=lm , color="red", se=FALSE) +
  theme_ipsum()

  1. Unlike the previous correlation that is direct, the association between X and Y is spurious because A ‘influences’ both X and Y, it is a confounding variable. Thus, we can assume a positive and significant association between X and Y but smaller than in the model Y~A.
summary(mod2 <- lm(Y ~ X))
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -198.940  -37.204    0.102   35.947  213.627 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 166.14581    8.33617   19.93   <2e-16 ***
## X             0.42088    0.03492   12.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.29 on 998 degrees of freedom
## Multiple R-squared:  0.1271, Adjusted R-squared:  0.1262 
## F-statistic: 145.3 on 1 and 998 DF,  p-value: < 2.2e-16
ggplot(data = dat, aes(x=X, y=Y)) + 
  geom_point(color = "#636363")+
  geom_smooth(method=lm , color="red", se=FALSE) +
  theme_ipsum()

  1. Since both independent variables - A and X - are highly correlated, it becomes difficult or impossible to distinguish their individual effects on the dependent variable. Therefore, regression coefficients will change according to whether one or another variable is included or excluded from the model. We can expect that when all predictors are included the spurious association between X and Y will become insignificant.
summary(mod3 <- lm(Y ~ X+A))
## 
## Call:
## lm(formula = Y ~ X + A)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -155.363  -33.491   -0.843   33.397  176.274 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 142.06654    7.72103  18.400   <2e-16 ***
## X             0.06935    0.03951   1.755   0.0796 .  
## A             0.96100    0.06477  14.837   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.97 on 997 degrees of freedom
## Multiple R-squared:  0.2849, Adjusted R-squared:  0.2835 
## F-statistic: 198.6 on 2 and 997 DF,  p-value: < 2.2e-16
library(scatterplot3d) 

scatterplot3d(dat[,1:3], pch = 16)

  1. I expect the root mean squared error in model 2 (1b) will be large since the beta coefficient might not be accurately estimated due to the spurious association.
sq1 <- sqrt(mean(mod1$residuals^2))
sq2 <- sqrt(mean(mod2$residuals^2))
RMES <- c(sq1,sq2)
Models <- c('Model1', 'Model 2')
print(fstat <- data.frame(Models, RMES))
##    Models     RMES
## 1  Model1 50.97267
## 2 Model 2 56.23270
  1. I expect the R2 in model 1a to be larger than the R2 in model 1b because again, the relationship between X and Y is confounded by the influence of A.
r1 <- summary(mod1)$adj.r.squared
r2 <- summary(mod2)$adj.r.squared
rr <- c(r1,r2)
print(fstat <- data.frame(Models, rr))
##    Models        rr
## 1  Model1 0.2820090
## 2 Model 2 0.1261795
  1. By including of covariate A in a regression model, we see that the spurious association between X and Y disappears.
  Model 2 Model 3
Predictors Estimates std. Error Estimates std. Error
(Intercept) 166.15 *** 8.34 142.07 *** 7.72
X 0.42 *** 0.03 0.07 0.04
A 0.96 *** 0.06
Observations 1000 1000
R2 / R2 adjusted 0.127 / 0.126 0.285 / 0.284
  • p<0.05   ** p<0.01   *** p<0.001
  1. Zero conditional mean of error also often referred as exogeneity implies that the error mean is zero and there is no linear relationship between the errors and the independent variables. I plotted residuals against the predictors in all models to inspect if there are obvious correlations. The residuals should be randomly and symmetrically distributed around zero across row numbers no matter how we sort the rows. Visually, the plots indicate the correlations between consecutive errors (the lowest correlation is in the last graph between residuals and X). As a result, the models can systematically under- or over-predicts the observed values. This is not surprising because the assumption can be violated due to different reasons: omitted variable bias, or confounding, or measurement error in the independent variables, or simultaneity between the independent and dependent variables, etc.
eruption.res1 <-resid(mod1)
eruption.res2 <- resid(mod2)
eruption.res3 <- resid(mod3)

library(car)
par(
  mfrow=c(2,2),
  mar=c(4,4,1,0)
)
crPlots(mod1,
~ A,
ylab = "Partial residuals",
xlab = "A (Model 1)")
crPlots(mod2,
~ X,
ylab = "Partial residuals",
xlab = "X (Model 2)")
crPlots(mod3,
~ A,
ylab = "Partial residuals",
xlab = "A (Model 3)")
crPlots(mod3,
~ X,
ylab = "Partial residuals",
xlab = "X (Model 3)")

  1. The homoscedasticity assumption assumes that the error term has constant variance, i.e. the same variance regardless of the value of predictors. The variance of the errors should be consistent for all observations (homoscedasticity). If not, heteroscedasticity reduces the precision of the estimates in OLS linear regression. We do not see a pronounced dependence (although it is propapbly not an ideal homoscedasticity). Heteroskedasticity usually occurs when the data contains a wide range of values or abnormal distributions; it is also connected with the presence of outlier in the data. Our data is generated normally from random data, and does not have such problems.
spreadLevelPlot(mod1, main = "Homo(hetero)scedasticity Model 1: Y ~ A",
cex.lab = 1, cex.axis = 1)

## 
## Suggested power transformation:  0.9269855
spreadLevelPlot(mod2, main = "Homo(hetero)scedasticity Model 2: Y ~ X",
cex.lab = 1, cex.axis = 1)

## 
## Suggested power transformation:  1.205107
spreadLevelPlot(mod3, main = "Homo(hetero)scedasticity Model 3:  ~ X+A",
cex.lab = 1, cex.axis = 1)

## 
## Suggested power transformation:  0.9789594