set.seed(20200914)
N<-100
n<-2
y1<-rnorm(N,0,1)
y2<-rnorm(N,0,0.6)
y2<- y2 + 0.5 + 0.8 * y1
plot(y1,y2,main = "scatterplot of y1 v.s. y2")
result1<-sum( (y1-mean(y1))*(y2-mean(y2)) )/ sqrt(sum( (y1-mean(y1))*(y1-mean(y1)) )*sum( (y2-mean(y2))*(y2-mean(y2)) ))
result2<-cor(y1,y2)
result<-c(result1,result2)
names(result) <- c("mine","cor(y1,y2)")
(result)
## mine cor(y1,y2)
## 0.8141861 0.8141861
##
## Call:
## lm(formula = y ~ . - 1, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.02354 -0.68563 0.01647 0.77282 2.49934
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## beta0 0.1059 0.1075 0.985 0.32598
## beta1 0.4682 0.1521 3.079 0.00237 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 198 degrees of freedom
## Multiple R-squared: 0.1296, Adjusted R-squared: 0.1208
## F-statistic: 14.74 on 2 and 198 DF, p-value: 1.082e-06
## The estiamtion of beta1 is 0.468167
## The p-value for beta1 is 0.002373549
\[ \frac{\hat{\beta}_1}{ \hat{se}(\hat{\beta}_1) } \sim t_{n-2}\]
## The p-value for hypothesis testing H0: beta1=0 is 0.002373549
## t.test of r code:
##
## One Sample t-test
##
## data: z
## t = -7.1395, df = 99, p-value = 1.588e-10
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.5982809 -0.3380530
## sample estimates:
## mean of x
## -0.468167
## my result:
## mean of z : -0.468167 , t = -7.139473 , p-value = 1.587818e-10
## 95 percent confidence interval:
## -0.5982809 -0.338053
If we ignore the correlation, the 95% confidence interval length of \(\beta_1\) is 2 * 0.29.
If we take into account the correlation, the 95% confidence interval length of \(\beta_1\) is 2 * 0.13.
Both methods get the same estimation of \(\beta_1\).
Thus if we ignore the correlation, we will get a less accurate estimator (bigger variance, wider confidence interval, bigger p-value, less likely to reject the null hypothesis of beta equal to zero). And the reason comes from that: “ignoring correlation” relys on a stronger assumption and only uses limited information.