Problem 1)

Assume \(D_1 = \{ (x_i, y_i)\}^n_{i=1}\) generated from \(y_i = \beta_0 + \beta_1x_i + \epsilon_i\). Suppose some goofy-goober gucked up and doubled the data set– \(D_2=\{(x_i, y_i)\}^{2n}_{i=1}\).

A) How does this affect Parameter Est?

Consider the following two equations :

\[ \hat{\beta_1} = \frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2} \\ \hat{\beta_0} = \bar{y}-\hat{\beta_1}\bar{x} \]

However, notice that the average stays the same if all the data is the same but repeated.

\[ \text{Recall : }\\ \bar{x}=\frac{1}{n}\sum^n_{i=1}{x_i} \\ \implies \\ \frac{1}{2n}\sum^{2n}_{i=1}{x_i} \\ = \frac{1*2}{n*2}\sum^n_{i=1}{x_i} \\ = \frac{1}{n}\sum^n_{i=1}{x_i} \\ = \bar{x} \]

So, we know that the average doesn’t change. And since both \(\hat{\beta}_{0,1}\) depend upon the average, the est. doesn’t change.

B) CI-K%

Consider the following for intercepts :

\[ \hat{\beta_0} \pm t_{\frac{\alpha}{2}, n-2} * SE(\hat{\beta_0}) \\ = \hat{\beta_0} \pm t_{\frac{\alpha}{2}, n-2} * S\sqrt{\frac{1}{n}+\frac{\bar{x}^2}{\sum(x_i-\bar{x})^2}} \\ = \hat{\beta_0} \pm t_{\frac{\alpha}{2}, n-2} * \sqrt{\frac{\sum\hat{e}_i^2}{n-2}}\sqrt{\frac{1}{n}+\frac{\bar{x}^2}{\sum(x_i-\bar{x})^2}} \]

Notice that for the critical value ( \(t_{\frac{\alpha}{2}, n-2}\) ), as \(n\) becomes large we approximate the normal dist and our critical value approaches the \(Z-\text{values}\). As this occurs our critical value gets smaller.

Furthermore, consider that as \(n\) becomes larger, the \(SE(\beta_0)\) becomes smaller.

Therefore as \(n\) becomes larger, we should expect, our confidence interval becomes smaller.

Consider the following for slope :

\[ \hat{\beta_1} \pm t_{\frac{\alpha}{2}, n-2} * SE(\hat{\beta_1}) \\ = \hat{\beta_1} \pm t_{\frac{\alpha}{2}, n-2} * \frac{S}{S_{XX}} \\ = \hat{\beta_1} \pm t_{\frac{\alpha}{2}, n-2} * \frac{\sqrt{\frac{\sum\hat{e}_i^2}{n-2}}}{\sum(x_i-\bar{x})^2} \]

Therefore, As \(n\) becomes large, the t-distribution approximates the normal distribution, and the critical value approaches the Z-values, making it smaller.

Additionally, the standard error decreases because the residual variance stabilizes while the denominator increases. Therefore, as \(n\) grows, the confidence interval for \(\beta_1\) becomes narrower, leading to a more precise estimate

C) Hypothesis Testing

Consider the following for the intercept :

\[ T = \frac{\hat{\beta_0}}{SE(\hat{\beta_0})} \]

Therefore, as \(n\) becomes larger, out \(T-\text{value}\) becomes larger (Recall that parameter remains unchanged). As our \(T-\text{value}\) becomes larger, we become more likely to reject the null hypothesis because our \(P-\text{value}\) becomes smaller and smaller. Therefore, making us more subject to a Type 1 Error (False Positive).

Consider the following for slope :

\[ T = \frac{\hat{\beta_1}}{SE(\hat{\beta_1})} \]

Therefore, as \(n\) becomes larger, our T-value increases (recall that the parameter remains unchanged while the standard error decreases). As our T-value becomes larger, we become more likely to reject the null hypothesis because our P-value decreases. This makes us more susceptible to a Type I Error (False Positive), meaning we might incorrectly conclude that the slope is significant when it is not.

Problem 2)

Suppose we have the same data set \(D_1\), However a student accidentally switches ( \(x_i, y_i\) ) to ( \(y_i,x_i\) ) – suppose we have \(\text{MDL}_1, \text{MDL}_2\) respectively.

A) What is the relationship between the slope of each model?

Recall :

\[ \beta_1 = \frac{Cov(X,Y)}{Var(X)} \\ \beta_1^{'} = \frac{Cov(Y, X)}{Var(Y)} \\ \]

except note \(Cov(X,Y) = Cov(Y,X)\) so the only difference is the denominator.

Furthermore Suppose :

\[ \beta_1 * \beta_1^{'} \\ = \frac{Cov(X,Y)}{Var(X)} * \frac{Cov(Y, X)}{Var(Y)} \\ = \frac{Cov(X,Y)}{Var(X)} * \frac{Cov(X, Y)}{Var(Y)} \\ = \frac{Cov^2(X,Y)}{Var(X)Var(Y)} \]

and we should see something happening, specifically consider the corr-coef :

\[ r = \frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}} \\ \implies \sqrt{\beta_1 * \beta_1^{'}} \implies \\ \beta_1 * \beta_1^{'} = r^2 \\ \therefore \\ \beta_1 = \frac{r^2}{\beta_1^{'}} \]

However, if we have a perfect linear association, its clear that they are just reciprocals of one another.

B) What is the relationship between the slope of each model?

Recall :

\[ T_1 = \frac{\hat{\beta_1}}{SE(\hat{\beta_1})} \\ \]

If \(T_1\) is statistically significant then, \(|T_1| > t_{\frac{\alpha}{2}, n-2} = |T_1| > t_1\).

So consider that :

\[ T_2 = \frac{\hat{\beta_1^{'}}}{SE(\hat{\beta_1^{'}})} \\ = \frac{\frac{r^2}{\hat{\beta_1}}}{SE(\hat{\beta_1^{'}})} \\ = \frac{r^2}{\hat{\beta_1}SE(\hat{\beta_1^{'}})} \]
and notice :

\[ SE(\beta_1^{'}) = \frac{SE(\hat{\beta_1})}{\hat{\beta_1}^2} \\ \implies \\ = \frac{r^2}{\hat{\beta_1}SE(\hat{\beta_1^{'}})} \\ = \frac{r^2}{\hat{\beta_1}\frac{SE(\hat{\beta_1})}{\hat{\beta_1}^2}} \\ = \frac{r^2 \hat{\beta_1}}{SE(\hat{\beta_1})} \\ = r^2\frac{ \hat{\beta_1}}{SE(\hat{\beta_1})} \]

Meaning,

\[ T_2 = r^2*T_1 \\ \text{ and, } \\ T_1 = \frac{T_2}{r^2} \]

and recall, \(r^2 \in [0,1]\) (i.e. its a fraction, 0 or 1). And if, \(|T_1| > t_1\)

\[ \implies \\ |T_1| ≥ |T_2| = |T_1| ≥ |r^2*T_1| \\ \implies \\ t_1 * r^2≥ |T_2| \]

Therefore, as \(r^2 \rightarrow 1\), the more likely both are true. In other words as the proportion of variance in Y is explained by \(X\) in the regression model (ie more linear), the more likely the statement is correct.

That is, lower \(r^2\) values, make it less certain that \(T_2\)​ remains statistically significant even if \(T_1\)​ is.

Thus, statistical significance is preserved when switching \(X\) and \(Y\) only to the extent that the linear relationship is strong