October 4th, 2023
Suppose the true samples were distributed this way. Our samples would come from it. We don’t know what this true sampling distribution looks like, but, for a second, let’s pretend that we do. Our goal is to infer what number is this true sampling distribution centered around based on a single sample. NB: Sampling distribution is not the distribution of a sample.
While we don’t know the real sampling distribution, we do know the properties of a hypothetical distribution, i.e., of our null. For one, we know that it is centered around zero, because we assumed it. Whatever conclusion we make about the real sampling distribution will only makes sense relative to the hypothetical sampling distribution, i.e., our null.
Let’s pick one sample. Of course, we’re picking that sample from the real distribution, not the hypothetical distribution. Let’s add the mean of that sample to our chart (solid vertical line).
The area to the right of that vertical line and below the solid bell curve would be our p-value (red shaded area). It is the probability of us observing a value that is more extreme than our estimate if the sample came from the sampling distribution centered around the null hypothesis. Further is our realized sample from the hypothesized value, smaller is that small red area, smaller is the p-value.
Let’s add the lower bound of confidence interval around our sample mean (slightly thicker vertical line). It moves together with our sample mean, and our p-value as well, of course.
## 95% confidence interval is [0.03051649,0.4225165] ; p-value is 0.01175128
p-value is small, i.e., our observed sample estimate is quite far from the null hypothesis. The confidence interval does not contain zero (i.e., the null hypothesis). We may safely reject the null.
## 99% confidence interval is [-0.03148351,0.4845165] ; p-value is 0.01175128
p-value is the same. The confidence interval is now wider and does contain zero. We are more confident, but we are much less precise. I am 100% confident that average human height is somewhere between 0 and 3 meters: I’m confident, but imprecise. I’m only about 90% confident that average human height is between a meter and two meters.
## 95% confidence interval is [-0.05069999,0.3413] ; p-value is 0.07311186
This time our p-value is “large” (it’s bigger than 0.025); our estimate is quite close to the null hypothesis, and our confidence interval, of course, captures the null. We may not reject the null hypothesis. Our observed value is a bit too near to the hypothesizes value.
## 95% confidence interval is [0.07642825,0.4684283] ; p-value is 0.003222068
## 95% confidence interval is [-0.1792298,0.2127702] ; p-value is 0.433409
## 95% confidence interval is [0.01720735,0.4092074] ; p-value is 0.0165004
## 95% confidence interval is [0.01720735,0.4092074] ; p-value is 0.0165004
Of course, we’ll never know the true sampling distribution, i.e., the “Truth”, but thanks to statistical inference we can make good guesses about it.
Let’s generate some data.
Suppose our \(x_1\) comes from a uniform distribution bounded below by 1 and bounded above by 3; suppose \(x_2\) is dependent on \(x_1\). Let’s further create \(y\) that is dependent on \(x_1\) and \(x_2\).
\[X_1 \sim U[1, 3]\] \[X_2 = 0.5*X_1 + \varepsilon\]
\[\varepsilon \sim N(0, 0.5)\]
\[Y = \alpha + \beta_1x_1 + \beta_2x_2 + \varepsilon'\]
Let’s set the values of \(\alpha\), \(\beta_1\) and \(\beta_2\) at 1, 1 and 3 respectively.
\[Y = 1 + 1*x_1 + 3*x_2 + \varepsilon'\]
\[\varepsilon' \sim N(0, 1)\]
Our aim is to estimate \(\widehat{\alpha}\), \(\widehat{\beta_1}\) and \(\widehat{\beta_2}\):
\[Y = \widehat{\alpha} + \widehat{\beta_1}x_1 + \widehat{\beta_2}x_2\]
| Y | |
| X1 | 2.337*** |
| (0.463) | |
| Constant | 1.351 |
| (1.028) | |
| N | 30 |
| R2 | 0.476 |
| Adjusted R2 | 0.458 |
| p < .1; p < .05; p < .01 | |
Â
Note that the coefficient on \(X_1\) is over-estimated.
A regression line is a better summary than the average of the dependent variable and we can even count how much better it is.
\[ 1 - \frac{ \color{blue} {(\sum_{i}y_{i} - \widehat{y_{i}})^2} } { \color{red} {(\sum_{i}y_{i} - \overline{y_{i}})^2} } \]
\(\color{red} {(\sum_{i}y_{i} - \overline{y_{i}})^2}\) is the average squared distance from the actual values of the dependent variable to the average of it.
\(\color{blue} {(\sum_{i}y_{i} - \widehat{y_{i}})^2}\) is the average squared distance from the actual values of the dependent variable to the predicted values.
You also could think of \(\color{red} {(\sum_{i}y_{i} - \overline{y_{i}})^2}\) as an average error when using average of the dependent variable as a summary. \(\color{blue} {(\sum_{i}y_{i} - \widehat{y_{i}})^2}\) would of course be the error when using the regression line as a summary.
Error when using the regression line as a summary would normally be lower. Put simply, the regression line is an improvement in summary over the simple mean of the dependent variable. \(1 - \frac{ \color{blue} {(\sum_{i}y_{i} - \widehat{y_{i}})^2} } { \color{red} {(\sum_{i}y_{i} - \overline{y_{i}})^2} }\) is the measure of that improvement.
## [1] 0.4762929
## [1] 0.4416787
Let’s compare the squared summed errors of our “guess” line and the mean line.
## [1] 0.4247651
Let’s compare the squared summed errors of our “guess” line and the mean line.
## [1] 0.06663751
Let’s compare the squared summed errors of our “guess” line and the mean line.
## [1] -0.9416435
Let’s compare the squared summed errors of our “guess” line and the mean line.
## [1] -0.3616776
Let’s compare the squared summed errors of our “guess” line and the mean line.
## [1] 0.06663751
Let’s compare the squared summed errors of our “guess” line and the mean line.
## [1] 0.3433019
Let’s compare the squared summed errors of our “guess” line and the mean line.
## [1] 0.4683157
Let’s compare the squared summed errors of our “guess” line and the mean line.
## [1] 0.4762929
This number has a very specific name…
\[Y = 1 + 1*x_1 + 3*x_2 + \varepsilon' \text{ (this is Truth)}\] \[Y = \widehat{\alpha} + \widehat{\beta_1}*x_1 \text{ (this is the model)}\]
| Y | |
| X1 | 2.337*** |
| (0.463) | |
| Constant | 1.351 |
| (1.028) | |
| N | 30 |
| R2 | 0.476 |
| Adjusted R2 | 0.458 |
| p < .1; p < .05; p < .01 | |
Note that there is a relationship between \(X_1\) and \(X_2\).
\[Y = 1 + 1*x_1 + 3*x_2 + \varepsilon' \text{ (this is Truth)}\] \[Y = \widehat{\alpha} + \widehat{\beta_1}*x_1 + \widehat{\beta_2}*x_2 \text{ (this is the model)}\]
| Y | |
| X1 | 1.694*** |
| (0.286) | |
| X2 | 2.404*** |
| (0.327) | |
| Constant | 0.123 |
| (0.627) | |
| N | 30 |
| R2 | 0.826 |
| Adjusted R2 | 0.813 |
| p < .1; p < .05; p < .01 | |
Â
Observe that R-squared is not 1 even though the model specification is correct.
Let’s generate some more data.
Suppose our \(x_1\) comes from a uniform distribution bounded below by -5 and bounded above by 5; suppose \(x_2\) is independent of \(x_1\). Let’s further create \(y\) that is dependent on \(x_1\) and \(x_2\).
\[X_1 \sim U[-5, 5]\] \[X_2 \sim U[-5, 5]\] \[Y = \alpha + 2*X_1 + 3*X_2 + X_1*X_2 + \varepsilon\]
\[\varepsilon \sim N(0, 5)\]
| Y | |
| X1 | 1.572*** |
| (0.492) | |
| Constant | -0.129 |
| (1.429) | |
| N | 100 |
| R2 | 0.094 |
| Adjusted R2 | 0.085 |
| p < .1; p < .05; p < .01 | |
Â
Note that the coefficient on \(X_1\) is under-estimated.
## [1] -0.3590537
Let’s compare the squared summed errors of our “guess” line and the mean line.
## [1] 0.0148009
Let’s compare the squared summed errors of our “guess” line and the mean line.
## [1] 0.06029736
Let’s compare the squared summed errors of our “guess” line and the mean line.
## [1] 0.07838553
Let’s compare the squared summed errors of our “guess” line and the mean line.
## [1] 0.09446374
Let’s compare the squared summed errors of our “guess” line and the mean line.
## [1] 0.08561324
Let’s compare the squared summed errors of our “guess” line and the mean line.
## [1] -0.1584601
Let’s compare the squared summed errors of our “guess” line and the mean line.
## [1] 0.08561324
Let’s compare the squared summed errors of our “guess” line and the mean line.
## [1] 0.09446374
This number has a very specific name…
| Y | |
| X1 | 1.572*** |
| (0.492) | |
| Constant | -0.129 |
| (1.429) | |
| N | 100 |
| R2 | 0.094 |
| Adjusted R2 | 0.085 |
| p < .1; p < .05; p < .01 | |
Note that there is no relationship between \(X_1\) and \(X_2\).
| Y | |
| X1 | 1.287*** |
| (0.355) | |
| X2 | 3.649*** |
| (0.380) | |
| Constant | 1.854* |
| (1.049) | |
| N | 100 |
| R2 | 0.536 |
| Adjusted R2 | 0.526 |
| p < .1; p < .05; p < .01 | |
Â
\[Y = \alpha + 2*X_1 + 3*X_2 + X_1*X_2 + \varepsilon\]
| Y | |
| X1 | 2.084*** |
| (0.197) | |
| X2 | 3.325*** |
| (0.205) | |
| X1 * X2 | 1.047*** |
| (0.067) | |
| Constant | 0.950* |
| (0.566) | |
| N | 100 |
| R2 | 0.868 |
| Adjusted R2 | 0.863 |
| p < .1; p < .05; p < .01 | |
\[X_1 \sim U[-5, 5]\] \[X_2 \sim U[-5, 5]\] \[Y = \alpha + 2*X_1 + 3*X_2 + X_1*X_2 + \varepsilon\]
\[\varepsilon \sim N(0, 5)\]
Note that even though \(X_1\) and \(X_2\) were not correlated, the effect of each was dependent on the value of the other. Whether the use of interaction terms is justified or not has nothing to do with the correlation of the independent variables.