October 4th, 2023

A way to look at inference

Another way to look at it

Suppose the true samples were distributed this way. Our samples would come from it. We don’t know what this true sampling distribution looks like, but, for a second, let’s pretend that we do. Our goal is to infer what number is this true sampling distribution centered around based on a single sample. NB: Sampling distribution is not the distribution of a sample.

Another way to look at it

While we don’t know the real sampling distribution, we do know the properties of a hypothetical distribution, i.e., of our null. For one, we know that it is centered around zero, because we assumed it. Whatever conclusion we make about the real sampling distribution will only makes sense relative to the hypothetical sampling distribution, i.e., our null.

Sample

Let’s pick one sample. Of course, we’re picking that sample from the real distribution, not the hypothetical distribution. Let’s add the mean of that sample to our chart (solid vertical line).

p-value as a measure of distance

The area to the right of that vertical line and below the solid bell curve would be our p-value (red shaded area). It is the probability of us observing a value that is more extreme than our estimate if the sample came from the sampling distribution centered around the null hypothesis. Further is our realized sample from the hypothesized value, smaller is that small red area, smaller is the p-value.

Confidence interval as a measure of distance

Let’s add the lower bound of confidence interval around our sample mean (slightly thicker vertical line). It moves together with our sample mean, and our p-value as well, of course.

Confidence interval as a measure of distance

## 95% confidence interval is [0.03051649,0.4225165] ; p-value is 0.01175128

p-value is small, i.e., our observed sample estimate is quite far from the null hypothesis. The confidence interval does not contain zero (i.e., the null hypothesis). We may safely reject the null.

Confidence interval and a precision trade-off

## 99% confidence interval is [-0.03148351,0.4845165] ; p-value is 0.01175128

p-value is the same. The confidence interval is now wider and does contain zero. We are more confident, but we are much less precise. I am 100% confident that average human height is somewhere between 0 and 3 meters: I’m confident, but imprecise. I’m only about 90% confident that average human height is between a meter and two meters.

Another sample

## 95% confidence interval is [-0.05069999,0.3413] ; p-value is 0.07311186

This time our p-value is “large” (it’s bigger than 0.025); our estimate is quite close to the null hypothesis, and our confidence interval, of course, captures the null. We may not reject the null hypothesis. Our observed value is a bit too near to the hypothesizes value.

And another one

## 95% confidence interval is [0.07642825,0.4684283] ; p-value is 0.003222068

DJ Khaled

## 95% confidence interval is [-0.1792298,0.2127702] ; p-value is 0.433409

Expected disadvantage of ignorance

## 95% confidence interval is [0.01720735,0.4092074] ; p-value is 0.0165004

Expected disadvantage of ignorance

## 95% confidence interval is [0.01720735,0.4092074] ; p-value is 0.0165004

Of course, we’ll never know the true sampling distribution, i.e., the “Truth”, but thanks to statistical inference we can make good guesses about it.

Data Generating Process

Let’s generate some data.

Suppose our \(x_1\) comes from a uniform distribution bounded below by 1 and bounded above by 3; suppose \(x_2\) is dependent on \(x_1\). Let’s further create \(y\) that is dependent on \(x_1\) and \(x_2\).

\[X_1 \sim U[1, 3]\] \[X_2 = 0.5*X_1 + \varepsilon\]

\[\varepsilon \sim N(0, 0.5)\]

\[Y = \alpha + \beta_1x_1 + \beta_2x_2 + \varepsilon'\]

Let’s set the values of \(\alpha\), \(\beta_1\) and \(\beta_2\) at 1, 1 and 3 respectively.

\[Y = 1 + 1*x_1 + 3*x_2 + \varepsilon'\]

\[\varepsilon' \sim N(0, 1)\]

Our aim is to estimate \(\widehat{\alpha}\), \(\widehat{\beta_1}\) and \(\widehat{\beta_2}\):

\[Y = \widehat{\alpha} + \widehat{\beta_1}x_1 + \widehat{\beta_2}x_2\]

Summarizing

Summarizing with more information

Summarizing with more information

\[Y = 1 + 1*x_1 + 3*x_2 + \varepsilon' \text{ (this is Truth)}\] \[Y = \widehat{\alpha} + \widehat{\beta_1}*x_1 \text{ (this is the model)}\]
Y
X1 2.337***
(0.463)
Constant 1.351
(1.028)
N 30
R2 0.476
Adjusted R2 0.458
p < .1; p < .05; p < .01

 

Note that the coefficient on \(X_1\) is over-estimated.

A regression line is a better summary

A regression line is a better summary

A regression line is a better summary than the average of the dependent variable and we can even count how much better it is.

\[ 1 - \frac{ \color{blue} {(\sum_{i}y_{i} - \widehat{y_{i}})^2} } { \color{red} {(\sum_{i}y_{i} - \overline{y_{i}})^2} } \]

\(\color{red} {(\sum_{i}y_{i} - \overline{y_{i}})^2}\) is the average squared distance from the actual values of the dependent variable to the average of it.

\(\color{blue} {(\sum_{i}y_{i} - \widehat{y_{i}})^2}\) is the average squared distance from the actual values of the dependent variable to the predicted values.

You also could think of \(\color{red} {(\sum_{i}y_{i} - \overline{y_{i}})^2}\) as an average error when using average of the dependent variable as a summary. \(\color{blue} {(\sum_{i}y_{i} - \widehat{y_{i}})^2}\) would of course be the error when using the regression line as a summary.

Error when using the regression line as a summary would normally be lower. Put simply, the regression line is an improvement in summary over the simple mean of the dependent variable. \(1 - \frac{ \color{blue} {(\sum_{i}y_{i} - \widehat{y_{i}})^2} } { \color{red} {(\sum_{i}y_{i} - \overline{y_{i}})^2} }\) is the measure of that improvement.

Could we have done better?

## [1] 0.4762929

Could we have done better?

## [1] 0.4416787

Let’s compare the squared summed errors of our “guess” line and the mean line.

Could we have done better?

## [1] 0.4247651

Let’s compare the squared summed errors of our “guess” line and the mean line.

Could we have done better?

## [1] 0.06663751

Let’s compare the squared summed errors of our “guess” line and the mean line.

Could we have done better?

## [1] -0.9416435

Let’s compare the squared summed errors of our “guess” line and the mean line.

Could we have done better?

## [1] -0.3616776

Let’s compare the squared summed errors of our “guess” line and the mean line.

Could we have done better?

## [1] 0.06663751

Let’s compare the squared summed errors of our “guess” line and the mean line.

Could we have done better?

## [1] 0.3433019

Let’s compare the squared summed errors of our “guess” line and the mean line.

Could we have done better?

## [1] 0.4683157

Let’s compare the squared summed errors of our “guess” line and the mean line.

Could we have done better?

## [1] 0.4762929

This number has a very specific name…

R-squared and SSR

R-squared

\[Y = 1 + 1*x_1 + 3*x_2 + \varepsilon' \text{ (this is Truth)}\] \[Y = \widehat{\alpha} + \widehat{\beta_1}*x_1 \text{ (this is the model)}\]

Y
X1 2.337***
(0.463)
Constant 1.351
(1.028)
N 30
R2 0.476
Adjusted R2 0.458
p < .1; p < .05; p < .01

Summarizing with even more information

Note that there is a relationship between \(X_1\) and \(X_2\).

Summarizing with even more information

\[Y = 1 + 1*x_1 + 3*x_2 + \varepsilon' \text{ (this is Truth)}\] \[Y = \widehat{\alpha} + \widehat{\beta_1}*x_1 + \widehat{\beta_2}*x_2 \text{ (this is the model)}\]

Y
X1 1.694***
(0.286)
X2 2.404***
(0.327)
Constant 0.123
(0.627)
N 30
R2 0.826
Adjusted R2 0.813
p < .1; p < .05; p < .01

 

Observe that R-squared is not 1 even though the model specification is correct.

A regression line turns into a regression plane

Data Generating Process II

Let’s generate some more data.

Suppose our \(x_1\) comes from a uniform distribution bounded below by -5 and bounded above by 5; suppose \(x_2\) is independent of \(x_1\). Let’s further create \(y\) that is dependent on \(x_1\) and \(x_2\).

\[X_1 \sim U[-5, 5]\] \[X_2 \sim U[-5, 5]\] \[Y = \alpha + 2*X_1 + 3*X_2 + X_1*X_2 + \varepsilon\]

\[\varepsilon \sim N(0, 5)\]

Summarizing

Summarizing with more information

Summarizing with more information

\[Y = \alpha + 2*X_1 + 3*X_2 + X_1*X_2 + \varepsilon\]
Y
X1 1.572***
(0.492)
Constant -0.129
(1.429)
N 100
R2 0.094
Adjusted R2 0.085
p < .1; p < .05; p < .01

 

Note that the coefficient on \(X_1\) is under-estimated.

A regression line is a better summary

Could we have done better?

## [1] -0.3590537

Let’s compare the squared summed errors of our “guess” line and the mean line.

Could we have done better?

## [1] 0.0148009

Let’s compare the squared summed errors of our “guess” line and the mean line.

Could we have done better?

## [1] 0.06029736

Let’s compare the squared summed errors of our “guess” line and the mean line.

Could we have done better?

## [1] 0.07838553

Let’s compare the squared summed errors of our “guess” line and the mean line.

Could we have done better?

## [1] 0.09446374

Let’s compare the squared summed errors of our “guess” line and the mean line.

Could we have done better?

## [1] 0.08561324

Let’s compare the squared summed errors of our “guess” line and the mean line.

Could we have done better?

## [1] -0.1584601

Let’s compare the squared summed errors of our “guess” line and the mean line.

Could we have done better?

## [1] 0.08561324

Let’s compare the squared summed errors of our “guess” line and the mean line.

Could we have done better?

## [1] 0.09446374

This number has a very specific name…

R-squared

\[Y = \alpha + 2*X_1 + 3*X_2 + X_1*X_2 + \varepsilon\]
Y
X1 1.572***
(0.492)
Constant -0.129
(1.429)
N 100
R2 0.094
Adjusted R2 0.085
p < .1; p < .05; p < .01

Summarizing with even more information

Note that there is no relationship between \(X_1\) and \(X_2\).

Summarizing with even more information

\[Y = \alpha + 2*X_1 + 3*X_2 + X_1*X_2 + \varepsilon\]
Y
X1 1.287***
(0.355)
X2 3.649***
(0.380)
Constant 1.854*
(1.049)
N 100
R2 0.536
Adjusted R2 0.526
p < .1; p < .05; p < .01

 

A regression line turns into a regression plane

Could we have done better?

\[Y = \alpha + 2*X_1 + 3*X_2 + X_1*X_2 + \varepsilon\]

Y
X1 2.084***
(0.197)
X2 3.325***
(0.205)
X1 * X2 1.047***
(0.067)
Constant 0.950*
(0.566)
N 100
R2 0.868
Adjusted R2 0.863
p < .1; p < .05; p < .01

Could we have done better?

3D to 2D

Interaction terms and correlation among independent variables

\[X_1 \sim U[-5, 5]\] \[X_2 \sim U[-5, 5]\] \[Y = \alpha + 2*X_1 + 3*X_2 + X_1*X_2 + \varepsilon\]

\[\varepsilon \sim N(0, 5)\]

Note that even though \(X_1\) and \(X_2\) were not correlated, the effect of each was dependent on the value of the other. Whether the use of interaction terms is justified or not has nothing to do with the correlation of the independent variables.