hubble
##    distance velocity
## 1     0.032      170
## 2     0.034      290
## 3     0.214     -130
## 4     0.263      -70
## 5     0.275     -185
## 6     0.275     -220
## 7     0.450      200
## 8     0.500      290
## 9     0.500      270
## 10    0.630      200
## 11    0.800      300
## 12    0.900      -30
## 13    0.900      650
## 14    0.900      150
## 15    0.900      500
## 16    1.000      920
## 17    1.100      450
## 18    1.100      500
## 19    1.400      500
## 20    1.700      960
## 21    2.000      500
## 22    2.000      850
## 23    2.000      800
## 24    2.000     1090
plot(hubble$distance ~ hubble$velocity, pch = 20,
     xlab = "recession velocity (km/sec)", ylab = "distance (megaparsecs)",
     main = "Scatter plot of distance against recession velocity", 
     bty = "n")
# Here, I plot distance against velocity

points (0,0, col = "red", pch = 16)

abline (v = 0, col = "grey", lwd = 2, lty = 2)

It appears that distance and velocity are positively associated: the values of distance tend to be larger for nebulae with large velocities than for nebulae with smaller velocities. Also, this relationship appears to be approximately linear, at least over the range of velocities available.

Scientists then wondered how the positive linear association between distance and velocity could have arisen. The result was ‘Big Bang’ theory. This theory proposes that the Universe started with a Big Bang at a single point in space a very long time ago, scattering material around the surface of an ever-expanding sphere. If Big Bang theory is correct then the relationship between distance (Y) and recession velocity (X) should be of the form

\[ Y = TX \] where T is the age of the Universe when the observations were made. This is called Hubble’s Law. In other words, distance, Y , should depend linearly on velocity, X. \(H = 1/Y\) is called Hubble’s constant.

The points in above scatterplot do not lie exactly on a straight line, partly because the values of distance are not exact: they include measurement error.

Also, there may have been astronomical events since the Big Bang which have weakened further the supposed linear relationships between distance and velocity.

If we look at nebulae with the same value, x, of velocity the measured value of distance, Y , varies from one nebulae to another.

For example, the 4 nebulae with velocities of 500 km/sec have have distances 0.9, 1.1, 1.4 and 2.0 MPc. So, for a given value of velocity there is variability in their distances from the Earth. Therefore, Y∣ X = x is a random variable, with conditional mean E(Y ∣ X = x) and conditional variance var(Y ∣ X = x).

In the figure it looks possible that there is a straight line relationship between E(Y ∣ X = x) and x. Therefore we consider fitting a simple linear regression model of Y on x. You could think of this as a way to draw a ‘line of best fit’ through the points in the figure.

I skip some part, so we go directly to the conclusion: Notice that:

Suppose that we have paired data (\(x_1\), \(y_1\)), … , (\(x_n\), \(y_n\)). How can we fit a simple linear regression model to these data? Initially, our aim is to use estimators \(\hat{\alpha}\) and \(\hat{\beta}\) of \(\alpha\) and \(\beta\) to produce an estimated regression line:

\[ y = \hat{\alpha} + \hat{\beta} \times x\]

There are many possible estimators of \(\alpha\) and \(\beta\) that could be used. A standard approach, which produces estimators with some nice properties is least squares estimation. Firstly, we rearrange equation above to define residuals

\[ R_{i} = Y_{i}-(\hat{\alpha} + \hat{\beta} \times x) = Y_{i} - \hat{Y}_{i}, ~~~~~~~~ i = 1, ..., n \] the differences between the observed values \(Y_{i}\), i = 1, … , n and the fitted values \(\hat{Y}_{i}\) =\(\hat{\alpha} + \hat{\beta} \times x_{i}\) ,i = 1, … , n given by the estimated regression line.

The least squares estimators have the property that they minimise the sum of squared residuals:

\[\sum_{i = 1}^{n} (Y_{i} -\hat{\alpha} - \hat{\beta} \times x_{i})^2\]

The following show least squares regression lines under 3 different models:

where \(\beta_{2} = T\) is the age of the Universe.

These figures also shows the sizes of the residuals and the residual sums of squares RSS. From the plots, and the relative sizes of RSS, it seems clear that velocity x explains some of the variability in the values of distance Y . The residual sum of squares, \(RSS_{3}\) , of Model 3 is smaller than the residual sum of squares, \(RSS_{2}\) , of Model 2. It is impossible that \(RSS_{3} > RSS_{2}\).

A key question is whether \(RSS_{3}\) is so much smaller than \(RSS_{2}\) that we would choose Model 3 over Model 2, which is a question that is considered in STAT0003(not my current course). Model 2 is an example of regression through the origin, where it is assumed that the intercept equals 0. We should only fit this kind of model if we have a good reason to. Here Hubble’s Law gives us a good reason.

I just omit the first Model 1 and see Model 2 and 3.

usual <- lm(distance ~ velocity, data = hubble)
coef_values <- coef(usual)
coef_values
## (Intercept)    velocity 
## 0.399098216 0.001372936
rss <- sum(residuals(usual) ^ 2)

plot(rev(hubble), pch = 16, main = paste("Fit the (usual) model \n RSS =", round(rss, 2)))
abline(coef = coef_values)

unusual <- lm(distance ~ velocity - 1, data = hubble) 
coef(unusual)
##    velocity 
## 0.001921806
rss <- sum(residuals(unusual) ^ 2)
plot(rev(hubble), pch = 16, main = paste("Regression through the origin \n RSS =", round(rss, 2))) 
abline(a = 0, b = coef(unusual))

We fit this model because Hubble’s law gives us a reason \[ \hat{\theta} \approx 0.00192Mpc/km/sec \approx 2 ~billion~ years\] Freedman et al. (2001) updated Hubble’s dataset …

unusual2 <- lm(distance ~ velocity - 1, data = hubble2) 
matplot(cbind(hubble[,2], hubble2[,2]), cbind(hubble[,1], hubble2[,1]),
        xlab = "velocity (km/sec)", ylab = "distance (megaparsecs)")
abline(a = 0, b = coef(unusual), col = "black")
abline(a = 0, b = coef(unusual2), col = "red")

\[ \hat{\theta} \approx 0.0123Mpc/km/sec \approx 12 ~billion~ years\]