hubble
## distance velocity
## 1 0.032 170
## 2 0.034 290
## 3 0.214 -130
## 4 0.263 -70
## 5 0.275 -185
## 6 0.275 -220
## 7 0.450 200
## 8 0.500 290
## 9 0.500 270
## 10 0.630 200
## 11 0.800 300
## 12 0.900 -30
## 13 0.900 650
## 14 0.900 150
## 15 0.900 500
## 16 1.000 920
## 17 1.100 450
## 18 1.100 500
## 19 1.400 500
## 20 1.700 960
## 21 2.000 500
## 22 2.000 850
## 23 2.000 800
## 24 2.000 1090
plot(hubble$distance ~ hubble$velocity, pch = 20,
xlab = "recession velocity (km/sec)", ylab = "distance (megaparsecs)",
main = "Scatter plot of distance against recession velocity",
bty = "n")
# Here, I plot distance against velocity
points (0,0, col = "red", pch = 16)
abline (v = 0, col = "grey", lwd = 2, lty = 2)
It appears that distance and velocity are positively associated: the values of distance tend to be larger for nebulae with large velocities than for nebulae with smaller velocities. Also, this relationship appears to be approximately linear, at least over the range of velocities available.
Scientists then wondered how the positive linear association between distance and velocity could have arisen. The result was ‘Big Bang’ theory. This theory proposes that the Universe started with a Big Bang at a single point in space a very long time ago, scattering material around the surface of an ever-expanding sphere. If Big Bang theory is correct then the relationship between distance (Y) and recession velocity (X) should be of the form
\[ Y = TX \] where T is the age of the Universe when the observations were made. This is called Hubble’s Law. In other words, distance, Y , should depend linearly on velocity, X. \(H = 1/Y\) is called Hubble’s constant.
The points in above scatterplot do not lie exactly on a straight line, partly because the values of distance are not exact: they include measurement error.
Also, there may have been astronomical events since the Big Bang which have weakened further the supposed linear relationships between distance and velocity.
If we look at nebulae with the same value, x, of velocity the measured value of distance, Y , varies from one nebulae to another.
For example, the 4 nebulae with velocities of 500 km/sec have have distances 0.9, 1.1, 1.4 and 2.0 MPc. So, for a given value of velocity there is variability in their distances from the Earth. Therefore, Y∣ X = x is a random variable, with conditional mean E(Y ∣ X = x) and conditional variance var(Y ∣ X = x).
In the figure it looks possible that there is a straight line relationship between E(Y ∣ X = x) and x. Therefore we consider fitting a simple linear regression model of Y on x. You could think of this as a way to draw a ‘line of best fit’ through the points in the figure.
I skip some part, so we go directly to the conclusion: Notice that:
Suppose that we have paired data (\(x_1\), \(y_1\)), … , (\(x_n\), \(y_n\)). How can we fit a simple linear regression model to these data? Initially, our aim is to use estimators \(\hat{\alpha}\) and \(\hat{\beta}\) of \(\alpha\) and \(\beta\) to produce an estimated regression line:
\[ y = \hat{\alpha} + \hat{\beta} \times x\]
There are many possible estimators of \(\alpha\) and \(\beta\) that could be used. A standard approach, which produces estimators with some nice properties is least squares estimation. Firstly, we rearrange equation above to define residuals
\[ R_{i} = Y_{i}-(\hat{\alpha} + \hat{\beta} \times x) = Y_{i} - \hat{Y}_{i}, ~~~~~~~~ i = 1, ..., n \] the differences between the observed values \(Y_{i}\), i = 1, … , n and the fitted values \(\hat{Y}_{i}\) =\(\hat{\alpha} + \hat{\beta} \times x_{i}\) ,i = 1, … , n given by the estimated regression line.
The least squares estimators have the property that they minimise the sum of squared residuals:
\[\sum_{i = 1}^{n} (Y_{i} -\hat{\alpha} - \hat{\beta} \times x_{i})^2\]
The following show least squares regression lines under 3 different models:
Model 1. Y does not depend on X, so that: \[ Y_{i} = \alpha_1 + \epsilon_i, ~~~~~i = 1,...,n\]
Model 2. Y depends on X according to Hubble’s law, so that \[ Y_{i} = \beta_2\times x_{i} + \epsilon_i, ~~~~~i = 1,...,n\]
where \(\beta_{2} = T\) is the age of the Universe.
These figures also shows the sizes of the residuals and the residual sums of squares RSS. From the plots, and the relative sizes of RSS, it seems clear that velocity x explains some of the variability in the values of distance Y . The residual sum of squares, \(RSS_{3}\) , of Model 3 is smaller than the residual sum of squares, \(RSS_{2}\) , of Model 2. It is impossible that \(RSS_{3} > RSS_{2}\).
A key question is whether \(RSS_{3}\) is so much smaller than \(RSS_{2}\) that we would choose Model 3 over Model 2, which is a question that is considered in STAT0003(not my current course). Model 2 is an example of regression through the origin, where it is assumed that the intercept equals 0. We should only fit this kind of model if we have a good reason to. Here Hubble’s Law gives us a good reason.
I just omit the first Model 1 and see Model 2 and 3.
usual <- lm(distance ~ velocity, data = hubble)
coef_values <- coef(usual)
coef_values
## (Intercept) velocity
## 0.399098216 0.001372936
rss <- sum(residuals(usual) ^ 2)
plot(rev(hubble), pch = 16, main = paste("Fit the (usual) model \n RSS =", round(rss, 2)))
abline(coef = coef_values)
unusual <- lm(distance ~ velocity - 1, data = hubble)
coef(unusual)
## velocity
## 0.001921806
rss <- sum(residuals(unusual) ^ 2)
plot(rev(hubble), pch = 16, main = paste("Regression through the origin \n RSS =", round(rss, 2)))
abline(a = 0, b = coef(unusual))
We fit this model because Hubble’s law gives us a reason \[ \hat{\theta} \approx 0.00192Mpc/km/sec \approx 2 ~billion~ years\] Freedman et al. (2001) updated Hubble’s dataset …
unusual2 <- lm(distance ~ velocity - 1, data = hubble2)
matplot(cbind(hubble[,2], hubble2[,2]), cbind(hubble[,1], hubble2[,1]),
xlab = "velocity (km/sec)", ylab = "distance (megaparsecs)")
abline(a = 0, b = coef(unusual), col = "black")
abline(a = 0, b = coef(unusual2), col = "red")
\[ \hat{\theta} \approx 0.0123Mpc/km/sec
\approx 12 ~billion~ years\]