A group of hikers are monitored throughout a mountain walk at checkpoints located throughout the course. The data collected are shown below:
distance <- c(1.3, 3.3, 5.8, 8.2, 10.7, 12.4) # in km
arrival <- c(0.35, 1.2, 3.0, 5.1, 6.3, 7.2) # in hours
data.frame(distance, arrival)
## distance arrival
## 1 1.3 0.35
## 2 3.3 1.20
## 3 5.8 3.00
## 4 8.2 5.10
## 5 10.7 6.30
## 6 12.4 7.20
Plot these data on a piece of graph paper and sketch a line of best fit. Using the points on the line, indicate how the line of best fit is calculated. Hint: What value is minimized in a line of best fit? Indicate this on your graph.
plot(distance, arrival, xlab = "Checkpoint distance / km",
ylab = "Arrival time / hours", main = "Arrival time at different checkpoints")
model <- lm(arrival ~ distance)
abline(model)
segments(distance, arrival, distance, model$fitted.values, col = "red")
The sum of the squared of the red segments is minimised in least squared linear regression.
If instead of collecting arrival times at specific distance checkpoints, you determined the locations of the group at the times indicated above, given the same data, would your calculation of a best fit line change? If so, why?
The calculation of a best fit line would change because the sum being minimised is different.
model2 <- lm(distance ~ arrival)
plot(distance, arrival, xlab = "Checkpoint distance / km",
ylab = "Arrival time / hours", main = "Arrival time at different checkpoints")
abline(model)
segments(distance, arrival, model2$fitted.values, arrival, col = "blue")
It is the sum of squares of the blue segments that would be minimised in the second model, hence giving a different equation for the line of best fit. If we tried to predict arrival times with the second model, we would obtain a different result than we would with the first model. Least squares linear regression is an asymmetric method when it comes to dependent and independent variables.