The Coast Starlight, Part II. Exercise 7.13 introduces data on the Coast Starlight Amtrak train that runs from Seattle to Los Angeles. The mean travel time from one stop to the next on the Coast Starlight is 129 mins, with a standard deviation of 113 minutes. The mean distance traveled from one stop to the next is 108 miles with a standard deviation of 99 miles. The correlation between travel time and distance is 0.636.
Scatterplot from Question 7.13
# The general equation for the regression line: y = B0 + B1*x where B0 and B1 represent two model parameters.
# X is the explanatory or predictor variable and y is the response.
# B1 is the slope, which also equals: (sample y standard deviation) / (sample x standard deviation) * R.
# R is the correlation between the two variables.
# R ranges from -1 to 1, with -1 being completely negative correlation and +1 being a completely positive correlation.
# Distance is the explanatory variable
distance.mean <- 108 # in miles
distance.SD <- 99 # in miles
# Time is the response
time.mean <- 129 # minutes
time.SD <- 113 # minutes
# R for correlation
R <- 0.636
# Calculate the slope (or otherwise known as B1)
B1 <- R * (time.SD/distance.SD)
# Now to calculate for B0, we will use the values (x,y) = (108, 129). They are also the mean values, and they lie along the regression line.
# Now to rearrange the equation to solve for B0. B0 = y - B1*x
B0 <- 129 - B1 * 108
# With all of the pieces placed together, the equation for the regression line is:
paste("Regression line equation: Time =", round(B0,3), "+",round(B1,3),"* Distance")
## [1] "Regression line equation: Time = 50.599 + 0.726 * Distance"
# The slope is 0.636. Which means that for every 10 miles a train travels, it will add an additional 7.26 minutes to the trip. (Or for every 1 mile, an additional .726 minutes wil be added. Please note that a travel distance of 0 indicates ~51 minutes of travel time, which does not make sense.)
R.squared <- R^2
paste("R squared: ", round(R.squared,3))
## [1] "R squared: 0.404"
# This means that 40.4% of the variation found in this data is explained by the linear model i.e. explained by the distance traveled.
SBtoLA.dist <- 103 # miles
SBtoLA.time <- B0 + B1 * SBtoLA.dist
paste("According to the model, the estimated time to travel between Santa Barbara and Los Angeles is: ", round(SBtoLA.time,3), "minutes.")
## [1] "According to the model, the estimated time to travel between Santa Barbara and Los Angeles is: 125.37 minutes."
# The residual = time(actual) - time(expected)
SBtoLA.residual <- 168 - SBtoLA.time
paste("The residual is: ", round(SBtoLA.residual,3))
## [1] "The residual is: 42.63"
# Anytime we make estimates of explanatory variables that reaches beyond what the data provides, we are performing something called `extrapolation`. We have to be careful when we extrapolate as this relationship is not generally valid until it has been analyzed. In this situation, it may be potentially possible to apply this model at 500 miles, as this destination is still in California. However, we must note that at these distances, it may be a completely different company or a whole new set of railroad regulations may apply, so the times may not follow the linear regression model.