Exercise 7.13 introduces data on the Coast Starlight Amtrak train that runs from Seattle to Los Angeles. The mean travel time from one stop to the next on the Coast Starlight is 129 mins, with a standard deviation of 113 minutes. The mean distance traveled from one stop to the next is 108 miles with a standard deviation of 99 miles. The correlation between travel time and distance is 0.636.
path <- "https://raw.githubusercontent.com/jbryer/DATA606Fall2017/master/Data/Data%20from%20openintro.org/Ch%207%20Exercise%20Data/CoastStarlight.txt"
starlight <- read.delim(path)
mean_t <- mean(starlight$travel_time)
s_t <- sd(starlight$travel_time)
mean_d <- mean(starlight$dist)
s_d <- sd(starlight$dist)
R <- cor(starlight$travel_time,starlight$dist)
tbl <- data.frame(MEAN=c(mean_t, mean_d), SD=c(s_t, s_d), COR=c(R, ""))
row.names(tbl) <- c("time", "distance")
tbl
## MEAN SD COR
## time 128.8750 113.35306 0.635931061735658
## distance 107.4375 99.30154
plot(x=starlight$dist, y=starlight$travel_time, xlab="Distance (miles)", ylab="Travel Time (minutes)")
(a) Write the equation of the regression line for predicting travel time.
\({ t={ b }_{ 0 }+{ b }_{ 1 } }d\)
where t = predicted travel time in minutes, d = distance between the stops in miles, \({ b }_{ 1 }\) = slope of the regression line, and \({ b }_{ 0 }\) = intercept of the regression line.
The slope of the line can be found as:
\({ b }_{ 1 }=\frac { { s }_{ t } }{ { s }_{ d } } R\)
The point-slope form for the line is:
\(t-\overline { t } ={ b }_{ 1 }(d-\overline { d } )\)
where \(\overline { t }\) = mean travel time, and \(\overline { d }\) = mean distance.
Solve for the intercept:
\({ b }_{ 0 }=-{ b }_{ 1 }\overline { d } +\overline { t }\)
(slope <- (s_t/s_d) * R)
## [1] 0.7259176
(intercept <- - slope * mean_d + mean_t)
## [1] 50.88423
So the regression line is:
\(t = 50.88 + 0.73d\)
(b) Interpret the slope and the intercept in this context.
The slope means that for each 10 additional miles travel, you should expect about 7.3 minutes of additional travel time.
From the textbook:
“The intercept describes the average outcome of y if x = 0 and the linear model is valid all the way to x = 0, which in many applications is not the case.”
The intercept is not relevant in this application, since if you plug in d = 0, the model predicts t = 50.88 mins, which does not make sense.
(c) Calculate \({ R }^{ 2 }\) of the regression line for predicting travel time from distance traveled for the Coast Starlight, and interpret \({ R }^{ 2 }\) in the context of the application.
(R_square <- R^2)
## [1] 0.4044083
Therefore, \({ R }^{ 2 }\) = 0.4.
The formula for \({ R }^{ 2 }\) is:
\({ R }^{ 2 }=\frac { { { s }_{ t } }^{ 2 }-{ { s }_{ RES } }^{ 2 } }{ { { s }_{ t } }^{ 2 } }\)
The \({ R }^{ 2 }\) is a measurement of prediction improvement, using the regression line to predict travel time instead of using mean.
You could have used the green line (mean travel time) to predict the travel time. But it wouldn’t be very accurate. By using regression line to predict, you have become more accurate, and the variance become smaller. In this case, the variance is reduced by 40%.
(d) The distance between Santa Barbara and Los Angeles is 103 miles. Use the model to estimate the time it takes for the Starlight to travel between these two cities.
t <- function(d){
return(intercept + slope * d)
}
(t(103))
## [1] 125.6537
(e) It actually takes the Coast Starlight about 168 mins to travel from Santa Barbara to Los Angeles. Calculate the residual and explain the meaning of this residual value.
(e <- 168 - t(103))
## [1] 42.34626
The residual is an error in the prediction. It is the difference between the predicted value and the observed value.
(f) Suppose Amtrak is considering adding a stop to the Coast Starlight 500 miles away from Los Angeles. Would it be appropriate to use this linear model to predict the travel time from Los Angeles to this point?
plot(x=starlight$dist, y=starlight$travel_time, xlab="Distance (miles)", ylab="Travel Time (minutes)")
(maximum <- max(starlight$dist))
## [1] 352
No it would not be appropriate. 500 miles is way outside of the realm of data. The largest distance in the dataset is 352 miles.
From the textbook:
“Generally, a linear model is only an approximation of the real relationship between two variables. If we extrapolate, we are making an unreliable bet that the approximate linear relationship will be valid in places where it has not been analyzed.”