CUNY 606 Presentation

7.25

The Coast Starlight, Part II. Exercise 7.13 introduces data on the Coast Starlight Amtrak train that runs from Seattle to Los Angeles. The mean travel time from one stop to the next on the Coast Starlight is 129 mins, with a standard deviation of 113 minutes. The mean distance traveled from one stop to the next is 108 miles with a standard deviation of 99 miles. The correlation between travel time and distance is 0.636.

Scatterplot from Question 7.13

Write the equation of the regression line for predicting travel time.

# The general equation for the regression line: y = B0 + B1*x where B0 and B1 represent two model parameters.
# X is the explanatory or predictor variable and y is the response.
# B1 is the slope, which also equals: (sample y standard deviation) / (sample x standard deviation) * R.
# R is the correlation between the two variables.
# R ranges from -1 to 1, with -1 being completely negative correlation and +1 being a completely positive correlation.

# Distance is the explanatory variable
distance.mean <- 108 # in miles
distance.SD <- 99 # in miles

# Time is the response
time.mean <- 129 # minutes
time.SD <- 113 # minutes

# R for correlation
R <- 0.636

# Calculate the slope (or otherwise known as B1)
B1 <- R * (time.SD/distance.SD)

# Now to calculate for B0, we will use the values (x,y) = (108, 129). They are also the mean values, and they lie along the regression line.
# Now to rearrange the equation to solve for B0. B0 = y - B1*x
B0 <- 129 - B1 * 108

# With all of the pieces placed together, the equation for the regression line is:
paste("Regression line equation: Time =", round(B0,3), "+",round(B1,3),"* Distance")

## [1] "Regression line equation: Time = 50.599 + 0.726 * Distance"

Interpret the slope and the intercept in this context.

# The slope is 0.636. Which means that for every 10 miles a train travels, it will add an additional 7.26 minutes to the trip. (Or for every 1 mile, an additional .726 minutes wil be added. Please note that a travel distance of 0 indicates ~51 minutes of travel time, which does not make sense.)

Calculate R(squared) of the regression line for predicting travel time from distance traveled for the Coast Starlight, and interpret R(squared) in the context of the application.

R.squared <- R^2
paste("R squared: ", round(R.squared,3))

## [1] "R squared:  0.404"

# This means that 40.4% of the variation found in this data is explained by the linear model i.e. explained by the distance traveled.

The distance between Santa Barbara and Los Angeles is 103 miles. Use the model to estimate the time it takes for the Starlight to travel between these two cities.

SBtoLA.dist <- 103 # miles
SBtoLA.time <- B0 + B1 * SBtoLA.dist
paste("According to the model, the estimated time to travel between Santa Barbara and Los Angeles is: ", round(SBtoLA.time,3), "minutes.")

## [1] "According to the model, the estimated time to travel between Santa Barbara and Los Angeles is:  125.37 minutes."

It actually takes the Coast Starlight about 168 mins to travel from Santa Barbara to Los Angeles. Calculate the residual and explain the meaning of this residual value.

# The residual = time(actual) - time(expected)
SBtoLA.residual <- 168 - SBtoLA.time
paste("The residual is: ", round(SBtoLA.residual,3))

## [1] "The residual is:  42.63"

Suppose Amtrak is considering adding a stop to the Coast Starlight 500 miles away from Los Angeles. Would it be appropriate to use this linear model to predict the travel time from Los Angeles to this point?

# Anytime we make estimates of explanatory variables that reaches beyond what the data provides, we are performing something called `extrapolation`. We have to be careful when we extrapolate as this relationship is not generally valid until it has been analyzed. In this situation, it may be potentially possible to apply this model at 500 miles, as this destination is still in California. However, we must note that at these distances, it may be a completely different company or a whole new set of railroad regulations may apply, so the times may not follow the linear regression model.

CUNY 606 Presentation

Joel Park

4/3/2017

7.25